Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

At a glance

Base modelcerebras/DeepSeek-V3.2-REAP-508B-A37B
FormatNVFP4
Total params508B
Active / token37B
Experts / layer192
Layers61
Hidden size7168
Context163,840
On-disk size288 GB

Which variant should I pick?

VariantFormatLink
DeepSeek-V3.2-345B-W3A16W3A16link
DeepSeek-V3.2-508B-NVFP4 (this)NVFP4link

๐“Œณ REAP ๐“Œณ the Experts: Why Pruning Prevails for One-Shot MoE Compression

๐Ÿ“„ Paper โ€ข ๐Ÿ’ป Code

DeepSeek-V3.2-REAP-508B-A37B-NVFP4

REAP-pruned + NVFP4 quantized DeepSeek-V3.2 for efficient deployment on NVIDIA Blackwell (sm120).

This is the first publicly available NVFP4-quantized variant of the 508B-parameter REAP-pruned DeepSeek-V3.2, targeting 8x RTX PRO 6000 Blackwell 96GB deployments via sglang.

๐Ÿ“‹ Model Specifications

PropertyValue
Base Modelcerebras/DeepSeek-V3.2-REAP-508B-A37B (REAP-pruned from deepseek-ai/DeepSeek-V3.2)
ArchitectureDeepseekV3ForCausalLM (MoE with MLA)
Params508B total, ~37B active per token (top-8 of 384 routed + 1 shared)
Base precisionBF16 (source: ~1.0 TB)
QuantizationNVFP4 (4-bit weights + FP8 per-group scales, group=16)
Output size288 GB (~3.6x compression)
Experts per MoE layer384 routed + 1 shared
Layers61
Hidden size7168
Formatnvfp4-pack-quantized via compressed-tensors

๐Ÿš€ Deploy on sm120 (RTX PRO 6000 Blackwell)

Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.

bash

docker run --gpus all --ipc=host --shm-size=8g --network=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v jit-cache:/cache/jit \
-e SGLANG_ENABLE_SPEC_V2=True \
-e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
-e SGLANG_ENABLE_DEEP_GEMM=0 \
-e NCCL_IB_DISABLE=1 \
-e NCCL_P2P_LEVEL=SYS \
-e NCCL_MIN_NCHANNELS=8 \
voipmonitor/sglang:cu130 \
python3 -m sglang.launch_server \
--model-path 0xSero/DeepSeek-V3.2-508B-NVFP4 \
--served-model-name deepseek-v32-reap-nvfp4 \
--tensor-parallel-size 8 \
--quantization modelopt_fp4 \
--kv-cache-dtype bf16 \
--trust-remote-code \
--attention-backend flashinfer \
--moe-runner-backend b12x \
--cuda-graph-max-bs 32 \
--mem-fraction-static 0.85 \
--host 0.0.0.0 --port 5000 \
--disable-custom-all-reduce

Critical flags:

  • --kv-cache-dtype bf16 โ€” mandatory; fp8_e4m3 produces garbled output on sm120
  • --attention-backend flashinfer โ€” sm120-compatible
  • --quantization modelopt_fp4 โ€” sglang's NVFP4 loader for compressed-tensors format
  • SGLANG_ENABLE_DEEP_GEMM=0 โ€” DeepGEMM needs WGMMA/TCGEN05 absent on sm120

Memory fit: 288 GB weights + KV cache fits on 8x 96GB (โ‰ˆ768 GB total VRAM).

Will not run on sm_90 (H100): NVFP4 is Blackwell-native. Both vLLM (Marlin FP4 PTX mismatch) and sglang (NotImplementedError: Current platform does not support w4a4 nvfp4 quantization) explicitly block sm_90.


๐Ÿ”ฌ Quantization Method

Produced via AutoRound 0.12.2 layerwise mode on 8x H100 80GB.

Settings

SettingValueNotes
--schemeNVFP44-bit weights + FP8 per-group scales
--iters200Full tuning (same hyperparameter as GPTQ variant)
--nsamples512Calibration samples
--seqlen2048Default
--batch_size8Default
--low_gpu_mem_usagetrueRequired for ~1TB source on 640GB VRAM
--group_size16Matches NVFP4 native 16-element block scale
--formatauto_round:llm_compressorProduces compressed-tensors (sglang/vLLM compatible)
--disable_amptrueAvoids autocast issues on BF16 source

Calibration Dataset

SourceSamplesContent
NeelNanda/pile-10k512General web text (distribution anchor)

Multi-dataset loading used AutoRound's :concat=true option to pack short samples into full-seqlen sequences.

Wall Time

  • Quantization tuning: 19h 38m (61 blocks, ~20 min/block)
  • Packing + save: ~7 min (58 safetensors shards, 288 GB)
  • Total: ~19.7 hours on 8x H100 80GB

Quality Characteristics

Layer-level loss trajectory (iter 0 โ†’ final):

Layer depthiter 0 lossfinal lossBehavior
0-101e-6 to 1e-250-80% reductionEarly layers, minimal drift
11-301e-2 to 1e-130-50% reductionSign-tuning active
31-501e-1 to 5e-120-30% reductionAccumulating
51-605e-1 to 1.910-20% reductionDeep-layer drift (layer 60: 1.86 โ†’ 1.50)

Weight-validity check (CPU dequant, pre-upload):

  • Cosine similarity vs BF16 source: 0.995+ across all tested layers
  • Relative MAE: ~9% uniformly (typical NVFP4 reconstruction error)

๐Ÿ“Š Benchmarks

Pending. Run on 8x RTX PRO 6000 sm120 and report:

TaskScoreNotes
MMLU (5-shot)โ€”
GSM8Kโ€”
MATHโ€”
HumanEvalโ€”
IFEval strictโ€”

Expected ranges (based on GLM-5.1-555B-A14B-REAP-NVFP4 precedent):

  • MMLU: 73-79% (BF16 base ~75-80%, โˆ’1 to โˆ’2 pp for NVFP4)
  • GSM8K: 78-88% (BF16 base ~80-90%)
  • Decode throughput: 50-70 tok/s @ conc=1, 120-180 tok/s @ conc=4

๐Ÿงพ Provenance

StepDetails
Source modelcerebras/DeepSeek-V3.2-REAP-508B-A37B (BF16, ~1.0 TB, 96 safetensors)
PruningREAP (Relative Expert Activation Pruning) โ€” 384 โ†’ 384 experts (structure preserved, 508B is a REAP variant)
Quantization computeNebius H100x8 via brev
Quant toolIntel AutoRound 0.12.2
Deploy toolvoipmonitor/sglang:cu130
Upload date2026-04-21

License & citation

License inherited from the base model.

bibtex

@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA ยท TNG Technology ยท Lambda ยท Prime Intellect ยท Hot Aisle.

Model provider

0xSero

Model tree

Base

cerebras/DeepSeek-V3.2-REAP-508B-A37B

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today