At a glance
Which variant should I pick?
Table with columns: Variant, Format, Link| Variant | Format | Link |
|---|
DeepSeek-V3.2-345B-W3A16 | W3A16 | link |
DeepSeek-V3.2-508B-NVFP4 (this) | NVFP4 | link |
𓌳 REAP 𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression
📄 Paper • 💻 Code
DeepSeek-V3.2-REAP-508B-A37B-NVFP4
REAP-pruned + NVFP4 quantized DeepSeek-V3.2 for efficient deployment on NVIDIA Blackwell (sm120).
This is the first publicly available NVFP4-quantized variant of the 508B-parameter REAP-pruned DeepSeek-V3.2, targeting 8x RTX PRO 6000 Blackwell 96GB deployments via sglang.
📋 Model Specifications
Table with columns: Property, Value| Property | Value |
|---|
| Base Model | cerebras/DeepSeek-V3.2-REAP-508B-A37B (REAP-pruned from deepseek-ai/DeepSeek-V3.2) |
| Architecture | DeepseekV3ForCausalLM (MoE with MLA) |
| Params | 508B total, ~37B active per token (top-8 of 384 routed + 1 shared) |
| Base precision | BF16 (source: ~1.0 TB) |
| Quantization | NVFP4 (4-bit weights + FP8 per-group scales, group=16) |
| Output size | 288 GB (~3.6x compression) |
🚀 Deploy on sm120 (RTX PRO 6000 Blackwell)
Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.
docker run --gpus all --ipc=host --shm-size=8g --network=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v jit-cache:/cache/jit \
-e SGLANG_ENABLE_SPEC_V2=True \
-e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
-e SGLANG_ENABLE_DEEP_GEMM=0 \
-e NCCL_IB_DISABLE=1 \
-e NCCL_P2P_LEVEL=SYS \
-e NCCL_MIN_NCHANNELS=8 \
voipmonitor/sglang:cu130 \
python3 -m sglang.launch_server \
--model-path 0xSero/DeepSeek-V3.2-508B-NVFP4 \
--served-model-name deepseek-v32-reap-nvfp4 \
--tensor-parallel-size 8 \
--quantization modelopt_fp4 \
--kv-cache-dtype bf16 \
--trust-remote-code \
--attention-backend flashinfer \
--moe-runner-backend b12x \
--cuda-graph-max-bs 32 \
--mem-fraction-static 0.85 \
--host 0.0.0.0 --port 5000 \
--disable-custom-all-reduce
Critical flags:
--kv-cache-dtype bf16 — mandatory; fp8_e4m3 produces garbled output on sm120
--attention-backend flashinfer — sm120-compatible
--quantization modelopt_fp4 — sglang's NVFP4 loader for compressed-tensors format
SGLANG_ENABLE_DEEP_GEMM=0 — DeepGEMM needs WGMMA/TCGEN05 absent on sm120
Memory fit: 288 GB weights + KV cache fits on 8x 96GB (≈768 GB total VRAM).
Will not run on sm_90 (H100): NVFP4 is Blackwell-native. Both vLLM (Marlin FP4 PTX mismatch) and sglang (NotImplementedError: Current platform does not support w4a4 nvfp4 quantization) explicitly block sm_90.
🔬 Quantization Method
Produced via AutoRound 0.12.2 layerwise mode on 8x H100 80GB.
Settings
Table with columns: Setting, Value, Notes| Setting | Value | Notes |
|---|
--scheme | NVFP4 | 4-bit weights + FP8 per-group scales |
--iters | 200 | Full tuning (same hyperparameter as GPTQ variant) |
--nsamples | 512 | Calibration samples |
--seqlen | 2048 | Default |
|
Calibration Dataset
Table with columns: Source, Samples, Content| Source | Samples | Content |
|---|
| NeelNanda/pile-10k | 512 | General web text (distribution anchor) |
Multi-dataset loading used AutoRound's :concat=true option to pack short samples into full-seqlen sequences.
Wall Time
- Quantization tuning: 19h 38m (61 blocks, ~20 min/block)
- Packing + save: ~7 min (58 safetensors shards, 288 GB)
- Total: ~19.7 hours on 8x H100 80GB
Quality Characteristics
Layer-level loss trajectory (iter 0 → final):
Table with columns: Layer depth, iter 0 loss, final loss, Behavior| Layer depth | iter 0 loss | final loss | Behavior |
|---|
| 0-10 | 1e-6 to 1e-2 | 50-80% reduction | Early layers, minimal drift |
| 11-30 | 1e-2 to 1e-1 | 30-50% reduction | Sign-tuning active |
| 31-50 | 1e-1 to 5e-1 | 20-30% reduction | Accumulating |
| 51-60 | 5e-1 to 1.9 | 10-20% reduction | Deep-layer drift (layer 60: 1.86 → 1.50) |
Weight-validity check (CPU dequant, pre-upload):
- Cosine similarity vs BF16 source: 0.995+ across all tested layers
- Relative MAE: ~9% uniformly (typical NVFP4 reconstruction error)
📊 Benchmarks
Pending. Run on 8x RTX PRO 6000 sm120 and report:
Table with columns: Task, Score, Notes| Task | Score | Notes |
|---|
| MMLU (5-shot) | — | |
| GSM8K | — | |
| MATH | — | |
| HumanEval | — | |
| IFEval strict | — | |
Expected ranges (based on GLM-5.1-555B-A14B-REAP-NVFP4 precedent):
- MMLU: 73-79% (BF16 base ~75-80%, −1 to −2 pp for NVFP4)
- GSM8K: 78-88% (BF16 base ~80-90%)
- Decode throughput: 50-70 tok/s @ conc=1, 120-180 tok/s @ conc=4
🧾 Provenance
Table with columns: Step, Details| Step | Details |
|---|
| Source model | cerebras/DeepSeek-V3.2-REAP-508B-A37B (BF16, ~1.0 TB, 96 safetensors) |
| Pruning | REAP (Relative Expert Activation Pruning) — 384 → 384 experts (structure preserved, 508B is a REAP variant) |
| Quantization compute | Nebius H100x8 via brev |
| Quant tool | Intel AutoRound 0.12.2 |
| Deploy tool | voipmonitor/sglang:cu130 |
|
License & citation
License inherited from the base model.
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.