Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0At a glance
| Base model | cerebras/DeepSeek-V3.2-REAP-508B-A37B |
| Format | NVFP4 |
| Total params | 508B |
| Active / token | 37B |
| Experts / layer | 192 |
| Layers | 61 |
| Hidden size | 7168 |
| Context | 163,840 |
| On-disk size | 288 GB |
Which variant should I pick?
๐ณ REAP ๐ณ the Experts: Why Pruning Prevails for One-Shot MoE Compression
๐ Paper โข ๐ป Code
DeepSeek-V3.2-REAP-508B-A37B-NVFP4
REAP-pruned + NVFP4 quantized DeepSeek-V3.2 for efficient deployment on NVIDIA Blackwell (sm120).
This is the first publicly available NVFP4-quantized variant of the 508B-parameter REAP-pruned DeepSeek-V3.2, targeting 8x RTX PRO 6000 Blackwell 96GB deployments via sglang.
๐ Model Specifications
| Property | Value |
|---|---|
| Base Model | cerebras/DeepSeek-V3.2-REAP-508B-A37B (REAP-pruned from deepseek-ai/DeepSeek-V3.2) |
| Architecture | DeepseekV3ForCausalLM (MoE with MLA) |
| Params | 508B total, ~37B active per token (top-8 of 384 routed + 1 shared) |
| Base precision | BF16 (source: ~1.0 TB) |
| Quantization | NVFP4 (4-bit weights + FP8 per-group scales, group=16) |
| Output size | 288 GB (~3.6x compression) |
| Experts per MoE layer | 384 routed + 1 shared |
| Layers | 61 |
| Hidden size | 7168 |
| Format | nvfp4-pack-quantized via compressed-tensors |
๐ Deploy on sm120 (RTX PRO 6000 Blackwell)
Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.
bash
docker run --gpus all --ipc=host --shm-size=8g --network=host \--ulimit memlock=-1 --ulimit stack=67108864 \-v ~/.cache/huggingface:/root/.cache/huggingface \-v jit-cache:/cache/jit \-e SGLANG_ENABLE_SPEC_V2=True \-e SGLANG_ENABLE_JIT_DEEPGEMM=0 \-e SGLANG_ENABLE_DEEP_GEMM=0 \-e NCCL_IB_DISABLE=1 \-e NCCL_P2P_LEVEL=SYS \-e NCCL_MIN_NCHANNELS=8 \voipmonitor/sglang:cu130 \python3 -m sglang.launch_server \--model-path 0xSero/DeepSeek-V3.2-508B-NVFP4 \--served-model-name deepseek-v32-reap-nvfp4 \--tensor-parallel-size 8 \--quantization modelopt_fp4 \--kv-cache-dtype bf16 \--trust-remote-code \--attention-backend flashinfer \--moe-runner-backend b12x \--cuda-graph-max-bs 32 \--mem-fraction-static 0.85 \--host 0.0.0.0 --port 5000 \--disable-custom-all-reduce
Critical flags:
--kv-cache-dtype bf16โ mandatory; fp8_e4m3 produces garbled output on sm120--attention-backend flashinferโ sm120-compatible--quantization modelopt_fp4โ sglang's NVFP4 loader for compressed-tensors formatSGLANG_ENABLE_DEEP_GEMM=0โ DeepGEMM needs WGMMA/TCGEN05 absent on sm120
Memory fit: 288 GB weights + KV cache fits on 8x 96GB (โ768 GB total VRAM).
Will not run on sm_90 (H100): NVFP4 is Blackwell-native. Both vLLM (Marlin FP4 PTX mismatch) and sglang (NotImplementedError: Current platform does not support w4a4 nvfp4 quantization) explicitly block sm_90.
๐ฌ Quantization Method
Produced via AutoRound 0.12.2 layerwise mode on 8x H100 80GB.
Settings
| Setting | Value | Notes |
|---|---|---|
--scheme | NVFP4 | 4-bit weights + FP8 per-group scales |
--iters | 200 | Full tuning (same hyperparameter as GPTQ variant) |
--nsamples | 512 | Calibration samples |
--seqlen | 2048 | Default |
--batch_size | 8 | Default |
--low_gpu_mem_usage | true | Required for ~1TB source on 640GB VRAM |
--group_size | 16 | Matches NVFP4 native 16-element block scale |
--format | auto_round:llm_compressor | Produces compressed-tensors (sglang/vLLM compatible) |
--disable_amp | true | Avoids autocast issues on BF16 source |
Calibration Dataset
| Source | Samples | Content |
|---|---|---|
| NeelNanda/pile-10k | 512 | General web text (distribution anchor) |
Multi-dataset loading used AutoRound's :concat=true option to pack short samples into full-seqlen sequences.
Wall Time
- Quantization tuning: 19h 38m (61 blocks, ~20 min/block)
- Packing + save: ~7 min (58 safetensors shards, 288 GB)
- Total: ~19.7 hours on 8x H100 80GB
Quality Characteristics
Layer-level loss trajectory (iter 0 โ final):
| Layer depth | iter 0 loss | final loss | Behavior |
|---|---|---|---|
| 0-10 | 1e-6 to 1e-2 | 50-80% reduction | Early layers, minimal drift |
| 11-30 | 1e-2 to 1e-1 | 30-50% reduction | Sign-tuning active |
| 31-50 | 1e-1 to 5e-1 | 20-30% reduction | Accumulating |
| 51-60 | 5e-1 to 1.9 | 10-20% reduction | Deep-layer drift (layer 60: 1.86 โ 1.50) |
Weight-validity check (CPU dequant, pre-upload):
- Cosine similarity vs BF16 source: 0.995+ across all tested layers
- Relative MAE: ~9% uniformly (typical NVFP4 reconstruction error)
๐ Benchmarks
Pending. Run on 8x RTX PRO 6000 sm120 and report:
| Task | Score | Notes |
|---|---|---|
| MMLU (5-shot) | โ | |
| GSM8K | โ | |
| MATH | โ | |
| HumanEval | โ | |
| IFEval strict | โ |
Expected ranges (based on GLM-5.1-555B-A14B-REAP-NVFP4 precedent):
- MMLU: 73-79% (BF16 base ~75-80%, โ1 to โ2 pp for NVFP4)
- GSM8K: 78-88% (BF16 base ~80-90%)
- Decode throughput: 50-70 tok/s @ conc=1, 120-180 tok/s @ conc=4
๐งพ Provenance
| Step | Details |
|---|---|
| Source model | cerebras/DeepSeek-V3.2-REAP-508B-A37B (BF16, ~1.0 TB, 96 safetensors) |
| Pruning | REAP (Relative Expert Activation Pruning) โ 384 โ 384 experts (structure preserved, 508B is a REAP variant) |
| Quantization compute | Nebius H100x8 via brev |
| Quant tool | Intel AutoRound 0.12.2 |
| Deploy tool | voipmonitor/sglang:cu130 |
| Upload date | 2026-04-21 |
License & citation
License inherited from the base model.
bibtex
@misc{lasby2025reap,title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}}
Sponsors
Made possible by NVIDIA ยท TNG Technology ยท Lambda ยท Prime Intellect ยท Hot Aisle.
Model provider
0xSero
Model tree
Base
cerebras/DeepSeek-V3.2-REAP-508B-A37B
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information