0xSero/DeepSeek-V3.2-508B-NVFP4 API & Inference Endpoint

At a glance

Table

Base model	cerebras/DeepSeek-V3.2-REAP-508B-A37B
Format	NVFP4
Total params	508B
Active / token	37B
Experts / layer	192
Layers	61
Hidden size	7168
Context	163,840
On-disk size	288 GB

Which variant should I pick?

Table with columns: Variant, Format, Link
Variant	Format	Link
`DeepSeek-V3.2-345B-W3A16`	W3A16	link
`DeepSeek-V3.2-508B-NVFP4` (this)	NVFP4	link

𓌳 REAP 𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression

📄 Paper • 💻 Code

DeepSeek-V3.2-REAP-508B-A37B-NVFP4

REAP-pruned + NVFP4 quantized DeepSeek-V3.2 for efficient deployment on NVIDIA Blackwell (sm120).

This is the first publicly available NVFP4-quantized variant of the 508B-parameter REAP-pruned DeepSeek-V3.2, targeting 8x RTX PRO 6000 Blackwell 96GB deployments via sglang.

📋 Model Specifications

Table with columns: Property, Value
Property	Value
Base Model	`cerebras/DeepSeek-V3.2-REAP-508B-A37B` (REAP-pruned from `deepseek-ai/DeepSeek-V3.2`)
Architecture	`DeepseekV3ForCausalLM` (MoE with MLA)
Params	508B total, ~37B active per token (top-8 of 384 routed + 1 shared)
Base precision	BF16 (source: ~1.0 TB)
Quantization	NVFP4 (4-bit weights + FP8 per-group scales, group=16)
Output size	288 GB (~3.6x compression)

🚀 Deploy on sm120 (RTX PRO 6000 Blackwell)

Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.

bash
docker run --gpus all --ipc=host --shm-size=8g --network=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v jit-cache:/cache/jit \
  -e SGLANG_ENABLE_SPEC_V2=True \
  -e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
  -e SGLANG_ENABLE_DEEP_GEMM=0 \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_P2P_LEVEL=SYS \
  -e NCCL_MIN_NCHANNELS=8 \
  voipmonitor/sglang:cu130 \
  python3 -m sglang.launch_server \
    --model-path 0xSero/DeepSeek-V3.2-508B-NVFP4 \
    --served-model-name deepseek-v32-reap-nvfp4 \
    --tensor-parallel-size 8 \
    --quantization modelopt_fp4 \
    --kv-cache-dtype bf16 \
    --trust-remote-code \
    --attention-backend flashinfer \
    --moe-runner-backend b12x \
    --cuda-graph-max-bs 32 \
    --mem-fraction-static 0.85 \
    --host 0.0.0.0 --port 5000 \
    --disable-custom-all-reduce

Critical flags:

--kv-cache-dtype bf16 — mandatory; fp8_e4m3 produces garbled output on sm120
--attention-backend flashinfer — sm120-compatible
--quantization modelopt_fp4 — sglang's NVFP4 loader for compressed-tensors format
SGLANG_ENABLE_DEEP_GEMM=0 — DeepGEMM needs WGMMA/TCGEN05 absent on sm120

Memory fit: 288 GB weights + KV cache fits on 8x 96GB (≈768 GB total VRAM).

Will not run on sm_90 (H100): NVFP4 is Blackwell-native. Both vLLM (Marlin FP4 PTX mismatch) and sglang (NotImplementedError: Current platform does not support w4a4 nvfp4 quantization) explicitly block sm_90.

🔬 Quantization Method

Produced via AutoRound 0.12.2 layerwise mode on 8x H100 80GB.

Settings

Table with columns: Setting, Value, Notes
Setting	Value	Notes
`--scheme`	NVFP4	4-bit weights + FP8 per-group scales
`--iters`	200	Full tuning (same hyperparameter as GPTQ variant)
`--nsamples`	512	Calibration samples
`--seqlen`	2048	Default

Calibration Dataset

Table with columns: Source, Samples, Content
Source	Samples	Content
NeelNanda/pile-10k	512	General web text (distribution anchor)

Multi-dataset loading used AutoRound's :concat=true option to pack short samples into full-seqlen sequences.

Wall Time

Quantization tuning: 19h 38m (61 blocks, ~20 min/block)
Packing + save: ~7 min (58 safetensors shards, 288 GB)
Total: ~19.7 hours on 8x H100 80GB

Quality Characteristics

Layer-level loss trajectory (iter 0 → final):

Table with columns: Layer depth, iter 0 loss, final loss, Behavior
Layer depth	iter 0 loss	final loss	Behavior
0-10	1e-6 to 1e-2	50-80% reduction	Early layers, minimal drift
11-30	1e-2 to 1e-1	30-50% reduction	Sign-tuning active
31-50	1e-1 to 5e-1	20-30% reduction	Accumulating
51-60	5e-1 to 1.9	10-20% reduction	Deep-layer drift (layer 60: 1.86 → 1.50)

Weight-validity check (CPU dequant, pre-upload):

Cosine similarity vs BF16 source: 0.995+ across all tested layers
Relative MAE: ~9% uniformly (typical NVFP4 reconstruction error)

📊 Benchmarks

Pending. Run on 8x RTX PRO 6000 sm120 and report:

Table with columns: Task, Score, Notes
Task	Score	Notes
MMLU (5-shot)	—
GSM8K	—
MATH	—
HumanEval	—
IFEval strict	—

Expected ranges (based on GLM-5.1-555B-A14B-REAP-NVFP4 precedent):

MMLU: 73-79% (BF16 base ~75-80%, −1 to −2 pp for NVFP4)
GSM8K: 78-88% (BF16 base ~80-90%)
Decode throughput: 50-70 tok/s @ conc=1, 120-180 tok/s @ conc=4

🧾 Provenance

Table with columns: Step, Details
Step	Details
Source model	`cerebras/DeepSeek-V3.2-REAP-508B-A37B` (BF16, ~1.0 TB, 96 safetensors)
Pruning	REAP (Relative Expert Activation Pruning) — 384 → 384 experts (structure preserved, 508B is a REAP variant)
Quantization compute	Nebius H100x8 via brev
Quant tool	Intel AutoRound 0.12.2
Deploy tool	voipmonitor/sglang:cu130

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

At a glance

Table

Base model	cerebras/DeepSeek-V3.2-REAP-508B-A37B
Format	NVFP4
Total params	508B
Active / token	37B
Experts / layer	192
Layers	61
Hidden size	7168
Context	163,840
On-disk size	288 GB

Which variant should I pick?

Table with columns: Variant, Format, Link
Variant	Format	Link
`DeepSeek-V3.2-345B-W3A16`	W3A16	link
`DeepSeek-V3.2-508B-NVFP4` (this)	NVFP4	link

𓌳 REAP 𓌳 the Experts: Why Pruning Prevails for One-Shot MoE Compression

📄 Paper • 💻 Code

DeepSeek-V3.2-REAP-508B-A37B-NVFP4

REAP-pruned + NVFP4 quantized DeepSeek-V3.2 for efficient deployment on NVIDIA Blackwell (sm120).

This is the first publicly available NVFP4-quantized variant of the 508B-parameter REAP-pruned DeepSeek-V3.2, targeting 8x RTX PRO 6000 Blackwell 96GB deployments via sglang.

📋 Model Specifications

Table with columns: Property, Value
Property	Value
Base Model	`cerebras/DeepSeek-V3.2-REAP-508B-A37B` (REAP-pruned from `deepseek-ai/DeepSeek-V3.2`)
Architecture	`DeepseekV3ForCausalLM` (MoE with MLA)
Params	508B total, ~37B active per token (top-8 of 384 routed + 1 shared)
Base precision	BF16 (source: ~1.0 TB)
Quantization	NVFP4 (4-bit weights + FP8 per-group scales, group=16)
Output size	288 GB (~3.6x compression)

🚀 Deploy on sm120 (RTX PRO 6000 Blackwell)

Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.

bash
docker run --gpus all --ipc=host --shm-size=8g --network=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v jit-cache:/cache/jit \
  -e SGLANG_ENABLE_SPEC_V2=True \
  -e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
  -e SGLANG_ENABLE_DEEP_GEMM=0 \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_P2P_LEVEL=SYS \
  -e NCCL_MIN_NCHANNELS=8 \
  voipmonitor/sglang:cu130 \
  python3 -m sglang.launch_server \
    --model-path 0xSero/DeepSeek-V3.2-508B-NVFP4 \
    --served-model-name deepseek-v32-reap-nvfp4 \
    --tensor-parallel-size 8 \
    --quantization modelopt_fp4 \
    --kv-cache-dtype bf16 \
    --trust-remote-code \
    --attention-backend flashinfer \
    --moe-runner-backend b12x \
    --cuda-graph-max-bs 32 \
    --mem-fraction-static 0.85 \
    --host 0.0.0.0 --port 5000 \
    --disable-custom-all-reduce

Critical flags:

--kv-cache-dtype bf16 — mandatory; fp8_e4m3 produces garbled output on sm120
--attention-backend flashinfer — sm120-compatible
--quantization modelopt_fp4 — sglang's NVFP4 loader for compressed-tensors format
SGLANG_ENABLE_DEEP_GEMM=0 — DeepGEMM needs WGMMA/TCGEN05 absent on sm120

Memory fit: 288 GB weights + KV cache fits on 8x 96GB (≈768 GB total VRAM).

🔬 Quantization Method

Produced via AutoRound 0.12.2 layerwise mode on 8x H100 80GB.

Settings

Table with columns: Setting, Value, Notes
Setting	Value	Notes
`--scheme`	NVFP4	4-bit weights + FP8 per-group scales
`--iters`	200	Full tuning (same hyperparameter as GPTQ variant)
`--nsamples`	512	Calibration samples
`--seqlen`	2048	Default

Calibration Dataset

Table with columns: Source, Samples, Content
Source	Samples	Content
NeelNanda/pile-10k	512	General web text (distribution anchor)

Multi-dataset loading used AutoRound's :concat=true option to pack short samples into full-seqlen sequences.

Wall Time

Quantization tuning: 19h 38m (61 blocks, ~20 min/block)
Packing + save: ~7 min (58 safetensors shards, 288 GB)
Total: ~19.7 hours on 8x H100 80GB

Quality Characteristics

Layer-level loss trajectory (iter 0 → final):

Table with columns: Layer depth, iter 0 loss, final loss, Behavior
Layer depth	iter 0 loss	final loss	Behavior
0-10	1e-6 to 1e-2	50-80% reduction	Early layers, minimal drift
11-30	1e-2 to 1e-1	30-50% reduction	Sign-tuning active
31-50	1e-1 to 5e-1	20-30% reduction	Accumulating
51-60	5e-1 to 1.9	10-20% reduction	Deep-layer drift (layer 60: 1.86 → 1.50)

Weight-validity check (CPU dequant, pre-upload):

Cosine similarity vs BF16 source: 0.995+ across all tested layers
Relative MAE: ~9% uniformly (typical NVFP4 reconstruction error)

📊 Benchmarks

Pending. Run on 8x RTX PRO 6000 sm120 and report:

Table with columns: Task, Score, Notes
Task	Score	Notes
MMLU (5-shot)	—
GSM8K	—
MATH	—
HumanEval	—
IFEval strict	—

Expected ranges (based on GLM-5.1-555B-A14B-REAP-NVFP4 precedent):

MMLU: 73-79% (BF16 base ~75-80%, −1 to −2 pp for NVFP4)
GSM8K: 78-88% (BF16 base ~80-90%)
Decode throughput: 50-70 tok/s @ conc=1, 120-180 tok/s @ conc=4

🧾 Provenance

Table with columns: Step, Details
Step	Details
Source model	`cerebras/DeepSeek-V3.2-REAP-508B-A37B` (BF16, ~1.0 TB, 96 safetensors)
Pruning	REAP (Relative Expert Activation Pruning) — 384 → 384 experts (structure preserved, 508B is a REAP variant)
Quantization compute	Nebius H100x8 via brev
Quant tool	Intel AutoRound 0.12.2
Deploy tool	voipmonitor/sglang:cu130

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

DeepSeek-V3.2-508B-NVFP4

Get help setting up a custom Dedicated Endpoints.

README

At a glance

Which variant should I pick?

DeepSeek-V3.2-REAP-508B-A37B-NVFP4

📋 Model Specifications

🚀 Deploy on sm120 (RTX PRO 6000 Blackwell)

🔬 Quantization Method

Settings

Calibration Dataset

Wall Time

Quality Characteristics

📊 Benchmarks

🧾 Provenance

License & citation

Sponsors

Explore FriendliAI today

README

At a glance

Which variant should I pick?

DeepSeek-V3.2-REAP-508B-A37B-NVFP4

📋 Model Specifications

🚀 Deploy on sm120 (RTX PRO 6000 Blackwell)

🔬 Quantization Method

Settings

Calibration Dataset

Wall Time

Quality Characteristics

📊 Benchmarks

🧾 Provenance

License & citation

Sponsors