0xSero

GLM-5.1-555B-NVFP4

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

At a glance

Table

Base model	0xSero/GLM-5.1-555B
Format	NVFP4
Total params	555B
Active / token	14B
Experts / layer	192
Layers	78
Hidden size	6144
Context	202,752
On-disk size	320 GB

Which variant should I pick?

Table with columns: Variant, Format, Link
Variant	Format	Link
`GLM-5.1-444B`	BF16	link
`GLM-5.1-444B-GGUF`	GGUF	link
`GLM-5.1-478B-NVFP4`	NVFP4	link

NVFP4 quantization of 0xSero/GLM-5.1-555B — a REAP-pruned variant of GLM-5.1 (192 experts per MoE layer, down from 256).

Target hardware: 8× RTX PRO 6000 Blackwell 96GB (sm120) via sglang. See deploy recipe below.

Model details

Table with columns: Property, Value
Property	Value
Architecture	`GlmMoeDsaForCausalLM` (DeepSeek Sparse Attention + MLA)
Base precision	BF16 (source: 1.1 TB)
Quantization	NVFP4 (4-bit weights + FP8 per-group scales, group=16)
Output size	320 GB (~3.4× compression)
Experts per MoE layer	192 (REAP-pruned from 256)
Layers	78
Format	`nvfp4-pack-quantized` via `compressed-tensors`

Layers kept in BF16 (per AutoRound ignore pattern)

lm_head
model.layers.[0-2].mlp.{gate,up,down}_proj (first 3 layers' experts — most sensitive)
model.layers.[0-77].self_attn.indexer.weights_proj (DSA indexer, quant-sensitive)

Deploy on sm120 (RTX PRO 6000 Blackwell)

Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.

bash
docker run --gpus all --ipc=host --shm-size=8g --network=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v jit-cache:/cache/jit \
  -e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
  -e SGLANG_ENABLE_DEEP_GEMM=0 \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_P2P_LEVEL=SYS \
  -e NCCL_MIN_NCHANNELS=8 \
  voipmonitor/sglang:cu130 \
  python3 -m sglang.launch_server \
    --model-path 0xSero/GLM-5.1-555B-NVFP4 \
    --served-model-name glm-5.1-reap \
    --reasoning-parser glm45 \
    --tool-call-parser glm47 \
    --tensor-parallel-size 8 \
    --quantization compressed-tensors \
    --kv-cache-dtype bf16 \
    --trust-remote-code \
    --mem-fraction-static 0.85 \
    --chunked-prefill-size 16384 \
    --attention-backend flashinfer \
    --fp4-gemm-backend b12x \
    --moe-runner-backend b12x \
    --host 0.0.0.0 --port 5000

Critical flags:

--kv-cache-dtype bf16 — mandatory; fp8_e4m3 produces garbled output on sm120
--attention-backend flashinfer — sm120-compatible (trtllm_mha, flashmla are not)
SGLANG_ENABLE_DEEP_GEMM=0 — DeepGEMM needs WGMMA/TCGEN05 absent on sm120

Memory fit: 320 GB weights + KV cache fits on 8× 96GB (≈768 GB total VRAM). Minimum viable: 6× RTX PRO 6000 with --tp 2 --pp 3.

Will not run on sm_90 (H100): NVFP4 is Blackwell-native. Both vLLM (Marlin FP4 PTX mismatch) and sglang (NotImplementedError: Current platform does not support w4a4 nvfp4 quantization) explicitly block sm_90.

Quantization method

Produced via AutoRound 0.12.2 layerwise mode on 8× H100 80GB.

Settings

Table with columns: Setting, Value, Notes
Setting	Value	Notes
`--scheme`	NVFP4	4-bit weights + FP8 per-group scales
`--iters`	50	Halved from default 200 (loss trajectory confirms iters 100+ produce negligible improvement)
`--nsamples`	512	Calibration samples
`--seqlen`	2048	Default (seqlen=4096 tried; most samples too short after tokenization)

Calibration dataset

Custom mix targeting realistic use cases (1,190 samples total → 505 valid after packing):

Table with columns: Source, Samples, Content
Source	Samples	Content
0xSero/structured-outputs-calibration-v1	430	JSON schemas, sharegpt-JSON, Mermaid diagrams
0xSero/reap-calibration-data-v1	560	100 long_context + 120 function_calling + 100 agentic + 60 coding + 40 cuda + 30 reasoning + 30 math + 40 terminal + 40 cybersecurity
NeelNanda/pile-10k	200	General web text (distribution anchor; provides long samples to compensate for short custom samples)

Multi-dataset loading used AutoRound's :concat=true option (patched during build; upstreamable) to pack short instruction samples into full-seqlen sequences.

Wall time

Model load + offload: ~55 min
Calibration + quant: 6h 34m
Save: 7 min
Total: ~7.5 hours on 8× H100 80GB (brev compute)

Quality characteristics

Layer-level loss (iter 0 → iter 49) trajectory:

Table with columns: Layer depth, iter 0 loss, iter 49 loss, Behavior
Layer depth	iter 0 loss	iter 49 loss	Behavior
0-2	0	0	Attention-only; MLP skipped
3-9	1e-6 to 1e-5	1e-6 to 1e-5	Iterative tuning minimal effect
10-30	1e-4 to 1e-2	30-50% reduction	Sign-tuning active
31-55	1e-2 to 1e-1	20-30% reduction	Accumulating

Expected quality impact: benchmarks on sm120 recommended to measure MMLU/GSM8K/IFEval gap vs BF16 source. Loss magnitudes alone suggest non-trivial degradation at deep layers; whether this matters in practice depends on task.

Provenance

Source model: 0xSero/GLM-5.1-555B (BF16, 1.1 TB, 26 safetensors)
Quantization compute: Nebius H100×8 via brev
Quant tool: Intel AutoRound 0.12.2
Deploy tool: voipmonitor/sglang:cu130

License

MIT (inherits from base model).

Acknowledgements

Cerebras REAP team for the pruning recipe
voipmonitor for the sm120 sglang deployment guide
Intel AutoRound team for the quantization toolkit
Nebius for the H100 compute

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Explore FriendliAI today

Get started Talk to an engineer

At a glance

Table

Base model	0xSero/GLM-5.1-555B
Format	NVFP4
Total params	555B
Active / token	14B
Experts / layer	192
Layers	78
Hidden size	6144
Context	202,752
On-disk size	320 GB

Which variant should I pick?

Table with columns: Variant, Format, Link
Variant	Format	Link
`GLM-5.1-444B`	BF16	link
`GLM-5.1-444B-GGUF`	GGUF	link
`GLM-5.1-478B-NVFP4`	NVFP4	link

NVFP4 quantization of 0xSero/GLM-5.1-555B — a REAP-pruned variant of GLM-5.1 (192 experts per MoE layer, down from 256).

Target hardware: 8× RTX PRO 6000 Blackwell 96GB (sm120) via sglang. See deploy recipe below.

Model details

Table with columns: Property, Value
Property	Value
Architecture	`GlmMoeDsaForCausalLM` (DeepSeek Sparse Attention + MLA)
Base precision	BF16 (source: 1.1 TB)
Quantization	NVFP4 (4-bit weights + FP8 per-group scales, group=16)
Output size	320 GB (~3.4× compression)
Experts per MoE layer	192 (REAP-pruned from 256)
Layers	78
Format	`nvfp4-pack-quantized` via `compressed-tensors`

Layers kept in BF16 (per AutoRound ignore pattern)

lm_head
model.layers.[0-2].mlp.{gate,up,down}_proj (first 3 layers' experts — most sensitive)
model.layers.[0-77].self_attn.indexer.weights_proj (DSA indexer, quant-sensitive)

Deploy on sm120 (RTX PRO 6000 Blackwell)

Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.

bash
docker run --gpus all --ipc=host --shm-size=8g --network=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v jit-cache:/cache/jit \
  -e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
  -e SGLANG_ENABLE_DEEP_GEMM=0 \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_P2P_LEVEL=SYS \
  -e NCCL_MIN_NCHANNELS=8 \
  voipmonitor/sglang:cu130 \
  python3 -m sglang.launch_server \
    --model-path 0xSero/GLM-5.1-555B-NVFP4 \
    --served-model-name glm-5.1-reap \
    --reasoning-parser glm45 \
    --tool-call-parser glm47 \
    --tensor-parallel-size 8 \
    --quantization compressed-tensors \
    --kv-cache-dtype bf16 \
    --trust-remote-code \
    --mem-fraction-static 0.85 \
    --chunked-prefill-size 16384 \
    --attention-backend flashinfer \
    --fp4-gemm-backend b12x \
    --moe-runner-backend b12x \
    --host 0.0.0.0 --port 5000

Critical flags:

--kv-cache-dtype bf16 — mandatory; fp8_e4m3 produces garbled output on sm120
--attention-backend flashinfer — sm120-compatible (trtllm_mha, flashmla are not)
SGLANG_ENABLE_DEEP_GEMM=0 — DeepGEMM needs WGMMA/TCGEN05 absent on sm120

Memory fit: 320 GB weights + KV cache fits on 8× 96GB (≈768 GB total VRAM). Minimum viable: 6× RTX PRO 6000 with --tp 2 --pp 3.

Quantization method

Produced via AutoRound 0.12.2 layerwise mode on 8× H100 80GB.

Settings

Table with columns: Setting, Value, Notes
Setting	Value	Notes
`--scheme`	NVFP4	4-bit weights + FP8 per-group scales
`--iters`	50	Halved from default 200 (loss trajectory confirms iters 100+ produce negligible improvement)
`--nsamples`	512	Calibration samples
`--seqlen`	2048	Default (seqlen=4096 tried; most samples too short after tokenization)

Calibration dataset

Custom mix targeting realistic use cases (1,190 samples total → 505 valid after packing):

Table with columns: Source, Samples, Content
Source	Samples	Content
0xSero/structured-outputs-calibration-v1	430	JSON schemas, sharegpt-JSON, Mermaid diagrams
0xSero/reap-calibration-data-v1	560	100 long_context + 120 function_calling + 100 agentic + 60 coding + 40 cuda + 30 reasoning + 30 math + 40 terminal + 40 cybersecurity
NeelNanda/pile-10k	200	General web text (distribution anchor; provides long samples to compensate for short custom samples)

Multi-dataset loading used AutoRound's :concat=true option (patched during build; upstreamable) to pack short instruction samples into full-seqlen sequences.

Wall time

Model load + offload: ~55 min
Calibration + quant: 6h 34m
Save: 7 min
Total: ~7.5 hours on 8× H100 80GB (brev compute)

Quality characteristics

Layer-level loss (iter 0 → iter 49) trajectory:

Table with columns: Layer depth, iter 0 loss, iter 49 loss, Behavior
Layer depth	iter 0 loss	iter 49 loss	Behavior
0-2	0	0	Attention-only; MLP skipped
3-9	1e-6 to 1e-5	1e-6 to 1e-5	Iterative tuning minimal effect
10-30	1e-4 to 1e-2	30-50% reduction	Sign-tuning active
31-55	1e-2 to 1e-1	20-30% reduction	Accumulating

Provenance

Source model: 0xSero/GLM-5.1-555B (BF16, 1.1 TB, 26 safetensors)
Quantization compute: Nebius H100×8 via brev
Quant tool: Intel AutoRound 0.12.2
Deploy tool: voipmonitor/sglang:cu130

License

MIT (inherits from base model).

Acknowledgements

Cerebras REAP team for the pruning recipe
voipmonitor for the sm120 sglang deployment guide
Intel AutoRound team for the quantization toolkit
Nebius for the H100 compute

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

GLM-5.1-555B-NVFP4

Get help setting up a custom Dedicated Endpoints.

README

At a glance

Which variant should I pick?

Model details

Layers kept in BF16 (per AutoRound ignore pattern)

Deploy on sm120 (RTX PRO 6000 Blackwell)

Quantization method

Settings

Calibration dataset

Wall time

Quality characteristics

Provenance

License

Acknowledgements

License & citation

Sponsors

Explore FriendliAI today

README

At a glance

Which variant should I pick?

Model details

Layers kept in BF16 (per AutoRound ignore pattern)

Deploy on sm120 (RTX PRO 6000 Blackwell)

Quantization method

Settings

Calibration dataset

Wall time

Quality characteristics

Provenance

License

Acknowledgements

License & citation

Sponsors