Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitAt a glance
| Base model | 0xSero/GLM-5.1-555B |
| Format | NVFP4 |
| Total params | 555B |
| Active / token | 14B |
| Experts / layer | 192 |
| Layers | 78 |
| Hidden size | 6144 |
| Context | 202,752 |
| On-disk size | 320 GB |
Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
GLM-5.1-444B | BF16 | link |
GLM-5.1-444B-GGUF | GGUF | link |
GLM-5.1-478B-NVFP4 | NVFP4 | link |
GLM-5.1-555B | BF16 | link |
GLM-5.1-555B-GGUF | GGUF | link |
GLM-5.1-555B-NVFP4 (this) | NVFP4 | link |
GLM-5.1-555B-W4A16 | W4A16 | link |
NVFP4 quantization of 0xSero/GLM-5.1-555B — a REAP-pruned variant of GLM-5.1 (192 experts per MoE layer, down from 256).
Target hardware: 8× RTX PRO 6000 Blackwell 96GB (sm120) via sglang. See deploy recipe below.
Model details
| Property | Value |
|---|---|
| Architecture | GlmMoeDsaForCausalLM (DeepSeek Sparse Attention + MLA) |
| Base precision | BF16 (source: 1.1 TB) |
| Quantization | NVFP4 (4-bit weights + FP8 per-group scales, group=16) |
| Output size | 320 GB (~3.4× compression) |
| Experts per MoE layer | 192 (REAP-pruned from 256) |
| Layers | 78 |
| Format | nvfp4-pack-quantized via compressed-tensors |
Layers kept in BF16 (per AutoRound ignore pattern)
lm_headmodel.layers.[0-2].mlp.{gate,up,down}_proj(first 3 layers' experts — most sensitive)model.layers.[0-77].self_attn.indexer.weights_proj(DSA indexer, quant-sensitive)
Deploy on sm120 (RTX PRO 6000 Blackwell)
Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.
bash
docker run --gpus all --ipc=host --shm-size=8g --network=host \--ulimit memlock=-1 --ulimit stack=67108864 \-v ~/.cache/huggingface:/root/.cache/huggingface \-v jit-cache:/cache/jit \-e SGLANG_ENABLE_JIT_DEEPGEMM=0 \-e SGLANG_ENABLE_DEEP_GEMM=0 \-e FLASHINFER_DISABLE_VERSION_CHECK=1 \-e NCCL_IB_DISABLE=1 \-e NCCL_P2P_LEVEL=SYS \-e NCCL_MIN_NCHANNELS=8 \voipmonitor/sglang:cu130 \python3 -m sglang.launch_server \--model-path 0xSero/GLM-5.1-555B-NVFP4 \--served-model-name glm-5.1-reap \--reasoning-parser glm45 \--tool-call-parser glm47 \--tensor-parallel-size 8 \--quantization compressed-tensors \--kv-cache-dtype bf16 \--trust-remote-code \--mem-fraction-static 0.85 \--chunked-prefill-size 16384 \--attention-backend flashinfer \--fp4-gemm-backend b12x \--moe-runner-backend b12x \--host 0.0.0.0 --port 5000
Critical flags:
--kv-cache-dtype bf16— mandatory; fp8_e4m3 produces garbled output on sm120--attention-backend flashinfer— sm120-compatible (trtllm_mha, flashmla are not)SGLANG_ENABLE_DEEP_GEMM=0— DeepGEMM needs WGMMA/TCGEN05 absent on sm120
Memory fit: 320 GB weights + KV cache fits on 8× 96GB (≈768 GB total VRAM). Minimum viable: 6× RTX PRO 6000 with --tp 2 --pp 3.
Will not run on sm_90 (H100): NVFP4 is Blackwell-native. Both vLLM (Marlin FP4 PTX mismatch) and sglang (NotImplementedError: Current platform does not support w4a4 nvfp4 quantization) explicitly block sm_90.
Quantization method
Produced via AutoRound 0.12.2 layerwise mode on 8× H100 80GB.
Settings
| Setting | Value | Notes |
|---|---|---|
--scheme | NVFP4 | 4-bit weights + FP8 per-group scales |
--iters | 50 | Halved from default 200 (loss trajectory confirms iters 100+ produce negligible improvement) |
--nsamples | 512 | Calibration samples |
--seqlen | 2048 | Default (seqlen=4096 tried; most samples too short after tokenization) |
--batch_size | 8 | Default |
--low_gpu_mem_usage | true | Required for 1.1TB source on 640GB VRAM |
--format | auto_round:llm_compressor | Produces compressed-tensors (sglang/vLLM compatible) |
Calibration dataset
Custom mix targeting realistic use cases (1,190 samples total → 505 valid after packing):
| Source | Samples | Content |
|---|---|---|
| 0xSero/structured-outputs-calibration-v1 | 430 | JSON schemas, sharegpt-JSON, Mermaid diagrams |
| 0xSero/reap-calibration-data-v1 | 560 | 100 long_context + 120 function_calling + 100 agentic + 60 coding + 40 cuda + 30 reasoning + 30 math + 40 terminal + 40 cybersecurity |
| NeelNanda/pile-10k | 200 | General web text (distribution anchor; provides long samples to compensate for short custom samples) |
Multi-dataset loading used AutoRound's :concat=true option (patched during build; upstreamable) to pack short instruction samples into full-seqlen sequences.
Wall time
- Model load + offload: ~55 min
- Calibration + quant: 6h 34m
- Save: 7 min
- Total: ~7.5 hours on 8× H100 80GB (brev compute)
Quality characteristics
Layer-level loss (iter 0 → iter 49) trajectory:
| Layer depth | iter 0 loss | iter 49 loss | Behavior |
|---|---|---|---|
| 0-2 | 0 | 0 | Attention-only; MLP skipped |
| 3-9 | 1e-6 to 1e-5 | 1e-6 to 1e-5 | Iterative tuning minimal effect |
| 10-30 | 1e-4 to 1e-2 | 30-50% reduction | Sign-tuning active |
| 31-55 | 1e-2 to 1e-1 | 20-30% reduction | Accumulating |
| 56-77 | 1e-1 to 8e-1 | 10-20% reduction | Deep-layer drift |
Expected quality impact: benchmarks on sm120 recommended to measure MMLU/GSM8K/IFEval gap vs BF16 source. Loss magnitudes alone suggest non-trivial degradation at deep layers; whether this matters in practice depends on task.
Provenance
- Source model: 0xSero/GLM-5.1-555B (BF16, 1.1 TB, 26 safetensors)
- Quantization compute: Nebius H100×8 via brev
- Quant tool: Intel AutoRound 0.12.2
- Deploy tool: voipmonitor/sglang:cu130
License
MIT (inherits from base model).
Acknowledgements
- Cerebras REAP team for the pruning recipe
- voipmonitor for the sm120 sglang deployment guide
- Intel AutoRound team for the quantization toolkit
- Nebius for the H100 compute
License & citation
License inherited from the base model.
bibtex
@misc{lasby2025reap,title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
Model provider
0xSero
Model tree
Base
0xSero/GLM-5.1-555B
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information