Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

At a glance

Base model0xSero/GLM-5.1-555B
FormatNVFP4
Total params555B
Active / token14B
Experts / layer192
Layers78
Hidden size6144
Context202,752
On-disk size320 GB

Which variant should I pick?

VariantFormatLink
GLM-5.1-444BBF16link
GLM-5.1-444B-GGUFGGUFlink
GLM-5.1-478B-NVFP4NVFP4link
GLM-5.1-555BBF16link
GLM-5.1-555B-GGUFGGUFlink
GLM-5.1-555B-NVFP4 (this)NVFP4link
GLM-5.1-555B-W4A16W4A16link

NVFP4 quantization of 0xSero/GLM-5.1-555B — a REAP-pruned variant of GLM-5.1 (192 experts per MoE layer, down from 256).

Target hardware: 8× RTX PRO 6000 Blackwell 96GB (sm120) via sglang. See deploy recipe below.

Model details

PropertyValue
ArchitectureGlmMoeDsaForCausalLM (DeepSeek Sparse Attention + MLA)
Base precisionBF16 (source: 1.1 TB)
QuantizationNVFP4 (4-bit weights + FP8 per-group scales, group=16)
Output size320 GB (~3.4× compression)
Experts per MoE layer192 (REAP-pruned from 256)
Layers78
Formatnvfp4-pack-quantized via compressed-tensors

Layers kept in BF16 (per AutoRound ignore pattern)

  • lm_head
  • model.layers.[0-2].mlp.{gate,up,down}_proj (first 3 layers' experts — most sensitive)
  • model.layers.[0-77].self_attn.indexer.weights_proj (DSA indexer, quant-sensitive)

Deploy on sm120 (RTX PRO 6000 Blackwell)

Uses pre-built voipmonitor/sglang:cu130 Docker image with all sm120 patches applied.

bash

docker run --gpus all --ipc=host --shm-size=8g --network=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v jit-cache:/cache/jit \
-e SGLANG_ENABLE_JIT_DEEPGEMM=0 \
-e SGLANG_ENABLE_DEEP_GEMM=0 \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e NCCL_IB_DISABLE=1 \
-e NCCL_P2P_LEVEL=SYS \
-e NCCL_MIN_NCHANNELS=8 \
voipmonitor/sglang:cu130 \
python3 -m sglang.launch_server \
--model-path 0xSero/GLM-5.1-555B-NVFP4 \
--served-model-name glm-5.1-reap \
--reasoning-parser glm45 \
--tool-call-parser glm47 \
--tensor-parallel-size 8 \
--quantization compressed-tensors \
--kv-cache-dtype bf16 \
--trust-remote-code \
--mem-fraction-static 0.85 \
--chunked-prefill-size 16384 \
--attention-backend flashinfer \
--fp4-gemm-backend b12x \
--moe-runner-backend b12x \
--host 0.0.0.0 --port 5000

Critical flags:

  • --kv-cache-dtype bf16 — mandatory; fp8_e4m3 produces garbled output on sm120
  • --attention-backend flashinfer — sm120-compatible (trtllm_mha, flashmla are not)
  • SGLANG_ENABLE_DEEP_GEMM=0 — DeepGEMM needs WGMMA/TCGEN05 absent on sm120

Memory fit: 320 GB weights + KV cache fits on 8× 96GB (≈768 GB total VRAM). Minimum viable: 6× RTX PRO 6000 with --tp 2 --pp 3.

Will not run on sm_90 (H100): NVFP4 is Blackwell-native. Both vLLM (Marlin FP4 PTX mismatch) and sglang (NotImplementedError: Current platform does not support w4a4 nvfp4 quantization) explicitly block sm_90.

Quantization method

Produced via AutoRound 0.12.2 layerwise mode on 8× H100 80GB.

Settings

SettingValueNotes
--schemeNVFP44-bit weights + FP8 per-group scales
--iters50Halved from default 200 (loss trajectory confirms iters 100+ produce negligible improvement)
--nsamples512Calibration samples
--seqlen2048Default (seqlen=4096 tried; most samples too short after tokenization)
--batch_size8Default
--low_gpu_mem_usagetrueRequired for 1.1TB source on 640GB VRAM
--formatauto_round:llm_compressorProduces compressed-tensors (sglang/vLLM compatible)

Calibration dataset

Custom mix targeting realistic use cases (1,190 samples total → 505 valid after packing):

SourceSamplesContent
0xSero/structured-outputs-calibration-v1430JSON schemas, sharegpt-JSON, Mermaid diagrams
0xSero/reap-calibration-data-v1560100 long_context + 120 function_calling + 100 agentic + 60 coding + 40 cuda + 30 reasoning + 30 math + 40 terminal + 40 cybersecurity
NeelNanda/pile-10k200General web text (distribution anchor; provides long samples to compensate for short custom samples)

Multi-dataset loading used AutoRound's :concat=true option (patched during build; upstreamable) to pack short instruction samples into full-seqlen sequences.

Wall time

  • Model load + offload: ~55 min
  • Calibration + quant: 6h 34m
  • Save: 7 min
  • Total: ~7.5 hours on 8× H100 80GB (brev compute)

Quality characteristics

Layer-level loss (iter 0 → iter 49) trajectory:

Layer depthiter 0 lossiter 49 lossBehavior
0-200Attention-only; MLP skipped
3-91e-6 to 1e-51e-6 to 1e-5Iterative tuning minimal effect
10-301e-4 to 1e-230-50% reductionSign-tuning active
31-551e-2 to 1e-120-30% reductionAccumulating
56-771e-1 to 8e-110-20% reductionDeep-layer drift

Expected quality impact: benchmarks on sm120 recommended to measure MMLU/GSM8K/IFEval gap vs BF16 source. Loss magnitudes alone suggest non-trivial degradation at deep layers; whether this matters in practice depends on task.

Provenance

License

MIT (inherits from base model).

Acknowledgements

  • Cerebras REAP team for the pruning recipe
  • voipmonitor for the sm120 sglang deployment guide
  • Intel AutoRound team for the quantization toolkit
  • Nebius for the H100 compute

License & citation

License inherited from the base model.

bibtex

@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Model provider

0xSero

Model tree

Base

0xSero/GLM-5.1-555B

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today