0xSero/GLM-5-381B-W3A16 API & Inference Endpoint

At a glance

Table

Base model	zai-org/GLM-5
Format	W3A16
Total params	381B
Active / token	—
Experts / layer	128
Layers	78
Hidden size	6144
Context	202,752
On-disk size	154 GB

Which variant should I pick?

Table with columns: Variant, Format, Link
Variant	Format	Link
`GLM-5-381B`	BF16	link
`GLM-5-381B-GGUF-BF16`	GGUF	link
`GLM-5-381B-GGUF-IQ2_M`	GGUF	link

This repository contains the W3A16 AutoRound quantization of the 50% REAP-pruned GLM-5 checkpoint.

Checkpoint

Base family: GLM-5
Architecture: GlmMoeDsaForCausalLM
Total parameters: 381,464,351,232
Source prune: refusal_contrast_reap, compression ratio 0.50, seed 42, router renormalization true
Quantization method: AutoRound
Quantization scheme: W3A16
Group size: 128
Calibration dataset:

Output

Saved model shards: 29
Quantized tensors: 29,571 / 29,659
Quantization config file: quantization_config.json

Intentionally Unquantized

lm_head
model.layers.[0-2].mlp.down_proj
model.layers.[0-2].mlp.gate_proj
model.layers.[0-2].mlp.up_proj
model.layers.[0-77].self_attn.indexer.weights_proj

Provenance

Quantized artifact path: /data0/external_research/glm5-autoround/full/glm5-reap-50pct-w3a16-pile10k-20260405T182123Z/output/layerwise_refusal_contrast_reap-renorm_true-seed_42-0.50-w3g128
Quantization log: /data0/external_research/glm5-autoround/full/glm5-reap-50pct-w3a16-pile10k-20260405T182123Z/quant.log

Notes

The source checkpoint for this quantization is the BF16 50% REAP GLM-5 artifact.
AutoRound reported total tuning time 4549.26s.

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

At a glance

Table

Base model	zai-org/GLM-5
Format	W3A16
Total params	381B
Active / token	—
Experts / layer	128
Layers	78
Hidden size	6144
Context	202,752
On-disk size	154 GB

Which variant should I pick?

Table with columns: Variant, Format, Link
Variant	Format	Link
`GLM-5-381B`	BF16	link
`GLM-5-381B-GGUF-BF16`	GGUF	link
`GLM-5-381B-GGUF-IQ2_M`	GGUF	link

This repository contains the W3A16 AutoRound quantization of the 50% REAP-pruned GLM-5 checkpoint.

Checkpoint

Base family: GLM-5
Architecture: GlmMoeDsaForCausalLM
Total parameters: 381,464,351,232
Source prune: refusal_contrast_reap, compression ratio 0.50, seed 42, router renormalization true
Quantization method: AutoRound
Quantization scheme: W3A16
Group size: 128
Calibration dataset:

Output

Saved model shards: 29
Quantized tensors: 29,571 / 29,659
Quantization config file: quantization_config.json

Intentionally Unquantized

lm_head
model.layers.[0-2].mlp.down_proj
model.layers.[0-2].mlp.gate_proj
model.layers.[0-2].mlp.up_proj
model.layers.[0-77].self_attn.indexer.weights_proj

Provenance

Quantized artifact path: /data0/external_research/glm5-autoround/full/glm5-reap-50pct-w3a16-pile10k-20260405T182123Z/output/layerwise_refusal_contrast_reap-renorm_true-seed_42-0.50-w3g128
Quantization log: /data0/external_research/glm5-autoround/full/glm5-reap-50pct-w3a16-pile10k-20260405T182123Z/quant.log

Notes

The source checkpoint for this quantization is the BF16 50% REAP GLM-5 artifact.
AutoRound reported total tuning time 4549.26s.

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

GLM-5-381B-W3A16

Get help setting up a custom Dedicated Endpoints.

README

At a glance

Which variant should I pick?

Checkpoint

Output

Intentionally Unquantized

Provenance

Notes

License & citation

Sponsors

Explore FriendliAI today

README

At a glance

Which variant should I pick?

Checkpoint

Output

Intentionally Unquantized

Provenance

Notes

License & citation

Sponsors