At a glance
Table | |
|---|
| Base model | zai-org/GLM-5 |
| Format | W3A16 |
| Total params | 381B |
| Active / token | — |
| Experts / layer | 128 |
| Layers | 78 |
| Hidden size | 6144 |
| Context | 202,752 |
| On-disk size | 154 GB |
Which variant should I pick?
Table with columns: Variant, Format, Link| Variant | Format | Link |
|---|
GLM-5-381B | BF16 | link |
GLM-5-381B-GGUF-BF16 | GGUF | link |
GLM-5-381B-GGUF-IQ2_M | GGUF | link |
This repository contains the W3A16 AutoRound quantization of the 50% REAP-pruned GLM-5 checkpoint.
Checkpoint
- Base family:
GLM-5
- Architecture:
GlmMoeDsaForCausalLM
- Total parameters:
381,464,351,232
- Source prune:
refusal_contrast_reap, compression ratio 0.50, seed 42, router renormalization true
- Quantization method:
AutoRound
- Quantization scheme:
W3A16
- Group size:
128
- Calibration dataset:
Output
- Saved model shards:
29
- Quantized tensors:
29,571 / 29,659
- Quantization config file:
quantization_config.json
Intentionally Unquantized
lm_head
model.layers.[0-2].mlp.down_proj
model.layers.[0-2].mlp.gate_proj
model.layers.[0-2].mlp.up_proj
model.layers.[0-77].self_attn.indexer.weights_proj
Provenance
- Quantized artifact path:
/data0/external_research/glm5-autoround/full/glm5-reap-50pct-w3a16-pile10k-20260405T182123Z/output/layerwise_refusal_contrast_reap-renorm_true-seed_42-0.50-w3g128
- Quantization log:
/data0/external_research/glm5-autoround/full/glm5-reap-50pct-w3a16-pile10k-20260405T182123Z/quant.log
Notes
- The source checkpoint for this quantization is the BF16 50% REAP GLM-5 artifact.
- AutoRound reported total tuning time
4549.26s.
License & citation
License inherited from the base model.
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.