Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0At a glance
| Base model | — |
| Format | BF16 |
| Total params | 185B |
| Active / token | — |
| Experts / layer | 80 |
| Layers | 92 |
| Hidden size | 5120 |
| Context | 202,752 |
| On-disk size | 370 GB |
Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
GLM-4.7-185B (this) | BF16 | link |
GLM-4.7-185B-W4A16 | W4A16 | link |
GLM-4.7-202B | BF16 | link |
GLM-4.7-218B-W4A16 | W4A16 | link |
GLM-4.7-REAP-40-W4A16 | W4A16 | link |
GLM-4.7-REAP-50
✨ Highlights
50% Expert-Pruned GLM-4.7 optimized for code generation, function calling, and agentic workflows.
Created using REAP (Router-weighted Expert Activation Pruning) by Cerebras:
- 358B → 179B: 50% of MoE experts pruned (80/160 remaining)
- Calibrated for Code & Tools: Preserves coding and function-calling capabilities
- One-Shot Compression: No fine-tuning required
- Drop-in Compatible: Works with vLLM, Transformers, SGLang
📋 Model Specifications
| Property | Value |
|---|---|
| Base Model | zai/glm-4.7 |
| Architecture | Sparse Mixture-of-Experts (SMoE) |
| Original Parameters | 358B |
| Pruned Parameters | 179B |
| Compression | 50% experts removed |
| Experts per Layer | 80 (was 160) |
| MoE Layers | 92 |
| Activated Experts | 8 per token |
| Precision | BF16 |
| Disk Size | ~345GB |
| VRAM Required | ~345GB |
🔬 Calibration Dataset: Deep Dive
REAP's effectiveness depends critically on calibration data that represents the target use case. We specifically optimized for code generation, function/tool calling, and agentic workflows.
Why These 3 Datasets?
| Dataset | Samples | Purpose | Why It Matters |
|---|---|---|---|
| evol-codealpaca-v1 | 700 | Code generation | 51% of mix — Code tasks activate specific expert pathways; pruning without code calibration destroys coding ability |
| xlam-function-calling-60k | 330 | Function/tool calling | 24% of mix — Tool use requires structured JSON output; experts handling schema generation must be preserved |
| SWE-smith-trajectories | 330 | Agentic multi-turn | 24% of mix — Real SWE-bench trajectories with tool calls, file edits, and multi-step reasoning |
The Science Behind Dataset Selection
markdown
REAP Algorithm:1. Forward pass calibration samples through model2. Record which experts activate and their magnitudes3. Compute saliency = router_weight × activation_norm4. Prune lowest-saliency expertsKey Insight: Experts are TASK-SPECIFIC├── Some experts specialize in natural language├── Some experts specialize in code syntax├── Some experts specialize in JSON/structured output└── Some experts specialize in multi-turn contextIf calibration lacks code → code-specialized experts appear "unused" → get pruned → model loses coding ability
Cerebras' Original Mix (from paper)
Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:
- evol-codealpaca-v1 for code generation
- xlam-function-calling-60k for tool calling
- SWE-smith-trajectories for agentic tasks
We followed this exact recipe for reproducibility.
Combined Dataset
Our calibration mix: 0xSero/glm47-reap-calibration-v2
📦 Related Models
| Model | Params | Experts | Size | Format |
|---|---|---|---|---|
| GLM-4.7-REAP-30 | 251B | 112 | ~470GB | BF16 |
| GLM-4.7-REAP-35 | 233B | 104 | ~439GB | BF16 |
| GLM-4.7-REAP-40 | 218B | 96 | ~407GB | BF16 |
| GLM-4.7-REAP-45 | 197B | 88 | ~370GB | BF16 |
| GLM-4.7-REAP-50 | 179B | 80 | ~345GB | BF16 |
| GLM-4.7-REAP-40-W4A16 | 218B | 96 | ~108GB | GPTQ |
| GLM-4.7-REAP-50-W4A16 | 179B | 80 | ~92GB | GPTQ |
🚀 Deployment
vLLM (Recommended)
bash
vllm serve 0xSero/GLM-4.7-185B \--tensor-parallel-size 8 \--trust-remote-code \--dtype bfloat16
Transformers
python
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizermodel = AutoModelForCausalLM.from_pretrained("0xSero/GLM-4.7-185B",torch_dtype=torch.bfloat16,device_map="auto",trust_remote_code=True)tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-185B", trust_remote_code=True)messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
🧩 Reproduction
REAP Pruning Script
python
#!/usr/bin/env python3"""REAP Pruning Script for MoE ModelsAdapted from: https://github.com/CerebrasResearch/reap"""import subprocessimport sysdef run_reap(model_path: str,compression_ratio: float,dataset: str = "0xSero/glm47-reap-calibration-v2",samples: int = 1360,seed: int = 42,distance: str = "angular",reuse_observations: str = None,):"""Run REAP expert pruning.Args:model_path: Path to base modelcompression_ratio: 0.30 = prune 30%, keep 70%dataset: Calibration dataset (code + tools + agentic)samples: Number of calibration samplesseed: Random seed for reproducibilitydistance: Distance metric for expert clusteringreuse_observations: Path to pre-computed observations for instant pruning"""cmd = [sys.executable, "src/reap/prune.py","--model-name", model_path,"--dataset-name", dataset,"--compression-ratio", str(compression_ratio),"--prune-method", "reap","--seed", str(seed),"--samples_per_category", str(samples),"--model_max_length", "2048","--distance_measure", distance,"--record_pruning_metrics_only", "true",]if reuse_observations:# Instant pruning: skip calibration, reuse precomputed expert scorescmd.extend(["--load_observations", reuse_observations])subprocess.run(cmd, check=True)# Example: Create 40% pruned modelrun_reap(model_path="/path/to/GLM-4.7",compression_ratio=0.40, # Prune 40% of experts)
Observation Reuse (Instant Multi-Ratio Pruning)
REAP computes expert saliency scores during calibration. These scores are compression-ratio independent, enabling instant pruning at any ratio:
bash
# First run: compute observations (~5 hours)python prune.py --compression-ratio 0.40 --output_file_name observations.pt# Subsequent runs: instant pruning (<5 minutes)python prune.py --compression-ratio 0.30 --load_observations observations.ptpython prune.py --compression-ratio 0.50 --load_observations observations.pt
⚖️ License
Apache 2.0 (inherited from GLM-4)
License & citation
License inherited from the base model.
bibtex
@misc{lasby2025reap,title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
Model provider
0xSero
Model tree
Base
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information