Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0At a glance
| Base model | cerebras/GLM-4.7-REAP-218B-A32B |
| Format | W4A16 |
| Total params | 218B |
| Active / token | 32B |
| Experts / layer | 96 |
| Layers | 92 |
| Hidden size | 5120 |
| Context | 202,752 |
| On-disk size | 116 GB |
Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
GLM-4.7-185B | BF16 | link |
GLM-4.7-185B-W4A16 | W4A16 | link |
GLM-4.7-202B | BF16 | link |
GLM-4.7-218B-W4A16 (this) | W4A16 | link |
GLM-4.7-REAP-40-W4A16 | W4A16 | link |
40% Expert-Pruned + INT4 Quantized GLM-4 (218B total / 32B active params, ~116GB)
A highly compressed version of GLM-4.7 combining REAP expert pruning (40% experts removed) with INT4 weight quantization (AutoRound W4A16). This model is ~6.5x smaller than the original 700GB GLM-4.7.
Model Details
| Property | Value |
|---|---|
| Base Model | GLM-4.7-REAP-218B-A32B |
| Original (GLM-4.7) | 358B params, ~717GB |
| After REAP Pruning | 218B params, ~407GB |
| After W4A16 Quant | 218B params, ~108GB |
| Active Parameters | 32B per forward pass |
| Total Compression | ~6.5x from original |
| Quantization | INT4 weights, FP16 activations |
| Group Size | 128 |
| Format | AutoRound |
| VRAM Required | ~110GB |
Compression Pipeline
markdown
GLM-4.7 (358B, 700GB)|v REAP 40% pruning (96/160 experts)|GLM-4.7-REAP-218B-A32B (218B, 407GB)|v AutoRound W4A16 quantization|GLM-4.7-REAP-218B-A32B-W4A16 (218B, 108GB) <-- This modelTotal: 6.5x compression
Usage
📊 Benchmarks
Tested on 8x RTX 3090:
| Metric | Value |
|---|---|
| Prefill | 375 tps |
| Generation | 38.5 |
| Time to First Token | 3.82s |
Deployment
vLLM
bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \vllm serve GLM-4.7-REAP-218B-A32B-W4A16 \--tensor-parallel-size 4 \--pipeline-parallel-size 2 \--max-model-len 165000 \--max-num-seqs 4 \--gpu-memory-utilization 0.92 \--kv-cache-dtype fp8_e4m3 \--tool-call-parser glm47 \--served-model-name glm-4.7 \--enable-auto-tool-choice \--trust-remote-code \--host 0.0.0.0 \--port 8000
AutoRound Quantization Details
AutoRound is Intel's weight quantization method using signed gradient descent.
yaml
bits: 4group_size: 128format: auto_roundnsamples: 64seqlen: 512dataset: NeelNanda/pile-10k
Reproduce This Model
bash
# 1. Download the BF16 REAP modelhuggingface-cli download 0xSero/GLM-4.7-REAP-218B-A32B --local-dir ./GLM-4.7-REAP-218B-A32B# 2. Run AutoRound quantizationpip install auto-roundpython -c "from auto_round import AutoRoundar = AutoRound('./GLM-4.7-REAP-218B-A32B',device='cuda',device_map='auto',nsamples=64,seqlen=512,batch_size=1)ar.quantize_and_save('./GLM-4.7-REAP-218B-A32B-W4A16', format='auto_round')"# Takes ~2 hours on 8x H200
Related Models
| Model | Params | Size | Format | Link |
|---|---|---|---|---|
| GLM-4.7 (Base) | 358B | ~700GB | BF16 | zai-org/GLM-4.7 |
| GLM-4.7-REAP-218B-A32B | 218B | ~407GB | BF16 | 0xSero/GLM-4.7-REAP-218B-A32B |
| This Model | 218B | ~108GB | W4A16 | - |
Benchmarks
Benchmarks in progress
| Benchmark | GLM-4.7 Base | REAP BF16 | REAP W4A16 |
|---|---|---|---|
| HumanEval | - | - | - |
| MBPP | - | - | - |
| GSM8K | - | - | - |
License & citation
License inherited from the base model.
bibtex
@misc{lasby2025reap,title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
Model provider
0xSero
Model tree
Base
cerebras/GLM-4.7-REAP-218B-A32B
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information