Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

At a glance

Base model0xSero/GLM-4.7-185B
FormatW4A16
Total params185B
Active / token
Experts / layer80
Layers92
Hidden size5120
Context202,752
On-disk size99 GB

Which variant should I pick?

VariantFormatLink
GLM-4.7-185BBF16link
GLM-4.7-185B-W4A16 (this)W4A16link
GLM-4.7-202BBF16link
GLM-4.7-218B-W4A16W4A16link
GLM-4.7-REAP-40-W4A16W4A16link

GLM-4.7-REAP-50-W4A16

✨ Highlights

50% Expert-Pruned + INT4 Quantized — Double compression for efficient deployment.

  • ~6.5x Total Compression: 700GB → ~92GB
  • REAP + AutoRound: Expert pruning + weight quantization
  • Optimized for Code & Tools: Calibrated on code generation and function calling
  • Lower VRAM: Fits on 2-4x fewer GPUs than BF16

📋 Model Specifications

PropertyValue
Base ModelGLM-4.7-REAP-50
Original (GLM-4.7)358B params, ~700GB
After REAP 50%179B params
After W4A16 Quant~92GB on disk
QuantizationINT4 weights, FP16 activations
Group Size128
FormatGPTQ (AutoRound)
Experts per Layer80 (was 160)
VRAM Required~100GB

Compression Pipeline

markdown

GLM-4.7 (358B, 700GB)
▼ REAP 50% expert pruning
GLM-4.7-REAP-50 (179B)
▼ AutoRound W4A16 quantization
GLM-4.7-REAP-50-W4A16 (~92GB) ◀── This model
Total: ~6.5x compression

🔬 Calibration Dataset: Deep Dive

REAP's effectiveness depends critically on calibration data that represents the target use case. We specifically optimized for code generation, function/tool calling, and agentic workflows.

Why These 3 Datasets?

DatasetSamplesPurposeWhy It Matters
evol-codealpaca-v1700Code generation51% of mix — Code tasks activate specific expert pathways; pruning without code calibration destroys coding ability
xlam-function-calling-60k330Function/tool calling24% of mix — Tool use requires structured JSON output; experts handling schema generation must be preserved
SWE-smith-trajectories330Agentic multi-turn24% of mix — Real SWE-bench trajectories with tool calls, file edits, and multi-step reasoning

The Science Behind Dataset Selection

markdown

REAP Algorithm:
1. Forward pass calibration samples through model
2. Record which experts activate and their magnitudes
3. Compute saliency = router_weight × activation_norm
4. Prune lowest-saliency experts
Key Insight: Experts are TASK-SPECIFIC
├── Some experts specialize in natural language
├── Some experts specialize in code syntax
├── Some experts specialize in JSON/structured output
└── Some experts specialize in multi-turn context
If calibration lacks code → code-specialized experts appear "unused" → get pruned → model loses coding ability

Cerebras' Original Mix (from paper)

Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:

  • evol-codealpaca-v1 for code generation
  • xlam-function-calling-60k for tool calling
  • SWE-smith-trajectories for agentic tasks

We followed this exact recipe for reproducibility.

Combined Dataset

Our calibration mix: 0xSero/glm47-reap-calibration-v2


🚀 Deployment

vLLM (Recommended)

bash

vllm serve 0xSero/GLM-4.7-185B-W4A16 \
--tensor-parallel-size 4 \
--trust-remote-code \
--quantization gptq

Transformers

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"0xSero/GLM-4.7-185B-W4A16",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-185B-W4A16", trust_remote_code=True)

🧩 Reproduction

Step 1: REAP Pruning

python

#!/usr/bin/env python3
"""
REAP Pruning Script for MoE Models
Adapted from: https://github.com/CerebrasResearch/reap
"""
import subprocess
import sys
def run_reap(
model_path: str,
compression_ratio: float,
dataset: str = "0xSero/glm47-reap-calibration-v2",
samples: int = 1360,
seed: int = 42,
distance: str = "angular",
reuse_observations: str = None,
):
"""
Run REAP expert pruning.
Args:
model_path: Path to base model
compression_ratio: 0.30 = prune 30%, keep 70%
dataset: Calibration dataset (code + tools + agentic)
samples: Number of calibration samples
seed: Random seed for reproducibility
distance: Distance metric for expert clustering
reuse_observations: Path to pre-computed observations for instant pruning
"""
cmd = [
sys.executable, "src/reap/prune.py",
"--model-name", model_path,
"--dataset-name", dataset,
"--compression-ratio", str(compression_ratio),
"--prune-method", "reap",
"--seed", str(seed),
"--samples_per_category", str(samples),
"--model_max_length", "2048",
"--distance_measure", distance,
"--record_pruning_metrics_only", "true",
]
if reuse_observations:
# Instant pruning: skip calibration, reuse precomputed expert scores
cmd.extend(["--load_observations", reuse_observations])
subprocess.run(cmd, check=True)
# Example: Create 40% pruned model
run_reap(
model_path="/path/to/GLM-4.7",
compression_ratio=0.40, # Prune 40% of experts
)

Step 2: AutoRound Quantization

python

#!/usr/bin/env python3
"""
AutoRound W4A16 Quantization
Intel's state-of-the-art weight quantization using signed gradient descent.
"""
from auto_round import AutoRound
def quantize_w4a16(
model_path: str,
output_dir: str,
bits: int = 4,
group_size: int = 128,
format: str = "auto_gptq",
):
"""
Quantize model to INT4 weights with FP16 activations.
Args:
model_path: Path to REAP-pruned model
output_dir: Output directory
bits: Weight bit width (4 for W4A16)
group_size: Quantization group size (128 is optimal)
format: Output format (auto_gptq for vLLM compatibility)
"""
ar = AutoRound(
model_path,
scheme="W4A16",
device="cuda",
device_map="auto",
trust_remote_code=True,
batch_size=1,
seqlen=512,
nsamples=64,
)
ar.quantize_and_save(output_dir, format=format)
# Example: Quantize REAP-40 to W4A16
quantize_w4a16(
model_path="./GLM-4.7-REAP-40",
output_dir="./GLM-4.7-REAP-40-W4A16",
)

⚖️ License

Apache 2.0


License & citation

License inherited from the base model.

bibtex

@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Model provider

0xSero

Model tree

Base

0xSero/GLM-4.7-185B

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today