Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

At a glance

Base modelcerebras/GLM-4.7-REAP-218B-A32B
FormatW4A16
Total params218B
Active / token32B
Experts / layer96
Layers92
Hidden size5120
Context202,752
On-disk size116 GB

Which variant should I pick?

VariantFormatLink
GLM-4.7-185BBF16link
GLM-4.7-185B-W4A16W4A16link
GLM-4.7-202BBF16link
GLM-4.7-218B-W4A16 (this)W4A16link
GLM-4.7-REAP-40-W4A16W4A16link

40% Expert-Pruned + INT4 Quantized GLM-4 (218B total / 32B active params, ~116GB)

A highly compressed version of GLM-4.7 combining REAP expert pruning (40% experts removed) with INT4 weight quantization (AutoRound W4A16). This model is ~6.5x smaller than the original 700GB GLM-4.7.

Model Details

PropertyValue
Base ModelGLM-4.7-REAP-218B-A32B
Original (GLM-4.7)358B params, ~717GB
After REAP Pruning218B params, ~407GB
After W4A16 Quant218B params, ~108GB
Active Parameters32B per forward pass
Total Compression~6.5x from original
QuantizationINT4 weights, FP16 activations
Group Size128
FormatAutoRound
VRAM Required~110GB

Compression Pipeline

markdown

GLM-4.7 (358B, 700GB)
|
v REAP 40% pruning (96/160 experts)
|
GLM-4.7-REAP-218B-A32B (218B, 407GB)
|
v AutoRound W4A16 quantization
|
GLM-4.7-REAP-218B-A32B-W4A16 (218B, 108GB) <-- This model
Total: 6.5x compression

Usage


📊 Benchmarks

Tested on 8x RTX 3090:

MetricValue
Prefill375 tps
Generation38.5
Time to First Token3.82s

Deployment

vLLM

bash

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve GLM-4.7-REAP-218B-A32B-W4A16 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2 \
--max-model-len 165000 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.92 \
--kv-cache-dtype fp8_e4m3 \
--tool-call-parser glm47 \
--served-model-name glm-4.7 \
--enable-auto-tool-choice \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000

AutoRound Quantization Details

AutoRound is Intel's weight quantization method using signed gradient descent.

yaml

bits: 4
group_size: 128
format: auto_round
nsamples: 64
seqlen: 512
dataset: NeelNanda/pile-10k

Reproduce This Model

bash

# 1. Download the BF16 REAP model
huggingface-cli download 0xSero/GLM-4.7-REAP-218B-A32B --local-dir ./GLM-4.7-REAP-218B-A32B
# 2. Run AutoRound quantization
pip install auto-round
python -c "
from auto_round import AutoRound
ar = AutoRound(
'./GLM-4.7-REAP-218B-A32B',
device='cuda',
device_map='auto',
nsamples=64,
seqlen=512,
batch_size=1
)
ar.quantize_and_save('./GLM-4.7-REAP-218B-A32B-W4A16', format='auto_round')
"
# Takes ~2 hours on 8x H200

Related Models

ModelParamsSizeFormatLink
GLM-4.7 (Base)358B~700GBBF16zai-org/GLM-4.7
GLM-4.7-REAP-218B-A32B218B~407GBBF160xSero/GLM-4.7-REAP-218B-A32B
This Model218B~108GBW4A16-

Benchmarks

Benchmarks in progress

BenchmarkGLM-4.7 BaseREAP BF16REAP W4A16
HumanEval---
MBPP---
GSM8K---

License & citation

License inherited from the base model.

bibtex

@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Model provider

0xSero

Model tree

Base

cerebras/GLM-4.7-REAP-218B-A32B

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today