0xSero/GLM-4.7-218B-W4A16 API & Inference Endpoint

At a glance

Table

Base model	cerebras/GLM-4.7-REAP-218B-A32B
Format	W4A16
Total params	218B
Active / token	32B
Experts / layer	96
Layers	92
Hidden size	5120
Context	202,752
On-disk size	116 GB

Which variant should I pick?

Table with columns: Variant, Format, Link
Variant	Format	Link
`GLM-4.7-185B`	BF16	link
`GLM-4.7-185B-W4A16`	W4A16	link
`GLM-4.7-202B`	BF16	link

40% Expert-Pruned + INT4 Quantized GLM-4 (218B total / 32B active params, ~116GB)

A highly compressed version of GLM-4.7 combining REAP expert pruning (40% experts removed) with INT4 weight quantization (AutoRound W4A16). This model is ~6.5x smaller than the original 700GB GLM-4.7.

Model Details

Table with columns: Property, Value
Property	Value
Base Model	GLM-4.7-REAP-218B-A32B
Original (GLM-4.7)	358B params, ~717GB
After REAP Pruning	218B params, ~407GB
After W4A16 Quant	218B params, ~108GB
Active Parameters	32B per forward pass
Total Compression	~6.5x from original
Quantization	INT4 weights, FP16 activations

Compression Pipeline

markdown
GLM-4.7 (358B, 700GB)
        |
        v  REAP 40% pruning (96/160 experts)
        |
GLM-4.7-REAP-218B-A32B (218B, 407GB)
        |
        v  AutoRound W4A16 quantization
        |
GLM-4.7-REAP-218B-A32B-W4A16 (218B, 108GB)  <-- This model

Total: 6.5x compression

Usage

📊 Benchmarks

Tested on 8x RTX 3090:

Table with columns: Metric, Value
Metric	Value
Prefill	375 tps
Generation	38.5
Time to First Token	3.82s

Deployment

vLLM

bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve GLM-4.7-REAP-218B-A32B-W4A16 \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2 \
  --max-model-len 165000 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype fp8_e4m3 \
  --tool-call-parser glm47 \
  --served-model-name glm-4.7 \
  --enable-auto-tool-choice \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

AutoRound Quantization Details

AutoRound is Intel's weight quantization method using signed gradient descent.

yaml
bits: 4
group_size: 128
format: auto_round
nsamples: 64
seqlen: 512
dataset: NeelNanda/pile-10k

Reproduce This Model

bash
# 1. Download the BF16 REAP model
huggingface-cli download 0xSero/GLM-4.7-REAP-218B-A32B --local-dir ./GLM-4.7-REAP-218B-A32B

# 2. Run AutoRound quantization
pip install auto-round

python -c "
from auto_round import AutoRound
ar = AutoRound(
    './GLM-4.7-REAP-218B-A32B',
    device='cuda',
    device_map='auto',
    nsamples=64,
    seqlen=512,
    batch_size=1
)
ar.quantize_and_save('./GLM-4.7-REAP-218B-A32B-W4A16', format='auto_round')
"

# Takes ~2 hours on 8x H200

Table with columns: Model, Params, Size, Format, Link
Model	Params	Size	Format	Link
GLM-4.7 (Base)	358B	~700GB	BF16	zai-org/GLM-4.7
GLM-4.7-REAP-218B-A32B	218B	~407GB	BF16	0xSero/GLM-4.7-REAP-218B-A32B
This Model	218B

Benchmarks

Benchmarks in progress

Table with columns: Benchmark, GLM-4.7 Base, REAP BF16, REAP W4A16
Benchmark	GLM-4.7 Base	REAP BF16	REAP W4A16
HumanEval	-	-	-
MBPP	-	-	-
GSM8K	-	-	-

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Base model

cerebras/GLM-4.7-REAP-218B-A32B

Format

W4A16

Total params

218B

Active / token

32B

Experts / layer

Layers

Hidden size

5120

Context

202,752

On-disk size

116 GB

Variant

Format

Link

GLM-4.7-185B

BF16

link

GLM-4.7-185B-W4A16

W4A16

link

GLM-4.7-202B

BF16

link

Property

Value

Base Model

GLM-4.7-REAP-218B-A32B

Original (GLM-4.7)

358B params, ~717GB

After REAP Pruning

218B params, ~407GB

After W4A16 Quant

218B params, ~108GB

Active Parameters

32B per forward pass

Total Compression

~6.5x from original

Quantization

INT4 weights, FP16 activations

markdown

GLM-4.7 (358B, 700GB)
        |
        v  REAP 40% pruning (96/160 experts)
        |
GLM-4.7-REAP-218B-A32B (218B, 407GB)
        |
        v  AutoRound W4A16 quantization
        |
GLM-4.7-REAP-218B-A32B-W4A16 (218B, 108GB)  <-- This model

Total: 6.5x compression

Metric

Value

Prefill

375 tps

Generation

38.5

Time to First Token

3.82s

bash

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve GLM-4.7-REAP-218B-A32B-W4A16 \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2 \
  --max-model-len 165000 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype fp8_e4m3 \
  --tool-call-parser glm47 \
  --served-model-name glm-4.7 \
  --enable-auto-tool-choice \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

bash

# 1. Download the BF16 REAP model
huggingface-cli download 0xSero/GLM-4.7-REAP-218B-A32B --local-dir ./GLM-4.7-REAP-218B-A32B

# 2. Run AutoRound quantization
pip install auto-round

python -c "
from auto_round import AutoRound
ar = AutoRound(
    './GLM-4.7-REAP-218B-A32B',
    device='cuda',
    device_map='auto',
    nsamples=64,
    seqlen=512,
    batch_size=1
)
ar.quantize_and_save('./GLM-4.7-REAP-218B-A32B-W4A16', format='auto_round')
"

# Takes ~2 hours on 8x H200

Model

Params

Size

Format

Link

GLM-4.7 (Base)

358B

~700GB

BF16

zai-org/GLM-4.7

GLM-4.7-REAP-218B-A32B

218B

~407GB

BF16

0xSero/GLM-4.7-REAP-218B-A32B

This Model

218B

Benchmark

GLM-4.7 Base

REAP BF16

REAP W4A16

HumanEval

MBPP

GSM8K

bibtex

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

GLM-4.7-218B-W4A16

Get help setting up a custom Dedicated Endpoints.

README

At a glance

Which variant should I pick?

Model Details

Compression Pipeline

Usage

📊 Benchmarks

Deployment

vLLM

AutoRound Quantization Details

Reproduce This Model

Benchmarks

License & citation

Sponsors

Explore FriendliAI today

README

At a glance

Which variant should I pick?

Model Details

Compression Pipeline

Usage

📊 Benchmarks

Deployment

vLLM

AutoRound Quantization Details

Reproduce This Model

Benchmarks

License & citation

Sponsors

GLM-4.7-218B-W4A16

Get help setting up a custom Dedicated Endpoints.

At a glance

Which variant should I pick?

Model Details

Compression Pipeline

Usage

📊 Benchmarks

Deployment

vLLM

AutoRound Quantization Details

Reproduce This Model

Related Models

Benchmarks

License & citation

Sponsors

Explore FriendliAI today

At a glance

Which variant should I pick?

Model Details

Compression Pipeline

Usage

📊 Benchmarks

Deployment

vLLM

AutoRound Quantization Details

Reproduce This Model

Related Models

Benchmarks

License & citation

Sponsors