0xSero

Qwen3-Coder-64B

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

At a glance

Table

Base model	Qwen/Qwen3-Coder-Next
Format	BF16
Total params	64B
Active / token	—
Experts / layer	410
Layers	48
Hidden size	2048
Context	262,144
On-disk size	129 GB

Which variant should I pick?

Table with columns: Variant, Format, Link
Variant	Format	Link
`Qwen3-Coder-57B`	BF16	link
`Qwen3-Coder-64B` (this)	BF16	link

20% expert-pruned version of Qwen/Qwen3-Coder-Next using Cerebras REAP (Router-weighted Expert Activation Pruning).

Table with columns: Original, This Model
	Original	This Model
Total params	~80B	64.26B
Experts	512	410
Active params/tok	~4.2B	~4.2B
Experts/tok	10	10
Format	BF16	BF16

REAP removes 20% of MoE experts (102 of 512) while preserving the model's routing behavior and output quality. The active parameter count per token is unchanged since the router still selects 10 experts per token from the remaining pool. This yields a ~14% reduction in total disk/memory footprint with minimal quality loss.

Method

REAP (ICLR 2026) prunes Mixture-of-Experts models by scoring expert importance using:

Router gate values -- how often and how strongly the router selects each expert
Expert activation norms -- magnitude of each expert's output contribution
Frequency-weighted saliency -- combining routing frequency with activation importance
Router logit renormalization -- maintains output distribution after expert removal
Layerwise application -- independent per-layer pruning decisions for stability

Calibration Dataset

22,000 samples (no-refusal subset: 21,000), packed to 16,384 token sequences:

Table with columns: Category, Samples, Source
Category	Samples	Source
Coding (general)	4,096	`theblackcat102/evol-codealpaca-v1`
Reasoning (code)	~2,680	`open-r1/Mixture-of-Thoughts[code]`
Reasoning (math)	~2,778	`open-r1/Mixture-of-Thoughts[math]`
Reasoning (science)	~2,776	`open-r1/Mixture-of-Thoughts[science]`

Total tokens observed: ~90.5M across 6,391 packed sequences.

Pruning Configuration

Table with columns: Parameter, Value
Parameter	Value
Compression ratio	0.20 (20% expert removal)
Original experts per layer	512
Remaining experts per layer	410
Pruning method	REAP
Distance measure	Angular (cosine)
Router weight renormalization	Yes
Seed	42
Observation batch size	8
Calibration batches

Benchmark Results

10-task lm-eval suite, 200 samples per task, tensor_parallel_size=4, vLLM eager mode:

Table with columns: Task, Metric, Original, REAP 0.20, Delta
Task	Metric	Original	REAP 0.20	Delta
ARC-Challenge	acc_norm	58.5%	64.0%	+5.5
BoolQ	acc	93.0%	91.0%	-2.0
CommonsenseQA	acc	89.0%	88.0%	-1.0

Aggregate:

Overall average: 66.7% -> 64.6% (-2.1 pts)
Reasoning average: 71.4% -> 70.5% (-0.9 pts)
Math average: 47.8% -> 41.0% (-6.8 pts)

Architecture

Qwen3-Coder-Next uses a hybrid linear/full attention architecture with 48 layers:

Full attention every 4th layer (12 layers)
Linear attention for remaining layers (36 layers)
MoE FFN with 410 remaining experts per layer, 10 active per token
Shared expert (intermediate size 512) in every layer
Context window: 262,144 tokens
Vocab size: 151,936

Usage

Transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/Qwen3-Coder-64B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

bash
vllm serve 0xSero/Qwen3-Coder-64B \
    --tensor-parallel-size 4 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768

Reproducing

bash
git clone https://github.com/cerebras/reap
cd reap

python -m reap.layerwise_prune \
    --model-name Qwen/Qwen3-Coder-Next \
    --dataset-name combined \
    --compression-ratio 0.20 \
    --prune-method reap \
    --seed 42 \
    --renormalize_router_weights true \
    --batch_size 8 \
    --batches_per_category 128

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Explore FriendliAI today

Get started Talk to an engineer

At a glance

Table

Base model	Qwen/Qwen3-Coder-Next
Format	BF16
Total params	64B
Active / token	—
Experts / layer	410
Layers	48
Hidden size	2048
Context	262,144
On-disk size	129 GB

Which variant should I pick?

Table with columns: Variant, Format, Link
Variant	Format	Link
`Qwen3-Coder-57B`	BF16	link
`Qwen3-Coder-64B` (this)	BF16	link

20% expert-pruned version of Qwen/Qwen3-Coder-Next using Cerebras REAP (Router-weighted Expert Activation Pruning).

Table with columns: Original, This Model
	Original	This Model
Total params	~80B	64.26B
Experts	512	410
Active params/tok	~4.2B	~4.2B
Experts/tok	10	10
Format	BF16	BF16

Method

REAP (ICLR 2026) prunes Mixture-of-Experts models by scoring expert importance using:

Router gate values -- how often and how strongly the router selects each expert
Expert activation norms -- magnitude of each expert's output contribution
Frequency-weighted saliency -- combining routing frequency with activation importance
Router logit renormalization -- maintains output distribution after expert removal
Layerwise application -- independent per-layer pruning decisions for stability

Calibration Dataset

22,000 samples (no-refusal subset: 21,000), packed to 16,384 token sequences:

Table with columns: Category, Samples, Source
Category	Samples	Source
Coding (general)	4,096	`theblackcat102/evol-codealpaca-v1`
Reasoning (code)	~2,680	`open-r1/Mixture-of-Thoughts[code]`
Reasoning (math)	~2,778	`open-r1/Mixture-of-Thoughts[math]`
Reasoning (science)	~2,776	`open-r1/Mixture-of-Thoughts[science]`

Total tokens observed: ~90.5M across 6,391 packed sequences.

Pruning Configuration

Table with columns: Parameter, Value
Parameter	Value
Compression ratio	0.20 (20% expert removal)
Original experts per layer	512
Remaining experts per layer	410
Pruning method	REAP
Distance measure	Angular (cosine)
Router weight renormalization	Yes
Seed	42
Observation batch size	8
Calibration batches

Benchmark Results

10-task lm-eval suite, 200 samples per task, tensor_parallel_size=4, vLLM eager mode:

Table with columns: Task, Metric, Original, REAP 0.20, Delta
Task	Metric	Original	REAP 0.20	Delta
ARC-Challenge	acc_norm	58.5%	64.0%	+5.5
BoolQ	acc	93.0%	91.0%	-2.0
CommonsenseQA	acc	89.0%	88.0%	-1.0

Aggregate:

Overall average: 66.7% -> 64.6% (-2.1 pts)
Reasoning average: 71.4% -> 70.5% (-0.9 pts)
Math average: 47.8% -> 41.0% (-6.8 pts)

Architecture

Qwen3-Coder-Next uses a hybrid linear/full attention architecture with 48 layers:

Full attention every 4th layer (12 layers)
Linear attention for remaining layers (36 layers)
MoE FFN with 410 remaining experts per layer, 10 active per token
Shared expert (intermediate size 512) in every layer
Context window: 262,144 tokens
Vocab size: 151,936

Usage

Transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/Qwen3-Coder-64B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

bash
vllm serve 0xSero/Qwen3-Coder-64B \
    --tensor-parallel-size 4 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768

Reproducing

bash
git clone https://github.com/cerebras/reap
cd reap

python -m reap.layerwise_prune \
    --model-name Qwen/Qwen3-Coder-Next \
    --dataset-name combined \
    --compression-ratio 0.20 \
    --prune-method reap \
    --seed 42 \
    --renormalize_router_weights true \
    --batch_size 8 \
    --batches_per_category 128

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Qwen3-Coder-64B

Get help setting up a custom Dedicated Endpoints.

README

At a glance

Which variant should I pick?

Method

Calibration Dataset

Pruning Configuration

Benchmark Results

Architecture

Usage

Transformers

vLLM

Reproducing

Links

License & citation

Sponsors

Explore FriendliAI today

README

At a glance

Which variant should I pick?

Method

Calibration Dataset

Pruning Configuration

Benchmark Results

Architecture

Usage

Transformers

vLLM

Reproducing

Links

License & citation

Sponsors