Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

At a glance

Base modelQwen/Qwen3-Coder-Next
FormatBF16
Total params64B
Active / token
Experts / layer410
Layers48
Hidden size2048
Context262,144
On-disk size129 GB

Which variant should I pick?

VariantFormatLink
Qwen3-Coder-57BBF16link
Qwen3-Coder-64B (this)BF16link

20% expert-pruned version of Qwen/Qwen3-Coder-Next using Cerebras REAP (Router-weighted Expert Activation Pruning).

OriginalThis Model
Total params~80B64.26B
Experts512410
Active params/tok~4.2B~4.2B
Experts/tok1010
FormatBF16BF16
Disk size~149 GB~129 GB

REAP removes 20% of MoE experts (102 of 512) while preserving the model's routing behavior and output quality. The active parameter count per token is unchanged since the router still selects 10 experts per token from the remaining pool. This yields a ~14% reduction in total disk/memory footprint with minimal quality loss.

Method

REAP (ICLR 2026) prunes Mixture-of-Experts models by scoring expert importance using:

  1. Router gate values -- how often and how strongly the router selects each expert
  2. Expert activation norms -- magnitude of each expert's output contribution
  3. Frequency-weighted saliency -- combining routing frequency with activation importance
  4. Router logit renormalization -- maintains output distribution after expert removal
  5. Layerwise application -- independent per-layer pruning decisions for stability

Calibration Dataset

22,000 samples (no-refusal subset: 21,000), packed to 16,384 token sequences:

CategorySamplesSource
Coding (general)4,096theblackcat102/evol-codealpaca-v1
Reasoning (code)~2,680open-r1/Mixture-of-Thoughts[code]
Reasoning (math)~2,778open-r1/Mixture-of-Thoughts[math]
Reasoning (science)~2,776open-r1/Mixture-of-Thoughts[science]
Tool calling4,096Salesforce/xlam-function-calling-60k
Agentic coding4,096SWE-bench/SWE-smith-trajectories
+ extended domains~1,478Scientific, CUDA kernels, browser, advanced math, code correctness

Total tokens observed: ~90.5M across 6,391 packed sequences.

Pruning Configuration

ParameterValue
Compression ratio0.20 (20% expert removal)
Original experts per layer512
Remaining experts per layer410
Pruning methodREAP
Distance measureAngular (cosine)
Router weight renormalizationYes
Seed42
Observation batch size8
Calibration batches128 per category

Benchmark Results

10-task lm-eval suite, 200 samples per task, tensor_parallel_size=4, vLLM eager mode:

TaskMetricOriginalREAP 0.20Delta
ARC-Challengeacc_norm58.5%64.0%+5.5
BoolQacc93.0%91.0%-2.0
CommonsenseQAacc89.0%88.0%-1.0
GSM8Kflexible_extract35.0%28.5%-6.5
HellaSwagacc_norm72.0%66.0%-6.0
MathQAacc_norm60.5%53.5%-7.0
OpenBookQAacc_norm48.5%49.0%+0.5
PIQAacc_norm80.0%80.5%+0.5
TruthfulQA MC2acc60.2%55.2%-5.0
WinoGrandeacc70.0%70.0%+0.0

Aggregate:

  • Overall average: 66.7% -> 64.6% (-2.1 pts)
  • Reasoning average: 71.4% -> 70.5% (-0.9 pts)
  • Math average: 47.8% -> 41.0% (-6.8 pts)

Architecture

Qwen3-Coder-Next uses a hybrid linear/full attention architecture with 48 layers:

  • Full attention every 4th layer (12 layers)
  • Linear attention for remaining layers (36 layers)
  • MoE FFN with 410 remaining experts per layer, 10 active per token
  • Shared expert (intermediate size 512) in every layer
  • Context window: 262,144 tokens
  • Vocab size: 151,936

Usage

Transformers

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "0xSero/Qwen3-Coder-64B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

bash

vllm serve 0xSero/Qwen3-Coder-64B \
--tensor-parallel-size 4 \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--max-model-len 32768

Reproducing

bash

git clone https://github.com/cerebras/reap
cd reap
python -m reap.layerwise_prune \
--model-name Qwen/Qwen3-Coder-Next \
--dataset-name combined \
--compression-ratio 0.20 \
--prune-method reap \
--seed 42 \
--renormalize_router_weights true \
--batch_size 8 \
--batches_per_category 128

Links

License & citation

License inherited from the base model.

bibtex

@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Model provider

0xSero

Model tree

Base

Qwen/Qwen3-Coder-Next

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today