Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0At a glance
| Base model | Qwen/Qwen3-Coder-Next |
| Format | BF16 |
| Total params | 64B |
| Active / token | — |
| Experts / layer | 410 |
| Layers | 48 |
| Hidden size | 2048 |
| Context | 262,144 |
| On-disk size | 129 GB |
Which variant should I pick?
20% expert-pruned version of Qwen/Qwen3-Coder-Next using Cerebras REAP (Router-weighted Expert Activation Pruning).
| Original | This Model | |
|---|---|---|
| Total params | ~80B | 64.26B |
| Experts | 512 | 410 |
| Active params/tok | ~4.2B | ~4.2B |
| Experts/tok | 10 | 10 |
| Format | BF16 | BF16 |
| Disk size | ~149 GB | ~129 GB |
REAP removes 20% of MoE experts (102 of 512) while preserving the model's routing behavior and output quality. The active parameter count per token is unchanged since the router still selects 10 experts per token from the remaining pool. This yields a ~14% reduction in total disk/memory footprint with minimal quality loss.
Method
REAP (ICLR 2026) prunes Mixture-of-Experts models by scoring expert importance using:
- Router gate values -- how often and how strongly the router selects each expert
- Expert activation norms -- magnitude of each expert's output contribution
- Frequency-weighted saliency -- combining routing frequency with activation importance
- Router logit renormalization -- maintains output distribution after expert removal
- Layerwise application -- independent per-layer pruning decisions for stability
Calibration Dataset
22,000 samples (no-refusal subset: 21,000), packed to 16,384 token sequences:
| Category | Samples | Source |
|---|---|---|
| Coding (general) | 4,096 | theblackcat102/evol-codealpaca-v1 |
| Reasoning (code) | ~2,680 | open-r1/Mixture-of-Thoughts[code] |
| Reasoning (math) | ~2,778 | open-r1/Mixture-of-Thoughts[math] |
| Reasoning (science) | ~2,776 | open-r1/Mixture-of-Thoughts[science] |
| Tool calling | 4,096 | Salesforce/xlam-function-calling-60k |
| Agentic coding | 4,096 | SWE-bench/SWE-smith-trajectories |
| + extended domains | ~1,478 | Scientific, CUDA kernels, browser, advanced math, code correctness |
Total tokens observed: ~90.5M across 6,391 packed sequences.
Pruning Configuration
| Parameter | Value |
|---|---|
| Compression ratio | 0.20 (20% expert removal) |
| Original experts per layer | 512 |
| Remaining experts per layer | 410 |
| Pruning method | REAP |
| Distance measure | Angular (cosine) |
| Router weight renormalization | Yes |
| Seed | 42 |
| Observation batch size | 8 |
| Calibration batches | 128 per category |
Benchmark Results
10-task lm-eval suite, 200 samples per task, tensor_parallel_size=4, vLLM eager mode:
| Task | Metric | Original | REAP 0.20 | Delta |
|---|---|---|---|---|
| ARC-Challenge | acc_norm | 58.5% | 64.0% | +5.5 |
| BoolQ | acc | 93.0% | 91.0% | -2.0 |
| CommonsenseQA | acc | 89.0% | 88.0% | -1.0 |
| GSM8K | flexible_extract | 35.0% | 28.5% | -6.5 |
| HellaSwag | acc_norm | 72.0% | 66.0% | -6.0 |
| MathQA | acc_norm | 60.5% | 53.5% | -7.0 |
| OpenBookQA | acc_norm | 48.5% | 49.0% | +0.5 |
| PIQA | acc_norm | 80.0% | 80.5% | +0.5 |
| TruthfulQA MC2 | acc | 60.2% | 55.2% | -5.0 |
| WinoGrande | acc | 70.0% | 70.0% | +0.0 |
Aggregate:
- Overall average: 66.7% -> 64.6% (-2.1 pts)
- Reasoning average: 71.4% -> 70.5% (-0.9 pts)
- Math average: 47.8% -> 41.0% (-6.8 pts)
Architecture
Qwen3-Coder-Next uses a hybrid linear/full attention architecture with 48 layers:
- Full attention every 4th layer (12 layers)
- Linear attention for remaining layers (36 layers)
- MoE FFN with 410 remaining experts per layer, 10 active per token
- Shared expert (intermediate size 512) in every layer
- Context window: 262,144 tokens
- Vocab size: 151,936
Usage
Transformers
python
from transformers import AutoModelForCausalLM, AutoTokenizermodel_id = "0xSero/Qwen3-Coder-64B"tokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained(model_id,torch_dtype="auto",device_map="auto",trust_remote_code=True,)messages = [{"role": "user", "content": "Write a quicksort in Python."}]text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = tokenizer(text, return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=512)print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
vLLM
bash
vllm serve 0xSero/Qwen3-Coder-64B \--tensor-parallel-size 4 \--enforce-eager \--gpu-memory-utilization 0.9 \--max-model-len 32768
Reproducing
bash
git clone https://github.com/cerebras/reapcd reappython -m reap.layerwise_prune \--model-name Qwen/Qwen3-Coder-Next \--dataset-name combined \--compression-ratio 0.20 \--prune-method reap \--seed 42 \--renormalize_router_weights true \--batch_size 8 \--batches_per_category 128
Links
- REAP paper: arxiv.org/abs/2510.13999
- REAP code: github.com/cerebras/reap
- Cerebras REAP collection: huggingface.co/collections/cerebras/cerebras-reap
- Base model: Qwen/Qwen3-Coder-Next
- 30% pruned variant: 0xSero/Qwen3-Coder-57B
License & citation
License inherited from the base model.
bibtex
@misc{lasby2025reap,title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
Model provider
0xSero
Model tree
Base
Qwen/Qwen3-Coder-Next
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information