Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherAt a glance
| Base model | Qwen/Qwen3.5-122B-A10B |
| Format | BF16 |
| Total params | 99B |
| Active / token | 10B |
| Experts / layer | — |
| Layers | — |
| Hidden size | — |
| Context | — |
| On-disk size | 198 GB |
Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
Qwen3.5-264B | BF16 | link |
Qwen3.5-264B-FP8 | FP8 | link |
Qwen3.5-264B-W4A16 | W4A16 | link |
Qwen3.5-28B | BF16 | link |
Qwen3.5-35B-EXL3-4bpw | EXL3-4bpw | link |
Qwen3.5-76B | BF16 | link |
Qwen3.5-76B-GGUF | GGUF | link |
Qwen3.5-88B | BF16 | link |
Qwen3.5-99B (this) | BF16 | link |
Qwen3.5-99B-GGUF | GGUF | link |
20% expert-pruned variant of Qwen3.5-122B-A10B using REAP (Routing-Enhanced Activation Pruning).
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-122B-A10B |
| Architecture | Qwen3.5 MoE (GDN + Full Attention) |
| Original Experts | 256 per layer |
| Pruned Experts | 205 per layer (20% removed) |
| Active Parameters | ~10B per token |
| Pruning Method | REAP with targeted refusal preservation |
| Preserve Threshold | 80% (super-expert protection) |
| Calibration | reap-calibration-data-v1 — 23k benchmark-free samples |
| Maintainer | 0xSero |
| Organization | Sybil Solutions |
| Project | REAP PR17 |
Benchmark Results
Code Generation (EvalPlus)
| Benchmark | Pass@1 |
|---|---|
| HumanEval (base) | 81.1% |
| HumanEval+ (base + extra) | 76.8% |
| MBPP (base) | 86.2% |
| MBPP+ (base + extra) | 73.0% |
Knowledge & Reasoning (lm-eval, 0-shot)
| Task | Baseline | REAP-20 | Retained |
|---|---|---|---|
| arc_challenge | 63.4% | 63.7% | 100.5% |
| boolq | 86.4% | 82.7% | 95.8% |
| hellaswag | 85.9% | 84.1% | 97.9% |
| mathqa | 68.5% | 67.3% | 98.1% |
| mmlu_world_religions | 91.2% | 86.0% | 94.2% |
| openbookqa | 46.4% | 45.6% | 98.3% |
| piqa | 83.8% | 82.3% | 98.1% |
| truthfulqa_mc2 | 51.9% | 52.4% | 100.8% |
| winogrande | 75.6% | 75.5% | 99.9% |
Average capability retained: 97.9% after removing 20% of experts.
Usage
bash
vllm serve 0xSero/Qwen3.5-99B \--tensor-parallel-size 4 \--enable-expert-parallel \--max-model-len 8192 \--trust-remote-code \--language-model-only \--dtype bfloat16
Important: Use --language-model-only flag — this is a text-only checkpoint pruned from the multimodal base model.
What is REAP?
REAP (Routing-Enhanced Activation Pruning) removes the least-activated experts from MoE models while preserving critical capabilities. It uses router activation patterns from a calibration dataset to identify dispensable experts, with special protection for safety-critical behaviors.
License
Same license as the base model (Qwen).
License & citation
License inherited from the base model.
bibtex
@misc{lasby2025reap,title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
Model provider
0xSero
Model tree
Base
Qwen/Qwen3.5-122B-A10B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information