Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

At a glance

Base modelQwen/Qwen3.5-122B-A10B
FormatBF16
Total params88B
Active / token10B
Experts / layer
Layers
Hidden size
Context
On-disk size175 GB

Which variant should I pick?

VariantFormatLink
Qwen3.5-264BBF16link
Qwen3.5-264B-FP8FP8link
Qwen3.5-264B-W4A16W4A16link
Qwen3.5-28BBF16link
Qwen3.5-35B-EXL3-4bpwEXL3-4bpwlink
Qwen3.5-76BBF16link
Qwen3.5-76B-GGUFGGUFlink
Qwen3.5-88B (this)BF16link
Qwen3.5-99BBF16link
Qwen3.5-99B-GGUFGGUFlink

30% expert-pruned variant of Qwen3.5-122B-A10B using REAP (Routing-Enhanced Activation Pruning).

Model Details

PropertyValue
Base ModelQwen/Qwen3.5-122B-A10B
ArchitectureQwen3.5 MoE (GDN + Full Attention)
Original Experts256 per layer
Pruned Experts180 per layer (30% removed)
Active Parameters~10B per token
Pruning MethodREAP with targeted refusal preservation
Preserve Threshold80% (super-expert protection)
Calibrationreap-calibration-data-v1 — 23k benchmark-free samples
Maintainer0xSero
OrganizationSybil Solutions
ProjectREAP PR17

Usage

bash

vllm serve 0xSero/Qwen3.5-88B \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--max-model-len 8192 \
--trust-remote-code \
--language-model-only \
--dtype bfloat16

Important: Use --language-model-only flag — this is a text-only checkpoint pruned from the multimodal base model.

What is REAP?

REAP (Routing-Enhanced Activation Pruning) removes the least-activated experts from MoE models while preserving critical capabilities. It uses router activation patterns from a calibration dataset to identify dispensable experts, with special protection for safety-critical behaviors.

License

Same license as the base model (Qwen).

License & citation

License inherited from the base model.

bibtex

@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Model provider

0xSero

Model tree

Base

Qwen/Qwen3.5-122B-A10B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today