Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0At a glance
| Base model | Qwen/Qwen3.5-35B-A3B-Base |
| Format | EXL3-4bpw |
| Total params | 35B |
| Active / token | 3B |
| Experts / layer | — |
| Layers | — |
| Hidden size | — |
| Context | — |
| On-disk size | 21 GB |
Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
Qwen3.5-264B | BF16 | link |
Qwen3.5-264B-FP8 | FP8 | link |
Qwen3.5-264B-W4A16 | W4A16 | link |
Qwen3.5-28B | BF16 | link |
Qwen3.5-35B-EXL3-4bpw (this) | EXL3-4bpw | link |
Qwen3.5-76B | BF16 | link |
Qwen3.5-76B-GGUF | GGUF | link |
Qwen3.5-88B | BF16 | link |
Qwen3.5-99B | BF16 | link |
Qwen3.5-99B-GGUF | GGUF | link |
The full base-model documentation lives upstream; this card covers only the EXL3-4bpw build.
See the base model for architecture, benchmarks, and general usage.
License & citation
License inherited from the base model.
bibtex
@misc{lasby2025reap,title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
Model provider
0xSero
Model tree
Base
Qwen/Qwen3.5-35B-A3B-Base
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information