Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0At a glance
| Base model | Qwen/Qwen3.5-397B-A17B |
| Format | W4A16 |
| Total params | 264B |
| Active / token | — |
| Experts / layer | — |
| Layers | — |
| Hidden size | — |
| Context | — |
| On-disk size | 282 GB |
Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
Qwen3.5-264B | BF16 | link |
Qwen3.5-264B-FP8 | FP8 | link |
Qwen3.5-264B-W4A16 (this) | W4A16 | link |
Qwen3.5-28B | BF16 | link |
Qwen3.5-35B-EXL3-4bpw | EXL3-4bpw | link |
Qwen3.5-76B | BF16 | link |
Qwen3.5-76B-GGUF | GGUF | link |
Qwen3.5-88B | BF16 | link |
Qwen3.5-99B | BF16 | link |
Qwen3.5-99B-GGUF | GGUF | link |
- Repository:
0xSero/Qwen3.5-264B-W4A16 - Base model:
Qwen/Qwen3.5-397B-A17B - Artifact kind:
quantized - Compression ratio:
34% - Prune metric:
reap - Quantization scheme:
W4A16 - Quantization format:
auto_round:auto_gptq - Parent artifact:
0xSero/Qwen3.5-264B
Details
- Maintainer:
0xSero - Organization:
Sybil Solutions - Project:
REAP PR17 - Hub owner:
0xSero - Summary: AutoRound W4A16 GPTQ quantization of Qwen3.5-264B-REAP with vision encoder transplanted from the 262B variant.
Architecture
Hybrid MoE + Linear Attention (GDN/Mamba-style):
- 60 layers with mixed
linear_attentionandfull_attentionlayer types - 336 experts, 10 active per token
- Vision encoder: ViT with 27 blocks, 1152 hidden size, spatial merge, transplanted from atbender/Qwen3.5-REAP-262B-A17B-W4A16
- Composite multimodal format:
Qwen3_5MoeForConditionalGenerationarchitecture
Vision Encoder
The vision encoder (visual-encoder.safetensors, 870 MB, 333 tensor keys) was transplanted from the 262B variant. The original 264B model was text-only; the vision weights are from the same Qwen3.5 architecture family and are fully compatible. Vision supports image understanding via the standard OpenAI image_url content format.
Provenance
- Observer state:
/home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-observer-state.raw.pt - Detail state:
/home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-detail-state.raw.pt
Benchmarks
Evaluated on 8x RTX 3090 (24 GB each) with vLLM, TP=8, expert parallel, fp8 KV cache.
| Benchmark | Samples | Score |
|---|---|---|
| HumanEval (coding) | 50 | 100% |
| MATH-500 (competition math) | 54 | 89% |
| Reasoning & Logic | 2 | 100% |
| Terminal/CLI | 2 | 100% |
| SWE (bug fixing) | 2 | 100% |
| Cybersecurity | 2 | 100% |
| Philosophy | 2 | 100% |
| MMLU (general knowledge) | 2 | 100% |
Generation speed: ~62 tokens/s at batch_size=1.
Serving with vLLM
Requirements
markdown
Python 3.12CUDA 12.88x GPU with 24+ GB each (tested on RTX 3090)
Exact working dependency versions
markdown
vllm==0.19.0torch==2.10.0+cu128transformers==4.57.6flashinfer-python==0.6.6flashinfer-cubin==0.6.6quack-kernels==0.3.10nvidia-cutlass-dsl==4.4.2nvidia-cutlass-dsl-libs-base==4.4.2triton==3.6.0xgrammar==0.1.33conch-triton-kernels==1.3
Installation
bash
uv venv vllm-env --python 3.12uv pip install --python vllm-env/bin/python3 'vllm==0.19.0' conch-triton-kernels
Tokenizer fix
The tokenizer_config.json shipped with this model uses "tokenizer_class": "Qwen2Tokenizer". If you encounter tokenizer errors, verify this field is set correctly:
python
import jsonwith open("tokenizer_config.json") as f:cfg = json.load(f)cfg["tokenizer_class"] = "Qwen2Tokenizer"with open("tokenizer_config.json", "w") as f:json.dump(cfg, f, indent=2)
Launch command
bash
vllm serve 0xSero/Qwen3.5-264B-W4A16 \--tensor-parallel-size 8 \--enable-expert-parallel \--enable-prefix-caching \--max-model-len 262144 \--max-num-seqs 4 \--gpu-memory-utilization 0.9 \--kv-cache-dtype fp8_e4m3 \--dtype bfloat16 \--trust-remote-code \--reasoning-parser qwen3 \--tool-call-parser qwen3_coder \--enable-auto-tool-choice \--served-model-name qwen35-264b
Known issues
- Mamba cache align mode: vLLM auto-enables experimental Mamba cache "align" mode when prefix caching is on. vLLM 0.19.0 includes a fix for Mamba state corruption (PR #37728) that improves stability. If you experience hangs after sustained usage on 0.18.x, upgrade to 0.19.0.
- PCIe riser instability: On systems with PCIe risers (e.g., mining rigs repurposed for ML), sustained multi-GPU NCCL traffic can cause AER errors. Mask AER with
setpci -s <addr> ECAP_AER+0x08.l=0xFFFFFFFFon affected slots. - CUDA graph memory: If CUDA graph capture fails, add
--max-cudagraph-capture-size 256or--enforce-eager.
Usage
Text generation
python
from openai import OpenAIclient = OpenAI(base_url="http://localhost:8000/v1", api_key="none")response = client.chat.completions.create(model="qwen35-264b",messages=[{"role": "user", "content": "Solve: what is the integral of x^2 * e^x dx?"}],max_tokens=8192,)print(response.choices[0].message.content)
Vision
python
import base64response = client.chat.completions.create(model="qwen35-264b",messages=[{"role": "user","content": [{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64_image}"}},{"type": "text", "text": "What's in this image?"}]}],max_tokens=4096,)
Tool calling
python
tools = [{"type": "function","function": {"name": "get_weather","parameters": {"type": "object", "properties": {"city": {"type": "string"}}}}}]response = client.chat.completions.create(model="qwen35-264b",messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],tools=tools,)
License & citation
License inherited from the base model.
bibtex
@misc{lasby2025reap,title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
Model provider
0xSero
Model tree
Base
Qwen/Qwen3.5-397B-A17B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information