At a glance
Table | |
|---|
| Base model | Qwen/Qwen3.5-397B-A17B |
| Format | W4A16 |
| Total params | 264B |
| Active / token | — |
| Experts / layer | — |
| Layers | — |
| Hidden size | — |
| Context | — |
| On-disk size | 282 GB |
Which variant should I pick?
Table with columns: Variant, Format, Link| Variant | Format | Link |
|---|
Qwen3.5-264B | BF16 | link |
Qwen3.5-264B-FP8 | FP8 | link |
Qwen3.5-264B-W4A16 (this) | W4A16 | link |
- Repository:
0xSero/Qwen3.5-264B-W4A16
- Base model:
Qwen/Qwen3.5-397B-A17B
- Artifact kind:
quantized
- Compression ratio:
34%
- Prune metric:
reap
- Quantization scheme:
W4A16
- Quantization format:
auto_round:auto_gptq
- Parent artifact:
0xSero/Qwen3.5-264B
Details
- Maintainer:
0xSero
- Organization:
Sybil Solutions
- Project:
REAP PR17
- Hub owner:
0xSero
- Summary: AutoRound W4A16 GPTQ quantization of Qwen3.5-264B-REAP with vision encoder transplanted from the 262B variant.
Architecture
Hybrid MoE + Linear Attention (GDN/Mamba-style):
- 60 layers with mixed
linear_attention and full_attention layer types
- 336 experts, 10 active per token
- Vision encoder: ViT with 27 blocks, 1152 hidden size, spatial merge, transplanted from atbender/Qwen3.5-REAP-262B-A17B-W4A16
- Composite multimodal format:
Qwen3_5MoeForConditionalGeneration architecture
Vision Encoder
The vision encoder (visual-encoder.safetensors, 870 MB, 333 tensor keys) was transplanted from the 262B variant. The original 264B model was text-only; the vision weights are from the same Qwen3.5 architecture family and are fully compatible. Vision supports image understanding via the standard OpenAI image_url content format.
Provenance
- Observer state:
/home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-observer-state.raw.pt
- Detail state:
/home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-detail-state.raw.pt
Benchmarks
Evaluated on 8x RTX 3090 (24 GB each) with vLLM, TP=8, expert parallel, fp8 KV cache.
Table with columns: Benchmark, Samples, Score| Benchmark | Samples | Score |
|---|
| HumanEval (coding) | 50 | 100% |
| MATH-500 (competition math) | 54 | 89% |
| Reasoning & Logic | 2 | 100% |
| Terminal/CLI | 2 | 100% |
| SWE (bug fixing) | 2 | 100% |
| Cybersecurity | 2 |
Generation speed: ~62 tokens/s at batch_size=1.
Serving with vLLM
Requirements
Python 3.12
CUDA 12.8
8x GPU with 24+ GB each (tested on RTX 3090)
Exact working dependency versions
vllm==0.19.0
torch==2.10.0+cu128
transformers==4.57.6
flashinfer-python==0.6.6
flashinfer-cubin==0.6.6
quack-kernels==0.3.10
nvidia-cutlass-dsl==4.4.2
nvidia-cutlass-dsl-libs-base==4.4.2
triton==3.6.0
xgrammar==0.1.33
conch-triton-kernels==1.3
Installation
uv venv vllm-env --python 3.12
uv pip install --python vllm-env/bin/python3 'vllm==0.19.0' conch-triton-kernels
Tokenizer fix
The tokenizer_config.json shipped with this model uses "tokenizer_class": "Qwen2Tokenizer". If you encounter tokenizer errors, verify this field is set correctly:
import json
with open("tokenizer_config.json") as f:
cfg = json.load(f)
cfg["tokenizer_class"] = "Qwen2Tokenizer"
with open("tokenizer_config.json", "w") as f:
json.dump(cfg, f, indent=2)
Launch command
vllm serve 0xSero/Qwen3.5-264B-W4A16 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--enable-prefix-caching \
--max-model-len 262144 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.9 \
--kv-cache-dtype fp8_e4m3 \
--dtype bfloat16 \
--trust-remote-code \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--served-model-name qwen35-264b
Known issues
- Mamba cache align mode: vLLM auto-enables experimental Mamba cache "align" mode when prefix caching is on. vLLM 0.19.0 includes a fix for Mamba state corruption (PR #37728) that improves stability. If you experience hangs after sustained usage on 0.18.x, upgrade to 0.19.0.
- PCIe riser instability: On systems with PCIe risers (e.g., mining rigs repurposed for ML), sustained multi-GPU NCCL traffic can cause AER errors. Mask AER with
setpci -s <addr> ECAP_AER+0x08.l=0xFFFFFFFF on affected slots.
- CUDA graph memory: If CUDA graph capture fails, add
--max-cudagraph-capture-size 256 or --enforce-eager.
Usage
Text generation
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="qwen35-264b",
messages=[{"role": "user", "content": "Solve: what is the integral of x^2 * e^x dx?"}],
max_tokens=8192,
)
print(response.choices[0].message.content)
Vision
import base64
response = client.chat.completions.create(
model="qwen35-264b",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64_image}"}},
{"type": "text", "text": "What's in this image?"}
]
}],
max_tokens=4096,
)
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}}
}
}]
response = client.chat.completions.create(
model="qwen35-264b",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
)
License & citation
License inherited from the base model.
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.