0xSero

Qwen3.5-264B-W4A16

README

License: apache-2.0

At a glance

Table

Base model	Qwen/Qwen3.5-397B-A17B
Format	W4A16
Total params	264B
Active / token	—
Experts / layer	—
Layers	—
Hidden size	—
Context	—
On-disk size	282 GB

Which variant should I pick?

Table with columns: Variant, Format, Link
Variant	Format	Link
`Qwen3.5-264B`	BF16	link
`Qwen3.5-264B-FP8`	FP8	link
`Qwen3.5-264B-W4A16` (this)	W4A16	link

Repository: 0xSero/Qwen3.5-264B-W4A16
Base model: Qwen/Qwen3.5-397B-A17B
Artifact kind: quantized
Compression ratio: 34%
Prune metric: reap
Quantization scheme: W4A16
Quantization format: auto_round:auto_gptq
Parent artifact: 0xSero/Qwen3.5-264B

Details

Maintainer: 0xSero
Organization: Sybil Solutions
Project: REAP PR17
Hub owner: 0xSero
Summary: AutoRound W4A16 GPTQ quantization of Qwen3.5-264B-REAP with vision encoder transplanted from the 262B variant.

Architecture

Hybrid MoE + Linear Attention (GDN/Mamba-style):

60 layers with mixed linear_attention and full_attention layer types
336 experts, 10 active per token
Vision encoder: ViT with 27 blocks, 1152 hidden size, spatial merge, transplanted from atbender/Qwen3.5-REAP-262B-A17B-W4A16
Composite multimodal format: Qwen3_5MoeForConditionalGeneration architecture

Vision Encoder

The vision encoder (visual-encoder.safetensors, 870 MB, 333 tensor keys) was transplanted from the 262B variant. The original 264B model was text-only; the vision weights are from the same Qwen3.5 architecture family and are fully compatible. Vision supports image understanding via the standard OpenAI image_url content format.

Provenance

Observer state: /home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-observer-state.raw.pt
Detail state: /home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-detail-state.raw.pt

Benchmarks

Evaluated on 8x RTX 3090 (24 GB each) with vLLM, TP=8, expert parallel, fp8 KV cache.

Table with columns: Benchmark, Samples, Score
Benchmark	Samples	Score
HumanEval (coding)	50	100%
MATH-500 (competition math)	54	89%
Reasoning & Logic	2	100%
Terminal/CLI	2	100%
SWE (bug fixing)	2	100%
Cybersecurity	2

Generation speed: ~62 tokens/s at batch_size=1.

Serving with vLLM

Requirements

markdown
Python 3.12
CUDA 12.8
8x GPU with 24+ GB each (tested on RTX 3090)

Exact working dependency versions

markdown
vllm==0.19.0
torch==2.10.0+cu128
transformers==4.57.6
flashinfer-python==0.6.6
flashinfer-cubin==0.6.6
quack-kernels==0.3.10
nvidia-cutlass-dsl==4.4.2
nvidia-cutlass-dsl-libs-base==4.4.2
triton==3.6.0
xgrammar==0.1.33
conch-triton-kernels==1.3

Installation

bash
uv venv vllm-env --python 3.12
uv pip install --python vllm-env/bin/python3 'vllm==0.19.0' conch-triton-kernels

Tokenizer fix

The tokenizer_config.json shipped with this model uses "tokenizer_class": "Qwen2Tokenizer". If you encounter tokenizer errors, verify this field is set correctly:

python
import json
with open("tokenizer_config.json") as f:
    cfg = json.load(f)
cfg["tokenizer_class"] = "Qwen2Tokenizer"
with open("tokenizer_config.json", "w") as f:
    json.dump(cfg, f, indent=2)

Launch command

bash
vllm serve 0xSero/Qwen3.5-264B-W4A16 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --enable-prefix-caching \
  --max-model-len 262144 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.9 \
  --kv-cache-dtype fp8_e4m3 \
  --dtype bfloat16 \
  --trust-remote-code \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --served-model-name qwen35-264b

Known issues

Mamba cache align mode: vLLM auto-enables experimental Mamba cache "align" mode when prefix caching is on. vLLM 0.19.0 includes a fix for Mamba state corruption (PR #37728) that improves stability. If you experience hangs after sustained usage on 0.18.x, upgrade to 0.19.0.
PCIe riser instability: On systems with PCIe risers (e.g., mining rigs repurposed for ML), sustained multi-GPU NCCL traffic can cause AER errors. Mask AER with setpci -s <addr> ECAP_AER+0x08.l=0xFFFFFFFF on affected slots.
CUDA graph memory: If CUDA graph capture fails, add --max-cudagraph-capture-size 256 or --enforce-eager.

Usage

Text generation

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="qwen35-264b",
    messages=[{"role": "user", "content": "Solve: what is the integral of x^2 * e^x dx?"}],
    max_tokens=8192,
)
print(response.choices[0].message.content)

Vision

python
import base64
response = client.chat.completions.create(
    model="qwen35-264b",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64_image}"}},
            {"type": "text", "text": "What's in this image?"}
        ]
    }],
    max_tokens=4096,
)

Tool calling

python
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}
    }
}]
response = client.chat.completions.create(
    model="qwen35-264b",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

At a glance

Table

Base model	Qwen/Qwen3.5-397B-A17B
Format	W4A16
Total params	264B
Active / token	—
Experts / layer	—
Layers	—
Hidden size	—
Context	—
On-disk size	282 GB

Which variant should I pick?

Table with columns: Variant, Format, Link
Variant	Format	Link
`Qwen3.5-264B`	BF16	link
`Qwen3.5-264B-FP8`	FP8	link
`Qwen3.5-264B-W4A16` (this)	W4A16	link

Repository: 0xSero/Qwen3.5-264B-W4A16
Base model: Qwen/Qwen3.5-397B-A17B
Artifact kind: quantized
Compression ratio: 34%
Prune metric: reap
Quantization scheme: W4A16
Quantization format: auto_round:auto_gptq
Parent artifact: 0xSero/Qwen3.5-264B

Details

Maintainer: 0xSero
Organization: Sybil Solutions
Project: REAP PR17
Hub owner: 0xSero
Summary: AutoRound W4A16 GPTQ quantization of Qwen3.5-264B-REAP with vision encoder transplanted from the 262B variant.

Architecture

Hybrid MoE + Linear Attention (GDN/Mamba-style):

60 layers with mixed linear_attention and full_attention layer types
336 experts, 10 active per token
Vision encoder: ViT with 27 blocks, 1152 hidden size, spatial merge, transplanted from atbender/Qwen3.5-REAP-262B-A17B-W4A16
Composite multimodal format: Qwen3_5MoeForConditionalGeneration architecture

Vision Encoder

Provenance

Observer state: /home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-observer-state.raw.pt
Detail state: /home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-detail-state.raw.pt

Benchmarks

Evaluated on 8x RTX 3090 (24 GB each) with vLLM, TP=8, expert parallel, fp8 KV cache.

Table with columns: Benchmark, Samples, Score
Benchmark	Samples	Score
HumanEval (coding)	50	100%
MATH-500 (competition math)	54	89%
Reasoning & Logic	2	100%
Terminal/CLI	2	100%
SWE (bug fixing)	2	100%
Cybersecurity	2

Generation speed: ~62 tokens/s at batch_size=1.

Serving with vLLM

Requirements

markdown
Python 3.12
CUDA 12.8
8x GPU with 24+ GB each (tested on RTX 3090)

Exact working dependency versions

markdown
vllm==0.19.0
torch==2.10.0+cu128
transformers==4.57.6
flashinfer-python==0.6.6
flashinfer-cubin==0.6.6
quack-kernels==0.3.10
nvidia-cutlass-dsl==4.4.2
nvidia-cutlass-dsl-libs-base==4.4.2
triton==3.6.0
xgrammar==0.1.33
conch-triton-kernels==1.3

Installation

bash
uv venv vllm-env --python 3.12
uv pip install --python vllm-env/bin/python3 'vllm==0.19.0' conch-triton-kernels

Tokenizer fix

The tokenizer_config.json shipped with this model uses "tokenizer_class": "Qwen2Tokenizer". If you encounter tokenizer errors, verify this field is set correctly:

python
import json
with open("tokenizer_config.json") as f:
    cfg = json.load(f)
cfg["tokenizer_class"] = "Qwen2Tokenizer"
with open("tokenizer_config.json", "w") as f:
    json.dump(cfg, f, indent=2)

Launch command

bash
vllm serve 0xSero/Qwen3.5-264B-W4A16 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --enable-prefix-caching \
  --max-model-len 262144 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.9 \
  --kv-cache-dtype fp8_e4m3 \
  --dtype bfloat16 \
  --trust-remote-code \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --served-model-name qwen35-264b

Known issues

Mamba cache align mode: vLLM auto-enables experimental Mamba cache "align" mode when prefix caching is on. vLLM 0.19.0 includes a fix for Mamba state corruption (PR #37728) that improves stability. If you experience hangs after sustained usage on 0.18.x, upgrade to 0.19.0.
PCIe riser instability: On systems with PCIe risers (e.g., mining rigs repurposed for ML), sustained multi-GPU NCCL traffic can cause AER errors. Mask AER with setpci -s <addr> ECAP_AER+0x08.l=0xFFFFFFFF on affected slots.
CUDA graph memory: If CUDA graph capture fails, add --max-cudagraph-capture-size 256 or --enforce-eager.

Usage

Text generation

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="qwen35-264b",
    messages=[{"role": "user", "content": "Solve: what is the integral of x^2 * e^x dx?"}],
    max_tokens=8192,
)
print(response.choices[0].message.content)

Vision

python
import base64
response = client.chat.completions.create(
    model="qwen35-264b",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64_image}"}},
            {"type": "text", "text": "What's in this image?"}
        ]
    }],
    max_tokens=4096,
)

Tool calling

python
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}
    }
}]
response = client.chat.completions.create(
    model="qwen35-264b",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Qwen3.5-264B-W4A16

README

At a glance

Which variant should I pick?

Details

Architecture

Vision Encoder

Provenance

Benchmarks

Serving with vLLM

Requirements

Exact working dependency versions

Installation

Tokenizer fix

Launch command

Known issues

Usage

Text generation

Vision

Tool calling

License & citation

Sponsors

Explore FriendliAI today

README

At a glance

Which variant should I pick?

Details

Architecture

Vision Encoder

Provenance

Benchmarks

Serving with vLLM

Requirements

Exact working dependency versions

Installation

Tokenizer fix

Launch command

Known issues

Usage

Text generation

Vision

Tool calling

License & citation

Sponsors