Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

At a glance

Base modelQwen/Qwen3.5-397B-A17B
FormatW4A16
Total params264B
Active / token
Experts / layer
Layers
Hidden size
Context
On-disk size282 GB

Which variant should I pick?

VariantFormatLink
Qwen3.5-264BBF16link
Qwen3.5-264B-FP8FP8link
Qwen3.5-264B-W4A16 (this)W4A16link
Qwen3.5-28BBF16link
Qwen3.5-35B-EXL3-4bpwEXL3-4bpwlink
Qwen3.5-76BBF16link
Qwen3.5-76B-GGUFGGUFlink
Qwen3.5-88BBF16link
Qwen3.5-99BBF16link
Qwen3.5-99B-GGUFGGUFlink
  • Repository: 0xSero/Qwen3.5-264B-W4A16
  • Base model: Qwen/Qwen3.5-397B-A17B
  • Artifact kind: quantized
  • Compression ratio: 34%
  • Prune metric: reap
  • Quantization scheme: W4A16
  • Quantization format: auto_round:auto_gptq
  • Parent artifact: 0xSero/Qwen3.5-264B

Details

  • Maintainer: 0xSero
  • Organization: Sybil Solutions
  • Project: REAP PR17
  • Hub owner: 0xSero
  • Summary: AutoRound W4A16 GPTQ quantization of Qwen3.5-264B-REAP with vision encoder transplanted from the 262B variant.

Architecture

Hybrid MoE + Linear Attention (GDN/Mamba-style):

  • 60 layers with mixed linear_attention and full_attention layer types
  • 336 experts, 10 active per token
  • Vision encoder: ViT with 27 blocks, 1152 hidden size, spatial merge, transplanted from atbender/Qwen3.5-REAP-262B-A17B-W4A16
  • Composite multimodal format: Qwen3_5MoeForConditionalGeneration architecture

Vision Encoder

The vision encoder (visual-encoder.safetensors, 870 MB, 333 tensor keys) was transplanted from the 262B variant. The original 264B model was text-only; the vision weights are from the same Qwen3.5 architecture family and are fully compatible. Vision supports image understanding via the standard OpenAI image_url content format.

Provenance

  • Observer state: /home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-observer-state.raw.pt
  • Detail state: /home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-detail-state.raw.pt

Benchmarks

Evaluated on 8x RTX 3090 (24 GB each) with vLLM, TP=8, expert parallel, fp8 KV cache.

BenchmarkSamplesScore
HumanEval (coding)50100%
MATH-500 (competition math)5489%
Reasoning & Logic2100%
Terminal/CLI2100%
SWE (bug fixing)2100%
Cybersecurity2100%
Philosophy2100%
MMLU (general knowledge)2100%

Generation speed: ~62 tokens/s at batch_size=1.

Serving with vLLM

Requirements

markdown

Python 3.12
CUDA 12.8
8x GPU with 24+ GB each (tested on RTX 3090)

Exact working dependency versions

markdown

vllm==0.19.0
torch==2.10.0+cu128
transformers==4.57.6
flashinfer-python==0.6.6
flashinfer-cubin==0.6.6
quack-kernels==0.3.10
nvidia-cutlass-dsl==4.4.2
nvidia-cutlass-dsl-libs-base==4.4.2
triton==3.6.0
xgrammar==0.1.33
conch-triton-kernels==1.3

Installation

bash

uv venv vllm-env --python 3.12
uv pip install --python vllm-env/bin/python3 'vllm==0.19.0' conch-triton-kernels

Tokenizer fix

The tokenizer_config.json shipped with this model uses "tokenizer_class": "Qwen2Tokenizer". If you encounter tokenizer errors, verify this field is set correctly:

python

import json
with open("tokenizer_config.json") as f:
cfg = json.load(f)
cfg["tokenizer_class"] = "Qwen2Tokenizer"
with open("tokenizer_config.json", "w") as f:
json.dump(cfg, f, indent=2)

Launch command

bash

vllm serve 0xSero/Qwen3.5-264B-W4A16 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--enable-prefix-caching \
--max-model-len 262144 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.9 \
--kv-cache-dtype fp8_e4m3 \
--dtype bfloat16 \
--trust-remote-code \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--served-model-name qwen35-264b

Known issues

  • Mamba cache align mode: vLLM auto-enables experimental Mamba cache "align" mode when prefix caching is on. vLLM 0.19.0 includes a fix for Mamba state corruption (PR #37728) that improves stability. If you experience hangs after sustained usage on 0.18.x, upgrade to 0.19.0.
  • PCIe riser instability: On systems with PCIe risers (e.g., mining rigs repurposed for ML), sustained multi-GPU NCCL traffic can cause AER errors. Mask AER with setpci -s <addr> ECAP_AER+0x08.l=0xFFFFFFFF on affected slots.
  • CUDA graph memory: If CUDA graph capture fails, add --max-cudagraph-capture-size 256 or --enforce-eager.

Usage

Text generation

python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="qwen35-264b",
messages=[{"role": "user", "content": "Solve: what is the integral of x^2 * e^x dx?"}],
max_tokens=8192,
)
print(response.choices[0].message.content)

Vision

python

import base64
response = client.chat.completions.create(
model="qwen35-264b",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64_image}"}},
{"type": "text", "text": "What's in this image?"}
]
}],
max_tokens=4096,
)

Tool calling

python

tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}}
}
}]
response = client.chat.completions.create(
model="qwen35-264b",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
)

License & citation

License inherited from the base model.

bibtex

@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Model provider

0xSero

Model tree

Base

Qwen/Qwen3.5-397B-A17B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today