What's in the box
Table | |
|---|
| Base model | Qwen3.6-35B-A3B (hybrid Gated-DeltaNet + MoE, 35B total / ~3B active, multimodal) |
| Quant method | GPTQ, symmetric |
| Format | compressed-tensors, W4A16, group_size 32 (uint4b8) |
| Scope | language model only — vision tower kept at full precision |
| Architecture on disk | Qwen3_5MoeForConditionalGeneration (multimodal wrapper preserved), served --language-model-only |
| Size | ~20 GB |
| Calibration | 512 samples (ultrachat_200k), 2048 tokens |
Why these specific choices
This quant is unusual on three axes, each forced by a hard requirement — see
QUANT_GFX1151.md for the full story.
- Symmetric (GPTQ), not AWQ. vLLM's
CompressedTensorsWNA16MoEMethod asserts symmetric
quantization for MoE experts (Only symmetric quantization is supported for MoE). AWQ's
asymmetric (zero-point) W4A16 loads fine as AutoAWQ but is rejected on the
compressed-tensors path. GPTQ's W4A16 preset is symmetric and maps to the kernel's
uint4b8 type.
- group_size 32. The custom MMQ HIP prefill kernel binds only to compressed-tensors W4A16
at g32 (not the usual g128).
- Multimodal-preserving packaging. Quantized via
AutoModelForImageTextToText so the
checkpoint keeps the Qwen3_5MoeForConditionalGeneration wrapper + vision_config. A
text-only repackaging (AutoModelForCausalLM) produces Qwen3_5MoeForCausalLM and crashes
vLLM's VL renderer (Expected Qwen3_5MoeConfig, found Qwen3_5MoeTextConfig).
Benchmarked against the prior AutoAWQ build on the same node, ~30K-token agentic-coding prompt,
400-token output, warm:
Table with columns: metric, AutoAWQ baseline, this model (W4A16-sym + kernel), delta| metric | AutoAWQ baseline | this model (W4A16-sym + kernel) | delta |
|---|
| TTFT (prefill) | 43.8 s | 3.5 s | 12.4× faster |
| decode ITL p50 | 356 ms (2.8 tok/s) | 325 ms (3.1 tok/s) | ~10% |
| total wall-clock | 97.5 s | 49.7 s | ~2× |
The prefill speedup comes from
hec-ovi/vllm-awq4-qwen's AWQ-INT4 MMQ HIP kernel
for gfx1151 (WMMA iu8 inner loop), which binds to this checkpoint's symmetric W4A16 g32 layers
(can_implement -> True on all 380 expert layers, wt=uint4b8).
Usage
This checkpoint is built for a custom gfx1151 vLLM image (the hec-ovi lineage + the Patch-16 MMQ
kernel registration). The headline numbers require that kernel. On standard vLLM / other
hardware it still loads and serves as an ordinary compressed-tensors W4A16 model — you just
get the stock Triton W4A16 path, not the custom prefill kernel.
# gfx1151 (Strix Halo) custom build — see the homelab-ops vllm-strix-halo image
VLLM_USE_TRITON_AWQ=1 VLLM_ROCM_USE_AITER=0 \
vllm serve twigboy2000/Qwen3.6-35B-A3B-W4A16-g32 \
--served-model-name qwen3.6-35b \
--enforce-eager \ # HIP graphs freeze on gfx1151 (vllm#32180)
--language-model-only \ # text-only serve; skips the vision encoder
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--max-model-len 131072
Pairs well with the z-lab/Qwen3.6-35B-A3B-DFlash drafter for speculative decoding (see
limitations re: acceptance).
Limitations
- DFlash speculative-decode acceptance regresses vs AutoAWQ (~44% → ~20%). The symmetric
quant's token distribution diverges from what the
z-lab DFlash drafter expects, so
speculation lands less often. Decode throughput held up in testing, but the spec-decode margin
shrinks. An AWQ-symmetric variant may recover this.
- First-prompt cold start: the MMQ kernel JIT-autotunes on the first M≥32 prefill (~46 s,
once per process); warm thereafter. A startup warmup request hides it.
- Vision tower is not quantized — this is a language-model-only serving quant.
- Calibrated on general chat (ultrachat); a code-heavy calibration set may shift quality slightly
for coding tasks.
- Quality vs the FP16 base: standard ~4-bit GPTQ g32 loss; the small group size keeps it modest.
Reproduction
Recipe: quant_ct_w4a16.py (loads via AutoModelForImageTextToText, GPTQModifier W4A16 g32
symmetric, ignores vision tower + MoE router/gate + GDN state params; pre-flight + post-verify
guards). Produced on a CUDA host (NVIDIA GB10 / RTX 4090) in ~30–60 min — the output
compressed-tensors checkpoint is portable and serves on the gfx1151 ROCm node.
Acknowledgements