twigboy2000
Qwen3.6-35B-A3B-W4A16-g32
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What's in the box
| Base model | Qwen3.6-35B-A3B (hybrid Gated-DeltaNet + MoE, 35B total / ~3B active, multimodal) |
| Quant method | GPTQ, symmetric |
| Format | compressed-tensors, W4A16, group_size 32 (uint4b8) |
| Scope | language model only — vision tower kept at full precision |
| Architecture on disk | Qwen3_5MoeForConditionalGeneration (multimodal wrapper preserved), served --language-model-only |
| Size | ~20 GB |
| Calibration | 512 samples (ultrachat_200k), 2048 tokens |
Why these specific choices
This quant is unusual on three axes, each forced by a hard requirement — see QUANT_GFX1151.md for the full story.
- Symmetric (GPTQ), not AWQ. vLLM's
CompressedTensorsWNA16MoEMethodasserts symmetric quantization for MoE experts (Only symmetric quantization is supported for MoE). AWQ's asymmetric (zero-point) W4A16 loads fine as AutoAWQ but is rejected on the compressed-tensors path. GPTQ's W4A16 preset is symmetric and maps to the kernel'suint4b8type. - group_size 32. The custom MMQ HIP prefill kernel binds only to compressed-tensors W4A16 at g32 (not the usual g128).
- Multimodal-preserving packaging. Quantized via
AutoModelForImageTextToTextso the checkpoint keeps theQwen3_5MoeForConditionalGenerationwrapper +vision_config. A text-only repackaging (AutoModelForCausalLM) producesQwen3_5MoeForCausalLMand crashes vLLM's VL renderer (Expected Qwen3_5MoeConfig, found Qwen3_5MoeTextConfig).
Performance (AMD Strix Halo gfx1151, vLLM + the MMQ prefill kernel)
Benchmarked against the prior AutoAWQ build on the same node, ~30K-token agentic-coding prompt, 400-token output, warm:
| metric | AutoAWQ baseline | this model (W4A16-sym + kernel) | delta |
|---|---|---|---|
| TTFT (prefill) | 43.8 s | 3.5 s | 12.4× faster |
| decode ITL p50 | 356 ms (2.8 tok/s) | 325 ms (3.1 tok/s) | ~10% |
| total wall-clock | 97.5 s | 49.7 s | ~2× |
The prefill speedup comes from
hec-ovi/vllm-awq4-qwen's AWQ-INT4 MMQ HIP kernel
for gfx1151 (WMMA iu8 inner loop), which binds to this checkpoint's symmetric W4A16 g32 layers
(can_implement -> True on all 380 expert layers, wt=uint4b8).
Usage
This checkpoint is built for a custom gfx1151 vLLM image (the hec-ovi lineage + the Patch-16 MMQ kernel registration). The headline numbers require that kernel. On standard vLLM / other hardware it still loads and serves as an ordinary compressed-tensors W4A16 model — you just get the stock Triton W4A16 path, not the custom prefill kernel.
bash
# gfx1151 (Strix Halo) custom build — see the homelab-ops vllm-strix-halo imageVLLM_USE_TRITON_AWQ=1 VLLM_ROCM_USE_AITER=0 \vllm serve twigboy2000/Qwen3.6-35B-A3B-W4A16-g32 \--served-model-name qwen3.6-35b \--enforce-eager \ # HIP graphs freeze on gfx1151 (vllm#32180)--language-model-only \ # text-only serve; skips the vision encoder--reasoning-parser qwen3 \--tool-call-parser qwen3_coder \--max-model-len 131072
Pairs well with the z-lab/Qwen3.6-35B-A3B-DFlash drafter for speculative decoding (see
limitations re: acceptance).
Limitations
- DFlash speculative-decode acceptance regresses vs AutoAWQ (~44% → ~20%). The symmetric
quant's token distribution diverges from what the
z-labDFlash drafter expects, so speculation lands less often. Decode throughput held up in testing, but the spec-decode margin shrinks. An AWQ-symmetric variant may recover this. - First-prompt cold start: the MMQ kernel JIT-autotunes on the first M≥32 prefill (~46 s, once per process); warm thereafter. A startup warmup request hides it.
- Vision tower is not quantized — this is a language-model-only serving quant.
- Calibrated on general chat (ultrachat); a code-heavy calibration set may shift quality slightly for coding tasks.
- Quality vs the FP16 base: standard ~4-bit GPTQ g32 loss; the small group size keeps it modest.
Reproduction
Recipe: quant_ct_w4a16.py (loads via AutoModelForImageTextToText, GPTQModifier W4A16 g32
symmetric, ignores vision tower + MoE router/gate + GDN state params; pre-flight + post-verify
guards). Produced on a CUDA host (NVIDIA GB10 / RTX 4090) in ~30–60 min — the output
compressed-tensors checkpoint is portable and serves on the gfx1151 ROCm node.
Acknowledgements
- Qwen for the base model.
- hec-ovi/vllm-awq4-qwen for the gfx1151 vLLM patch bundle and the AWQ-INT4 MMQ HIP kernel that makes the prefill win possible.
- llm-compressor /
compressed-tensors.
Model provider
twigboy2000
Model tree
Base
Qwen/Qwen3.6-35B-A3B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information