twigboy2000

Qwen3.6-35B-A3B-W4A16-g32

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What's in the box

Table
Base modelQwen3.6-35B-A3B (hybrid Gated-DeltaNet + MoE, 35B total / ~3B active, multimodal)
Quant methodGPTQ, symmetric
Formatcompressed-tensors, W4A16, group_size 32 (uint4b8)
Scopelanguage model only — vision tower kept at full precision
Architecture on diskQwen3_5MoeForConditionalGeneration (multimodal wrapper preserved), served --language-model-only
Size~20 GB
Calibration512 samples (ultrachat_200k), 2048 tokens

Why these specific choices

This quant is unusual on three axes, each forced by a hard requirement — see QUANT_GFX1151.md for the full story.

  1. Symmetric (GPTQ), not AWQ. vLLM's CompressedTensorsWNA16MoEMethod asserts symmetric quantization for MoE experts (Only symmetric quantization is supported for MoE). AWQ's asymmetric (zero-point) W4A16 loads fine as AutoAWQ but is rejected on the compressed-tensors path. GPTQ's W4A16 preset is symmetric and maps to the kernel's uint4b8 type.
  2. group_size 32. The custom MMQ HIP prefill kernel binds only to compressed-tensors W4A16 at g32 (not the usual g128).
  3. Multimodal-preserving packaging. Quantized via AutoModelForImageTextToText so the checkpoint keeps the Qwen3_5MoeForConditionalGeneration wrapper + vision_config. A text-only repackaging (AutoModelForCausalLM) produces Qwen3_5MoeForCausalLM and crashes vLLM's VL renderer (Expected Qwen3_5MoeConfig, found Qwen3_5MoeTextConfig).

Performance (AMD Strix Halo gfx1151, vLLM + the MMQ prefill kernel)

Benchmarked against the prior AutoAWQ build on the same node, ~30K-token agentic-coding prompt, 400-token output, warm:

Table
metricAutoAWQ baselinethis model (W4A16-sym + kernel)delta
TTFT (prefill)43.8 s3.5 s12.4× faster
decode ITL p50356 ms (2.8 tok/s)325 ms (3.1 tok/s)~10%
total wall-clock97.5 s49.7 s~2×

The prefill speedup comes from hec-ovi/vllm-awq4-qwen's AWQ-INT4 MMQ HIP kernel for gfx1151 (WMMA iu8 inner loop), which binds to this checkpoint's symmetric W4A16 g32 layers (can_implement -> True on all 380 expert layers, wt=uint4b8).

Usage

This checkpoint is built for a custom gfx1151 vLLM image (the hec-ovi lineage + the Patch-16 MMQ kernel registration). The headline numbers require that kernel. On standard vLLM / other hardware it still loads and serves as an ordinary compressed-tensors W4A16 model — you just get the stock Triton W4A16 path, not the custom prefill kernel.

bash

# gfx1151 (Strix Halo) custom build — see the homelab-ops vllm-strix-halo image
VLLM_USE_TRITON_AWQ=1 VLLM_ROCM_USE_AITER=0 \
vllm serve twigboy2000/Qwen3.6-35B-A3B-W4A16-g32 \
--served-model-name qwen3.6-35b \
--enforce-eager \ # HIP graphs freeze on gfx1151 (vllm#32180)
--language-model-only \ # text-only serve; skips the vision encoder
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--max-model-len 131072

Pairs well with the z-lab/Qwen3.6-35B-A3B-DFlash drafter for speculative decoding (see limitations re: acceptance).

Limitations

  • DFlash speculative-decode acceptance regresses vs AutoAWQ (~44% → ~20%). The symmetric quant's token distribution diverges from what the z-lab DFlash drafter expects, so speculation lands less often. Decode throughput held up in testing, but the spec-decode margin shrinks. An AWQ-symmetric variant may recover this.
  • First-prompt cold start: the MMQ kernel JIT-autotunes on the first M≥32 prefill (~46 s, once per process); warm thereafter. A startup warmup request hides it.
  • Vision tower is not quantized — this is a language-model-only serving quant.
  • Calibrated on general chat (ultrachat); a code-heavy calibration set may shift quality slightly for coding tasks.
  • Quality vs the FP16 base: standard ~4-bit GPTQ g32 loss; the small group size keeps it modest.

Reproduction

Recipe: quant_ct_w4a16.py (loads via AutoModelForImageTextToText, GPTQModifier W4A16 g32 symmetric, ignores vision tower + MoE router/gate + GDN state params; pre-flight + post-verify guards). Produced on a CUDA host (NVIDIA GB10 / RTX 4090) in ~30–60 min — the output compressed-tensors checkpoint is portable and serves on the gfx1151 ROCm node.

Acknowledgements

Model provider

twigboy2000

Model tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today