twigboy2000

Qwen3.6-35B-A3B-W4A16-g32

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

What's in the box

Table

Base model	Qwen3.6-35B-A3B (hybrid Gated-DeltaNet + MoE, 35B total / ~3B active, multimodal)
Quant method	GPTQ, symmetric
Format	`compressed-tensors`, W4A16, group_size 32 (`uint4b8`)
Scope	language model only — vision tower kept at full precision
Architecture on disk	`Qwen3_5MoeForConditionalGeneration` (multimodal wrapper preserved), served `--language-model-only`
Size	~20 GB
Calibration	512 samples (ultrachat_200k), 2048 tokens

Why these specific choices

This quant is unusual on three axes, each forced by a hard requirement — see QUANT_GFX1151.md for the full story.

Symmetric (GPTQ), not AWQ. vLLM's CompressedTensorsWNA16MoEMethod asserts symmetric quantization for MoE experts (Only symmetric quantization is supported for MoE). AWQ's asymmetric (zero-point) W4A16 loads fine as AutoAWQ but is rejected on the compressed-tensors path. GPTQ's W4A16 preset is symmetric and maps to the kernel's uint4b8 type.
group_size 32. The custom MMQ HIP prefill kernel binds only to compressed-tensors W4A16 at g32 (not the usual g128).
Multimodal-preserving packaging. Quantized via AutoModelForImageTextToText so the checkpoint keeps the Qwen3_5MoeForConditionalGeneration wrapper + vision_config. A text-only repackaging (AutoModelForCausalLM) produces Qwen3_5MoeForCausalLM and crashes vLLM's VL renderer (Expected Qwen3_5MoeConfig, found Qwen3_5MoeTextConfig).

Performance (AMD Strix Halo gfx1151, vLLM + the MMQ prefill kernel)

Benchmarked against the prior AutoAWQ build on the same node, ~30K-token agentic-coding prompt, 400-token output, warm:

Table with columns: metric, AutoAWQ baseline, this model (W4A16-sym + kernel), delta
metric	AutoAWQ baseline	this model (W4A16-sym + kernel)	delta
TTFT (prefill)	43.8 s	3.5 s	12.4× faster
decode ITL p50	356 ms (2.8 tok/s)	325 ms (3.1 tok/s)	~10%
total wall-clock	97.5 s	49.7 s	~2×

The prefill speedup comes from hec-ovi/vllm-awq4-qwen's AWQ-INT4 MMQ HIP kernel for gfx1151 (WMMA iu8 inner loop), which binds to this checkpoint's symmetric W4A16 g32 layers (can_implement -> True on all 380 expert layers, wt=uint4b8).

Usage

This checkpoint is built for a custom gfx1151 vLLM image (the hec-ovi lineage + the Patch-16 MMQ kernel registration). The headline numbers require that kernel. On standard vLLM / other hardware it still loads and serves as an ordinary compressed-tensors W4A16 model — you just get the stock Triton W4A16 path, not the custom prefill kernel.

bash
# gfx1151 (Strix Halo) custom build — see the homelab-ops vllm-strix-halo image
VLLM_USE_TRITON_AWQ=1 VLLM_ROCM_USE_AITER=0 \
vllm serve twigboy2000/Qwen3.6-35B-A3B-W4A16-g32 \
  --served-model-name qwen3.6-35b \
  --enforce-eager \                 # HIP graphs freeze on gfx1151 (vllm#32180)
  --language-model-only \           # text-only serve; skips the vision encoder
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --max-model-len 131072

Pairs well with the z-lab/Qwen3.6-35B-A3B-DFlash drafter for speculative decoding (see limitations re: acceptance).

Limitations

DFlash speculative-decode acceptance regresses vs AutoAWQ (~44% → ~20%). The symmetric quant's token distribution diverges from what the z-lab DFlash drafter expects, so speculation lands less often. Decode throughput held up in testing, but the spec-decode margin shrinks. An AWQ-symmetric variant may recover this.
First-prompt cold start: the MMQ kernel JIT-autotunes on the first M≥32 prefill (~46 s, once per process); warm thereafter. A startup warmup request hides it.
Vision tower is not quantized — this is a language-model-only serving quant.
Calibrated on general chat (ultrachat); a code-heavy calibration set may shift quality slightly for coding tasks.
Quality vs the FP16 base: standard ~4-bit GPTQ g32 loss; the small group size keeps it modest.

Reproduction

Recipe: quant_ct_w4a16.py (loads via AutoModelForImageTextToText, GPTQModifier W4A16 g32 symmetric, ignores vision tower + MoE router/gate + GDN state params; pre-flight + post-verify guards). Produced on a CUDA host (NVIDIA GB10 / RTX 4090) in ~30–60 min — the output compressed-tensors checkpoint is portable and serves on the gfx1151 ROCm node.

Acknowledgements

Qwen for the base model.
hec-ovi/vllm-awq4-qwen for the gfx1151 vLLM patch bundle and the AWQ-INT4 MMQ HIP kernel that makes the prefill win possible.
llm-compressor / compressed-tensors.

Model provider

twigboy2000

Model tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

What's in the box

Table

Base model	Qwen3.6-35B-A3B (hybrid Gated-DeltaNet + MoE, 35B total / ~3B active, multimodal)
Quant method	GPTQ, symmetric
Format	`compressed-tensors`, W4A16, group_size 32 (`uint4b8`)
Scope	language model only — vision tower kept at full precision
Architecture on disk	`Qwen3_5MoeForConditionalGeneration` (multimodal wrapper preserved), served `--language-model-only`
Size	~20 GB
Calibration	512 samples (ultrachat_200k), 2048 tokens

Why these specific choices

This quant is unusual on three axes, each forced by a hard requirement — see QUANT_GFX1151.md for the full story.

Symmetric (GPTQ), not AWQ. vLLM's CompressedTensorsWNA16MoEMethod asserts symmetric quantization for MoE experts (Only symmetric quantization is supported for MoE). AWQ's asymmetric (zero-point) W4A16 loads fine as AutoAWQ but is rejected on the compressed-tensors path. GPTQ's W4A16 preset is symmetric and maps to the kernel's uint4b8 type.
group_size 32. The custom MMQ HIP prefill kernel binds only to compressed-tensors W4A16 at g32 (not the usual g128).
Multimodal-preserving packaging. Quantized via AutoModelForImageTextToText so the checkpoint keeps the Qwen3_5MoeForConditionalGeneration wrapper + vision_config. A text-only repackaging (AutoModelForCausalLM) produces Qwen3_5MoeForCausalLM and crashes vLLM's VL renderer (Expected Qwen3_5MoeConfig, found Qwen3_5MoeTextConfig).

Performance (AMD Strix Halo gfx1151, vLLM + the MMQ prefill kernel)

Benchmarked against the prior AutoAWQ build on the same node, ~30K-token agentic-coding prompt, 400-token output, warm:

Table with columns: metric, AutoAWQ baseline, this model (W4A16-sym + kernel), delta
metric	AutoAWQ baseline	this model (W4A16-sym + kernel)	delta
TTFT (prefill)	43.8 s	3.5 s	12.4× faster
decode ITL p50	356 ms (2.8 tok/s)	325 ms (3.1 tok/s)	~10%
total wall-clock	97.5 s	49.7 s	~2×

Usage

bash
# gfx1151 (Strix Halo) custom build — see the homelab-ops vllm-strix-halo image
VLLM_USE_TRITON_AWQ=1 VLLM_ROCM_USE_AITER=0 \
vllm serve twigboy2000/Qwen3.6-35B-A3B-W4A16-g32 \
  --served-model-name qwen3.6-35b \
  --enforce-eager \                 # HIP graphs freeze on gfx1151 (vllm#32180)
  --language-model-only \           # text-only serve; skips the vision encoder
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --max-model-len 131072

Pairs well with the z-lab/Qwen3.6-35B-A3B-DFlash drafter for speculative decoding (see limitations re: acceptance).

Limitations

DFlash speculative-decode acceptance regresses vs AutoAWQ (~44% → ~20%). The symmetric quant's token distribution diverges from what the z-lab DFlash drafter expects, so speculation lands less often. Decode throughput held up in testing, but the spec-decode margin shrinks. An AWQ-symmetric variant may recover this.
First-prompt cold start: the MMQ kernel JIT-autotunes on the first M≥32 prefill (~46 s, once per process); warm thereafter. A startup warmup request hides it.
Vision tower is not quantized — this is a language-model-only serving quant.
Calibrated on general chat (ultrachat); a code-heavy calibration set may shift quality slightly for coding tasks.
Quality vs the FP16 base: standard ~4-bit GPTQ g32 loss; the small group size keeps it modest.

Reproduction

Acknowledgements

Qwen for the base model.
hec-ovi/vllm-awq4-qwen for the gfx1151 vLLM patch bundle and the AWQ-INT4 MMQ HIP kernel that makes the prefill win possible.
llm-compressor / compressed-tensors.

Qwen3.6-35B-A3B-W4A16-g32

Get help setting up a custom Dedicated Endpoints.

README

What's in the box

Why these specific choices

Performance (AMD Strix Halo gfx1151, vLLM + the MMQ prefill kernel)

Usage

Limitations

Reproduction

Acknowledgements

Explore FriendliAI today

README

What's in the box

Why these specific choices

Performance (AMD Strix Halo gfx1151, vLLM + the MMQ prefill kernel)

Usage

Limitations

Reproduction

Acknowledgements