waher

Qwen3.6-27B-W8W4A16-G128

README

License: apache-2.0

Description

This mixed-precision GPTQ would be used with the PR https://github.com/vllm-project/vllm/pull/41394 which enables a RDNA3W4A16LinearKernel.

W8W4A16 = 8-bit weights on attention/SSM projections, 4-bit on MLP, BF16 activations/MTP.

G128 = per-group symmetric quantization, group size 128

Quantization Error by Layer

Per-layer quantization error

Calibration Data

Table with columns: Source, N, MeanLen, MinLen, MaxLen, StdDev
Source	N	MeanLen	MinLen	MaxLen	StdDev
agentic_coding	26	26343	23867	28529	1429
c4	41	1696	1071	10671	1478
cauldron	41	219	67	1309	294
gsm8k	31	449	255	1152	173
lambda_hermes	31	70179	27155	177338	45201
multilingual	31	910	56	3280	764
openorca	41	648	110	2204	453
tool_calling	152	5734	2182	20475	2721
ultrachat	41	2697	1117	5319	909
zake7749_qwen36	77	14969	9108	32785	4694

Aggregate Metrics

Evaluated at 2048 sequence length (prompts ranged 16–793 tokens)

Table with columns: Metric, Value, Interpretation
Metric	Value	Interpretation
Full KLD	0.129	KL divergence over full vocabulary
Top-20 KLD	0.107	KL divergence over top-20 tokens (generation-relevant)
Normalized KLD	0.213	KLD / BF16 entropy — comparable across configurations
Top-1 Accuracy	93.4%	% of tokens where top prediction matches BF16
Top-5 Accuracy	99.4%	% of tokens where BF16 top-1 is in quantized top-5

Reference BF16 statistics: Mean entropy 0.603 nats/token, max probability 0.807

Performance by Task Category

Table with columns: Category, Samples, Mean KLD, Normalized KLD, Top-1 Acc, Top-5 Acc
Category	Samples	Mean KLD	Normalized KLD	Top-1 Acc	Top-5 Acc
Tool selection	3	0.019	0.029	97.7%	100.0%
Tool definitions	3	0.052	0.083	94.5%	99.8%
Error recovery	3

Known Limitations

Four samples exhibit elevated KLD (>0.25) where the model's weight quantization introduces measurable divergence from BF16 behavior:

Table with columns: Sample, Description, KLD, nKLD, Context
Sample	Description	KLD	nKLD	Context
#16	Multi-location weather (parallel calls)	0.452	1.134	High-confidence but quantization-sensitive
#18	Tool registry with schemas	0.375	0.657	Deep nested structures
#20	Monitoring alert config	0.344	0.542	Complex nested JSON

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

waher

Model Tree

Base

Qwen/Qwen3.6-27B

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities