waher

Qwen3.6-27B-W8W4A16-G128

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Description

This mixed-precision GPTQ would be used with the PR https://github.com/vllm-project/vllm/pull/41394 which enables a RDNA3W4A16LinearKernel.

W8W4A16 = 8-bit weights on attention/SSM projections, 4-bit on MLP, BF16 activations/MTP.

G128 = per-group symmetric quantization, group size 128

Quantization Error by Layer

Per-layer quantization error

Calibration Data

Table
SourceNMeanLenMinLenMaxLenStdDev
agentic_coding262634323867285291429
c44116961071106711478
cauldron41219671309294
gsm8k314492551152173
lambda_hermes31701792715517733845201
multilingual31910563280764
openorca416481102204453
tool_calling15257342182204752721
ultrachat41269711175319909
zake7749_qwen3677149699108327854694

Aggregate Metrics

Evaluated at 2048 sequence length (prompts ranged 16–793 tokens)

Table
MetricValueInterpretation
Full KLD0.129KL divergence over full vocabulary
Top-20 KLD0.107KL divergence over top-20 tokens (generation-relevant)
Normalized KLD0.213KLD / BF16 entropy — comparable across configurations
Top-1 Accuracy93.4%% of tokens where top prediction matches BF16
Top-5 Accuracy99.4%% of tokens where BF16 top-1 is in quantized top-5

Reference BF16 statistics: Mean entropy 0.603 nats/token, max probability 0.807

Performance by Task Category

Table
CategorySamplesMean KLDNormalized KLDTop-1 AccTop-5 Acc
Tool selection30.0190.02997.7%100.0%
Tool definitions30.0520.08394.5%99.8%
Error recovery30.0520.08094.7%99.9%
Orchestration30.1020.18594.2%99.4%
Multi-turn30.1460.28792.1%99.5%
Batch operations30.1760.42593.9%99.2%
Edge cases50.1160.19691.7%99.5%
Nested JSON30.2780.45190.7%98.1%

Known Limitations

Four samples exhibit elevated KLD (>0.25) where the model's weight quantization introduces measurable divergence from BF16 behavior:

Table
SampleDescriptionKLDnKLDContext
#16Multi-location weather (parallel calls)0.4521.134High-confidence but quantization-sensitive
#18Tool registry with schemas0.3750.657Deep nested structures
#20Monitoring alert config0.3440.542Complex nested JSON
#25Overlapping tool calls0.4600.870Multiple calls in single message

Model provider

waher

Model tree

Base

Qwen/Qwen3.6-27B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today