waher
Qwen3.6-27B-W8W4A16-G128
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Description
This mixed-precision GPTQ would be used with the PR https://github.com/vllm-project/vllm/pull/41394 which enables a RDNA3W4A16LinearKernel.
W8W4A16 = 8-bit weights on attention/SSM projections, 4-bit on MLP, BF16 activations/MTP.
G128 = per-group symmetric quantization, group size 128
Quantization Error by Layer

Calibration Data
| Source | N | MeanLen | MinLen | MaxLen | StdDev |
|---|---|---|---|---|---|
| agentic_coding | 26 | 26343 | 23867 | 28529 | 1429 |
| c4 | 41 | 1696 | 1071 | 10671 | 1478 |
| cauldron | 41 | 219 | 67 | 1309 | 294 |
| gsm8k | 31 | 449 | 255 | 1152 | 173 |
| lambda_hermes | 31 | 70179 | 27155 | 177338 | 45201 |
| multilingual | 31 | 910 | 56 | 3280 | 764 |
| openorca | 41 | 648 | 110 | 2204 | 453 |
| tool_calling | 152 | 5734 | 2182 | 20475 | 2721 |
| ultrachat | 41 | 2697 | 1117 | 5319 | 909 |
| zake7749_qwen36 | 77 | 14969 | 9108 | 32785 | 4694 |
Aggregate Metrics
Evaluated at 2048 sequence length (prompts ranged 16–793 tokens)
| Metric | Value | Interpretation |
|---|---|---|
| Full KLD | 0.129 | KL divergence over full vocabulary |
| Top-20 KLD | 0.107 | KL divergence over top-20 tokens (generation-relevant) |
| Normalized KLD | 0.213 | KLD / BF16 entropy — comparable across configurations |
| Top-1 Accuracy | 93.4% | % of tokens where top prediction matches BF16 |
| Top-5 Accuracy | 99.4% | % of tokens where BF16 top-1 is in quantized top-5 |
Reference BF16 statistics: Mean entropy 0.603 nats/token, max probability 0.807
Performance by Task Category
| Category | Samples | Mean KLD | Normalized KLD | Top-1 Acc | Top-5 Acc |
|---|---|---|---|---|---|
| Tool selection | 3 | 0.019 | 0.029 | 97.7% | 100.0% |
| Tool definitions | 3 | 0.052 | 0.083 | 94.5% | 99.8% |
| Error recovery | 3 | 0.052 | 0.080 | 94.7% | 99.9% |
| Orchestration | 3 | 0.102 | 0.185 | 94.2% | 99.4% |
| Multi-turn | 3 | 0.146 | 0.287 | 92.1% | 99.5% |
| Batch operations | 3 | 0.176 | 0.425 | 93.9% | 99.2% |
| Edge cases | 5 | 0.116 | 0.196 | 91.7% | 99.5% |
| Nested JSON | 3 | 0.278 | 0.451 | 90.7% | 98.1% |
Known Limitations
Four samples exhibit elevated KLD (>0.25) where the model's weight quantization introduces measurable divergence from BF16 behavior:
| Sample | Description | KLD | nKLD | Context |
|---|---|---|---|---|
| #16 | Multi-location weather (parallel calls) | 0.452 | 1.134 | High-confidence but quantization-sensitive |
| #18 | Tool registry with schemas | 0.375 | 0.657 | Deep nested structures |
| #20 | Monitoring alert config | 0.344 | 0.542 | Complex nested JSON |
| #25 | Overlapping tool calls | 0.460 | 0.870 | Multiple calls in single message |
Model provider
waher
Model tree
Base
Qwen/Qwen3.6-27B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information