Author: Prashant Takale
Model Compression

Table with columns: BF16 baseline, GPTQ INT4 (this model) | BF16 baseline | GPTQ INT4 (this model) |
|---|
| VRAM at load | ~54 GB | ~14 GB (3.9× smaller) |
| Bits / weight | 16 | 4.29 (3.7× fewer) |
Benchmarks
Note: MMLU-Redux uses a 1500-sample subset; other tasks are full. Decoding/prompts/filters are lm-eval-harness defaults, so absolute scores may differ from the official Qwen3.6-27B numbers. The goal is the BF16↔INT4 delta under identical conditions, not exact replication of the baseline.
Both models evaluated under identical conditions with lm-evaluation-harness:
greedy decoding (temperature=0), enable_thinking=False, seed=0. Long-CoT tasks use max_gen_toks=4096; HumanEval served via /v1/completions (raw, no chat template) so the harness's \\ndef / \\nclass stop sequences fire correctly.
Table with columns: Section, Task, Metric, N, BF16, INT4, Δ (pp)| Section | Task | Metric | N | BF16 | INT4 | Δ (pp) |
|---|
| Multiple-choice (Science) | ARC-Challenge | acc_norm | 1172 | 63.91 | 64.08 | +0.17 |
| Math (Word problems) | GSM8K | exact_match (strict) | 1319 |

Single-stream measurement on the same hardware, identical request (337 input / 42 output tokens):
Table with columns: Metric, BF16, INT4, Δ| Metric | BF16 | INT4 | Δ |
|---|
| Output token throughput (tok/s) | 25.55 | 62.34 | +143.99% |
| Request throughput (req/s) | 0.61 | 1.48 | +142.62% |
| Time to first token (ms) | 79.98 | 77.43 | −3.19% |
| Time per output token (ms) | 38.11 | 14.52 |
INT4 delivers ~2.4× higher throughput and ~2.4× lower latency at single-stream — the bandwidth savings from 4-bit weights translate almost 1:1 into decode-time speed-up (output tok/s and TPOT).
Quantization recipe
Table with columns: Setting, Value| Setting | Value |
|---|
| Method | GPTQ |
| Bits | 4 (weight-only) |
| Group size | 128 |
desc_act | True (activation-order) |
damp_percent | 0.01 |
| Symmetric | True |
| Calibration | C4 (en), 256 samples × 2048 tokens |
|
The vision encoder (model.visual.*) is intentionally left in BF16 — only the language-model weights are quantized.
Usage
With GPTQModel (recommended)
from gptqmodel import GPTQModel
from transformers import AutoTokenizer
model_id = "AxisQuant/Qwen3.6-27b-gptq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = GPTQModel.load(model_id, device_map="auto", trust_remote_code=True)
messages = [{"role": "user", "content": "Explain GPTQ in one sentence."}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "AxisQuant/Qwen3.6-27b-gptq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id, device_map="auto", trust_remote_code=True,
)
Hardware
- Weights: 18 GB on disk · ~14 GB VRAM at load
- Single-GPU friendly: comfortably fits on a 24 GB consumer card (RTX 3090 / 4090) for short-to-mid context
- Long context (64K+ tokens): H100 80 GB or A100 80 GB recommended
Limitations
- Only the language-model weights are quantized; the vision encoder remains in BF16
- Calibration set was English C4 — heavy non-English or domain-specific workloads may benefit from re-quantizing on a matching corpus
- Thinking mode (
enable_thinking=True) works but is significantly slower — enable only when reasoning quality matters more than latency
License
Inherits the license of the base model. See the Qwen/Qwen3.6-27B model page for terms.
Citation
Base model
@misc{qwen3.6-27b,
title = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
author = {{Qwen Team}},
month = {April},
year = {2026},
url = {https://qwen.ai/blog?id=qwen3.6-27b}
}
Quantization method
@article{frantar2022gptq,
title = {{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
author = {Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
journal = {arXiv preprint arXiv:2210.17323},
year = {2022}
}