Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Author: Prashant Takale
Qwen3.6-27b-gptq-int4
GPTQ INT4 quantization of
Qwen/Qwen3.6-27B. 3× smaller. ~2.4× faster.
Model Compression

| BF16 baseline | GPTQ INT4 (this model) | |
|---|---|---|
| VRAM at load | ~54 GB | ~14 GB (3.9× smaller) |
| Bits / weight | 16 | 4.29 (3.7× fewer) |
Benchmarks
Note: MMLU-Redux uses a 1500-sample subset; other tasks are full. Decoding/prompts/filters are lm-eval-harness defaults, so absolute scores may differ from the official Qwen3.6-27B numbers. The goal is the BF16↔INT4 delta under identical conditions, not exact replication of the baseline.
Both models evaluated under identical conditions with lm-evaluation-harness:
greedy decoding (temperature=0), enable_thinking=False, seed=0. Long-CoT tasks use max_gen_toks=4096; HumanEval served via /v1/completions (raw, no chat template) so the harness's \\ndef / \\nclass stop sequences fire correctly.
| Section | Task | Metric | N | BF16 | INT4 | Δ (pp) |
|---|---|---|---|---|---|---|
| Multiple-choice (Science) | ARC-Challenge | acc_norm | 1172 | 63.91 | 64.08 | +0.17 |
| Math (Word problems) | GSM8K | exact_match (strict) | 1319 | 96.36 | 96.82 | +0.46 |
| Knowledge | MMLU-Redux | exact_match (strict-match) | 1500 | 89.19 | 88.42 | −0.77 |
| STEM Reasoning | GPQA-Diamond | exact_match (flexible-extract) | 198 | 71.72 | 68.69 | −3.03 |
| Coding | HumanEval | pass@1 (create_test) | 164 | 85.98 | 77.44 | −8.54 |
Inference Performance

Single-stream measurement on the same hardware, identical request (337 input / 42 output tokens):
| Metric | BF16 | INT4 | Δ |
|---|---|---|---|
| Output token throughput (tok/s) | 25.55 | 62.34 | +143.99% |
| Request throughput (req/s) | 0.61 | 1.48 | +142.62% |
| Time to first token (ms) | 79.98 | 77.43 | −3.19% |
| Time per output token (ms) | 38.11 | 14.52 | −61.91% |
| End-to-end latency (ms) | 1642.66 | 672.70 | −59.05% |
INT4 delivers ~2.4× higher throughput and ~2.4× lower latency at single-stream — the bandwidth savings from 4-bit weights translate almost 1:1 into decode-time speed-up (output tok/s and TPOT).
Quantization recipe
| Setting | Value |
|---|---|
| Method | GPTQ |
| Bits | 4 (weight-only) |
| Group size | 128 |
desc_act | True (activation-order) |
damp_percent | 0.01 |
| Symmetric | True |
| Calibration | C4 (en), 256 samples × 2048 tokens |
| Tool | GPTQModel v7 |
| Effective bits / weight | 4.29 BPW |
The vision encoder (model.visual.*) is intentionally left in BF16 — only the language-model weights are quantized.
Usage
With GPTQModel (recommended)
python
from gptqmodel import GPTQModelfrom transformers import AutoTokenizermodel_id = "AxisQuant/Qwen3.6-27b-gptq-int4"tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)model = GPTQModel.load(model_id, device_map="auto", trust_remote_code=True)messages = [{"role": "user", "content": "Explain GPTQ in one sentence."}]text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,)inputs = tokenizer(text, return_tensors="pt").to(model.device)out = model.generate(**inputs, max_new_tokens=256, do_sample=False)print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
With transformers
python
from transformers import AutoModelForCausalLM, AutoTokenizermodel_id = "AxisQuant/Qwen3.6-27b-gptq-int4"tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True,)
Hardware
- Weights: 18 GB on disk · ~14 GB VRAM at load
- Single-GPU friendly: comfortably fits on a 24 GB consumer card (RTX 3090 / 4090) for short-to-mid context
- Long context (64K+ tokens): H100 80 GB or A100 80 GB recommended
Limitations
- Only the language-model weights are quantized; the vision encoder remains in BF16
- Calibration set was English C4 — heavy non-English or domain-specific workloads may benefit from re-quantizing on a matching corpus
- Thinking mode (
enable_thinking=True) works but is significantly slower — enable only when reasoning quality matters more than latency
License
Inherits the license of the base model. See the Qwen/Qwen3.6-27B model page for terms.
Citation
Base model
bibtex
@misc{qwen3.6-27b,title = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},author = {{Qwen Team}},month = {April},year = {2026},url = {https://qwen.ai/blog?id=qwen3.6-27b}}
Quantization method
bibtex
@article{frantar2022gptq,title = {{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},author = {Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},journal = {arXiv preprint arXiv:2210.17323},year = {2022}}
Model provider
AxisQuant
Model tree
Base
Qwen/Qwen3.6-27B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information