Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Author: Prashant Takale

Qwen3.6-27b-gptq-int4

GPTQ INT4 quantization of Qwen/Qwen3.6-27B. 3× smaller. ~2.4× faster.


Model Compression

Memory & storage reduction

BF16 baselineGPTQ INT4 (this model)
VRAM at load~54 GB~14 GB (3.9× smaller)
Bits / weight164.29 (3.7× fewer)

Benchmarks

Note: MMLU-Redux uses a 1500-sample subset; other tasks are full. Decoding/prompts/filters are lm-eval-harness defaults, so absolute scores may differ from the official Qwen3.6-27B numbers. The goal is the BF16↔INT4 delta under identical conditions, not exact replication of the baseline.

Both models evaluated under identical conditions with lm-evaluation-harness: greedy decoding (temperature=0), enable_thinking=False, seed=0. Long-CoT tasks use max_gen_toks=4096; HumanEval served via /v1/completions (raw, no chat template) so the harness's \\ndef / \\nclass stop sequences fire correctly.

SectionTaskMetricNBF16INT4Δ (pp)
Multiple-choice (Science)ARC-Challengeacc_norm117263.9164.08+0.17
Math (Word problems)GSM8Kexact_match (strict)131996.3696.82+0.46
KnowledgeMMLU-Reduxexact_match (strict-match)150089.1988.42−0.77
STEM ReasoningGPQA-Diamondexact_match (flexible-extract)19871.7268.69−3.03
CodingHumanEvalpass@1 (create_test)16485.9877.44−8.54

Inference Performance

Inference performance

Single-stream measurement on the same hardware, identical request (337 input / 42 output tokens):

MetricBF16INT4Δ
Output token throughput (tok/s)25.5562.34+143.99%
Request throughput (req/s)0.611.48+142.62%
Time to first token (ms)79.9877.43−3.19%
Time per output token (ms)38.1114.52−61.91%
End-to-end latency (ms)1642.66672.70−59.05%

INT4 delivers ~2.4× higher throughput and ~2.4× lower latency at single-stream — the bandwidth savings from 4-bit weights translate almost 1:1 into decode-time speed-up (output tok/s and TPOT).


Quantization recipe

SettingValue
MethodGPTQ
Bits4 (weight-only)
Group size128
desc_actTrue (activation-order)
damp_percent0.01
SymmetricTrue
CalibrationC4 (en), 256 samples × 2048 tokens
ToolGPTQModel v7
Effective bits / weight4.29 BPW

The vision encoder (model.visual.*) is intentionally left in BF16 — only the language-model weights are quantized.


Usage

With GPTQModel (recommended)

python

from gptqmodel import GPTQModel
from transformers import AutoTokenizer
model_id = "AxisQuant/Qwen3.6-27b-gptq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = GPTQModel.load(model_id, device_map="auto", trust_remote_code=True)
messages = [{"role": "user", "content": "Explain GPTQ in one sentence."}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

With transformers

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "AxisQuant/Qwen3.6-27b-gptq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id, device_map="auto", trust_remote_code=True,
)

Hardware

  • Weights: 18 GB on disk · ~14 GB VRAM at load
  • Single-GPU friendly: comfortably fits on a 24 GB consumer card (RTX 3090 / 4090) for short-to-mid context
  • Long context (64K+ tokens): H100 80 GB or A100 80 GB recommended

Limitations

  • Only the language-model weights are quantized; the vision encoder remains in BF16
  • Calibration set was English C4 — heavy non-English or domain-specific workloads may benefit from re-quantizing on a matching corpus
  • Thinking mode (enable_thinking=True) works but is significantly slower — enable only when reasoning quality matters more than latency

License

Inherits the license of the base model. See the Qwen/Qwen3.6-27B model page for terms.


Citation

Base model

bibtex

@misc{qwen3.6-27b,
title = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
author = {{Qwen Team}},
month = {April},
year = {2026},
url = {https://qwen.ai/blog?id=qwen3.6-27b}
}

Quantization method

bibtex

@article{frantar2022gptq,
title = {{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
author = {Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
journal = {arXiv preprint arXiv:2210.17323},
year = {2022}
}

Model provider

AxisQuant

Model tree

Base

Qwen/Qwen3.6-27B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today