AxisQuant

Qwen3.6-27b-gptq-int4

README

License: apache-2.0

Author: Prashant Takale

Model Compression

Table with columns: BF16 baseline, GPTQ INT4 (this model)
	BF16 baseline	GPTQ INT4 (this model)
VRAM at load	~54 GB	~14 GB (3.9× smaller)
Bits / weight	16	4.29 (3.7× fewer)

Benchmarks

Note: MMLU-Redux uses a 1500-sample subset; other tasks are full. Decoding/prompts/filters are lm-eval-harness defaults, so absolute scores may differ from the official Qwen3.6-27B numbers. The goal is the BF16↔INT4 delta under identical conditions, not exact replication of the baseline.

Both models evaluated under identical conditions with lm-evaluation-harness: greedy decoding (temperature=0), enable_thinking=False, seed=0. Long-CoT tasks use max_gen_toks=4096; HumanEval served via /v1/completions (raw, no chat template) so the harness's \\ndef / \\nclass stop sequences fire correctly.

Inference Performance

Single-stream measurement on the same hardware, identical request (337 input / 42 output tokens):

INT4 delivers ~2.4× higher throughput and ~2.4× lower latency at single-stream — the bandwidth savings from 4-bit weights translate almost 1:1 into decode-time speed-up (output tok/s and TPOT).

Quantization recipe

Table with columns: Setting, Value
Setting	Value
Method	GPTQ
Bits	4 (weight-only)
Group size	128
`desc_act`	True (activation-order)
`damp_percent`	0.01
Symmetric	True
Calibration	C4 (`en`), 256 samples × 2048 tokens

The vision encoder (model.visual.*) is intentionally left in BF16 — only the language-model weights are quantized.

Usage

With GPTQModel (recommended)

python
from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model_id  = "AxisQuant/Qwen3.6-27b-gptq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model     = GPTQModel.load(model_id, device_map="auto", trust_remote_code=True)

messages = [{"role": "user", "content": "Explain GPTQ in one sentence."}]
text   = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out    = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

With transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id  = "AxisQuant/Qwen3.6-27b-gptq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model     = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", trust_remote_code=True,
)

Hardware

Weights: 18 GB on disk · ~14 GB VRAM at load
Single-GPU friendly: comfortably fits on a 24 GB consumer card (RTX 3090 / 4090) for short-to-mid context
Long context (64K+ tokens): H100 80 GB or A100 80 GB recommended

Limitations

Only the language-model weights are quantized; the vision encoder remains in BF16
Calibration set was English C4 — heavy non-English or domain-specific workloads may benefit from re-quantizing on a matching corpus
Thinking mode (enable_thinking=True) works but is significantly slower — enable only when reasoning quality matters more than latency

License

Inherits the license of the base model. See the Qwen/Qwen3.6-27B model page for terms.

Citation

Base model

bibtex
@misc{qwen3.6-27b,
    title  = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
    author = {{Qwen Team}},
    month  = {April},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.6-27b}
}

Quantization method

bibtex
@article{frantar2022gptq,
    title   = {{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
    author  = {Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
    journal = {arXiv preprint arXiv:2210.17323},
    year    = {2022}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

AxisQuant

Model Tree

Base

Qwen/Qwen3.6-27B

Quantized

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality