vrfai

Qwen3.6-27B-FP8

README

License: apache-2.0

FP8 Quantization Details

Table

Base model	Qwen/Qwen3.6-27B
Quantization	W8A8 FP8 — weights FP8 static, activations FP8 static
Strategy	`tensor` (per-tensor symmetric, memoryless minmax)
Format	`compressed-tensors` (native vLLM support)
Tool	vllm-project/llm-compressor
Requires	NVIDIA Ampere / Hopper / Blackwell (SM 89+)

What's Quantized / What's Not

Same selective strategy as the NVFP4 variant — sensitive components are preserved in BF16:

Table with columns: Component, Precision, Reason
Component	Precision	Reason
FFN / MLP — all 64 transformer layers	FP8	High parameter density, stable under quantization
Full-attention projections (q/k/v/o) — 16 GQA layers	FP8	Standard attention, tolerant to 8-bit
DeltaNet / Linear-attention projections — 48 layers	BF16	Gated linear recurrence sensitive to numerical errors
Vision encoder — all 27 blocks + merger	BF16	Vision tower preserved for multimodal quality
`lm_head`

Quantization Config (llm-compressor)

yaml
# recipe.yaml
QuantizationModifier:
  targets: [Linear]
  scheme: FP8
  # static W8A8, per-tensor symmetric
  ignore:
    - lm_head
    - re:model\.visual\.blocks\.\d+\..*
    - model.visual.merger.linear_fc1
    - model.visual.merger.linear_fc2
    - re:model\.language_model\.layers\.\d+\.linear_attn\..*

Quick Start (vLLM)

bash
vllm serve vrfai/Qwen3.6-27B-FP8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9 \
  --dtype auto \
  --trust-remote-code \
  --tensor-parallel-size 2

Single GPU (≥ 24 GB VRAM, SM 89+):

bash
vllm serve vrfai/Qwen3.6-27B-FP8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --dtype auto \
  --trust-remote-code

Quantization Script

The recipes and scripts used to quantize this model can be found in the following repository:

VinRobotics/model-quantization-recipes

Python (Transformers)

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "vrfai/Qwen3.6-27B-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain quantization in one paragraph."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

OpenAI-compatible API

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="vrfai/Qwen3.6-27B-FP8",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

NVFP4 vs FP8 Comparison

Table with columns: NVFP4, FP8 (this)
	NVFP4	FP8 (this)
Weight bits	4	8
Activation bits	4 (dynamic)	8 (static)
Model size	~26 GB	~34 GB
Hardware	Blackwell only (SM 120+)	Ampere / Hopper / Blackwell
Speed	Faster	Slightly slower

Tested Environment

Table with columns: Component, Version
Component	Version
vLLM	0.19.1
Transformers	5.6.2
PyTorch	2.10.0+cu128
CUDA	12.8 (nvcc 12.8.61)
llm-compressor	compressed-tensors 0.14.0.1
GPU	2× NVIDIA RTX 5090 (tensor-parallel-size 2)

Best Practices

Table with columns: Mode, temperature, top_p, top_k, presence_penalty
Mode	temperature	top_p	top_k	presence_penalty
Thinking — general	1.0	0.95	20	0.0
Thinking — coding	0.6	0.95	20	0.0
Non-thinking / instruct	0.7	0.80	20	1.5

Thinking mode:

python
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    chat_template_kwargs={"enable_thinking": True},
)

Credits

Original model: Qwen Team (Alibaba Group)
FP8 quantization: vrfai
Quantization framework: vllm-project/llm-compressor

Below is the original model card from Qwen/Qwen3.6-27B:

[!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.

Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.

Qwen3.6 Highlights

Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision.
Thinking Preservation: reasoning context from historical messages is retained, streamlining iterative development.

Benchmark Results

For more details, please refer to our blog post Qwen3.6-27B.

Model Overview

Type: Causal Language Model with Vision Encoder
Number of Parameters: 27B
Context Length: 262,144 natively and extensible up to 1,010,000 tokens

Citation

bibtex
@misc{qwen3.6-27b,
    title  = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
    author = {{Qwen Team}},
    month  = {April},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.6-27b}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

vrfai

Model Tree

Base

Qwen/Qwen3.6-27B

Quantized

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality