vrfai
Qwen3.6-27B-FP8
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0FP8 Quantization Details
| Base model | Qwen/Qwen3.6-27B |
| Quantization | W8A8 FP8 — weights FP8 static, activations FP8 static |
| Strategy | tensor (per-tensor symmetric, memoryless minmax) |
| Format | compressed-tensors (native vLLM support) |
| Tool | vllm-project/llm-compressor |
| Requires | NVIDIA Ampere / Hopper / Blackwell (SM 89+) |
What's Quantized / What's Not
Same selective strategy as the NVFP4 variant — sensitive components are preserved in BF16:
| Component | Precision | Reason |
|---|---|---|
| FFN / MLP — all 64 transformer layers | FP8 | High parameter density, stable under quantization |
| Full-attention projections (q/k/v/o) — 16 GQA layers | FP8 | Standard attention, tolerant to 8-bit |
| DeltaNet / Linear-attention projections — 48 layers | BF16 | Gated linear recurrence sensitive to numerical errors |
| Vision encoder — all 27 blocks + merger | BF16 | Vision tower preserved for multimodal quality |
lm_head | BF16 | Output logits preserved for generation stability |
Quantization Config (llm-compressor)
yaml
# recipe.yamlQuantizationModifier:targets: [Linear]scheme: FP8# static W8A8, per-tensor symmetricignore:- lm_head- re:model\.visual\.blocks\.\d+\..*- model.visual.merger.linear_fc1- model.visual.merger.linear_fc2- re:model\.language_model\.layers\.\d+\.linear_attn\..*
Quick Start (vLLM)
bash
vllm serve vrfai/Qwen3.6-27B-FP8 \--max-model-len 8192 \--gpu-memory-utilization 0.9 \--dtype auto \--trust-remote-code \--tensor-parallel-size 2
Single GPU (≥ 24 GB VRAM, SM 89+):
bash
vllm serve vrfai/Qwen3.6-27B-FP8 \--max-model-len 8192 \--gpu-memory-utilization 0.92 \--dtype auto \--trust-remote-code
Quantization Script
The recipes and scripts used to quantize this model can be found in the following repository:
Python (Transformers)
python
from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "vrfai/Qwen3.6-27B-FP8"tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype="auto",device_map="auto",trust_remote_code=True,)messages = [{"role": "user", "content": "Explain quantization in one paragraph."}]text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = tokenizer(text, return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=512)print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
OpenAI-compatible API
python
from openai import OpenAIclient = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")response = client.chat.completions.create(model="vrfai/Qwen3.6-27B-FP8",messages=[{"role": "user", "content": "Hello!"}],temperature=0.7,max_tokens=512,)print(response.choices[0].message.content)
NVFP4 vs FP8 Comparison
| NVFP4 | FP8 (this) | |
|---|---|---|
| Weight bits | 4 | 8 |
| Activation bits | 4 (dynamic) | 8 (static) |
| Model size | ~26 GB | ~34 GB |
| Hardware | Blackwell only (SM 120+) | Ampere / Hopper / Blackwell |
| Speed | Faster | Slightly slower |
| Quality | Slightly lower | Higher |
Tested Environment
| Component | Version |
|---|---|
| vLLM | 0.19.1 |
| Transformers | 5.6.2 |
| PyTorch | 2.10.0+cu128 |
| CUDA | 12.8 (nvcc 12.8.61) |
| llm-compressor | compressed-tensors 0.14.0.1 |
| GPU | 2× NVIDIA RTX 5090 (tensor-parallel-size 2) |
Best Practices
| Mode | temperature | top_p | top_k | presence_penalty |
|---|---|---|---|---|
| Thinking — general | 1.0 | 0.95 | 20 | 0.0 |
| Thinking — coding | 0.6 | 0.95 | 20 | 0.0 |
| Non-thinking / instruct | 0.7 | 0.80 | 20 | 1.5 |
Thinking mode:
python
text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,chat_template_kwargs={"enable_thinking": True},)
Credits
- Original model: Qwen Team (Alibaba Group)
- FP8 quantization: vrfai
- Quantization framework: vllm-project/llm-compressor
Below is the original model card from Qwen/Qwen3.6-27B:
[!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.
Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.
Qwen3.6 Highlights
- Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision.
- Thinking Preservation: reasoning context from historical messages is retained, streamlining iterative development.

For more details, please refer to our blog post Qwen3.6-27B.
Model Overview
- Type: Causal Language Model with Vision Encoder
- Number of Parameters: 27B
- Context Length: 262,144 natively and extensible up to 1,010,000 tokens
Citation
bibtex
@misc{qwen3.6-27b,title = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},author = {{Qwen Team}},month = {April},year = {2026},url = {https://qwen.ai/blog?id=qwen3.6-27b}}
Model provider
vrfai
Model tree
Base
Qwen/Qwen3.6-27B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information