Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quick start
bash
pip install "vllm>=0.17"huggingface-cli download vadery/Qwen3.5-27B-W8A8 --local-dir ./Qwen3.5-27B-W8A8vllm serve ./Qwen3.5-27B-W8A8 \--max-model-len 262144 \--dtype bfloat16 \--gpu-memory-utilization 0.92 \--trust-remote-code \--enable-auto-tool-choice --tool-call-parser qwen3_coder \--reasoning-parser qwen3 \--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'
No additional patches required — both config.json.quantization_config.ignore (covers MTP Linear modules) and actorder field are already fixed.
Performance (single H200 SXM, vLLM 0.17.1, temperature=0)
| Workload | Concurrency | Throughput | TPOT p50 | MTP Accept rate | Mean accept length |
|---|---|---|---|---|---|
| JSON structured output | 1 | 131 tok/s | 7.5 ms | 93.1 % | 1.93 / 2.0 |
Same quantization recipe as the GRM-2.6-Plus fine-tune (vadery/qwen36-27b-ft-grm-w8a8) — performance numbers are within noise.
Architecture preserved
| Component | Status |
|---|---|
| Language model Linear (q/k/v/o, MLP) on the 16 full-attention layers | INT8 (W8A8 channelwise weight + dynamic per-token activation) |
| MLP on every layer (64 total) | INT8 |
linear_attn.* (Gated DeltaNet / SSM) — 48 layers | BF16 (excluded — Mamba state numerics matter) |
Vision tower (model.visual.*) | BF16 (excluded) |
MTP head (mtp.*, 1 layer) | BF16 (excluded; correctly listed in quantization_config.ignore) |
lm_head, embeddings, norms | BF16 / FP32 (excluded) |
Quantization recipe
python
SmoothQuantModifier(smoothing_strength=0.8, mappings=SQ_MAPPINGS,ignore=[...vision, mtp, linear_attn, embed, lm_head...])GPTQModifier(targets="Linear", scheme="W8A8",ignore=[same as above],dampening_frac=0.01)
SmoothQuant mappings explicitly cover only the 16 full-attention layers (indices 3, 7, …, 63 out of 64) plus MLP on every layer — to avoid SmoothQuant trying to fuse into the linear_attn projections which have non-standard shapes.
Calibration: 512 samples × 2048 tokens from HuggingFaceH4/ultrachat_200k.
Post-process steps (already applied; documented for reproducers)
llm-compressor 0.10 drops the MTP tensors from the saved state and writes a quantization_config.ignore that doesn't cover MTP Linear modules. We post-process:
- Restore MTP tensors — copy 15
mtp.*tensors from the BF16 sourcemodel-*.safetensorsshards into the W8A8 single-shard safetensors. - Patch
config.json— add 8 MTP Linear module names toquantization_config.ignoreand clear the spuriousactorder=staticfield, so vLLM treats the MTP head as un-quantized BF16 on load.
Without these two steps, vLLM either drops the MTP head (0 % acceptance) or loads garbage values (also 0 % acceptance after weights are corrupted on load).
File size
| Size | |
|---|---|
BF16 source (Qwen/Qwen3.5-27B) | 52 GB |
| This W8A8 model | 35 GB |
Reasoning + tool calling
Same parser flags as the BF16 source:
--reasoning-parser qwen3— separates<think>segments intoreasoningfield--tool-call-parser qwen3_coder+--enable-auto-tool-choice— OpenAI tool-call API
Notes
- vLLM ≥ 0.17 required (
qwen3_5_mtpspeculative method only landed there). transformers≥ 5.x is required forqwen3_5model_type.- Tested on H200 (compute capability 9.0). H100 should also work.
- The Qwen3.5 series emits
<think>blocks by default — givemax_tokens >= 4096or pass"chat_template_kwargs": {"enable_thinking": false}to skip.
License
Inherits Apache 2.0 from the base model.
Model provider
vadery
Model tree
Base
Qwen/Qwen3.5-27B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information