vadery

Qwen3.5-27B-W8A8

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Quick start

bash
pip install "vllm>=0.17"
huggingface-cli download vadery/Qwen3.5-27B-W8A8 --local-dir ./Qwen3.5-27B-W8A8

vllm serve ./Qwen3.5-27B-W8A8 \
  --max-model-len 262144 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'

No additional patches required — both config.json.quantization_config.ignore (covers MTP Linear modules) and actorder field are already fixed.

Performance (single H200 SXM, vLLM 0.17.1, temperature=0)

Table with columns: Workload, Concurrency, Throughput, TPOT p50, MTP Accept rate, Mean accept length
Workload	Concurrency	Throughput	TPOT p50	MTP Accept rate	Mean accept length
JSON structured output	1	131 tok/s	7.5 ms	93.1 %	1.93 / 2.0

Same quantization recipe as the GRM-2.6-Plus fine-tune (vadery/qwen36-27b-ft-grm-w8a8) — performance numbers are within noise.

Architecture preserved

Table with columns: Component, Status
Component	Status
Language model Linear (q/k/v/o, MLP) on the 16 full-attention layers	INT8 (W8A8 channelwise weight + dynamic per-token activation)
MLP on every layer (64 total)	INT8
`linear_attn.*` (Gated DeltaNet / SSM) — 48 layers	BF16 (excluded — Mamba state numerics matter)
Vision tower (`model.visual.*`)	BF16 (excluded)
MTP head (`mtp.*`, 1 layer)	BF16 (excluded; correctly listed in `quantization_config.ignore`)
, embeddings, norms

Quantization recipe

python
SmoothQuantModifier(smoothing_strength=0.8, mappings=SQ_MAPPINGS,
                    ignore=[...vision, mtp, linear_attn, embed, lm_head...])
GPTQModifier(targets="Linear", scheme="W8A8",
             ignore=[same as above],
             dampening_frac=0.01)

SmoothQuant mappings explicitly cover only the 16 full-attention layers (indices 3, 7, …, 63 out of 64) plus MLP on every layer — to avoid SmoothQuant trying to fuse into the linear_attn projections which have non-standard shapes.

Calibration: 512 samples × 2048 tokens from HuggingFaceH4/ultrachat_200k.

Post-process steps (already applied; documented for reproducers)

llm-compressor 0.10 drops the MTP tensors from the saved state and writes a quantization_config.ignore that doesn't cover MTP Linear modules. We post-process:

Restore MTP tensors — copy 15 mtp.* tensors from the BF16 source model-*.safetensors shards into the W8A8 single-shard safetensors.
Patch config.json — add 8 MTP Linear module names to quantization_config.ignore and clear the spurious actorder=static field, so vLLM treats the MTP head as un-quantized BF16 on load.

Without these two steps, vLLM either drops the MTP head (0 % acceptance) or loads garbage values (also 0 % acceptance after weights are corrupted on load).

File size

Table with columns: Size
	Size
BF16 source (`Qwen/Qwen3.5-27B`)	52 GB
This W8A8 model	35 GB

Reasoning + tool calling

Same parser flags as the BF16 source:

--reasoning-parser qwen3 — separates <think> segments into reasoning field
--tool-call-parser qwen3_coder + --enable-auto-tool-choice — OpenAI tool-call API

Notes

vLLM ≥ 0.17 required (qwen3_5_mtp speculative method only landed there).
transformers ≥ 5.x is required for qwen3_5 model_type.
Tested on H200 (compute capability 9.0). H100 should also work.
The Qwen3.5 series emits <think> blocks by default — give max_tokens >= 4096 or pass "chat_template_kwargs": {"enable_thinking": false} to skip.

License

Inherits Apache 2.0 from the base model.

Model provider

vadery

Model tree

Base

Qwen/Qwen3.5-27B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Quick start

bash
pip install "vllm>=0.17"
huggingface-cli download vadery/Qwen3.5-27B-W8A8 --local-dir ./Qwen3.5-27B-W8A8

vllm serve ./Qwen3.5-27B-W8A8 \
  --max-model-len 262144 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'

No additional patches required — both config.json.quantization_config.ignore (covers MTP Linear modules) and actorder field are already fixed.

Performance (single H200 SXM, vLLM 0.17.1, temperature=0)

Table with columns: Workload, Concurrency, Throughput, TPOT p50, MTP Accept rate, Mean accept length
Workload	Concurrency	Throughput	TPOT p50	MTP Accept rate	Mean accept length
JSON structured output	1	131 tok/s	7.5 ms	93.1 %	1.93 / 2.0

Same quantization recipe as the GRM-2.6-Plus fine-tune (vadery/qwen36-27b-ft-grm-w8a8) — performance numbers are within noise.

Architecture preserved

Table with columns: Component, Status
Component	Status
Language model Linear (q/k/v/o, MLP) on the 16 full-attention layers	INT8 (W8A8 channelwise weight + dynamic per-token activation)
MLP on every layer (64 total)	INT8
`linear_attn.*` (Gated DeltaNet / SSM) — 48 layers	BF16 (excluded — Mamba state numerics matter)
Vision tower (`model.visual.*`)	BF16 (excluded)
MTP head (`mtp.*`, 1 layer)	BF16 (excluded; correctly listed in `quantization_config.ignore`)
, embeddings, norms

Quantization recipe

python
SmoothQuantModifier(smoothing_strength=0.8, mappings=SQ_MAPPINGS,
                    ignore=[...vision, mtp, linear_attn, embed, lm_head...])
GPTQModifier(targets="Linear", scheme="W8A8",
             ignore=[same as above],
             dampening_frac=0.01)

Calibration: 512 samples × 2048 tokens from HuggingFaceH4/ultrachat_200k.

Post-process steps (already applied; documented for reproducers)

llm-compressor 0.10 drops the MTP tensors from the saved state and writes a quantization_config.ignore that doesn't cover MTP Linear modules. We post-process:

Restore MTP tensors — copy 15 mtp.* tensors from the BF16 source model-*.safetensors shards into the W8A8 single-shard safetensors.
Patch config.json — add 8 MTP Linear module names to quantization_config.ignore and clear the spurious actorder=static field, so vLLM treats the MTP head as un-quantized BF16 on load.

Without these two steps, vLLM either drops the MTP head (0 % acceptance) or loads garbage values (also 0 % acceptance after weights are corrupted on load).

File size

Table with columns: Size
	Size
BF16 source (`Qwen/Qwen3.5-27B`)	52 GB
This W8A8 model	35 GB

Reasoning + tool calling

Same parser flags as the BF16 source:

--reasoning-parser qwen3 — separates <think> segments into reasoning field
--tool-call-parser qwen3_coder + --enable-auto-tool-choice — OpenAI tool-call API

Notes

vLLM ≥ 0.17 required (qwen3_5_mtp speculative method only landed there).
transformers ≥ 5.x is required for qwen3_5 model_type.
Tested on H200 (compute capability 9.0). H100 should also work.
The Qwen3.5 series emits <think> blocks by default — give max_tokens >= 4096 or pass "chat_template_kwargs": {"enable_thinking": false} to skip.

License

Inherits Apache 2.0 from the base model.

Qwen3.5-27B-W8A8

Get help setting up a custom Dedicated Endpoints.

README

Quick start

Performance (single H200 SXM, vLLM 0.17.1, temperature=0)

Architecture preserved

Quantization recipe

Post-process steps (already applied; documented for reproducers)

File size

Reasoning + tool calling

Notes

License

Explore FriendliAI today

README

Quick start

Performance (single H200 SXM, vLLM 0.17.1, temperature=0)

Architecture preserved

Quantization recipe

Post-process steps (already applied; documented for reproducers)

File size

Reasoning + tool calling

Notes

License