Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quick start

bash

pip install "vllm>=0.17"
huggingface-cli download vadery/Qwen3.5-27B-W8A8 --local-dir ./Qwen3.5-27B-W8A8
vllm serve ./Qwen3.5-27B-W8A8 \
--max-model-len 262144 \
--dtype bfloat16 \
--gpu-memory-utilization 0.92 \
--trust-remote-code \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'

No additional patches required — both config.json.quantization_config.ignore (covers MTP Linear modules) and actorder field are already fixed.

Performance (single H200 SXM, vLLM 0.17.1, temperature=0)

WorkloadConcurrencyThroughputTPOT p50MTP Accept rateMean accept length
JSON structured output1131 tok/s7.5 ms93.1 %1.93 / 2.0

Same quantization recipe as the GRM-2.6-Plus fine-tune (vadery/qwen36-27b-ft-grm-w8a8) — performance numbers are within noise.

Architecture preserved

ComponentStatus
Language model Linear (q/k/v/o, MLP) on the 16 full-attention layersINT8 (W8A8 channelwise weight + dynamic per-token activation)
MLP on every layer (64 total)INT8
linear_attn.* (Gated DeltaNet / SSM) — 48 layersBF16 (excluded — Mamba state numerics matter)
Vision tower (model.visual.*)BF16 (excluded)
MTP head (mtp.*, 1 layer)BF16 (excluded; correctly listed in quantization_config.ignore)
lm_head, embeddings, normsBF16 / FP32 (excluded)

Quantization recipe

python

SmoothQuantModifier(smoothing_strength=0.8, mappings=SQ_MAPPINGS,
ignore=[...vision, mtp, linear_attn, embed, lm_head...])
GPTQModifier(targets="Linear", scheme="W8A8",
ignore=[same as above],
dampening_frac=0.01)

SmoothQuant mappings explicitly cover only the 16 full-attention layers (indices 3, 7, …, 63 out of 64) plus MLP on every layer — to avoid SmoothQuant trying to fuse into the linear_attn projections which have non-standard shapes.

Calibration: 512 samples × 2048 tokens from HuggingFaceH4/ultrachat_200k.

Post-process steps (already applied; documented for reproducers)

llm-compressor 0.10 drops the MTP tensors from the saved state and writes a quantization_config.ignore that doesn't cover MTP Linear modules. We post-process:

  1. Restore MTP tensors — copy 15 mtp.* tensors from the BF16 source model-*.safetensors shards into the W8A8 single-shard safetensors.
  2. Patch config.json — add 8 MTP Linear module names to quantization_config.ignore and clear the spurious actorder=static field, so vLLM treats the MTP head as un-quantized BF16 on load.

Without these two steps, vLLM either drops the MTP head (0 % acceptance) or loads garbage values (also 0 % acceptance after weights are corrupted on load).

File size

Size
BF16 source (Qwen/Qwen3.5-27B)52 GB
This W8A8 model35 GB

Reasoning + tool calling

Same parser flags as the BF16 source:

  • --reasoning-parser qwen3 — separates <think> segments into reasoning field
  • --tool-call-parser qwen3_coder + --enable-auto-tool-choice — OpenAI tool-call API

Notes

  • vLLM ≥ 0.17 required (qwen3_5_mtp speculative method only landed there).
  • transformers ≥ 5.x is required for qwen3_5 model_type.
  • Tested on H200 (compute capability 9.0). H100 should also work.
  • The Qwen3.5 series emits <think> blocks by default — give max_tokens >= 4096 or pass "chat_template_kwargs": {"enable_thinking": false} to skip.

License

Inherits Apache 2.0 from the base model.

Model provider

vadery

vadery

Model tree

Base

Qwen/Qwen3.5-27B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today