vadery

qwen36-27b-ft-grm-w8a8

README

License: apache-2.0

Quick start

bash
pip install "vllm>=0.17"
huggingface-cli download vadery/qwen36-27b-ft-grm-w8a8 --local-dir ./GRM-2.6-Plus-W8A8

vllm serve ./GRM-2.6-Plus-W8A8 \
  --max-model-len 262144 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'

No additional patches required — both config.json.quantization_config.ignore (covers MTP Linear modules) and actorder field are already fixed.

Performance (single H200 SXM, vLLM 0.17.1, temperature=0)

Table with columns: Workload, Concurrency, Throughput, TPOT p50, MTP Accept rate
Workload	Concurrency	Throughput	TPOT p50	MTP Accept rate
Code generation	1	131 tok/s	7.6 ms	90.8 %
Code generation	8	859 tok/s	8.5 ms	91.1 %
JSON structured output	1	132 tok/s	7.6 ms	94.2 %
JSON structured output	8	942 tok/s	8.3 ms	93.7 %

For reference, the BF16 source with the same MTP recipe reaches 102 tok/s single-stream and 749 tok/s at concurrency 8 on W2 — i.e. this W8A8 model is 26-29 % faster while using ~18 GB less weight memory.

Architecture preserved

Table with columns: Component, Status
Component	Status
Language model Linear (q/k/v/o, MLP) on the 16 full-attention layers	INT8 (W8A8 channelwise weight + dynamic per-token activation)
MLP on every layer (64 total)	INT8
`linear_attn.*` (Gated DeltaNet / SSM) — 48 layers	BF16 (excluded — Mamba state numerics matter)
Vision tower (`model.visual.*`)	BF16 (excluded)
MTP head (`mtp.*`, 1 layer)	BF16 (excluded; correctly listed in `quantization_config.ignore`)
, embeddings, norms

Quantization recipe

python
SmoothQuantModifier(smoothing_strength=0.8, mappings=SQ_MAPPINGS,
                    ignore=[...vision, mtp, linear_attn, embed, lm_head...])
GPTQModifier(targets="Linear", scheme="W8A8",
             ignore=[same as above],
             dampening_frac=0.01)

SmoothQuant mappings explicitly cover only the 16 full-attention layers (indices 3, 7, 11, …, 63 out of 64) plus MLP on every layer — to avoid SmoothQuant trying to fuse into the linear_attn projections which have non-standard shapes.

Calibration: 512 samples × 2048 tokens from HuggingFaceH4/ultrachat_200k.

Post-process steps (already applied; documented for reproducers)

llm-compressor 0.10 drops the MTP tensors from the saved state and writes a quantization_config.ignore that doesn't cover MTP Linear modules. We post-process:

Restore MTP tensors — copy 15 mtp.* tensors from the BF16 source model-*.safetensors shards into the W8A8 single-shard safetensors.
Patch config.json — add 8 MTP Linear module names to quantization_config.ignore so vLLM treats them as un-quantized BF16 on load.

Without these two steps, vLLM either drops the MTP head (0 % acceptance) or loads garbage values (also 0 % acceptance after weights are corrupted on load).

File size

Table with columns: Size
	Size
BF16 source (`OrionLLM/GRM-2.6-Plus`)	52 GB
This W8A8 model	35 GB

Reasoning + tool calling

Same parser flags as the BF16 source:

--reasoning-parser qwen3 — separates <think> segments into reasoning field
--tool-call-parser qwen3_coder + --enable-auto-tool-choice — OpenAI tool-call API

Notes

vLLM ≥ 0.17 required (qwen3_5_mtp speculative method only landed there).
transformers ≥ 5.x is required for qwen3_5 model_type.
Tested on H200 (compute capability 9.0). H100 should also work. SM80 (A100) cutlass INT8 path exists but is slower per dollar than H100/H200.
For multi-stream serving raise --max-num-seqs to taste; observed peak throughput stays linear in concurrency up to at least 8.

License

Inherits Apache 2.0 from the base model.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

vadery

Model Tree

Base

OrionLLM/GRM-2.6-Plus

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities