vadery

vadery

qwen36-27b-ft-grm-w8a8

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quick start

bash

pip install "vllm>=0.17"
huggingface-cli download vadery/qwen36-27b-ft-grm-w8a8 --local-dir ./GRM-2.6-Plus-W8A8
vllm serve ./GRM-2.6-Plus-W8A8 \
--max-model-len 262144 \
--dtype bfloat16 \
--gpu-memory-utilization 0.92 \
--trust-remote-code \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'

No additional patches required — both config.json.quantization_config.ignore (covers MTP Linear modules) and actorder field are already fixed.

Performance (single H200 SXM, vLLM 0.17.1, temperature=0)

Table
WorkloadConcurrencyThroughputTPOT p50MTP Accept rate
Code generation1131 tok/s7.6 ms90.8 %
Code generation8859 tok/s8.5 ms91.1 %
JSON structured output1132 tok/s7.6 ms94.2 %
JSON structured output8942 tok/s8.3 ms93.7 %

For reference, the BF16 source with the same MTP recipe reaches 102 tok/s single-stream and 749 tok/s at concurrency 8 on W2 — i.e. this W8A8 model is 26-29 % faster while using ~18 GB less weight memory.

Architecture preserved

Table
ComponentStatus
Language model Linear (q/k/v/o, MLP) on the 16 full-attention layersINT8 (W8A8 channelwise weight + dynamic per-token activation)
MLP on every layer (64 total)INT8
linear_attn.* (Gated DeltaNet / SSM) — 48 layersBF16 (excluded — Mamba state numerics matter)
Vision tower (model.visual.*)BF16 (excluded)
MTP head (mtp.*, 1 layer)BF16 (excluded; correctly listed in quantization_config.ignore)
lm_head, embeddings, normsBF16 / FP32 (excluded)

Quantization recipe

python

SmoothQuantModifier(smoothing_strength=0.8, mappings=SQ_MAPPINGS,
ignore=[...vision, mtp, linear_attn, embed, lm_head...])
GPTQModifier(targets="Linear", scheme="W8A8",
ignore=[same as above],
dampening_frac=0.01)

SmoothQuant mappings explicitly cover only the 16 full-attention layers (indices 3, 7, 11, …, 63 out of 64) plus MLP on every layer — to avoid SmoothQuant trying to fuse into the linear_attn projections which have non-standard shapes.

Calibration: 512 samples × 2048 tokens from HuggingFaceH4/ultrachat_200k.

Post-process steps (already applied; documented for reproducers)

llm-compressor 0.10 drops the MTP tensors from the saved state and writes a quantization_config.ignore that doesn't cover MTP Linear modules. We post-process:

  1. Restore MTP tensors — copy 15 mtp.* tensors from the BF16 source model-*.safetensors shards into the W8A8 single-shard safetensors.
  2. Patch config.json — add 8 MTP Linear module names to quantization_config.ignore so vLLM treats them as un-quantized BF16 on load.

Without these two steps, vLLM either drops the MTP head (0 % acceptance) or loads garbage values (also 0 % acceptance after weights are corrupted on load).

File size

Table
Size
BF16 source (OrionLLM/GRM-2.6-Plus)52 GB
This W8A8 model35 GB

Reasoning + tool calling

Same parser flags as the BF16 source:

  • --reasoning-parser qwen3 — separates <think> segments into reasoning field
  • --tool-call-parser qwen3_coder + --enable-auto-tool-choice — OpenAI tool-call API

Notes

  • vLLM ≥ 0.17 required (qwen3_5_mtp speculative method only landed there).
  • transformers ≥ 5.x is required for qwen3_5 model_type.
  • Tested on H200 (compute capability 9.0). H100 should also work. SM80 (A100) cutlass INT8 path exists but is slower per dollar than H100/H200.
  • For multi-stream serving raise --max-num-seqs to taste; observed peak throughput stays linear in concurrency up to at least 8.

License

Inherits Apache 2.0 from the base model.

Model provider

vadery

vadery

Model tree

Base

OrionLLM/GRM-2.6-Plus

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today