vadery

vadery

Qwen3.5-0.8B-W8A8

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quick start

bash

pip install "vllm>=0.17"
huggingface-cli download vadery/Qwen3.5-0.8B-W8A8 --local-dir ./Qwen3.5-0.8B-W8A8
vllm serve ./Qwen3.5-0.8B-W8A8 \
--max-model-len 32768 \
--dtype bfloat16 \
--trust-remote-code \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'

No additional patches required — both config.json.quantization_config.ignore (covers MTP Linear modules) and actorder field are already fixed.

Performance (single H200 SXM, single-stream)

Table
WorkloadThroughputTPOT p50MTP Accept rateMean accept length
JSON gen (concurrency 1)392 tok/s2.6 ms95.4 %1.95 / 2.0

3-4× faster than the BF16 source running the same MTP recipe.

Architecture preserved

Table
ComponentStatus
Language model Linear (q/k/v/o, MLP) on full-attn layersINT8 (W8A8 channelwise weight + dynamic per-token activation)
linear_attn.* (Gated DeltaNet / SSM) layersBF16 (excluded — Mamba state numerics matter)
Vision tower (model.visual.*)BF16 (excluded)
MTP head (mtp.*, 1 layer)BF16 (excluded; correctly listed in quantization_config.ignore)
lm_head, embeddings, normsBF16 / FP32 (excluded)

Quantization recipe

python

SmoothQuantModifier(smoothing_strength=0.8, mappings=SQ_MAPPINGS,
ignore=[...vision, mtp, linear_attn, embed, lm_head...])
GPTQModifier(targets="Linear", scheme="W8A8",
ignore=[same as above],
dampening_frac=0.01)

SmoothQuant mappings explicitly cover only the 6 full-attention layers (indices 3, 7, 11, 15, 19, 23 out of 24) plus MLP on every layer — to avoid SmoothQuant trying to fuse into the linear_attn projections which have a non-standard shape.

Calibration: 256 samples × 2048 tokens from HuggingFaceH4/ultrachat_200k.

File size

Table
Size
BF16 source (Qwen/Qwen3.5-0.8B)1.7 GB
This W8A8 model1.4 GB

Notes / gotchas

  • vLLM ≥ 0.17 required (qwen3_5_mtp speculative method only landed there).
  • transformers ≥ 5.x is required for qwen3_5 model_type.
  • The MTP head weights are stored as mtp.* keys in the safetensors file; do not delete or re-quantize them. The companion quantization_config.ignore list explicitly excludes the 8 MTP Linear modules so vLLM treats them as float.
  • For multi-stream serving raise --max-model-len and --max-num-seqs to taste.

Reproducing

The quantization script is at https://huggingface.co/vadery/qwen36-27b-ft-grm-w8a8 (sibling 27B model), parameterized for the 0.8B's layer count.

License

Inherits Apache 2.0 from the base model.

Model provider

vadery

vadery

Model tree

Base

Qwen/Qwen3.5-0.8B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today