vadery

Qwen3.5-0.8B-W8A8

README

License: apache-2.0

Quick start

bash
pip install "vllm>=0.17"
huggingface-cli download vadery/Qwen3.5-0.8B-W8A8 --local-dir ./Qwen3.5-0.8B-W8A8

vllm serve ./Qwen3.5-0.8B-W8A8 \
  --max-model-len 32768 \
  --dtype bfloat16 \
  --trust-remote-code \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'

No additional patches required — both config.json.quantization_config.ignore (covers MTP Linear modules) and actorder field are already fixed.

Performance (single H200 SXM, single-stream)

Table with columns: Workload, Throughput, TPOT p50, MTP Accept rate, Mean accept length
Workload	Throughput	TPOT p50	MTP Accept rate	Mean accept length
JSON gen (concurrency 1)	392 tok/s	2.6 ms	95.4 %	1.95 / 2.0

3-4× faster than the BF16 source running the same MTP recipe.

Architecture preserved

Table with columns: Component, Status
Component	Status
Language model Linear (q/k/v/o, MLP) on full-attn layers	INT8 (W8A8 channelwise weight + dynamic per-token activation)
`linear_attn.*` (Gated DeltaNet / SSM) layers	BF16 (excluded — Mamba state numerics matter)
Vision tower (`model.visual.*`)	BF16 (excluded)
MTP head (`mtp.*`, 1 layer)	BF16 (excluded; correctly listed in `quantization_config.ignore`)
`lm_head`, embeddings, norms	BF16 / FP32 (excluded)

Quantization recipe

python
SmoothQuantModifier(smoothing_strength=0.8, mappings=SQ_MAPPINGS,
                    ignore=[...vision, mtp, linear_attn, embed, lm_head...])
GPTQModifier(targets="Linear", scheme="W8A8",
             ignore=[same as above],
             dampening_frac=0.01)

SmoothQuant mappings explicitly cover only the 6 full-attention layers (indices 3, 7, 11, 15, 19, 23 out of 24) plus MLP on every layer — to avoid SmoothQuant trying to fuse into the linear_attn projections which have a non-standard shape.

Calibration: 256 samples × 2048 tokens from HuggingFaceH4/ultrachat_200k.

File size

Table with columns: Size
	Size
BF16 source (`Qwen/Qwen3.5-0.8B`)	1.7 GB
This W8A8 model	1.4 GB

Notes / gotchas

vLLM ≥ 0.17 required (qwen3_5_mtp speculative method only landed there).
transformers ≥ 5.x is required for qwen3_5 model_type.
The MTP head weights are stored as mtp.* keys in the safetensors file; do not delete or re-quantize them. The companion quantization_config.ignore list explicitly excludes the 8 MTP Linear modules so vLLM treats them as float.
For multi-stream serving raise --max-model-len and --max-num-seqs to taste.

Reproducing

The quantization script is at https://huggingface.co/vadery/qwen36-27b-ft-grm-w8a8 (sibling 27B model), parameterized for the 0.8B's layer count.

License

Inherits Apache 2.0 from the base model.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

vadery

Model Tree

Base

Qwen/Qwen3.5-0.8B

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities