vadery
Qwen3.5-0.8B-W8A8
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quick start
bash
pip install "vllm>=0.17"huggingface-cli download vadery/Qwen3.5-0.8B-W8A8 --local-dir ./Qwen3.5-0.8B-W8A8vllm serve ./Qwen3.5-0.8B-W8A8 \--max-model-len 32768 \--dtype bfloat16 \--trust-remote-code \--reasoning-parser qwen3 \--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'
No additional patches required — both config.json.quantization_config.ignore (covers MTP Linear modules) and actorder field are already fixed.
Performance (single H200 SXM, single-stream)
| Workload | Throughput | TPOT p50 | MTP Accept rate | Mean accept length |
|---|---|---|---|---|
| JSON gen (concurrency 1) | 392 tok/s | 2.6 ms | 95.4 % | 1.95 / 2.0 |
3-4× faster than the BF16 source running the same MTP recipe.
Architecture preserved
| Component | Status |
|---|---|
| Language model Linear (q/k/v/o, MLP) on full-attn layers | INT8 (W8A8 channelwise weight + dynamic per-token activation) |
linear_attn.* (Gated DeltaNet / SSM) layers | BF16 (excluded — Mamba state numerics matter) |
Vision tower (model.visual.*) | BF16 (excluded) |
MTP head (mtp.*, 1 layer) | BF16 (excluded; correctly listed in quantization_config.ignore) |
lm_head, embeddings, norms | BF16 / FP32 (excluded) |
Quantization recipe
python
SmoothQuantModifier(smoothing_strength=0.8, mappings=SQ_MAPPINGS,ignore=[...vision, mtp, linear_attn, embed, lm_head...])GPTQModifier(targets="Linear", scheme="W8A8",ignore=[same as above],dampening_frac=0.01)
SmoothQuant mappings explicitly cover only the 6 full-attention layers (indices 3, 7, 11, 15, 19, 23 out of 24) plus MLP on every layer — to avoid SmoothQuant trying to fuse into the linear_attn projections which have a non-standard shape.
Calibration: 256 samples × 2048 tokens from HuggingFaceH4/ultrachat_200k.
File size
| Size | |
|---|---|
BF16 source (Qwen/Qwen3.5-0.8B) | 1.7 GB |
| This W8A8 model | 1.4 GB |
Notes / gotchas
- vLLM ≥ 0.17 required (
qwen3_5_mtpspeculative method only landed there). transformers≥ 5.x is required forqwen3_5model_type.- The MTP head weights are stored as
mtp.*keys in the safetensors file; do not delete or re-quantize them. The companionquantization_config.ignorelist explicitly excludes the 8 MTP Linear modules so vLLM treats them as float. - For multi-stream serving raise
--max-model-lenand--max-num-seqsto taste.
Reproducing
The quantization script is at https://huggingface.co/vadery/qwen36-27b-ft-grm-w8a8 (sibling 27B model), parameterized for the 0.8B's layer count.
License
Inherits Apache 2.0 from the base model.
Model provider
vadery
Model tree
Base
Qwen/Qwen3.5-0.8B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information