Quantisation Details
- Scheme:
FP8_DYNAMIC (W8A8 — FP8 per-channel weights, dynamic per-token activation quantisation)
- Target layers: All
Linear modules except those listed below
- Ignored layers:
lm_head — output projection to vocabulary
re:.*visual.* — entire vision encoder (patch embed, attention, MLP, merger)
re:.*linear_attn.* — GatedDeltaNet hybrid linear attention layers (Qwen3.5-specific architecture)
These layers remain in BF16 as they are sensitive to quantisation. In particular, the vision encoder's
merger layers are a bottleneck between the visual and language representations, and the GatedDeltaNet
layers contain small 32-dimensional projections that lose significant precision under FP8.
Usage with vLLM
from vllm import LLM
model = LLM("depop-ml/Qwen3.5-9B-FP8-Dynamic")
vLLM auto-detects the quantisation config from the checkpoint — no --quantization flag needed.
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained("depop-ml/Qwen3.5-9B-FP8-Dynamic")
processor = AutoProcessor.from_pretrained("depop-ml/Qwen3.5-9B-FP8-Dynamic")
Quantisation Recipe
from transformers import AutoModelForImageTextToText, AutoProcessor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
model = AutoModelForImageTextToText.from_pretrained(
"Qwen/Qwen3.5-9B", dtype="auto", trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=[
"lm_head",
"re:.*visual.*",
"re:.*linear_attn.*",
],
)
oneshot(model=model, recipe=recipe)
model.save_pretrained("Qwen3.5-9B-FP8-Dynamic")
processor.save_pretrained("Qwen3.5-9B-FP8-Dynamic")
Notes
- Qwen3.5 uses a hybrid architecture with both standard self-attention and GatedDeltaNet linear attention
layers. The linear attention layers are excluded from quantisation as their small projection dimensions
(32-dim in_proj_a and in_proj_b) appear particularly sensitive to precision loss. This is recommended by
llm-compressor.
- Qwen3.5 uses a 16px patch size (vs 14px in Qwen2.5), allowing ~30% more pixels per visual token at the
same inference token cost.
- Tested on NVIDIA L4 (24GB) GPUs.