depop-ml

Qwen3.5-9B-FP8-Dynamic

README

License: apache-2.0

Quantisation Details

Scheme: FP8_DYNAMIC (W8A8 — FP8 per-channel weights, dynamic per-token activation quantisation)
Target layers: All Linear modules except those listed below
Ignored layers:
- lm_head — output projection to vocabulary
- re:.*visual.* — entire vision encoder (patch embed, attention, MLP, merger)
- re:.*linear_attn.* — GatedDeltaNet hybrid linear attention layers (Qwen3.5-specific architecture)

These layers remain in BF16 as they are sensitive to quantisation. In particular, the vision encoder's merger layers are a bottleneck between the visual and language representations, and the GatedDeltaNet layers contain small 32-dimensional projections that lose significant precision under FP8.

Usage with vLLM

python
from vllm import LLM

model = LLM("depop-ml/Qwen3.5-9B-FP8-Dynamic")

vLLM auto-detects the quantisation config from the checkpoint — no --quantization flag needed.

Usage with Transformers

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained("depop-ml/Qwen3.5-9B-FP8-Dynamic")
processor = AutoProcessor.from_pretrained("depop-ml/Qwen3.5-9B-FP8-Dynamic")

Quantisation Recipe

python
from transformers import AutoModelForImageTextToText, AutoProcessor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.5-9B", dtype="auto", trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=[
        "lm_head",
        "re:.*visual.*",
        "re:.*linear_attn.*",
    ],
)

oneshot(model=model, recipe=recipe)

model.save_pretrained("Qwen3.5-9B-FP8-Dynamic")
processor.save_pretrained("Qwen3.5-9B-FP8-Dynamic")

Notes

Qwen3.5 uses a hybrid architecture with both standard self-attention and GatedDeltaNet linear attention layers. The linear attention layers are excluded from quantisation as their small projection dimensions (32-dim in_proj_a and in_proj_b) appear particularly sensitive to precision loss. This is recommended by llm-compressor.
Qwen3.5 uses a 16px patch size (vs 14px in Qwen2.5), allowing ~30% more pixels per visual token at the same inference token cost.
Tested on NVIDIA L4 (24GB) GPUs.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

depop-ml

Model Tree

Base

Qwen/Qwen3.5-9B

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities