Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quantisation Details
- Scheme:
FP8_DYNAMIC(W8A8 — FP8 per-channel weights, dynamic per-token activation quantisation) - Target layers: All
Linearmodules except those listed below - Ignored layers:
lm_head— output projection to vocabularyre:.*visual.*— entire vision encoder (patch embed, attention, MLP, merger)re:.*linear_attn.*— GatedDeltaNet hybrid linear attention layers (Qwen3.5-specific architecture)
These layers remain in BF16 as they are sensitive to quantisation. In particular, the vision encoder's merger layers are a bottleneck between the visual and language representations, and the GatedDeltaNet layers contain small 32-dimensional projections that lose significant precision under FP8.
Usage with vLLM
python
from vllm import LLMmodel = LLM("depop-ml/Qwen3.5-9B-FP8-Dynamic")
vLLM auto-detects the quantisation config from the checkpoint — no --quantization flag needed.
Usage with Transformers
python
from transformers import AutoModelForImageTextToText, AutoProcessormodel = AutoModelForImageTextToText.from_pretrained("depop-ml/Qwen3.5-9B-FP8-Dynamic")processor = AutoProcessor.from_pretrained("depop-ml/Qwen3.5-9B-FP8-Dynamic")
Quantisation Recipe
python
from transformers import AutoModelForImageTextToText, AutoProcessorfrom llmcompressor import oneshotfrom llmcompressor.modifiers.quantization import QuantizationModifiermodel = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.5-9B", dtype="auto", trust_remote_code=True)processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)recipe = QuantizationModifier(targets="Linear",scheme="FP8_DYNAMIC",ignore=["lm_head","re:.*visual.*","re:.*linear_attn.*",],)oneshot(model=model, recipe=recipe)model.save_pretrained("Qwen3.5-9B-FP8-Dynamic")processor.save_pretrained("Qwen3.5-9B-FP8-Dynamic")
Notes
- Qwen3.5 uses a hybrid architecture with both standard self-attention and GatedDeltaNet linear attention
layers. The linear attention layers are excluded from quantisation as their small projection dimensions
(32-dim in_proj_a and in_proj_b) appear particularly sensitive to precision loss. This is recommended by
llm-compressor. - Qwen3.5 uses a 16px patch size (vs 14px in Qwen2.5), allowing ~30% more pixels per visual token at the same inference token cost.
- Tested on NVIDIA L4 (24GB) GPUs.
Model provider
depop-ml
Model tree
Base
Qwen/Qwen3.5-9B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information