Primary Focus: Vision Accuracy Preservation
Vision-Language Models (VLMs) are highly sensitive to quantization in their visual perception components. Quantizing the vision encoder typically degrades performance in spatial recognition, OCR, object counting, and visual grid analysis.
To solve this, this model uses a mixed-precision quantization recipe:
- 🎯 Unquantized Vision Tower: All visual transformer layers, vision projections, and linear attention modules are entirely bypassed and kept in native float16 precision. Visual feature extraction quality remains identical to the original unquantized model.
- 💾 Quantized Language layers: Only standard linear projections in the language model are compressed to FP8 using dynamic activation scaling and static weight scaling.
This combination yields the best of both worlds: native vision accuracy at half the memory footprint.
Key Benefits
- 💾 VRAM Savings: Cuts active VRAM footprint from ~18 GB (BF16) down to ~9.5 GB, allowing it to fit easily on standard 12GB/16GB VRAM GPUs.
- 🎯 Zero Visual Accuracy Loss: Retains the exact native coordinates, bounding box capabilities, grid reading, and visual OCR precision of the original
Qwen/Qwen3.5-9B model.
- ⚡ Hardware Acceleration: Faster inference on NVIDIA Ada Lovelace, Hopper, and Blackwell Tensor Cores (e.g., RTX 40-series, L4, A100, H100) using FP8 operations.
Quantization Methodology
Quantization was performed via the one-shot method in llmcompressor with a Dynamic FP8 Activation scaling and Static FP8 Weight scaling scheme.
The following components were explicitly ignored/exempted from quantization to guarantee vision performance:
- Vision Encoder (
re:.*visual.*): Keeps the entire image-processing pipeline in float16.
- Language Model Head (
lm_head): Mapped to native precision to preserve textual coherence.
- Linear Attention Blocks (
re:.*linear_attn.*): Preserved in native precision.
Quantization Recipe Used:
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForImageTextToText
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=["lm_head", "re:.*visual.*", "re:.*linear_attn.*"]
)
oneshot(model=model, recipe=recipe)
How to Load and Use
from transformers import AutoModelForImageTextToText, AutoProcessor
import torch
model_id = "YOUR_HF_USERNAME/Qwen3.5-9B-FP8-Dynamic"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16,
trust_remote_code=True
)
Primary Use Cases
- VRAM-constrained deployments where visual analysis accuracy is critical (e.g., edge surveillance, object counting, OCR, and automated grid-labeling).
- Low-latency batch analysis on affordable single-GPU servers.