ishaqinu

Qwen3.5-9B-FP8-Dynamic

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Primary Focus: Vision Accuracy Preservation

Vision-Language Models (VLMs) are highly sensitive to quantization in their visual perception components. Quantizing the vision encoder typically degrades performance in spatial recognition, OCR, object counting, and visual grid analysis.

To solve this, this model uses a mixed-precision quantization recipe:

🎯 Unquantized Vision Tower: All visual transformer layers, vision projections, and linear attention modules are entirely bypassed and kept in native float16 precision. Visual feature extraction quality remains identical to the original unquantized model.
💾 Quantized Language layers: Only standard linear projections in the language model are compressed to FP8 using dynamic activation scaling and static weight scaling.

This combination yields the best of both worlds: native vision accuracy at half the memory footprint.

Key Benefits

💾 VRAM Savings: Cuts active VRAM footprint from ~18 GB (BF16) down to ~9.5 GB, allowing it to fit easily on standard 12GB/16GB VRAM GPUs.
🎯 Zero Visual Accuracy Loss: Retains the exact native coordinates, bounding box capabilities, grid reading, and visual OCR precision of the original Qwen/Qwen3.5-9B model.
⚡ Hardware Acceleration: Faster inference on NVIDIA Ada Lovelace, Hopper, and Blackwell Tensor Cores (e.g., RTX 40-series, L4, A100, H100) using FP8 operations.

Quantization Methodology

Quantization was performed via the one-shot method in llmcompressor with a Dynamic FP8 Activation scaling and Static FP8 Weight scaling scheme.

The following components were explicitly ignored/exempted from quantization to guarantee vision performance:

Vision Encoder (re:.*visual.*): Keeps the entire image-processing pipeline in float16.
Language Model Head (lm_head): Mapped to native precision to preserve textual coherence.
Linear Attention Blocks (re:.*linear_attn.*): Preserved in native precision.

Quantization Recipe Used:

python
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForImageTextToText

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head", "re:.*visual.*", "re:.*linear_attn.*"]
)
oneshot(model=model, recipe=recipe)

How to Load and Use

python
from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

model_id = "YOUR_HF_USERNAME/Qwen3.5-9B-FP8-Dynamic"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True
)

Primary Use Cases

VRAM-constrained deployments where visual analysis accuracy is critical (e.g., edge surveillance, object counting, OCR, and automated grid-labeling).
Low-latency batch analysis on affordable single-GPU servers.

Model provider

ishaqinu

Model tree

Base

Qwen/Qwen3.5-9B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Primary Focus: Vision Accuracy Preservation

To solve this, this model uses a mixed-precision quantization recipe:

🎯 Unquantized Vision Tower: All visual transformer layers, vision projections, and linear attention modules are entirely bypassed and kept in native float16 precision. Visual feature extraction quality remains identical to the original unquantized model.
💾 Quantized Language layers: Only standard linear projections in the language model are compressed to FP8 using dynamic activation scaling and static weight scaling.

This combination yields the best of both worlds: native vision accuracy at half the memory footprint.

Key Benefits

💾 VRAM Savings: Cuts active VRAM footprint from ~18 GB (BF16) down to ~9.5 GB, allowing it to fit easily on standard 12GB/16GB VRAM GPUs.
🎯 Zero Visual Accuracy Loss: Retains the exact native coordinates, bounding box capabilities, grid reading, and visual OCR precision of the original Qwen/Qwen3.5-9B model.
⚡ Hardware Acceleration: Faster inference on NVIDIA Ada Lovelace, Hopper, and Blackwell Tensor Cores (e.g., RTX 40-series, L4, A100, H100) using FP8 operations.

Quantization Methodology

Quantization was performed via the one-shot method in llmcompressor with a Dynamic FP8 Activation scaling and Static FP8 Weight scaling scheme.

The following components were explicitly ignored/exempted from quantization to guarantee vision performance:

Vision Encoder (re:.*visual.*): Keeps the entire image-processing pipeline in float16.
Language Model Head (lm_head): Mapped to native precision to preserve textual coherence.
Linear Attention Blocks (re:.*linear_attn.*): Preserved in native precision.

Quantization Recipe Used:

python
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForImageTextToText

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head", "re:.*visual.*", "re:.*linear_attn.*"]
)
oneshot(model=model, recipe=recipe)

How to Load and Use

python
from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

model_id = "YOUR_HF_USERNAME/Qwen3.5-9B-FP8-Dynamic"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True
)

Primary Use Cases

VRAM-constrained deployments where visual analysis accuracy is critical (e.g., edge surveillance, object counting, OCR, and automated grid-labeling).
Low-latency batch analysis on affordable single-GPU servers.

Qwen3.5-9B-FP8-Dynamic

Get help setting up a custom Dedicated Endpoints.

README

Primary Focus: Vision Accuracy Preservation

Key Benefits

Quantization Methodology

Quantization Recipe Used:

How to Load and Use

Primary Use Cases

Explore FriendliAI today

README

Primary Focus: Vision Accuracy Preservation

Key Benefits

Quantization Methodology

Quantization Recipe Used:

How to Load and Use

Primary Use Cases