Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Huihui-Qwen3.5-9B-abliterated-AWQ-4bit

This is an AWQ 4-bit quantized version of huihui-ai/Huihui-Qwen3.5-9B-abliterated.

The goal is to preserve the original model’s vision-language capabilities—particularly video understanding—while making it practical for consumer GPUs. Unlike llama.cpp-based solutions, which currently offer limited support for video input in VLMs, this quantization allows direct, efficient inference using the vLLM.

Quantization details

The quantization configuration (layer selection, etc.) follows cyankiwi/Qwen3.5-9B-AWQ-4bit.

The calibration dataset used for AWQ is mit-han-lab/pile-val-backup.

You can find the original quantization script in the model repository. This is my first time doing something like this.

Running full 262K context on 16GB VRAM

On an RTX 5060 Ti 16GB (Blackwell architecture), the full 262,144-token context window can be used by enabling FP8 KV-cache quantization (set --kv-cache-dtype fp8" when loading, requires a compatible vLLM version).

Note on GPU architectures: This has been tested and confirmed working on the RTX 5060 Ti. Due to architectural differences, the same cannot be guaranteed for RTX 40-series or older 16GB GPUs when vision capabilities are also loaded—OOM (out of memory) is still possible. Adjust batch size and context length accordingly.

Example vLLM launch command

Below is the launch configuration I use on Windows. Replace the model path and media directory with your own.

cmd

set VIDEO_MAX_PIXELS=200704
set FPS=2.0
set FPS_MAX_FRAMES=2590
set FPS_MIN_FRAMES=4
set FORCE_QWENVL_VIDEO_READER=torchcodec
python -m vllm.entrypoints.openai.api_server ^
--model /path/to/your/model/HuiHui-Qwen3.5-9B-abliterated-AWQ-W4A16 ^
--served-model-name HuiHui-Qwen3.5-9B-abliterated-AWQ-W4A16 ^
--trust-remote-code ^
--enforce-eager ^
--dtype auto ^
--max-model-len 262144 ^
--kv-cache-dtype fp8 ^
--gpu-memory-utilization 0.92 ^
--port 8000 ^
--allowed-local-media-path /path/to/your/media

Acknowledgements

Model provider

nemozxy123

Model tree

Base

huihui-ai/Huihui-Qwen3.5-9B-abliterated

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today