Huihui-Qwen3.5-9B-abliterated-AWQ-4bit API & Inference Endpoint

Huihui-Qwen3.5-9B-abliterated-AWQ-4bit

This is an AWQ 4-bit quantized version of huihui-ai/Huihui-Qwen3.5-9B-abliterated.

The goal is to preserve the original model’s vision-language capabilities—particularly video understanding—while making it practical for consumer GPUs. Unlike llama.cpp-based solutions, which currently offer limited support for video input in VLMs, this quantization allows direct, efficient inference using the vLLM.

Quantization details

The quantization configuration (layer selection, etc.) follows cyankiwi/Qwen3.5-9B-AWQ-4bit.

The calibration dataset used for AWQ is mit-han-lab/pile-val-backup.

You can find the original quantization script in the model repository. This is my first time doing something like this.

Running full 262K context on 16GB VRAM

On an RTX 5060 Ti 16GB (Blackwell architecture), the full 262,144-token context window can be used by enabling FP8 KV-cache quantization (set --kv-cache-dtype fp8" when loading, requires a compatible vLLM version).

Note on GPU architectures: This has been tested and confirmed working on the RTX 5060 Ti. Due to architectural differences, the same cannot be guaranteed for RTX 40-series or older 16GB GPUs when vision capabilities are also loaded—OOM (out of memory) is still possible. Adjust batch size and context length accordingly.

Example vLLM launch command

Below is the launch configuration I use on Windows. Replace the model path and media directory with your own.

cmd
set VIDEO_MAX_PIXELS=200704
set FPS=2.0
set FPS_MAX_FRAMES=2590
set FPS_MIN_FRAMES=4
set FORCE_QWENVL_VIDEO_READER=torchcodec

python -m vllm.entrypoints.openai.api_server ^
    --model /path/to/your/model/HuiHui-Qwen3.5-9B-abliterated-AWQ-W4A16 ^
    --served-model-name HuiHui-Qwen3.5-9B-abliterated-AWQ-W4A16 ^
    --trust-remote-code ^
    --enforce-eager ^
    --dtype auto ^
    --max-model-len 262144 ^
    --kv-cache-dtype fp8 ^
    --gpu-memory-utilization 0.92 ^
    --port 8000 ^
    --allowed-local-media-path /path/to/your/media

Acknowledgements

Original model: huihui-ai/Huihui-Qwen3.5-9B-abliterated
AWQ quantization reference: cyankiwi/Qwen3.5-9B-AWQ-4bit
Calibration dataset: mit-han-lab/pile-val-backup

Huihui-Qwen3.5-9B-abliterated-AWQ-4bit

Quantization details

The quantization configuration (layer selection, etc.) follows cyankiwi/Qwen3.5-9B-AWQ-4bit.

The calibration dataset used for AWQ is mit-han-lab/pile-val-backup.

You can find the original quantization script in the model repository. This is my first time doing something like this.

Running full 262K context on 16GB VRAM

Example vLLM launch command

Below is the launch configuration I use on Windows. Replace the model path and media directory with your own.

cmd

set VIDEO_MAX_PIXELS=200704
set FPS=2.0
set FPS_MAX_FRAMES=2590
set FPS_MIN_FRAMES=4
set FORCE_QWENVL_VIDEO_READER=torchcodec

python -m vllm.entrypoints.openai.api_server ^
    --model /path/to/your/model/HuiHui-Qwen3.5-9B-abliterated-AWQ-W4A16 ^
    --served-model-name HuiHui-Qwen3.5-9B-abliterated-AWQ-W4A16 ^
    --trust-remote-code ^
    --enforce-eager ^
    --dtype auto ^
    --max-model-len 262144 ^
    --kv-cache-dtype fp8 ^
    --gpu-memory-utilization 0.92 ^
    --port 8000 ^
    --allowed-local-media-path /path/to/your/media

Acknowledgements

Huihui-Qwen3.5-9B-abliterated-AWQ-4bit

README