Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-4bit

This is an AWQ 4-bit quantized version of huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated.

The primary goal of this quantization is to retain the original model's video analysis capabilities. By converting the model to AWQ-4bit, it becomes possible to launch the model with vLLM and directly pass video inputs—preserving temporal information and continuous-frame understanding. In contrast, llama.cpp-based solutions rely on frame sampling, which inevitably discards fine-grained temporal dynamics and inter-frame coherence.

Quantization details

The quantization configuration (layer selection, etc.) follows cyankiwi/Qwen3.5-9B-AWQ-4bit.

The calibration dataset used for AWQ is mit-han-lab/pile-val-backup.

You can find the original quantization script in the model repository. This is my first time doing something like this.

Running 65K context on 16GB VRAM

On an RTX 5060 Ti 16GB (Blackwell architecture), the 65,536-token context window can be used by enabling FP8 KV-cache quantization (set --kv-cache-dtype fp8" when loading, requires a compatible vLLM version).

Note on GPU architectures: This has been tested and confirmed working on the RTX 5060 Ti. Due to architectural differences, the same cannot be guaranteed for RTX 40-series or older 16GB GPUs when vision capabilities are also loaded—OOM (out of memory) is still possible. Adjust batch size and context length accordingly.

Example vLLM launch command

Below is the launch configuration I use on Windows. Replace the model path and media directory with your own.

cmd

set VIDEO_MAX_PIXELS=200704
set FPS=2.0
set FPS_MAX_FRAMES=2590
set FPS_MIN_FRAMES=4
set FORCE_QWENVL_VIDEO_READER=torchcodec
python -m vllm.entrypoints.openai.api_server ^
--model /path/to/your/model/Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-W4A16 ^
--served-model-name Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-W4A16 ^
--trust-remote-code ^
--enforce-eager ^
--dtype auto ^
--max-model-len 65536 ^
--kv-cache-dtype fp8 ^
--gpu-memory-utilization 0.92 ^
--port 8000 ^
--allowed-local-media-path /path/to/your/media

Acknowledgements

Model provider

nemozxy123

Model tree

Base

huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today