Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-4bit
This is an AWQ 4-bit quantized version of huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated.
The primary goal of this quantization is to retain the original model's video analysis capabilities. By converting the model to AWQ-4bit, it becomes possible to launch the model with vLLM and directly pass video inputs—preserving temporal information and continuous-frame understanding. In contrast, llama.cpp-based solutions rely on frame sampling, which inevitably discards fine-grained temporal dynamics and inter-frame coherence.
Quantization details
The quantization configuration (layer selection, etc.) follows cyankiwi/Qwen3.5-9B-AWQ-4bit.
The calibration dataset used for AWQ is mit-han-lab/pile-val-backup.
You can find the original quantization script in the model repository. This is my first time doing something like this.
Running 65K context on 16GB VRAM
On an RTX 5060 Ti 16GB (Blackwell architecture), the 65,536-token context window can be used by enabling FP8 KV-cache quantization (set --kv-cache-dtype fp8" when loading, requires a compatible vLLM version).
Note on GPU architectures: This has been tested and confirmed working on the RTX 5060 Ti. Due to architectural differences, the same cannot be guaranteed for RTX 40-series or older 16GB GPUs when vision capabilities are also loaded—OOM (out of memory) is still possible. Adjust batch size and context length accordingly.
Example vLLM launch command
Below is the launch configuration I use on Windows. Replace the model path and media directory with your own.
cmd
set VIDEO_MAX_PIXELS=200704set FPS=2.0set FPS_MAX_FRAMES=2590set FPS_MIN_FRAMES=4set FORCE_QWENVL_VIDEO_READER=torchcodecpython -m vllm.entrypoints.openai.api_server ^--model /path/to/your/model/Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-W4A16 ^--served-model-name Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-W4A16 ^--trust-remote-code ^--enforce-eager ^--dtype auto ^--max-model-len 65536 ^--kv-cache-dtype fp8 ^--gpu-memory-utilization 0.92 ^--port 8000 ^--allowed-local-media-path /path/to/your/media
Acknowledgements
- Original model: huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated
- AWQ quantization reference: cyankiwi/Qwen3.5-9B-AWQ-4bit
- Calibration dataset: mit-han-lab/pile-val-backup
Model provider
nemozxy123
Model tree
Base
huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information