nemozxy123

Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-W4A16

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-4bit

This is an AWQ 4-bit quantized version of huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated.

The primary goal of this quantization is to retain the original model's video analysis capabilities. By converting the model to AWQ-4bit, it becomes possible to launch the model with vLLM and directly pass video inputs—preserving temporal information and continuous-frame understanding. In contrast, llama.cpp-based solutions rely on frame sampling, which inevitably discards fine-grained temporal dynamics and inter-frame coherence.

Quantization details

The quantization configuration (layer selection, etc.) follows cyankiwi/Qwen3.5-9B-AWQ-4bit.

The calibration dataset used for AWQ is mit-han-lab/pile-val-backup.

You can find the original quantization script in the model repository. This is my first time doing something like this.

Running 65K context on 16GB VRAM

On an RTX 5060 Ti 16GB (Blackwell architecture), the 65,536-token context window can be used by enabling FP8 KV-cache quantization (set --kv-cache-dtype fp8" when loading, requires a compatible vLLM version).

Model provider

nemozxy123

Model tree

Base

huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-4bit

This is an AWQ 4-bit quantized version of huihui-ai/Huihui-Qwen3-VL-8B-Thinking-abliterated.

Quantization details

The quantization configuration (layer selection, etc.) follows cyankiwi/Qwen3.5-9B-AWQ-4bit.

The calibration dataset used for AWQ is mit-han-lab/pile-val-backup.

You can find the original quantization script in the model repository. This is my first time doing something like this.

Running 65K context on 16GB VRAM

cmd

set VIDEO_MAX_PIXELS=200704
set FPS=2.0
set FPS_MAX_FRAMES=2590
set FPS_MIN_FRAMES=4
set FORCE_QWENVL_VIDEO_READER=torchcodec

python -m vllm.entrypoints.openai.api_server ^
    --model /path/to/your/model/Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-W4A16 ^
    --served-model-name Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-W4A16 ^
    --trust-remote-code ^
    --enforce-eager ^
    --dtype auto ^
    --max-model-len 65536 ^
    --kv-cache-dtype fp8 ^
    --gpu-memory-utilization 0.92 ^
    --port 8000 ^
    --allowed-local-media-path /path/to/your/media

Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-W4A16

Get help setting up a custom Dedicated Endpoints.

README

Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-4bit

Quantization details

Running 65K context on 16GB VRAM

Explore FriendliAI today

README

Huihui-Qwen3-VL-8B-Thinking-abliterated-AWQ-4bit

Quantization details

Running 65K context on 16GB VRAM

Example vLLM launch command

Acknowledgements