dark-side-of-the-code

Qwen3-VL-30B-A3B-Instruct-AWQ

README

License: apache-2.0

Quantization details

Table with columns: Field, Value
Field	Value
Method	AWQ (Activation-aware Weight Quantization)
Scheme	W4A16 — 4-bit weights, 16-bit activations
Group size	128
Format	`compressed-tensors` (vLLM `CompressedTensorsWNA16MarlinMoEMethod`)
MoE calibration	`moe_calibrate_all_experts=True` — every expert receives calibration data, not only routed-to experts
Ignored layers	`lm_head` (full precision), `visual.*` (vision tower full precision), `mlp.gate$` (MoE router full precision)
Tool	llmcompressor (`AWQModifier`, sequential pipeline)
Calibration dataset	`HuggingFaceH4/ultrachat_200k` (`train_sft` split) — text-only
Calibration samples	256
Max sequence length	1024 tokens

Total on-disk size: ~17.8 GB across four safetensors shards.

Serving with vLLM

Recipe validated on an RTX 4090 (24 GB) running vLLM 0.21:

bash
vllm serve dark-side-of-the-code/Qwen3-VL-30B-A3B-Instruct-AWQ \
  --max-model-len 49152 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --limit-mm-per-prompt '{"image": 8, "video": 0}' \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --host 0.0.0.0 \
  --port 8001

Notes:

--kv-cache-dtype fp8 lifts the context ceiling on a 24 GB card from ~32K to 48K with no observable quality regression on text / structured-output / vision-OCR / tool-calling / 14K-token tasks (single-stream decode is actually marginally faster). Drop it if you'd rather keep KV cache in fp16.
--tool-call-parser hermes is the correct parser for Qwen3-VL's tool-call format.
The served model id is the repo id you passed to vllm serve (dark-side-of-the-code/Qwen3-VL-30B-A3B-Instruct-AWQ) — use that as the model field in API requests. Add --served-model-name <short-label> if you'd rather expose a shorter id.
The vision tower runs at full precision regardless of the weight quant — image (and, if enabled, video) understanding is unaffected by 4-bit compression.

Python (OpenAI client)

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8001/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="dark-side-of-the-code/Qwen3-VL-30B-A3B-Instruct-AWQ",
    messages=[{"role": "user", "content": "Briefly: what is photosynthesis?"}],
    max_tokens=120,
)
print(resp.choices[0].message.content)

Multi-image example

python
import base64
from pathlib import Path
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8001/v1", api_key="EMPTY")

def as_data_url(path: str) -> str:
    data = Path(path).read_bytes()
    return f"data:image/jpeg;base64,{base64.b64encode(data).decode()}"

resp = client.chat.completions.create(
    model="dark-side-of-the-code/Qwen3-VL-30B-A3B-Instruct-AWQ",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe each frame and any text visible."},
            {"type": "image_url", "image_url": {"url": as_data_url("frame_0.jpg")}},
            {"type": "image_url", "image_url": {"url": as_data_url("frame_1.jpg")}},
        ],
    }],
    max_tokens=300,
)
print(resp.choices[0].message.content)

Throughput

On a single RTX 4090 with the recipe above:

Table with columns: Metric, Value
Metric	Value
Decode (single-stream)	~225 tok/s
TTFT (small prompt)	~0.1 s

Validation

Five end-to-end checks against an OpenAI-compatible vLLM endpoint serving this checkpoint (fp8 KV cache, 48K context):

Table with columns: Check, Result
Check	Result
text coherence	pass — coherent answer to a knowledge question
structured JSON	pass — valid JSON with all expected keys
vision + OCR	pass — reads on-image text and names a drawn shape
tool calling	pass — emits a correct function call
14K-token context	pass — coherent reply with full prompt context loaded

Limitations and accuracy

Quantization introduces a small accuracy degradation compared to the bf16 base model. The checks above confirm task-level competence on common multimodal workloads (vision, structured output, tool calling, long context) but do not constitute a formal benchmark suite (MMLU, MMMU, etc.).
The vision tower and MoE router are kept at full precision — image / video quality and routing behaviour should be unchanged.
The optional --kv-cache-dtype fp8 serve flag carries a small theoretical accuracy risk on very long contexts; the 14K-token bench check did not show degradation, but be cautious for >32K context workloads.
Inherits all limitations and intended-use restrictions from the base model.

License

Inherits from the base model — see Qwen/Qwen3-VL-30B-A3B-Instruct for the authoritative terms (Apache 2.0 as of publication). Quantized weights are a derivative work; verify the base model's licence applies to your intended use before commercial deployment.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

dark-side-of-the-code

Model Tree

Base

Qwen/Qwen3-VL-30B-A3B-Instruct

Quantized

this model

Input Modalities

Text

Image

Output Modalities