dark-side-of-the-code

Qwen3-VL-30B-A3B-Instruct-AWQ

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quantization details

Table
FieldValue
MethodAWQ (Activation-aware Weight Quantization)
SchemeW4A16 — 4-bit weights, 16-bit activations
Group size128
Formatcompressed-tensors (vLLM CompressedTensorsWNA16MarlinMoEMethod)
MoE calibrationmoe_calibrate_all_experts=True — every expert receives calibration data, not only routed-to experts
Ignored layerslm_head (full precision), visual.* (vision tower full precision), mlp.gate$ (MoE router full precision)
Toolllmcompressor (AWQModifier, sequential pipeline)
Calibration datasetHuggingFaceH4/ultrachat_200k (train_sft split) — text-only
Calibration samples256
Max sequence length1024 tokens

Total on-disk size: ~17.8 GB across four safetensors shards.

Serving with vLLM

Recipe validated on an RTX 4090 (24 GB) running vLLM 0.21:

bash

vllm serve dark-side-of-the-code/Qwen3-VL-30B-A3B-Instruct-AWQ \
--max-model-len 49152 \
--max-num-seqs 8 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--limit-mm-per-prompt '{"image": 8, "video": 0}' \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--host 0.0.0.0 \
--port 8001

Notes:

  • --kv-cache-dtype fp8 lifts the context ceiling on a 24 GB card from ~32K to 48K with no observable quality regression on text / structured-output / vision-OCR / tool-calling / 14K-token tasks (single-stream decode is actually marginally faster). Drop it if you'd rather keep KV cache in fp16.
  • --tool-call-parser hermes is the correct parser for Qwen3-VL's tool-call format.
  • The served model id is the repo id you passed to vllm serve (dark-side-of-the-code/Qwen3-VL-30B-A3B-Instruct-AWQ) — use that as the model field in API requests. Add --served-model-name <short-label> if you'd rather expose a shorter id.
  • The vision tower runs at full precision regardless of the weight quant — image (and, if enabled, video) understanding is unaffected by 4-bit compression.

Python (OpenAI client)

python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8001/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="dark-side-of-the-code/Qwen3-VL-30B-A3B-Instruct-AWQ",
messages=[{"role": "user", "content": "Briefly: what is photosynthesis?"}],
max_tokens=120,
)
print(resp.choices[0].message.content)

Multi-image example

python

import base64
from pathlib import Path
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8001/v1", api_key="EMPTY")
def as_data_url(path: str) -> str:
data = Path(path).read_bytes()
return f"data:image/jpeg;base64,{base64.b64encode(data).decode()}"
resp = client.chat.completions.create(
model="dark-side-of-the-code/Qwen3-VL-30B-A3B-Instruct-AWQ",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe each frame and any text visible."},
{"type": "image_url", "image_url": {"url": as_data_url("frame_0.jpg")}},
{"type": "image_url", "image_url": {"url": as_data_url("frame_1.jpg")}},
],
}],
max_tokens=300,
)
print(resp.choices[0].message.content)

Throughput

On a single RTX 4090 with the recipe above:

Table
MetricValue
Decode (single-stream)~225 tok/s
TTFT (small prompt)~0.1 s

Validation

Five end-to-end checks against an OpenAI-compatible vLLM endpoint serving this checkpoint (fp8 KV cache, 48K context):

Table
CheckResult
text coherencepass — coherent answer to a knowledge question
structured JSONpass — valid JSON with all expected keys
vision + OCRpass — reads on-image text and names a drawn shape
tool callingpass — emits a correct function call
14K-token contextpass — coherent reply with full prompt context loaded

Limitations and accuracy

  • Quantization introduces a small accuracy degradation compared to the bf16 base model. The checks above confirm task-level competence on common multimodal workloads (vision, structured output, tool calling, long context) but do not constitute a formal benchmark suite (MMLU, MMMU, etc.).
  • The vision tower and MoE router are kept at full precision — image / video quality and routing behaviour should be unchanged.
  • The optional --kv-cache-dtype fp8 serve flag carries a small theoretical accuracy risk on very long contexts; the 14K-token bench check did not show degradation, but be cautious for >32K context workloads.
  • Inherits all limitations and intended-use restrictions from the base model.

License

Inherits from the base model — see Qwen/Qwen3-VL-30B-A3B-Instruct for the authoritative terms (Apache 2.0 as of publication). Quantized weights are a derivative work; verify the base model's licence applies to your intended use before commercial deployment.

Model provider

dark-side-of-the-code

Model tree

Base

Qwen/Qwen3-VL-30B-A3B-Instruct

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today