dark-side-of-the-code
Qwen3-VL-30B-A3B-Instruct-AWQ
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quantization details
| Field | Value |
|---|---|
| Method | AWQ (Activation-aware Weight Quantization) |
| Scheme | W4A16 — 4-bit weights, 16-bit activations |
| Group size | 128 |
| Format | compressed-tensors (vLLM CompressedTensorsWNA16MarlinMoEMethod) |
| MoE calibration | moe_calibrate_all_experts=True — every expert receives calibration data, not only routed-to experts |
| Ignored layers | lm_head (full precision), visual.* (vision tower full precision), mlp.gate$ (MoE router full precision) |
| Tool | llmcompressor (AWQModifier, sequential pipeline) |
| Calibration dataset | HuggingFaceH4/ultrachat_200k (train_sft split) — text-only |
| Calibration samples | 256 |
| Max sequence length | 1024 tokens |
Total on-disk size: ~17.8 GB across four safetensors shards.
Serving with vLLM
Recipe validated on an RTX 4090 (24 GB) running vLLM 0.21:
bash
vllm serve dark-side-of-the-code/Qwen3-VL-30B-A3B-Instruct-AWQ \--max-model-len 49152 \--max-num-seqs 8 \--gpu-memory-utilization 0.95 \--kv-cache-dtype fp8 \--trust-remote-code \--limit-mm-per-prompt '{"image": 8, "video": 0}' \--enable-auto-tool-choice \--tool-call-parser hermes \--host 0.0.0.0 \--port 8001
Notes:
--kv-cache-dtype fp8lifts the context ceiling on a 24 GB card from ~32K to 48K with no observable quality regression on text / structured-output / vision-OCR / tool-calling / 14K-token tasks (single-stream decode is actually marginally faster). Drop it if you'd rather keep KV cache in fp16.--tool-call-parser hermesis the correct parser for Qwen3-VL's tool-call format.- The served model id is the repo id you passed to
vllm serve(dark-side-of-the-code/Qwen3-VL-30B-A3B-Instruct-AWQ) — use that as themodelfield in API requests. Add--served-model-name <short-label>if you'd rather expose a shorter id. - The vision tower runs at full precision regardless of the weight quant — image (and, if enabled, video) understanding is unaffected by 4-bit compression.
Python (OpenAI client)
python
from openai import OpenAIclient = OpenAI(base_url="http://localhost:8001/v1", api_key="EMPTY")resp = client.chat.completions.create(model="dark-side-of-the-code/Qwen3-VL-30B-A3B-Instruct-AWQ",messages=[{"role": "user", "content": "Briefly: what is photosynthesis?"}],max_tokens=120,)print(resp.choices[0].message.content)
Multi-image example
python
import base64from pathlib import Pathfrom openai import OpenAIclient = OpenAI(base_url="http://localhost:8001/v1", api_key="EMPTY")def as_data_url(path: str) -> str:data = Path(path).read_bytes()return f"data:image/jpeg;base64,{base64.b64encode(data).decode()}"resp = client.chat.completions.create(model="dark-side-of-the-code/Qwen3-VL-30B-A3B-Instruct-AWQ",messages=[{"role": "user","content": [{"type": "text", "text": "Describe each frame and any text visible."},{"type": "image_url", "image_url": {"url": as_data_url("frame_0.jpg")}},{"type": "image_url", "image_url": {"url": as_data_url("frame_1.jpg")}},],}],max_tokens=300,)print(resp.choices[0].message.content)
Throughput
On a single RTX 4090 with the recipe above:
| Metric | Value |
|---|---|
| Decode (single-stream) | ~225 tok/s |
| TTFT (small prompt) | ~0.1 s |
Validation
Five end-to-end checks against an OpenAI-compatible vLLM endpoint serving this checkpoint (fp8 KV cache, 48K context):
| Check | Result |
|---|---|
| text coherence | pass — coherent answer to a knowledge question |
| structured JSON | pass — valid JSON with all expected keys |
| vision + OCR | pass — reads on-image text and names a drawn shape |
| tool calling | pass — emits a correct function call |
| 14K-token context | pass — coherent reply with full prompt context loaded |
Limitations and accuracy
- Quantization introduces a small accuracy degradation compared to the bf16 base model. The checks above confirm task-level competence on common multimodal workloads (vision, structured output, tool calling, long context) but do not constitute a formal benchmark suite (MMLU, MMMU, etc.).
- The vision tower and MoE router are kept at full precision — image / video quality and routing behaviour should be unchanged.
- The optional
--kv-cache-dtype fp8serve flag carries a small theoretical accuracy risk on very long contexts; the 14K-token bench check did not show degradation, but be cautious for >32K context workloads. - Inherits all limitations and intended-use restrictions from the base model.
License
Inherits from the base model — see Qwen/Qwen3-VL-30B-A3B-Instruct for the authoritative terms (Apache 2.0 as of publication). Quantized weights are a derivative work; verify the base model's licence applies to your intended use before commercial deployment.
Model provider
dark-side-of-the-code
Model tree
Base
Qwen/Qwen3-VL-30B-A3B-Instruct
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information