Avesed
Qwen3.6-27B-W4A8
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Why W4A8
int4 weight bandwidth (fast decode) + int8 tensor-core compute (fast prefill) — the best serving quant on the NVIDIA Ampere line (A100 / RTX 3090).
Serving on Ampere (RTX 3090 / A100)
vLLM gates its W4A8 kernels to Hopper. On Ampere the Marlin kernel can run W4A8-int8 but needs a small enablement patch — use vllm-ampere-optimized (prebuilt wheel + Docker image, or the standalone hot-patch). On Hopper it runs out of the box.
Throughput (2× RTX 3090, vLLM tp2, 1024-in / 1024-out)
| concurrency | output tok/s | median TTFT | median TPOT |
|---|---|---|---|
| 1 (single-user) | 46.8 | 0.84 s | 19.8 ms |
| 32 (saturated) | 416 | 14.4 s | 63.6 ms |
Peak VRAM ~22.8 GiB/card. Single-user ~47 tok/s with sub-second TTFT; saturates ~416 tok/s aggregate.
Model provider
Avesed
Model tree
Base
Qwen/Qwen3.6-27B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information