Qwen3.6-27B-W4A8 API & Inference Endpoint

Why W4A8

int4 weight bandwidth (fast decode) + int8 tensor-core compute (fast prefill) — the best serving quant on the NVIDIA Ampere line (A100 / RTX 3090).

Serving on Ampere (RTX 3090 / A100)

vLLM gates its W4A8 kernels to Hopper. On Ampere the Marlin kernel can run W4A8-int8 but needs a small enablement patch — use vllm-ampere-optimized (prebuilt wheel + Docker image, or the standalone hot-patch). On Hopper it runs out of the box.

Throughput (2× RTX 3090, vLLM tp2, 1024-in / 1024-out)

Table with columns: concurrency, output tok/s, median TTFT, median TPOT
concurrency	output tok/s	median TTFT	median TPOT
1 (single-user)	46.8	0.84 s	19.8 ms
32 (saturated)	416	14.4 s	63.6 ms

Peak VRAM ~22.8 GiB/card. Single-user ~47 tok/s with sub-second TTFT; saturates ~416 tok/s aggregate.

Why W4A8

int4 weight bandwidth (fast decode) + int8 tensor-core compute (fast prefill) — the best serving quant on the NVIDIA Ampere line (A100 / RTX 3090).

Serving on Ampere (RTX 3090 / A100)

Throughput (2× RTX 3090, vLLM tp2, 1024-in / 1024-out)

Table with columns: concurrency, output tok/s, median TTFT, median TPOT
concurrency	output tok/s	median TTFT	median TPOT
1 (single-user)	46.8	0.84 s	19.8 ms
32 (saturated)	416	14.4 s	63.6 ms

Peak VRAM ~22.8 GiB/card. Single-user ~47 tok/s with sub-second TTFT; saturates ~416 tok/s aggregate.

Qwen3.6-27B-W4A8

README

Why W4A8

Serving on Ampere (RTX 3090 / A100)

Throughput (2× RTX 3090, vLLM tp2, 1024-in / 1024-out)

Explore FriendliAI today

README

Why W4A8

Serving on Ampere (RTX 3090 / A100)

Throughput (2× RTX 3090, vLLM tp2, 1024-in / 1024-out)