Avesed

Qwopus3.6-27B-v2-W4A8

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Why W4A8

int4 weight bandwidth (fast decode) + int8 tensor-core compute (fast prefill) — the best serving quant on the NVIDIA Ampere line (A100 / RTX 3090).

Serving on Ampere (RTX 3090 / A100)

vLLM gates its W4A8 kernels to Hopper. On Ampere the Marlin kernel can run W4A8-int8 but needs a small enablement patch — use vllm-ampere-optimized (prebuilt wheel + Docker image, or the standalone hot-patch). On Hopper it runs out of the box.

Throughput

Same dense Qwen3.6-27B architecture as its base, so the serving profile matches the measured numbers for Avesed/Qwen3.6-27B-W4A8: ~47 tok/s single-user (sub-second TTFT), ~416 tok/s saturated on 2× RTX 3090 (tp2), ~22.8 GiB/card peak.

Model provider

Avesed

Model tree

Base

Jackrong/Qwopus3.6-27B-v2

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today