Avesed

Qwopus3.6-27B-v2-abliterated-int4-w4a8

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Why W4A8 — best of both for serving

int4 weight bandwidth (fast decode) + int8 compute (fast prefill). Measured on 2×RTX 3090 (cudagraph):

Table
single-stream decodebatch-16 decodeprefill
W4A16 (int4)50.63871045 tok/s
W8A8 (int8)38.83431250 tok/s
W4A8 (this)50.43931229 tok/s

= W4A16's decode + W8A8's prefill, and small enough to keep CUDA graphs.

Running on Ampere (RTX 3090 / A100)

On Hopper it runs out of the box (Cutlass/Machete W4A8 kernels). On Ampere, vLLM's W4A8-int Marlin path is gated by a small config bug (vllm#38064); the #38066 fix (3 Python files, no recompile) enables the Ampere Marlin W4A8-int8 kernel — packaged at vllm-ampere-optimized.

Model provider

Avesed

Model tree

Base

Avesed/Qwopus3.6-27B-v2-abliterated

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today