Avesed

Qwopus3.6-27B-v2-abliterated-int4-w4a8

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Why W4A8 — best of both for serving

int4 weight bandwidth (fast decode) + int8 compute (fast prefill). Measured on 2×RTX 3090 (cudagraph):

Table
	single-stream decode	batch-16 decode	prefill
W4A16 (int4)	50.6	387	1045 tok/s
W8A8 (int8)	38.8	343	1250 tok/s
W4A8 (this)	50.4	393	1229 tok/s

= W4A16's decode + W8A8's prefill, and small enough to keep CUDA graphs.

Running on Ampere (RTX 3090 / A100)

On Hopper it runs out of the box (Cutlass/Machete W4A8 kernels). On Ampere, vLLM's W4A8-int Marlin path is gated by a small config bug (vllm#38064); the #38066 fix (3 Python files, no recompile) enables the Ampere Marlin W4A8-int8 kernel — packaged at vllm-ampere-optimized.

Model provider

Avesed

Model tree

Base

Avesed/Qwopus3.6-27B-v2-abliterated

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text