Avesed
Qwopus3.6-27B-v2-abliterated-int4-w4a8
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Why W4A8 — best of both for serving
int4 weight bandwidth (fast decode) + int8 compute (fast prefill). Measured on 2×RTX 3090 (cudagraph):
| single-stream decode | batch-16 decode | prefill | |
|---|---|---|---|
| W4A16 (int4) | 50.6 | 387 | 1045 tok/s |
| W8A8 (int8) | 38.8 | 343 | 1250 tok/s |
| W4A8 (this) | 50.4 | 393 | 1229 tok/s |
= W4A16's decode + W8A8's prefill, and small enough to keep CUDA graphs.
Running on Ampere (RTX 3090 / A100)
On Hopper it runs out of the box (Cutlass/Machete W4A8 kernels). On Ampere, vLLM's W4A8-int Marlin path is gated by a small config bug (vllm#38064); the #38066 fix (3 Python files, no recompile) enables the Ampere Marlin W4A8-int8 kernel — packaged at vllm-ampere-optimized.
Model provider
Avesed
Model tree
Base
Avesed/Qwopus3.6-27B-v2-abliterated
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information