redashes

QwenPaw-Flash-9B-heretic-AWQ-INT4-MTP

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

Model Description

This is the AWQ (Activation-aware Weight Quantization) 4-bit quantized version of SC117/QwenPaw-Flash-9B-heretic.

QwenPaw-Flash-9B-heretic is based on Qwen3.5-9B with a Hybrid Attention architecture:

  • 24 Linear Attention layers (Gated DeltaNet)
  • 8 Full Attention layers (traditional Softmax Attention)
  • 1 MTP (Multi-Token Prediction) Head
  • 27 Vision Encoder layers (multimodal)

After quantization, the model size is reduced from ~38GB (FP32) to 13GB (AWQ INT4), making it runnable on consumer GPUs with 20GB+ VRAM.

Quantization Details

Table
ParameterValue
Toolllmcompressor 0.12.1 + compressed-tensors 0.17.2
FormatW4A16 (symmetric int4)
Group Size128
AWQ Grid20
Calibrationwikitext-2-raw-v1 (128 samples)
Sequence Length2048
Inference Precisionbfloat16

Quantization Scope

Table
ComponentPrecisionNotes
MLP (layers 1-31) — gate/up/down projINT431 layers, ~4.68B params
Layer 0 (entire)BF16First layer kept at full precision
Linear Attention (24 layers)BF16Includes conv1d, in_proj_qkv, etc.
Full Attention (8 layers)BF16Q/K/V/O projections
Vision Encoder (27 layers)FP32Original precision preserved
MTP HeadBF16Speculative decoding preserved
Embed Tokens + LM HeadBF16Input/output embeddings

AWQ Smoothing

AWQ smoothing is applied only to MLP components:

  • post_attention_layernormmlp.gate_proj, mlp.up_proj

Inference Compatibility

Table
FrameworkStatus
SGLang ≥ 0.5.12✅ Tested and verified
vLLM❌ Not yet tested
HuggingFace Transformers✅ Supported

SGLang Launch Example

bash

sglang serve \
--trust-remote-code \
--model-path /path/to/QwenPaw-Flash-9B-heretic-AWQ-INT4-MTP \
--host 0.0.0.0 --port 8001 \
--dtype auto \
--kv-cache-dtype fp8_e4m3 \
--mem-fraction-static 0.85

Python Load Example

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"redashes/QwenPaw-Flash-9B-heretic-AWQ-INT4-MTP",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"redashes/QwenPaw-Flash-9B-heretic-AWQ-INT4-MTP",
trust_remote_code=True,
)

Model Files

Table
FileSizeDescription
model.safetensors10 GBQuantized text backbone (INT4 + BF16)
visual_mtp.safetensors2.2 GBVision encoder (FP32) + MTP head (BF16)
model.safetensors.index.json76 KBWeight index

Memory Usage

Table
ComponentSize
Model weights~13 GB
KV Cache (fp8, 131K tokens)~2 GB
Mamba Cache~1 GB
Total~16 GB

Recommended GPU: 20GB+ VRAM (RTX 3080 20GB / RTX 3090 / A100).

Disclaimer

This model is a quantized version of the source model, without additional training or fine-tuning. Please comply with the source model's license agreement.

Model provider

redashes

Model tree

Base

SC117/QwenPaw-Flash-9B-heretic

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today