redashes/QwenPaw-Flash-9B-heretic-AWQ-INT4-MTP API & Inference Endpoint

Model Description

This is the AWQ (Activation-aware Weight Quantization) 4-bit quantized version of SC117/QwenPaw-Flash-9B-heretic.

QwenPaw-Flash-9B-heretic is based on Qwen3.5-9B with a Hybrid Attention architecture:

24 Linear Attention layers (Gated DeltaNet)
8 Full Attention layers (traditional Softmax Attention)
1 MTP (Multi-Token Prediction) Head
27 Vision Encoder layers (multimodal)

After quantization, the model size is reduced from ~38GB (FP32) to 13GB (AWQ INT4), making it runnable on consumer GPUs with 20GB+ VRAM.

Quantization Details

Table
Parameter	Value
Tool	llmcompressor 0.12.1 + compressed-tensors 0.17.2
Format	W4A16 (symmetric int4)
Group Size	128
AWQ Grid	20
Calibration	wikitext-2-raw-v1 (128 samples)
Sequence Length	2048
Inference Precision	bfloat16

Quantization Scope

Table
Component	Precision	Notes
MLP (layers 1-31) — gate/up/down proj	INT4	31 layers, ~4.68B params
Layer 0 (entire)	BF16	First layer kept at full precision
Linear Attention (24 layers)	BF16	Includes conv1d, in_proj_qkv, etc.
Full Attention (8 layers)	BF16	Q/K/V/O projections
Vision Encoder (27 layers)	FP32	Original precision preserved
MTP Head	BF16	Speculative decoding preserved
Embed Tokens + LM Head	BF16	Input/output embeddings

AWQ Smoothing

AWQ smoothing is applied only to MLP components:

post_attention_layernorm → mlp.gate_proj, mlp.up_proj

Inference Compatibility

Table
Framework	Status
SGLang ≥ 0.5.12	✅ Tested and verified
vLLM	❌ Not yet tested
HuggingFace Transformers	✅ Supported

SGLang Launch Example

bash
sglang serve \
  --trust-remote-code \
  --model-path /path/to/QwenPaw-Flash-9B-heretic-AWQ-INT4-MTP \
  --host 0.0.0.0 --port 8001 \
  --dtype auto \
  --kv-cache-dtype fp8_e4m3 \
  --mem-fraction-static 0.85

Python Load Example

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "redashes/QwenPaw-Flash-9B-heretic-AWQ-INT4-MTP",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "redashes/QwenPaw-Flash-9B-heretic-AWQ-INT4-MTP",
    trust_remote_code=True,
)

Model Files

Table
File	Size	Description
`model.safetensors`	10 GB	Quantized text backbone (INT4 + BF16)
`visual_mtp.safetensors`	2.2 GB	Vision encoder (FP32) + MTP head (BF16)
`model.safetensors.index.json`	76 KB	Weight index

Memory Usage

Table
Component	Size
Model weights	~13 GB
KV Cache (fp8, 131K tokens)	~2 GB
Mamba Cache	~1 GB
Total	~16 GB

Recommended GPU: 20GB+ VRAM (RTX 3080 20GB / RTX 3090 / A100).

Disclaimer

This model is a quantized version of the source model, without additional training or fine-tuning. Please comply with the source model's license agreement.

QwenPaw-Flash-9B-heretic-AWQ-INT4-MTP

Get help setting up a custom Dedicated Endpoints.

README