redashes
QwenPaw-Flash-9B-heretic-AWQ-INT4-MTP
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherModel Description
This is the AWQ (Activation-aware Weight Quantization) 4-bit quantized version of SC117/QwenPaw-Flash-9B-heretic.
QwenPaw-Flash-9B-heretic is based on Qwen3.5-9B with a Hybrid Attention architecture:
- 24 Linear Attention layers (Gated DeltaNet)
- 8 Full Attention layers (traditional Softmax Attention)
- 1 MTP (Multi-Token Prediction) Head
- 27 Vision Encoder layers (multimodal)
After quantization, the model size is reduced from ~38GB (FP32) to 13GB (AWQ INT4), making it runnable on consumer GPUs with 20GB+ VRAM.
Quantization Details
| Parameter | Value |
|---|---|
| Tool | llmcompressor 0.12.1 + compressed-tensors 0.17.2 |
| Format | W4A16 (symmetric int4) |
| Group Size | 128 |
| AWQ Grid | 20 |
| Calibration | wikitext-2-raw-v1 (128 samples) |
| Sequence Length | 2048 |
| Inference Precision | bfloat16 |
Quantization Scope
| Component | Precision | Notes |
|---|---|---|
| MLP (layers 1-31) — gate/up/down proj | INT4 | 31 layers, ~4.68B params |
| Layer 0 (entire) | BF16 | First layer kept at full precision |
| Linear Attention (24 layers) | BF16 | Includes conv1d, in_proj_qkv, etc. |
| Full Attention (8 layers) | BF16 | Q/K/V/O projections |
| Vision Encoder (27 layers) | FP32 | Original precision preserved |
| MTP Head | BF16 | Speculative decoding preserved |
| Embed Tokens + LM Head | BF16 | Input/output embeddings |
AWQ Smoothing
AWQ smoothing is applied only to MLP components:
post_attention_layernorm→mlp.gate_proj,mlp.up_proj
Inference Compatibility
| Framework | Status |
|---|---|
| SGLang ≥ 0.5.12 | ✅ Tested and verified |
| vLLM | ❌ Not yet tested |
| HuggingFace Transformers | ✅ Supported |
SGLang Launch Example
bash
sglang serve \--trust-remote-code \--model-path /path/to/QwenPaw-Flash-9B-heretic-AWQ-INT4-MTP \--host 0.0.0.0 --port 8001 \--dtype auto \--kv-cache-dtype fp8_e4m3 \--mem-fraction-static 0.85
Python Load Example
python
from transformers import AutoModelForCausalLM, AutoTokenizermodel = AutoModelForCausalLM.from_pretrained("redashes/QwenPaw-Flash-9B-heretic-AWQ-INT4-MTP",device_map="auto",trust_remote_code=True,)tokenizer = AutoTokenizer.from_pretrained("redashes/QwenPaw-Flash-9B-heretic-AWQ-INT4-MTP",trust_remote_code=True,)
Model Files
| File | Size | Description |
|---|---|---|
model.safetensors | 10 GB | Quantized text backbone (INT4 + BF16) |
visual_mtp.safetensors | 2.2 GB | Vision encoder (FP32) + MTP head (BF16) |
model.safetensors.index.json | 76 KB | Weight index |
Memory Usage
| Component | Size |
|---|---|
| Model weights | ~13 GB |
| KV Cache (fp8, 131K tokens) | ~2 GB |
| Mamba Cache | ~1 GB |
| Total | ~16 GB |
Recommended GPU: 20GB+ VRAM (RTX 3080 20GB / RTX 3090 / A100).
Disclaimer
This model is a quantized version of the source model, without additional training or fine-tuning. Please comply with the source model's license agreement.
Model provider
redashes
Model tree
Base
SC117/QwenPaw-Flash-9B-heretic
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information