Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0- MTP acceptance rate: ~43%
- Speedup: ~1.5-1.9x decode throughput
For MTP-enabled GGUF inference, see the MTP GGUF repo below.
- Standard GGUF (no MTP): SC117/QwenPaw-Flash-9B-heretic-GGUF
- MTP GGUF (with Multi-Token Prediction head): SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF
model = AutoModelForCausalLM.from_pretrained( "SC117/QwenPaw-Flash-9B-heretic", torch_dtype=torch.float32, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("SC117/QwenPaw-Flash-9B-heretic")
markdown
</div></div><div style="border: 1px solid #cbd5e1; border-radius: 12px; overflow: hidden; background: #ffffff; box-shadow: 0 2px 4px rgba(0,0,0,0.02); margin-bottom: 20px;"><div style="background: linear-gradient(135deg, #ff6b35 0%, #f7931e 100%); padding: 12px 16px; color: white; font-weight: 700; font-size: 14px; display: flex; align-items: center; gap: 8px;"><span>📄</span> License</div><div style="padding: 16px; font-size: 13px; color: #334155; line-height: 1.7;">Same as base model (Qwen3.5-9B).</div></div>
Model provider
SC117
Model tree
Base
Qwen/Qwen3.5-9B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information