XReyRobert
Qwopus3.6-27B-v2-GPTQ-Pro-MTP-BF16
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherWhat changed
- Parent GPTQ-Pro artifact:
XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1 - Source for restored MTP tensors:
Jackrong/Qwopus3.6-27B-v2 - Restored
mtp.*tensors:15 - Restored MTP dtype counts:
{"bfloat16": 15} - Local tensor shard size after patch:
18.22 GiBtotal safetensors
Practical note
This is meant for loader and vLLM speculative-decoding experiments. Previous testing on 1x RTX 3090 showed that restoring MTP made draft acceptance work, but did not improve throughput versus the non-MTP GPTQ-Pro baseline. The likely bottleneck was vLLM/GPTQ-Marlin speculative path overhead rather than MTP tensor precision.
For practical long-context 1x RTX 3090 serving, the non-MTP baseline remains the recommended artifact:
text
XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1
Validation status
Structural checks performed during patch creation:
model.safetensors.index.jsoncontains restoredmtp.*keys.- All indexed shard files exist locally.
- The added MTP shard is a standalone safetensors file.
config.jsonrecordsmtp_num_hidden_layers >= 1.
Runtime serving validation is still required before treating this as a working MTP deployment artifact.
References
- Source model: Jackrong/Qwopus3.6-27B-v2
- Parent GPTQ-Pro artifact: XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1
Model provider
XReyRobert
Model tree
Base
Jackrong/Qwopus3.6-27B-v2
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information