XReyRobert

Qwopus3.6-27B-v2-GPTQ-Pro-MTP-BF16

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

What changed

Parent GPTQ-Pro artifact: XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1
Source for restored MTP tensors: Jackrong/Qwopus3.6-27B-v2
Restored mtp.* tensors: 15
Restored MTP dtype counts: {"bfloat16": 15}
Local tensor shard size after patch: 18.22 GiB total safetensors

Practical note

This is meant for loader and vLLM speculative-decoding experiments. Previous testing on 1x RTX 3090 showed that restoring MTP made draft acceptance work, but did not improve throughput versus the non-MTP GPTQ-Pro baseline. The likely bottleneck was vLLM/GPTQ-Marlin speculative path overhead rather than MTP tensor precision.

For practical long-context 1x RTX 3090 serving, the non-MTP baseline remains the recommended artifact:

text
XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1

Validation status

Structural checks performed during patch creation:

model.safetensors.index.json contains restored mtp.* keys.
All indexed shard files exist locally.
The added MTP shard is a standalone safetensors file.
config.json records mtp_num_hidden_layers >= 1.

Runtime serving validation is still required before treating this as a working MTP deployment artifact.

References

Source model: Jackrong/Qwopus3.6-27B-v2
Parent GPTQ-Pro artifact: XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1

Model provider

XReyRobert

Model tree

Base

Jackrong/Qwopus3.6-27B-v2

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer