XReyRobert
Qwopus3.6-27B-Coder-GPTQ-Pro
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Source And Credits
Source model:
Quantization tooling and reference recipe:
Thanks to Jackrong for the Qwopus3.6 models and to groxaxo for GPTQ-Pro and the Qwen3.6 GPTQ-Pro recipe this run was aligned with.
Artifact Summary
| Field | Value |
|---|---|
| Source model | Jackrong/Qwopus3.6-27B-Coder |
| Architecture | Qwen3_5ForConditionalGeneration |
| Model type | qwen3_5 |
| Tensor files | 6 |
| Safetensors size | 17.63 GiB |
| Indexed tensors | 2423 |
Quantized qweight tensors | 408 |
mtp.* tensors in index | true |
| vision/visual tensors in index | true |
| Index metadata size matches shards | true |
This upload includes an MTP-aware GPTQ patch shard:
model-mtp-aware-gptq.safetensorsMTP_AWARE_GPTQ_PATCH.json
That means the artifact has MTP tensors present and quantized MTP linears, but it does not yet mean speculative decoding is a recommended serving mode. See the MTP status notes below.
Quantization Recipe
| Setting | Value |
|---|---|
| Method | GPTQ-Pro / GPTQModel |
| Quantizer | gptqmodel:6.1.0-dev |
| Bits | 4 |
| Group size | 128 |
| Symmetric quantization | true |
| Desc act | false |
| True sequential | true |
| Calibration dataset | WikiText |
| Calibration samples | 256 |
| Calibration sequence length | 2048 |
| MSE | 2.0 |
| Damp percent | 0.05 |
| Damp auto increment | 0.01 |
| FOEM alpha | 0.25 |
| FOEM beta | 0.2 |
| FOEM device | auto |
| Dense VRAM strategy | exclusive |
| MoE VRAM strategy | exclusive |
| Disk offload | true |
| Pack implementation | cpu |
MTP-aware patch metadata:
| Field | Value |
|---|---|
| Patch type | mtp-aware-gptq-pro-core |
| MTP bits | 4 |
| MTP group size | 128 |
| MTP calibration samples | 256 |
| MTP calibration length | 2048 |
| Quantized MTP key count | 32 |
Quantized MTP modules:
mtp.fcmtp.layers.0.self_attn.q_projmtp.layers.0.self_attn.k_projmtp.layers.0.self_attn.v_projmtp.layers.0.self_attn.o_projmtp.layers.0.mlp.gate_projmtp.layers.0.mlp.up_projmtp.layers.0.mlp.down_proj
Intended Serving Shape
This checkpoint is intended for text-only vLLM serving as a local coding-agent model.
Recommended starting point:
bash
vllm serve XReyRobert/Qwopus3.6-27B-Coder-GPTQ-Pro \--served-model-name qwopus3.6-27b-coder-gptq-pro \--language-model-only \--dtype float16 \--quantization gptq_marlin \--tensor-parallel-size 1 \--max-model-len 131072 \--max-num-seqs 1 \--kv-cache-dtype fp8_e5m2 \--reasoning-parser qwen3 \--enable-auto-tool-choice \--tool-call-parser qwen3_coder \--enable-prefix-caching \--gpu-memory-utilization 0.95 \--trust-remote-code
For initial production-style testing, keep speculative decoding off until you have validated MTP behavior with your exact vLLM version and workload.
Validation And Benchmarks
Completed artifact checks:
- Local shard index inspection completed before upload.
- Remote file list verified after upload.
- Remote
model.safetensors.index.jsonverified after upload. - Index metadata total size matches the local safetensor shards.
- The remote artifact contains the expected safetensor shards.
Terminal-Bench 2.0 Smoke24 result and associated vLLM serving measurements.
This Smoke24 run used max_model_len=131072 for apples-to-apples comparison
with the other local models in this publication batch:
| Run | Score | Success rate | Wall-time | Output tokens | Observed decode | LLM API time |
|---|---|---|---|---|---|---|
qwopus3.6-27b-coder-gptq-pro-foem-4bit-g128-ns256 | 16/24 | 66.7% | 218.8m | 202.2k | 38.9 tok/s | 86.7m |
Smoke24 is a fixed 24-task Terminal-Bench 2.0 comparison corpus, not a full Terminal-Bench leaderboard run.
In this local harness, the coder artifact:
- tied
Qwopus3.6-27B-v2-GPTQ-Pro-v1on solved tasks at16/24; - had the fastest wall time among the compared local runs at
218.8m; - emitted the fewest output tokens among the compared local runs at
202.2k; - had the lowest LLM API time among the 16/24 Smoke24 runs in this local batch.
Task list and harness shape:
MTP And Vision Status
- The artifact contains
mtp.*tensors. - The MTP large linears listed above were quantized with an MTP-aware GPTQ-Pro core capture path.
- MTP speculative decoding is not yet published as the recommended serving mode for this artifact; validate it separately before relying on it.
- Vision/visual tensors are present because of the source checkpoint structure, but this release is positioned and validated as text-only.
Limitations
- Experimental quantization.
- Terminal-Bench Smoke24 is a small local comparison corpus, not a full benchmark submission.
- The coder Smoke24 result is assembled from a smoke12 run plus a missing12 complement run over the same fixed 24-task corpus.
- MTP tensors are present, but speculative decoding is not yet a supported recommendation.
- Vision tensors are present, but vision behavior has not been validated.
- Loader behavior may vary across vLLM, Transformers, GPTQModel, and GPTQ-Marlin versions.
Files
Key files:
model.safetensors.index.jsonmodel-00001-of-00005.safetensorsthroughmodel-00005-of-00005.safetensorsmodel-mtp-aware-gptq.safetensorsMTP_AWARE_GPTQ_PATCH.jsonconfig.jsonquantize_config.jsonprocessor_config.jsontokenizer.jsonUPLOAD_MANIFEST.json
UPLOAD_MANIFEST.json records the upload guardrail checks and artifact
inspection summary.
References
- Source model:
Jackrong/Qwopus3.6-27B-Coder - GPTQ-Pro tooling:
groxaxo/GPTQ-Pro - Reference recipe:
groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit - Terminal-Bench:
laude-institute/terminal-bench
Individual Project Notice
This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.
Model provider
XReyRobert
Model tree
Base
Jackrong/Qwopus3.6-27B-Coder
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information