XReyRobert/Qwopus3.6-27B-Coder-GPTQ-Pro API & Inference Endpoint

Source And Credits

Source model:

Jackrong/Qwopus3.6-27B-Coder

Quantization tooling and reference recipe:

Thanks to Jackrong for the Qwopus3.6 models and to groxaxo for GPTQ-Pro and the Qwen3.6 GPTQ-Pro recipe this run was aligned with.

Artifact Summary

Table
Field	Value
Source model	`Jackrong/Qwopus3.6-27B-Coder`
Architecture	`Qwen3_5ForConditionalGeneration`
Model type	`qwen3_5`
Tensor files	`6`
Safetensors size	`17.63 GiB`
Indexed tensors	`2423`
Quantized `qweight` tensors	`408`
`mtp.*` tensors in index	`true`
vision/visual tensors in index	`true`
Index metadata size matches shards	`true`

This upload includes an MTP-aware GPTQ patch shard:

model-mtp-aware-gptq.safetensors
MTP_AWARE_GPTQ_PATCH.json

That means the artifact has MTP tensors present and quantized MTP linears, but it does not yet mean speculative decoding is a recommended serving mode. See the MTP status notes below.

Quantization Recipe

Table
Setting	Value
Method	GPTQ-Pro / GPTQModel
Quantizer	`gptqmodel:6.1.0-dev`
Bits	`4`
Group size	`128`
Symmetric quantization	`true`
Desc act	`false`
True sequential	`true`
Calibration dataset	WikiText
Calibration samples	`256`
Calibration sequence length	`2048`
MSE	`2.0`
Damp percent	`0.05`
Damp auto increment	`0.01`
FOEM alpha	`0.25`
FOEM beta	`0.2`
FOEM device	`auto`
Dense VRAM strategy	`exclusive`
MoE VRAM strategy	`exclusive`
Disk offload	`true`
Pack implementation	`cpu`

MTP-aware patch metadata:

Table
Field	Value
Patch type	`mtp-aware-gptq-pro-core`
MTP bits	`4`
MTP group size	`128`
MTP calibration samples	`256`
MTP calibration length	`2048`
Quantized MTP key count	`32`

Quantized MTP modules:

mtp.fc
mtp.layers.0.self_attn.q_proj
mtp.layers.0.self_attn.k_proj
mtp.layers.0.self_attn.v_proj
mtp.layers.0.self_attn.o_proj
mtp.layers.0.mlp.gate_proj
mtp.layers.0.mlp.up_proj
mtp.layers.0.mlp.down_proj

Intended Serving Shape

This checkpoint is intended for text-only vLLM serving as a local coding-agent model.

Recommended starting point:

bash
vllm serve XReyRobert/Qwopus3.6-27B-Coder-GPTQ-Pro \
  --served-model-name qwopus3.6-27b-coder-gptq-pro \
  --language-model-only \
  --dtype float16 \
  --quantization gptq_marlin \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --kv-cache-dtype fp8_e5m2 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code

For initial production-style testing, keep speculative decoding off until you have validated MTP behavior with your exact vLLM version and workload.

Validation And Benchmarks

Completed artifact checks:

Local shard index inspection completed before upload.
Remote file list verified after upload.
Remote model.safetensors.index.json verified after upload.
Index metadata total size matches the local safetensor shards.
The remote artifact contains the expected safetensor shards.

Terminal-Bench 2.0 Smoke24 result and associated vLLM serving measurements. This Smoke24 run used max_model_len=131072 for apples-to-apples comparison with the other local models in this publication batch:

Table
Run	Score	Success rate	Wall-time	Output tokens	Observed decode	LLM API time
`qwopus3.6-27b-coder-gptq-pro-foem-4bit-g128-ns256`	`16/24`	`66.7%`	`218.8m`	`202.2k`	`38.9 tok/s`	`86.7m`

Smoke24 is a fixed 24-task Terminal-Bench 2.0 comparison corpus, not a full Terminal-Bench leaderboard run.

In this local harness, the coder artifact:

tied Qwopus3.6-27B-v2-GPTQ-Pro-v1 on solved tasks at 16/24;
had the fastest wall time among the compared local runs at 218.8m;
emitted the fewest output tokens among the compared local runs at 202.2k;
had the lowest LLM API time among the 16/24 Smoke24 runs in this local batch.

Task list and harness shape:

benchmarks/terminal-bench-2.0/smoke24_task_list_20260616.md

MTP And Vision Status

The artifact contains mtp.* tensors.
The MTP large linears listed above were quantized with an MTP-aware GPTQ-Pro core capture path.
MTP speculative decoding is not yet published as the recommended serving mode for this artifact; validate it separately before relying on it.
Vision/visual tensors are present because of the source checkpoint structure, but this release is positioned and validated as text-only.

Limitations

Experimental quantization.
Terminal-Bench Smoke24 is a small local comparison corpus, not a full benchmark submission.
The coder Smoke24 result is assembled from a smoke12 run plus a missing12 complement run over the same fixed 24-task corpus.
MTP tensors are present, but speculative decoding is not yet a supported recommendation.
Vision tensors are present, but vision behavior has not been validated.
Loader behavior may vary across vLLM, Transformers, GPTQModel, and GPTQ-Marlin versions.

Files

Key files:

model.safetensors.index.json
model-00001-of-00005.safetensors through model-00005-of-00005.safetensors
model-mtp-aware-gptq.safetensors
MTP_AWARE_GPTQ_PATCH.json
config.json
quantize_config.json
processor_config.json
tokenizer.json
UPLOAD_MANIFEST.json

UPLOAD_MANIFEST.json records the upload guardrail checks and artifact inspection summary.

References

Source model: Jackrong/Qwopus3.6-27B-Coder
GPTQ-Pro tooling: groxaxo/GPTQ-Pro
Reference recipe: groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit
Terminal-Bench: laude-institute/terminal-bench

Individual Project Notice

This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.

Qwopus3.6-27B-Coder-GPTQ-Pro

Get help setting up a custom Dedicated Endpoints.

README