XReyRobert

Qwopus3.6-27B-Coder-GPTQ-Pro

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Source And Credits

Source model:

Quantization tooling and reference recipe:

Thanks to Jackrong for the Qwopus3.6 models and to groxaxo for GPTQ-Pro and the Qwen3.6 GPTQ-Pro recipe this run was aligned with.

Artifact Summary

Table
FieldValue
Source modelJackrong/Qwopus3.6-27B-Coder
ArchitectureQwen3_5ForConditionalGeneration
Model typeqwen3_5
Tensor files6
Safetensors size17.63 GiB
Indexed tensors2423
Quantized qweight tensors408
mtp.* tensors in indextrue
vision/visual tensors in indextrue
Index metadata size matches shardstrue

This upload includes an MTP-aware GPTQ patch shard:

  • model-mtp-aware-gptq.safetensors
  • MTP_AWARE_GPTQ_PATCH.json

That means the artifact has MTP tensors present and quantized MTP linears, but it does not yet mean speculative decoding is a recommended serving mode. See the MTP status notes below.

Quantization Recipe

Table
SettingValue
MethodGPTQ-Pro / GPTQModel
Quantizergptqmodel:6.1.0-dev
Bits4
Group size128
Symmetric quantizationtrue
Desc actfalse
True sequentialtrue
Calibration datasetWikiText
Calibration samples256
Calibration sequence length2048
MSE2.0
Damp percent0.05
Damp auto increment0.01
FOEM alpha0.25
FOEM beta0.2
FOEM deviceauto
Dense VRAM strategyexclusive
MoE VRAM strategyexclusive
Disk offloadtrue
Pack implementationcpu

MTP-aware patch metadata:

Table
FieldValue
Patch typemtp-aware-gptq-pro-core
MTP bits4
MTP group size128
MTP calibration samples256
MTP calibration length2048
Quantized MTP key count32

Quantized MTP modules:

  • mtp.fc
  • mtp.layers.0.self_attn.q_proj
  • mtp.layers.0.self_attn.k_proj
  • mtp.layers.0.self_attn.v_proj
  • mtp.layers.0.self_attn.o_proj
  • mtp.layers.0.mlp.gate_proj
  • mtp.layers.0.mlp.up_proj
  • mtp.layers.0.mlp.down_proj

Intended Serving Shape

This checkpoint is intended for text-only vLLM serving as a local coding-agent model.

Recommended starting point:

bash

vllm serve XReyRobert/Qwopus3.6-27B-Coder-GPTQ-Pro \
--served-model-name qwopus3.6-27b-coder-gptq-pro \
--language-model-only \
--dtype float16 \
--quantization gptq_marlin \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--max-num-seqs 1 \
--kv-cache-dtype fp8_e5m2 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--trust-remote-code

For initial production-style testing, keep speculative decoding off until you have validated MTP behavior with your exact vLLM version and workload.

Validation And Benchmarks

Completed artifact checks:

  • Local shard index inspection completed before upload.
  • Remote file list verified after upload.
  • Remote model.safetensors.index.json verified after upload.
  • Index metadata total size matches the local safetensor shards.
  • The remote artifact contains the expected safetensor shards.

Terminal-Bench 2.0 Smoke24 result and associated vLLM serving measurements. This Smoke24 run used max_model_len=131072 for apples-to-apples comparison with the other local models in this publication batch:

Table
RunScoreSuccess rateWall-timeOutput tokensObserved decodeLLM API time
qwopus3.6-27b-coder-gptq-pro-foem-4bit-g128-ns25616/2466.7%218.8m202.2k38.9 tok/s86.7m

Smoke24 is a fixed 24-task Terminal-Bench 2.0 comparison corpus, not a full Terminal-Bench leaderboard run.

In this local harness, the coder artifact:

  • tied Qwopus3.6-27B-v2-GPTQ-Pro-v1 on solved tasks at 16/24;
  • had the fastest wall time among the compared local runs at 218.8m;
  • emitted the fewest output tokens among the compared local runs at 202.2k;
  • had the lowest LLM API time among the 16/24 Smoke24 runs in this local batch.

Task list and harness shape:

MTP And Vision Status

  • The artifact contains mtp.* tensors.
  • The MTP large linears listed above were quantized with an MTP-aware GPTQ-Pro core capture path.
  • MTP speculative decoding is not yet published as the recommended serving mode for this artifact; validate it separately before relying on it.
  • Vision/visual tensors are present because of the source checkpoint structure, but this release is positioned and validated as text-only.

Limitations

  • Experimental quantization.
  • Terminal-Bench Smoke24 is a small local comparison corpus, not a full benchmark submission.
  • The coder Smoke24 result is assembled from a smoke12 run plus a missing12 complement run over the same fixed 24-task corpus.
  • MTP tensors are present, but speculative decoding is not yet a supported recommendation.
  • Vision tensors are present, but vision behavior has not been validated.
  • Loader behavior may vary across vLLM, Transformers, GPTQModel, and GPTQ-Marlin versions.

Files

Key files:

  • model.safetensors.index.json
  • model-00001-of-00005.safetensors through model-00005-of-00005.safetensors
  • model-mtp-aware-gptq.safetensors
  • MTP_AWARE_GPTQ_PATCH.json
  • config.json
  • quantize_config.json
  • processor_config.json
  • tokenizer.json
  • UPLOAD_MANIFEST.json

UPLOAD_MANIFEST.json records the upload guardrail checks and artifact inspection summary.

References

Individual Project Notice

This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.

Model provider

XReyRobert

Model tree

Base

Jackrong/Qwopus3.6-27B-Coder

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today