XReyRobert

Qwopus3.6-27B-v2-GPTQ-Pro-v1

README

License: apache-2.0

Source and credits

Source model:

Jackrong/Qwopus3.6-27B-v2

Quantization methodology and reference recipe:

Thanks to Jackrong for the original Qwopus3.6 model, and to groxaxo for GPTQ-Pro and the Qwen3.6 GPTQ-Pro recipe this quantization was aligned with.

Quantization recipe

Table with columns: Setting, Value
Setting	Value
Method	GPTQ-Pro / GPTQModel
Bits	`4`
Group size	`128`
Symmetric quantization	`true`
Desc act	`false`
True sequential	`true`
Calibration dataset	WikiText-2 raw train
Calibration samples	`256`
Sequence length	`2048`
MSE	`2.0`
Damp percent	`0.05`
Damp auto increment	`0.01`
FOEM alpha	`0.25`
FOEM beta	`0.2`
Batch size	`1`

Preserved modules include vision, lm_head, embeddings, and norms.

Validation showed that this artifact preserves MTP-related configuration metadata, but does not include actual mtp.* tensors in model.safetensors.index.json, so this release should be treated as non-MTP for vLLM speculative decoding.

Post-save compatibility patch:

pad_token_id=248055
tokenizer class patched to Qwen2TokenizerFast when needed for vLLM compatibility

Intended serving setup

This checkpoint is intended for text-only vLLM serving on RTX 3090-class hardware.

Recommended vLLM options:

bash
vllm serve XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-FOEM-4bit-g128-ns256-v2 \
  --served-model-name qwopus3.6-27b-v2-gptq-pro-foem-4bit-g128-ns256-v2 \
  --language-model-only \
  --dtype float16 \
  --quantization gptq_marlin \
  --disable-custom-all-reduce \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --kv-cache-dtype fp8_e5m2 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --max-cudagraph-capture-size 32 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code

Serving context for the published vLLM measurements:

The vLLM numbers were collected on an internal llm-residency deployment. The custom image recipe is not published yet, so this card does not present that image as a public reproduction target. The stable serving knobs captured from the run are listed for context.

Table with columns: Field, Value
Field	Value
Nomad job profile	`vllm-qwopus36-base`
Served model name	`qwopus3.6-27b-v2-gptq-pro-foem-4bit-g128-ns256-v2`
Critical flags	`--dtype float16`, `--quantization gptq_marlin`, `--kv-cache-dtype fp8_e5m2`, `--reasoning-parser qwen3`, `--tool-call-parser qwen3_coder`, `--max-model-len 131072`; `--max-num-batched-tokens` was not set explicitly.

Public vLLM Reproducibility

This artifact has a public reproducibility path on the unmodified upstream vLLM OpenAI image:

image: docker.io/vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
vLLM version observed in validation: 0.20.1rc1.dev16+g7a1eb8ac2
GPU class: single RTX 3090 24 GB / Ampere
--enforce-eager was not used
no local sleep/wake patch or localhost/*sleepwake* image is required for the validation below

Validated serving shape:

context: --max-model-len 131072
--language-model-only, --dtype float16, --quantization gptq_marlin
--kv-cache-dtype fp8_e5m2, --enable-prefix-caching, --max-num-seqs 1
--max-cudagraph-capture-size 32, --gpu-memory-utilization 0.95
--reasoning-parser qwen3, --tool-call-parser qwen3_coder
--enable-sleep-mode was included in the validation command

Startup note: this dense 27B profile can fail the first cold start after torch compile/profiling with a pessimistic KV-cache check. The public validation passed on the second start when reusing persistent vLLM/Nomad-style cache directories such as TORCHINDUCTOR_CACHE_DIR=/data/vllm-qwopus36-base/torch_compile_cache and VLLM_CACHE_ROOT=/data/vllm-qwopus36-base. Treat startup retry plus persistent compile cache as part of the serving recipe.

Reasoning / thinking mode

This model preserves Qwen3-style reasoning behavior. The validation workload below was run with thinking enabled.

MTP / speculative decoding status

This ns256-v2 artifact should be considered text-only and non-MTP for vLLM speculative decoding as published. config.json advertises mtp_num_hidden_layers=1, but the weight index does not contain source mtp.* tensors. Enabling vLLM MTP against this unpatched artifact produced essentially zero accepted draft tokens and poor throughput.

A separate experimental follow-up artifact restores real MTP tensors and quantizes the large MTP linears:

text
XReyRobert/Qwopus3.6-27B-v2-MTP-GPTQ-Pro-v1

That MTP-GPTQ artifact works and reaches good draft acceptance, but it was still slower than this non-MTP baseline on a single RTX 3090. For practical 100k-131k serving on 1x RTX 3090, this ns256-v2 non-MTP artifact remains the preferred choice.

RTX 3090 validation status

This checkpoint was validated on an RTX 3090 24GB with vLLM, max_model_len=131072, kv_cache_dtype=fp8_e5m2, prefix caching enabled, and thinking enabled.

Observed vLLM multi-turn agent workload metrics:

Table with columns: Metric, Observed value, Notes
Metric	Observed value	Notes
Requests observed	`15`	Multi-turn agent session calls
vLLM request success count	`15/15`	No vLLM errors observed during the sample
Average prompt size	`33,172` tokens	Real multi-turn workload
Average output size	`322` tokens	Real generated responses

These are practical multi-turn serving metrics, not a synthetic benchmark. They are useful for RTX 3090-class long-context serving expectations, especially multi-turn usage with prefix caching.

📊 4. Evaluation & Benchmarks

Paired comparison on the same 350 question IDs is positive but should be treated cautiously because the subset is small and the local result is not a single-pass uniform run. Against Qwopus3.6-27B-v2, this run has 20 local-only correct answers and 9 Qwopus-only correct answers (+3.14 pp, McNemar p≈0.061). Against Qwen3.6-27B-v2, it has 32 local-only correct answers and 12 Qwen-only correct answers (+5.71 pp, McNemar p≈0.0037).

Compatibility notes

This artifact was built and validated for text-only vLLM serving without speculative decoding. Do not enable MTP on this artifact as published; the mtp.* tensors are absent from the weight index. Vision-related modules were not validated for vision use in this release.

Limitations

Experimental quantization.
MTP/speculative decoding is not supported by this published artifact because mtp.* tensors are missing.
Quality has been checked on Jackrong's 350-question MMLU-Pro subset only; this is not a full MMLU-Pro evaluation or an official leaderboard submission.
The subset result uses the official MMLU-Pro answer extractor, but the prompt style is the MMLU-Pro run_gpt4o.py OpenAI-compatible template, not the current local/vLLM template.
RTX 3090 metrics above are observed workload numbers, not a controlled benchmark suite.
Long-context and tool-calling workflows were validated on the described local vLLM/Hermes setup; behavior may vary on other serving stacks, hardware, or generation settings.

References

Source model: Jackrong/Qwopus3.6-27B-v2
GPTQ-Pro tooling: groxaxo/GPTQ-Pro
Reference GPTQ-Pro recipe: groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit
MMLU-Pro benchmark repository: TIGER-AI-Lab/MMLU-Pro
MMLU-Pro HF Space / leaderboard: TIGER-Lab/MMLU-Pro

Individual project notice

This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider