Source and credits
Source model:
Quantization methodology and reference recipe:
Thanks to Jackrong for the original Qwopus3.6 model, and to groxaxo for GPTQ-Pro and the Qwen3.6 GPTQ-Pro recipe this quantization was aligned with.
Quantization recipe
Table with columns: Setting, Value| Setting | Value |
|---|
| Method | GPTQ-Pro / GPTQModel |
| Bits | 4 |
| Group size | 128 |
| Symmetric quantization | true |
| Desc act | false |
| True sequential | true |
| Calibration dataset | WikiText-2 raw train |
| Calibration samples | 256 |
| Sequence length | 2048 |
| MSE | 2.0 |
| Damp percent | 0.05 |
| Damp auto increment | 0.01 |
| FOEM alpha | 0.25 |
| FOEM beta | 0.2 |
| Batch size | 1 |
Preserved modules include vision, lm_head, embeddings, and norms.
Validation showed that this artifact preserves MTP-related configuration metadata, but does not include actual mtp.* tensors in model.safetensors.index.json, so this release should be treated as non-MTP for vLLM speculative decoding.
Post-save compatibility patch:
pad_token_id=248055
- tokenizer class patched to
Qwen2TokenizerFast when needed for vLLM compatibility
Intended serving setup
This checkpoint is intended for text-only vLLM serving on RTX 3090-class hardware.
Recommended vLLM options:
vllm serve XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-FOEM-4bit-g128-ns256-v2 \
--served-model-name qwopus3.6-27b-v2-gptq-pro-foem-4bit-g128-ns256-v2 \
--language-model-only \
--dtype float16 \
--quantization gptq_marlin \
--disable-custom-all-reduce \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--max-num-seqs 1 \
--kv-cache-dtype fp8_e5m2 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--enable-prefix-caching \
--max-cudagraph-capture-size 32 \
--gpu-memory-utilization 0.95 \
--trust-remote-code
Reasoning / thinking mode
This model preserves Qwen3-style reasoning behavior. The validation workload below was run with thinking enabled.
MTP / speculative decoding status
This ns256-v2 artifact should be considered text-only and non-MTP for vLLM speculative decoding as published. config.json advertises mtp_num_hidden_layers=1, but the weight index does not contain source mtp.* tensors. Enabling vLLM MTP against this unpatched artifact produced essentially zero accepted draft tokens and poor throughput.
A separate experimental follow-up artifact restores real MTP tensors and quantizes the large MTP linears:
XReyRobert/Qwopus3.6-27B-v2-MTP-GPTQ-Pro-v1
That MTP-GPTQ artifact works and reaches good draft acceptance, but it was still slower than this non-MTP baseline on a single RTX 3090. For practical 100k-131k serving on 1x RTX 3090, this ns256-v2 non-MTP artifact remains the preferred choice.
RTX 3090 validation status
This checkpoint was validated on an RTX 3090 24GB with vLLM, max_model_len=131072, kv_cache_dtype=fp8_e5m2, prefix caching enabled, and thinking enabled.
Observed vLLM multi-turn agent workload metrics:
Table with columns: Metric, Observed value, Notes| Metric | Observed value | Notes |
|---|
| Requests observed | 15 | Multi-turn agent session calls |
| vLLM request success count | 15/15 | No vLLM errors observed during the sample |
| Average prompt size | 33,172 tokens | Real multi-turn workload |
| Average output size | 322 tokens | Real generated responses |
These are practical multi-turn serving metrics, not a synthetic benchmark. They are useful for RTX 3090-class long-context serving expectations, especially multi-turn usage with prefix caching.
📊 4. Evaluation & Benchmarks
Paired comparison on the same 350 question IDs is positive but should be treated cautiously because the subset is small and the local result is not a single-pass uniform run. Against Qwopus3.6-27B-v2, this run has 20 local-only correct answers and 9 Qwopus-only correct answers (+3.14 pp, McNemar p≈0.061). Against Qwen3.6-27B-v2, it has 32 local-only correct answers and 12 Qwen-only correct answers (+5.71 pp, McNemar p≈0.0037).
Compatibility notes
This artifact was built and validated for text-only vLLM serving without speculative decoding. Do not enable MTP on this artifact as published; the mtp.* tensors are absent from the weight index. Vision-related modules were not validated for vision use in this release.
Limitations
- Experimental quantization.
- MTP/speculative decoding is not supported by this published artifact because
mtp.* tensors are missing.
- Quality has been checked on Jackrong's 350-question MMLU-Pro subset only; this is not a full MMLU-Pro evaluation or an official leaderboard submission.
- The subset result uses the official MMLU-Pro answer extractor, but the prompt style is the MMLU-Pro
run_gpt4o.py OpenAI-compatible template, not the current local/vLLM template.
- RTX 3090 metrics above are observed workload numbers, not a controlled benchmark suite.
- Long-context and tool-calling workflows were validated on the described local vLLM/Hermes setup; behavior may vary on other serving stacks, hardware, or generation settings.
References
Individual project notice
This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.