Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Source and credits
Source model:
Quantization methodology and reference recipe:
Thanks to Jackrong for the original Qwopus3.6 model, and to groxaxo for GPTQ-Pro and the Qwen3.6 GPTQ-Pro recipe this quantization was aligned with.
Quantization recipe
| Setting | Value |
|---|---|
| Method | GPTQ-Pro / GPTQModel |
| Bits | 4 |
| Group size | 128 |
| Symmetric quantization | true |
| Desc act | false |
| True sequential | true |
| Calibration dataset | WikiText-2 raw train |
| Calibration samples | 256 |
| Sequence length | 2048 |
| MSE | 2.0 |
| Damp percent | 0.05 |
| Damp auto increment | 0.01 |
| FOEM alpha | 0.25 |
| FOEM beta | 0.2 |
| Batch size | 1 |
Preserved modules include vision, lm_head, embeddings, and norms.
Validation showed that this artifact preserves MTP-related configuration metadata, but does not include actual mtp.* tensors in model.safetensors.index.json, so this release should be treated as non-MTP for vLLM speculative decoding.
Post-save compatibility patch:
pad_token_id=248055- tokenizer class patched to
Qwen2TokenizerFastwhen needed for vLLM compatibility
Intended serving setup
This checkpoint is intended for text-only vLLM serving on RTX 3090-class hardware.
Recommended vLLM options:
bash
vllm serve XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-FOEM-4bit-g128-ns256-v2 \--served-model-name qwopus3.6-27b-v2-gptq-pro-foem-4bit-g128-ns256-v2 \--language-model-only \--dtype float16 \--quantization gptq_marlin \--disable-custom-all-reduce \--tensor-parallel-size 1 \--max-model-len 131072 \--max-num-seqs 1 \--kv-cache-dtype fp8_e5m2 \--reasoning-parser qwen3 \--enable-auto-tool-choice \--tool-call-parser qwen3_coder \--enable-prefix-caching \--max-cudagraph-capture-size 32 \--gpu-memory-utilization 0.95 \--trust-remote-code
Reasoning / thinking mode
This model preserves Qwen3-style reasoning behavior. The validation workload below was run with thinking enabled.
MTP / speculative decoding status
This ns256-v2 artifact should be considered text-only and non-MTP for vLLM speculative decoding as published. config.json advertises mtp_num_hidden_layers=1, but the weight index does not contain source mtp.* tensors. Enabling vLLM MTP against this unpatched artifact produced essentially zero accepted draft tokens and poor throughput.
A separate experimental follow-up artifact restores real MTP tensors and quantizes the large MTP linears:
text
XReyRobert/Qwopus3.6-27B-v2-MTP-GPTQ-Pro-v1
That MTP-GPTQ artifact works and reaches good draft acceptance, but it was still slower than this non-MTP baseline on a single RTX 3090. For practical 100k-131k serving on 1x RTX 3090, this ns256-v2 non-MTP artifact remains the preferred choice.
RTX 3090 validation status
This checkpoint was validated on an RTX 3090 24GB with vLLM, max_model_len=131072, kv_cache_dtype=fp8_e5m2, prefix caching enabled, and thinking enabled.
Observed vLLM multi-turn agent workload metrics:
| Metric | Observed value | Notes |
|---|---|---|
| Requests observed | 15 | Multi-turn agent session calls |
| vLLM request success count | 15/15 | No vLLM errors observed during the sample |
| Average prompt size | 33,172 tokens | Real multi-turn workload |
| Average output size | 322 tokens | Real generated responses |
| Average time to first token | 5.70s | Prometheus TTFT summary |
| Average end-to-end request latency | 13.07s | Includes prefill, decode, and serving overhead |
| Average time per output token | 0.0230s/token | vLLM TPOT summary |
| Decode throughput from TPOT | about 43.5 tok/s | Decode-only estimate |
| Prefix cache hit ratio | 83.2% cumulative | vLLM prefix-cache counters |
| Live 60s prompt throughput | about 1,917 prompt tok/s | Aggregate observed window |
| Live 60s generation throughput | about 19.1 generated tok/s | Aggregate over full window, including prefill and idle mix |
| Live 60s prefix-cache hit ratio | 78.9% | Delta over the observed window |
These are practical multi-turn serving metrics, not a synthetic benchmark. They are useful for RTX 3090-class long-context serving expectations, especially multi-turn usage with prefix caching.
📊 4. Evaluation & Benchmarks
Compatibility notes
This artifact was built and validated for text-only vLLM serving without speculative decoding. Do not enable MTP on this artifact as published; the mtp.* tensors are absent from the weight index. Vision-related modules were not validated for vision use in this release.
Limitations
- Experimental quantization.
- MTP/speculative decoding is not supported by this published artifact because
mtp.*tensors are missing. - Quality has been checked on Jackrong's 350-question MMLU-Pro selected subset only; this is not a full MMLU-Pro evaluation or an official leaderboard submission.
- The single-pass unrestricted run used no explicit
max_tokensand exposed one pathological long generation; bounded output limits are recommended for practical serving. - RTX 3090 metrics above are observed workload numbers, not a controlled benchmark suite.
- Long-context and tool-calling workflows were validated on the described local vLLM/Hermes setup; behavior may vary on other serving stacks, hardware, or generation settings.
References
- Source model: Jackrong/Qwopus3.6-27B-v2
- GPTQ-Pro tooling: groxaxo/GPTQ-Pro
- Reference GPTQ-Pro recipe: groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit
- MMLU-Pro benchmark repository: TIGER-AI-Lab/MMLU-Pro
- MMLU-Pro HF Space / leaderboard: TIGER-Lab/MMLU-Pro
Individual project notice
This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.
Model provider
XReyRobert
Model tree
Base
Jackrong/Qwopus3.6-27B-v2
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information