XReyRobert
Qwopus3.6-35B-A3B-v1-GPTQ-Pro
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Source And Credits
Source model:
Quantization tooling and reference recipe:
Thanks to Jackrong for the Qwopus3.6 model and to groxaxo for the GPTQ-Pro tooling and Qwen3.6 GPTQ-Pro recipe this run was aligned with.
Artifact Summary
| Field | Value |
|---|---|
| Source model | Jackrong/Qwopus3.6-35B-A3B-v1 |
| Architecture | Qwen3_5MoeForConditionalGeneration |
| Model type | qwen3_5_moe |
| Tensor files | 6 |
| Safetensors size | 20.81 GiB |
| Indexed tensors | 124595 |
Quantized qweight tensors | 30970 |
mtp.* tensors in index | true |
| vision/visual tensors in index | true |
| Index metadata size matches shards | true |
The artifact contains source MTP and vision/visual tensors in its weight index. That does not mean MTP speculative decoding or multimodal serving is already recommended. The validated use so far is text-oriented GPTQ serving and Terminal-Bench agent evaluation.
Quantization Recipe
| Setting | Value |
|---|---|
| Method | GPTQ-Pro / GPTQModel |
| Bits | 4 |
| Group size | 128 |
| Symmetric quantization | true |
| Desc act | false |
| True sequential | true |
| Calibration dataset | WikiText |
| Calibration samples | 256 |
| Calibration sequence length | 2048 |
| MSE | 2.0 |
| Damp percent | 0.05 |
| Damp auto increment | 0.01 |
| FOEM alpha | 0.25 |
| FOEM beta | 0.2 |
| FOEM device | cuda:0 |
| MoE routing | ExpertsRoutingBypass |
| MoE bypass batch size | 192 |
| Pack implementation | cpu |
Dynamic skip rules preserved these module families instead of quantizing them:
embed_tokenslm_headmtpnormvisionvisual
In practical terms, the language tower linear layers are the intended GPTQ-Pro payload, while embeddings, norms, MTP, and vision-related tensors remain preserved as non-quantized tensors.
Intended Serving Shape
This checkpoint is intended for advanced users testing text-only vLLM or GPTQ-compatible serving for Qwen/Qwopus MoE checkpoints.
A starting vLLM shape for text-only testing:
bash
vllm serve XReyRobert/Qwopus3.6-35B-A3B-v1-GPTQ-Pro \--served-model-name qwopus3.6-35b-a3b-v1-gptq-pro \--language-model-only \--dtype float16 \--quantization gptq_marlin \--tensor-parallel-size 1 \--max-model-len 262144 \--max-num-seqs 1 \--kv-cache-dtype fp8_e5m2 \--reasoning-parser qwen3 \--enable-auto-tool-choice \--tool-call-parser qwen3_coder \--enable-prefix-caching \--gpu-memory-utilization 0.95 \--trust-remote-code
Treat the command as a serving starting point, not a compatibility guarantee for every vLLM release. GPTQ-Marlin, Qwen3.6 MoE handling, and multimodal processor behavior are all loader-version sensitive.
The RTX 3090 image above reflects separate 262k-context serving validation.
Validation And Benchmarks
Completed artifact checks:
- Local shard index inspection completed before upload.
- Remote file list verified after upload.
- Remote
model.safetensors.index.jsonverified after upload. - Index metadata total size matches the local safetensor shards.
- The remote artifact contains the expected six safetensor shards.
Terminal-Bench 2.0 Smoke24 result and associated vLLM serving measurements.
This Smoke24 run used max_model_len=131072 for apples-to-apples comparison
with the other local models in this publication batch:
| Run | Score | Success rate | Wall-time | Output tokens | Observed decode | LLM API time |
|---|---|---|---|---|---|---|
qwopus3.6-35b-a3b-v1-gptq-pro-foem-4bit-g128-ns256 | 12/24 | 50.0% | 226.7m | 622.8k | 138.6 tok/s | 74.9m |
Smoke24 is a fixed 24-task Terminal-Bench 2.0 comparison corpus, not a full Terminal-Bench leaderboard run. The score above is useful for fast regression and local serving comparison, not for broad model ranking.
Task list and harness shape:
MTP And Vision Status
config.jsonadvertises MTP support, and the index containsmtp.*tensors.- MTP tensors were preserved, not the primary quantization target for this release.
- MTP speculative decoding has not yet been validated as a recommended path for this artifact.
- Vision/visual tensors are present, but multimodal serving has not yet been validated for this quantized artifact.
For now, publish and use this as a text-first GPTQ-Pro MoE artifact.
Limitations
- Experimental quantization.
- Terminal-Bench Smoke24 is a small local comparison corpus, not a full benchmark submission.
- MTP speculative decoding is not yet a supported recommendation for this artifact.
- Vision tensors are preserved, but vision behavior has not been validated.
- Loader behavior may vary across vLLM, Transformers, GPTQModel, and GPTQ-Marlin versions.
Files
Key files:
model.safetensors.index.jsonmodel-00001-of-00006.safetensorsthroughmodel-00006-of-00006.safetensorsconfig.jsonquantize_config.jsonprocessor_config.jsontokenizer.jsonUPLOAD_MANIFEST.json
UPLOAD_MANIFEST.json records the upload guardrail checks and artifact
inspection summary.
References
- Source model:
Jackrong/Qwopus3.6-35B-A3B-v1 - GPTQ-Pro tooling:
groxaxo/GPTQ-Pro - Reference recipe:
groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit - Terminal-Bench:
laude-institute/terminal-bench
Individual Project Notice
This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.
Model provider
XReyRobert
Model tree
Base
Jackrong/Qwopus3.6-35B-A3B-v1
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information