Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Source and credits

Source model:

Quantization methodology and reference recipe:

Thanks to Jackrong for the original Qwopus3.6 model, and to groxaxo for GPTQ-Pro and the Qwen3.6 GPTQ-Pro recipe this quantization was aligned with.

Quantization recipe

SettingValue
MethodGPTQ-Pro / GPTQModel
Bits4
Group size128
Symmetric quantizationtrue
Desc actfalse
True sequentialtrue
Calibration datasetWikiText-2 raw train
Calibration samples256
Sequence length2048
MSE2.0
Damp percent0.05
Damp auto increment0.01
FOEM alpha0.25
FOEM beta0.2
Batch size1

Preserved modules include vision, lm_head, embeddings, and norms.

Validation showed that this artifact preserves MTP-related configuration metadata, but does not include actual mtp.* tensors in model.safetensors.index.json, so this release should be treated as non-MTP for vLLM speculative decoding.

Post-save compatibility patch:

  • pad_token_id=248055
  • tokenizer class patched to Qwen2TokenizerFast when needed for vLLM compatibility

Intended serving setup

This checkpoint is intended for text-only vLLM serving on RTX 3090-class hardware.

Recommended vLLM options:

bash

vllm serve XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-FOEM-4bit-g128-ns256-v2 \
--served-model-name qwopus3.6-27b-v2-gptq-pro-foem-4bit-g128-ns256-v2 \
--language-model-only \
--dtype float16 \
--quantization gptq_marlin \
--disable-custom-all-reduce \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--max-num-seqs 1 \
--kv-cache-dtype fp8_e5m2 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--enable-prefix-caching \
--max-cudagraph-capture-size 32 \
--gpu-memory-utilization 0.95 \
--trust-remote-code

Reasoning / thinking mode

This model preserves Qwen3-style reasoning behavior. The validation workload below was run with thinking enabled.

MTP / speculative decoding status

This ns256-v2 artifact should be considered text-only and non-MTP for vLLM speculative decoding as published. config.json advertises mtp_num_hidden_layers=1, but the weight index does not contain source mtp.* tensors. Enabling vLLM MTP against this unpatched artifact produced essentially zero accepted draft tokens and poor throughput.

A separate experimental follow-up artifact restores real MTP tensors and quantizes the large MTP linears:

text

XReyRobert/Qwopus3.6-27B-v2-MTP-GPTQ-Pro-v1

That MTP-GPTQ artifact works and reaches good draft acceptance, but it was still slower than this non-MTP baseline on a single RTX 3090. For practical 100k-131k serving on 1x RTX 3090, this ns256-v2 non-MTP artifact remains the preferred choice.

RTX 3090 validation status

This checkpoint was validated on an RTX 3090 24GB with vLLM, max_model_len=131072, kv_cache_dtype=fp8_e5m2, prefix caching enabled, and thinking enabled.

Observed vLLM multi-turn agent workload metrics:

MetricObserved valueNotes
Requests observed15Multi-turn agent session calls
vLLM request success count15/15No vLLM errors observed during the sample
Average prompt size33,172 tokensReal multi-turn workload
Average output size322 tokensReal generated responses
Average time to first token5.70sPrometheus TTFT summary
Average end-to-end request latency13.07sIncludes prefill, decode, and serving overhead
Average time per output token0.0230s/tokenvLLM TPOT summary
Decode throughput from TPOTabout 43.5 tok/sDecode-only estimate
Prefix cache hit ratio83.2% cumulativevLLM prefix-cache counters
Live 60s prompt throughputabout 1,917 prompt tok/sAggregate observed window
Live 60s generation throughputabout 19.1 generated tok/sAggregate over full window, including prefill and idle mix
Live 60s prefix-cache hit ratio78.9%Delta over the observed window

These are practical multi-turn serving metrics, not a synthetic benchmark. They are useful for RTX 3090-class long-context serving expectations, especially multi-turn usage with prefix caching.

📊 4. Evaluation & Benchmarks

Compatibility notes

This artifact was built and validated for text-only vLLM serving without speculative decoding. Do not enable MTP on this artifact as published; the mtp.* tensors are absent from the weight index. Vision-related modules were not validated for vision use in this release.

Limitations

  • Experimental quantization.
  • MTP/speculative decoding is not supported by this published artifact because mtp.* tensors are missing.
  • Quality has been checked on Jackrong's 350-question MMLU-Pro selected subset only; this is not a full MMLU-Pro evaluation or an official leaderboard submission.
  • The single-pass unrestricted run used no explicit max_tokens and exposed one pathological long generation; bounded output limits are recommended for practical serving.
  • RTX 3090 metrics above are observed workload numbers, not a controlled benchmark suite.
  • Long-context and tool-calling workflows were validated on the described local vLLM/Hermes setup; behavior may vary on other serving stacks, hardware, or generation settings.

References

Individual project notice

This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.

Model provider

XReyRobert

Model tree

Base

Jackrong/Qwopus3.6-27B-v2

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today