sakamakismile
Qwen3.6-27B-MTP-pi-tune-NVFP4
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Architecture
Qwen3_5ForConditionalGeneration (model_type qwen3_5), dense 27.8 B:
- Hybrid attention — 48 Gated-DeltaNet (linear) + 16 full-attention layers (64 total,
full_attention_interval=4), hidden 5120, partial RoPE 0.25, 262 K native context. - Vision — Qwen3-VL ViT (depth 27, 1152→5120), kept bf16; serve text-only with
--limit-mm-per-prompt. - Native MTP (
mtp_num_hidden_layers=1), kept bf16 → drives vLLM speculative decoding. - Thinking-by-default reasoning model (
<think>…</think>, use--reasoning-parser qwen3).
Quantization recipe
markdown
QuantizationModifier(targets="Linear", scheme="NVFP4", # W4A4, group_size 16ignore=["lm_head", "re:.*visual.*", "re:.*conv1d.*", "re:.*mtp.*"])
- Vision tower, DeltaNet causal
conv1d,lm_head, and the entire MTP head stay bf16; everything else is NVFP4 W4A4. 32 calibration samples (neuralmagic/calibration), seq 8192. transformersdropsmtp.*on load (_keys_to_ignore_on_load_unexpected), so the 15 bf16 MTP tensors are grafted back intomodel-mtp-bf16.safetensorspost-quantization and spliced into the safetensors index.- Note for re-bakers: the grafted MTP modules must also be added to
quantization_config.ignore, otherwise vLLM matchesmtp.*_projagainsttargets=["Linear"], expects NVFP4 scales that do not exist, and loads theQwen3_5MTPdraft as garbage → 0 % spec-decode acceptance. With the fix, acceptance is ~74 %.
Serving (vLLM ≥ 0.22)
bash
vllm serve sakamakismile/Qwen3.6-27B-MTP-pi-tune-NVFP4 \--tensor-parallel-size 4 --max-model-len 131072 \--max-num-seqs 16 --gpu-memory-utilization 0.90 --kv-cache-dtype fp8 \--reasoning-parser qwen3 --limit-mm-per-prompt '{"image":0,"video":0}' \--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
On NVLink-less boxes add NCCL_P2P_DISABLE=1 + --disable-custom-all-reduce. Drop --speculative-config for plain decode. The hybrid model's KV is light (only the 16 full-attention layers cache), so full 128 K context fits even at TP=2.
Benchmarks
Measured on 7× RTX PRO 2000 Blackwell (16 GB), vLLM 0.22, KV fp8, --max-model-len 131072, 512-token greedy generations. single = 1 stream; aggr = total tokens/s across N concurrent streams.
| Config | single (c1) | aggr c2 | aggr c4 | aggr c8 |
|---|---|---|---|---|
| TP=4, no MTP | 49.9 t/s | 93.1 | 184.3 | 340.6 |
| TP=4, MTP n=3 | 85.2 t/s¹ | – | 237.7 | – |
| TP=2, no MTP | 29.0 t/s | 55.7 | 108.2 | 201.1 |
| TP=2, MTP n=3 | 50.0 t/s | – | – | – |
¹ generic ignore_eos text; real code generation reaches ~107 t/s (MTP acceptance is content-dependent; measured ~74 %).
128 K context fits at both tensor-parallel sizes: TP=4 → KV pool 1,025,977 tokens (7.83× concurrency at full 128 K); TP=2 → KV pool 198,867 tokens (1.52× at full 128 K — comfortable at normal context lengths). MTP roughly 1.7× single-stream (TP=2: 29→50; TP=4 on code: ~49→~107).
Quality
A 13-agent adversarial verification panel rated it ship / 8 of 10, no quality collapse under W4A4:
- Python coding 10/10, arithmetic reasoning 10/10, defensive-security helpfulness 9/10.
- A
pandasSMA-crossover backtest signal was lookahead-safe (.shift(1), NaN-warmup flat) — i.e. it does not use same-bar/future data to set the tradeable position.
Caveats
- Reasoning model → set
max_tokens≥ 4096 (prefer 8192+). At 2048 it can spend the whole budget inside<think>and return empty content. - Do not produce a W4A16 / NVFP4A16 variant — it fails to serve on vLLM 0.22 (
gptq_marlin_repack: size_n=24 not divisible by tile_n_size=64; the 24 attention-heads / DeltaNet odd dims violate Marlin's tile constraint). W4A4 avoids Marlin (NVFP4 cutlass/FlashInfer path). - Bare-format outputs may carry leading whitespace;
.strip()if you parse for an exact string.
License & attribution
Apache-2.0, inherited from the base models. Quantization by sakamakismile (Lna-Lab). Sibling reference: Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP (ModelOpt-produced, same architecture).
Model provider
sakamakismile
Model tree
Base
Qwen/Qwen3.6-27B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information