sakamakismile

Qwen3.6-27B-MTP-pi-tune-NVFP4

README

License: apache-2.0

Architecture

Qwen3_5ForConditionalGeneration (model_type qwen3_5), dense 27.8 B:

Hybrid attention — 48 Gated-DeltaNet (linear) + 16 full-attention layers (64 total, full_attention_interval=4), hidden 5120, partial RoPE 0.25, 262 K native context.
Vision — Qwen3-VL ViT (depth 27, 1152→5120), kept bf16; serve text-only with --limit-mm-per-prompt.
Native MTP (mtp_num_hidden_layers=1), kept bf16 → drives vLLM speculative decoding.
Thinking-by-default reasoning model (<think>…</think>, use --reasoning-parser qwen3).

Quantization recipe

markdown
QuantizationModifier(targets="Linear", scheme="NVFP4",  # W4A4, group_size 16
  ignore=["lm_head", "re:.*visual.*", "re:.*conv1d.*", "re:.*mtp.*"])

Vision tower, DeltaNet causal conv1d, lm_head, and the entire MTP head stay bf16; everything else is NVFP4 W4A4. 32 calibration samples (neuralmagic/calibration), seq 8192.
transformers drops mtp.* on load (_keys_to_ignore_on_load_unexpected), so the 15 bf16 MTP tensors are grafted back into model-mtp-bf16.safetensors post-quantization and spliced into the safetensors index.
Note for re-bakers: the grafted MTP modules must also be added to quantization_config.ignore, otherwise vLLM matches mtp.*_proj against targets=["Linear"], expects NVFP4 scales that do not exist, and loads the Qwen3_5MTP draft as garbage → 0 % spec-decode acceptance. With the fix, acceptance is ~74 %.

Serving (vLLM ≥ 0.22)

bash
vllm serve sakamakismile/Qwen3.6-27B-MTP-pi-tune-NVFP4 \
  --tensor-parallel-size 4 --max-model-len 131072 \
  --max-num-seqs 16 --gpu-memory-utilization 0.90 --kv-cache-dtype fp8 \
  --reasoning-parser qwen3 --limit-mm-per-prompt '{"image":0,"video":0}' \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

On NVLink-less boxes add NCCL_P2P_DISABLE=1 + --disable-custom-all-reduce. Drop --speculative-config for plain decode. The hybrid model's KV is light (only the 16 full-attention layers cache), so full 128 K context fits even at TP=2.

Benchmarks

Measured on 7× RTX PRO 2000 Blackwell (16 GB), vLLM 0.22, KV fp8, --max-model-len 131072, 512-token greedy generations. single = 1 stream; aggr = total tokens/s across N concurrent streams.

Table with columns: Config, single (c1), aggr c2, aggr c4, aggr c8
Config	single (c1)	aggr c2	aggr c4	aggr c8
TP=4, no MTP	49.9 t/s	93.1	184.3	340.6
TP=4, MTP n=3	85.2 t/s¹	–	237.7	–
TP=2, no MTP	29.0 t/s	55.7	108.2

¹ generic ignore_eos text; real code generation reaches ~107 t/s (MTP acceptance is content-dependent; measured ~74 %).

128 K context fits at both tensor-parallel sizes: TP=4 → KV pool 1,025,977 tokens (7.83× concurrency at full 128 K); TP=2 → KV pool 198,867 tokens (1.52× at full 128 K — comfortable at normal context lengths). MTP roughly 1.7× single-stream (TP=2: 29→50; TP=4 on code: ~49→~107).

Quality

A 13-agent adversarial verification panel rated it ship / 8 of 10, no quality collapse under W4A4:

Python coding 10/10, arithmetic reasoning 10/10, defensive-security helpfulness 9/10.
A pandas SMA-crossover backtest signal was lookahead-safe (.shift(1), NaN-warmup flat) — i.e. it does not use same-bar/future data to set the tradeable position.

Caveats

Reasoning model → set max_tokens ≥ 4096 (prefer 8192+). At 2048 it can spend the whole budget inside <think> and return empty content.
Do not produce a W4A16 / NVFP4A16 variant — it fails to serve on vLLM 0.22 (gptq_marlin_repack: size_n=24 not divisible by tile_n_size=64; the 24 attention-heads / DeltaNet odd dims violate Marlin's tile constraint). W4A4 avoids Marlin (NVFP4 cutlass/FlashInfer path).
Bare-format outputs may carry leading whitespace; .strip() if you parse for an exact string.

License & attribution

Apache-2.0, inherited from the base models. Quantization by sakamakismile (Lna-Lab). Sibling reference: Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP (ModelOpt-produced, same architecture).

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

sakamakismile

Model Tree

Base

Qwen/Qwen3.6-27B

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities