sakamakismile

Qwen3.6-27B-MTP-pi-tune-NVFP4

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Architecture

Qwen3_5ForConditionalGeneration (model_type qwen3_5), dense 27.8 B:

  • Hybrid attention — 48 Gated-DeltaNet (linear) + 16 full-attention layers (64 total, full_attention_interval=4), hidden 5120, partial RoPE 0.25, 262 K native context.
  • Vision — Qwen3-VL ViT (depth 27, 1152→5120), kept bf16; serve text-only with --limit-mm-per-prompt.
  • Native MTP (mtp_num_hidden_layers=1), kept bf16 → drives vLLM speculative decoding.
  • Thinking-by-default reasoning model (<think>…</think>, use --reasoning-parser qwen3).

Quantization recipe

markdown

QuantizationModifier(targets="Linear", scheme="NVFP4", # W4A4, group_size 16
ignore=["lm_head", "re:.*visual.*", "re:.*conv1d.*", "re:.*mtp.*"])
  • Vision tower, DeltaNet causal conv1d, lm_head, and the entire MTP head stay bf16; everything else is NVFP4 W4A4. 32 calibration samples (neuralmagic/calibration), seq 8192.
  • transformers drops mtp.* on load (_keys_to_ignore_on_load_unexpected), so the 15 bf16 MTP tensors are grafted back into model-mtp-bf16.safetensors post-quantization and spliced into the safetensors index.
  • Note for re-bakers: the grafted MTP modules must also be added to quantization_config.ignore, otherwise vLLM matches mtp.*_proj against targets=["Linear"], expects NVFP4 scales that do not exist, and loads the Qwen3_5MTP draft as garbage → 0 % spec-decode acceptance. With the fix, acceptance is ~74 %.

Serving (vLLM ≥ 0.22)

bash

vllm serve sakamakismile/Qwen3.6-27B-MTP-pi-tune-NVFP4 \
--tensor-parallel-size 4 --max-model-len 131072 \
--max-num-seqs 16 --gpu-memory-utilization 0.90 --kv-cache-dtype fp8 \
--reasoning-parser qwen3 --limit-mm-per-prompt '{"image":0,"video":0}' \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

On NVLink-less boxes add NCCL_P2P_DISABLE=1 + --disable-custom-all-reduce. Drop --speculative-config for plain decode. The hybrid model's KV is light (only the 16 full-attention layers cache), so full 128 K context fits even at TP=2.

Benchmarks

Measured on 7× RTX PRO 2000 Blackwell (16 GB), vLLM 0.22, KV fp8, --max-model-len 131072, 512-token greedy generations. single = 1 stream; aggr = total tokens/s across N concurrent streams.

Table
Configsingle (c1)aggr c2aggr c4aggr c8
TP=4, no MTP49.9 t/s93.1184.3340.6
TP=4, MTP n=385.2 t/s¹237.7
TP=2, no MTP29.0 t/s55.7108.2201.1
TP=2, MTP n=350.0 t/s

¹ generic ignore_eos text; real code generation reaches ~107 t/s (MTP acceptance is content-dependent; measured ~74 %).

128 K context fits at both tensor-parallel sizes: TP=4 → KV pool 1,025,977 tokens (7.83× concurrency at full 128 K); TP=2 → KV pool 198,867 tokens (1.52× at full 128 K — comfortable at normal context lengths). MTP roughly 1.7× single-stream (TP=2: 29→50; TP=4 on code: ~49→~107).

Quality

A 13-agent adversarial verification panel rated it ship / 8 of 10, no quality collapse under W4A4:

  • Python coding 10/10, arithmetic reasoning 10/10, defensive-security helpfulness 9/10.
  • A pandas SMA-crossover backtest signal was lookahead-safe (.shift(1), NaN-warmup flat) — i.e. it does not use same-bar/future data to set the tradeable position.

Caveats

  • Reasoning model → set max_tokens ≥ 4096 (prefer 8192+). At 2048 it can spend the whole budget inside <think> and return empty content.
  • Do not produce a W4A16 / NVFP4A16 variant — it fails to serve on vLLM 0.22 (gptq_marlin_repack: size_n=24 not divisible by tile_n_size=64; the 24 attention-heads / DeltaNet odd dims violate Marlin's tile constraint). W4A4 avoids Marlin (NVFP4 cutlass/FlashInfer path).
  • Bare-format outputs may carry leading whitespace; .strip() if you parse for an exact string.

License & attribution

Apache-2.0, inherited from the base models. Quantization by sakamakismile (Lna-Lab). Sibling reference: Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP (ModelOpt-produced, same architecture).

Model provider

sakamakismile

Model tree

Base

Qwen/Qwen3.6-27B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today