Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Contents
model.safetensors(~20.6 GB): single shard- NVFP4-packed body weights (
uint8packed + per-blockfloat8_e4m3fnweight_scale+ per-tensorfloat32weight_scale_2) - BF16 vision tower (333
model.visual.*tensors) - BF16 MTP head (15
mtp.*tensors) - BF16
lm_head.weight - BF16
linear_attn.conv1dandin_proj_*projections
- NVFP4-packed body weights (
config.json—quantization_config.ignorelists the 65 entries kept in BF16 (50 vision blocks, 15 MTP modules)hf_quant_config.json— modelopt metadatachat_template.jinja—froggeric/Qwen-Fixed-Chat-Templates(see Patches)tokenizer.json,tokenizer_config.json,preprocessor_config.json,video_preprocessor_config.json,generation_config.json
Input checkpoint size: 55.6 GB BF16 → output 20.6 GB (0.37×).
Base recipe
The 5-step graft procedure is from lna-lab/GGUF-to-NVFP4-SM120 — credit to Tonoken / LNA-LAB. Recipe doc: docs/MTP_GRAFT_RECIPE.md. VLM-preserving variant: src/quantize/qwen36_27b_vlm_mtp.py.
Step 1 — Quant config
NVFP4_DEFAULT_CFG already excludes linear_attn.conv1d, lm_head, router, mlp.gate, block_sparse_moe.gate. Two ignores added on top:
python
import modelopt.torch.quantization as mtqconfig = mtq.NVFP4_DEFAULT_CFGquant_cfg = dict(config["quant_cfg"])quant_cfg["*visual*"] = {"enable": False} # keep vision tower BF16quant_cfg["*mtp*"] = {"enable": False} # keep MTP head BF16build_config = {"quant_cfg": quant_cfg, "algorithm": config["algorithm"]}
Step 2 — Calibration
20 samples from neuralmagic/calibration (name="LLM", split="train[:20]") at max_seq_len=8192, applied via tokenizer.apply_chat_template(...). Forward-pass calibration with torch.no_grad() and the model in inference mode.
Step 3 — Export
python
from modelopt.torch.export import export_hf_checkpointmtq.quantize(model, build_config, forward_loop=...)export_hf_checkpoint(model, export_dir=OUT)
compressed-tensors.oneshot does not produce a working SM120 NVFP4 checkpoint per lna-lab's notes; modelopt is the path used here.
Step 4 — Graft mtp.* (15 tensors for Qwen3.6-27B dense)
python
from safetensors import safe_openfrom safetensors.torch import load_file, save_file# Walk base BF16 shards, collect mtp.* tensorsshard_to_keys = {...} # via base index.jsonmtp_tensors = {}for shard, keys in shard_to_keys.items():with safe_open(BASE/shard, framework="pt") as f:for k in keys: mtp_tensors[k] = f.get_tensor(k)# Append into the last quantized shard, BF16target = sorted(OUT.glob("model*.safetensors"))[-1]existing = load_file(str(target))for k, v in mtp_tensors.items():existing[k] = v.to(torch.bfloat16).contiguous()save_file(existing, str(target), metadata=meta)# Update index.json weight_map + total_size if multi-shard
Step 5 — Patch config.json
python
mtp_modules = sorted({".".join(k.split(".")[:-1]) for k in mtp_keys if k.endswith(".weight")})cfg["quantization_config"].setdefault("ignore", []).extend(mtp_modules)# vision_config stays; language_model_only stays False
Patches applied on top of the lna-lab recipe
-
Chat template — replaced upstream
Qwen/Qwen3.6-27Bchat_template.jinjawithfroggeric/Qwen-Fixed-Chat-Templates(top-level current version). The upstream template has known silent tool-call drops and<|think_on|>/enable_thinking=falseissues; see Qwen/Qwen3.6-27B/discussions/16, discussions/20, and froggeric/.../discussions/2 (the kraka40 / openclaw tool-call fix). Pair with--tool-call-parser qwen3_xmlif you serve tool-call workloads. -
tokenizer_config.jsonbackendkey strip —tokenizer.save_pretrained()fromtransformers>=5emits"backend": "tokenizers".transformers==4.57.6(the pin in our serving image) does not recognize this field. The recipe strips it post-export:python
import jsoncfg = json.loads(open("tokenizer_config.json").read())cfg.pop("backend", None)open("tokenizer_config.json", "w").write(json.dumps(cfg, indent=2)) -
No other source modifications. Weights are the base modelopt NVFP4 quant + base
mtp.*graft; no fine-tuning, no abliteration, no distillation.
Serving with vLLM
bash
vllm serve <local-path-or-repo-id> \--port 8000 \--max-model-len 65536 \--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}' \--kv-cache-dtype fp8_e4m3 \--mamba-cache-mode align \--trust-remote-code
Notes for vLLM:
- The method string
qwen3_5_mtpis what readstext_config.mtp_num_hidden_layers(whichQwen/Qwen3.6-27Bships as1). The base config does not carrynum_nextn_predict_layers, so theqwen3_next_mtpmethod shown on some Qwen model cards resolves ton_predict=0on this checkpoint. - The hybrid attention layout (16 full + 48 linear) is recognized via the
qwen3_5model_type. - Vision is retained on disk. To skip vision at serve time, add
--language-model-only --limit-mm-per-prompt '{"image": 0, "video": 0}'.
Context
The artifact is published here so the work can be referenced, not because it has been evaluated for production correctness or performance.
Acknowledgments
lna-lab/GGUF-to-NVFP4-SM120— published the modelopt + MTP-graft recipe used here.sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP— same recipe, text-only variant.froggeric/Qwen-Fixed-Chat-Templates— chat template.Qwen/Qwen3.6-27B— base weights.
Model provider
natfii
Model tree
Base
Qwen/Qwen3.6-27B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information