natfii

Qwen3.6-27B-VLM-NVFP4-MTP

README

License: apache-2.0

model.safetensors (~20.6 GB): single shard
- NVFP4-packed body weights (uint8 packed + per-block float8_e4m3fn weight_scale + per-tensor float32 weight_scale_2)
- BF16 vision tower (333 model.visual.* tensors)
- BF16 MTP head (15 mtp.* tensors)
- BF16 lm_head.weight
- BF16 linear_attn.conv1d and in_proj_* projections
config.json — quantization_config.ignore lists the 65 entries kept in BF16 (50 vision blocks, 15 MTP modules)
hf_quant_config.json — modelopt metadata
chat_template.jinja — froggeric/Qwen-Fixed-Chat-Templates (see Patches)
tokenizer.json, tokenizer_config.json, preprocessor_config.json, video_preprocessor_config.json, generation_config.json

Input checkpoint size: 55.6 GB BF16 → output 20.6 GB (0.37×).

Base recipe

The 5-step graft procedure is from lna-lab/GGUF-to-NVFP4-SM120 — credit to Tonoken / LNA-LAB. Recipe doc: docs/MTP_GRAFT_RECIPE.md. VLM-preserving variant: src/quantize/qwen36_27b_vlm_mtp.py.

Step 1 — Quant config

NVFP4_DEFAULT_CFG already excludes linear_attn.conv1d, lm_head, router, mlp.gate, block_sparse_moe.gate. Two ignores added on top:

python
import modelopt.torch.quantization as mtq
config = mtq.NVFP4_DEFAULT_CFG
quant_cfg = dict(config["quant_cfg"])
quant_cfg["*visual*"] = {"enable": False}   # keep vision tower BF16
quant_cfg["*mtp*"]    = {"enable": False}   # keep MTP head BF16
build_config = {"quant_cfg": quant_cfg, "algorithm": config["algorithm"]}

Step 2 — Calibration

20 samples from neuralmagic/calibration (name="LLM", split="train[:20]") at max_seq_len=8192, applied via tokenizer.apply_chat_template(...). Forward-pass calibration with torch.no_grad() and the model in inference mode.

Step 3 — Export

python
from modelopt.torch.export import export_hf_checkpoint
mtq.quantize(model, build_config, forward_loop=...)
export_hf_checkpoint(model, export_dir=OUT)

compressed-tensors.oneshot does not produce a working SM120 NVFP4 checkpoint per lna-lab's notes; modelopt is the path used here.

Step 4 — Graft `mtp.*` (15 tensors for Qwen3.6-27B dense)

python
from safetensors import safe_open
from safetensors.torch import load_file, save_file
# Walk base BF16 shards, collect mtp.* tensors
shard_to_keys = {...}    # via base index.json
mtp_tensors = {}
for shard, keys in shard_to_keys.items():
    with safe_open(BASE/shard, framework="pt") as f:
        for k in keys: mtp_tensors[k] = f.get_tensor(k)
# Append into the last quantized shard, BF16
target = sorted(OUT.glob("model*.safetensors"))[-1]
existing = load_file(str(target))
for k, v in mtp_tensors.items():
    existing[k] = v.to(torch.bfloat16).contiguous()
save_file(existing, str(target), metadata=meta)
# Update index.json weight_map + total_size if multi-shard

Step 5 — Patch `config.json`

python
mtp_modules = sorted({".".join(k.split(".")[:-1]) for k in mtp_keys if k.endswith(".weight")})
cfg["quantization_config"].setdefault("ignore", []).extend(mtp_modules)
# vision_config stays; language_model_only stays False

Patches applied on top of the lna-lab recipe

Chat template — replaced upstream Qwen/Qwen3.6-27B chat_template.jinja with froggeric/Qwen-Fixed-Chat-Templates (top-level current version). The upstream template has known silent tool-call drops and <|think_on|>/enable_thinking=false issues; see Qwen/Qwen3.6-27B/discussions/16, discussions/20, and froggeric/.../discussions/2 (the kraka40 / openclaw tool-call fix). Pair with --tool-call-parser qwen3_xml if you serve tool-call workloads.
tokenizer_config.json key strip — from emits . (the pin in our serving image) does not recognize this field. The recipe strips it post-export:

Serving with vLLM

bash
vllm serve <local-path-or-repo-id> \
  --port 8000 \
  --max-model-len 65536 \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}' \
  --kv-cache-dtype fp8_e4m3 \
  --mamba-cache-mode align \
  --trust-remote-code

Notes for vLLM:

The method string qwen3_5_mtp is what reads text_config.mtp_num_hidden_layers (which Qwen/Qwen3.6-27B ships as 1). The base config does not carry num_nextn_predict_layers, so the qwen3_next_mtp method shown on some Qwen model cards resolves to n_predict=0 on this checkpoint.
The hybrid attention layout (16 full + 48 linear) is recognized via the qwen3_5 model_type.
Vision is retained on disk. To skip vision at serve time, add --language-model-only --limit-mm-per-prompt '{"image": 0, "video": 0}'.

Context

The artifact is published here so the work can be referenced, not because it has been evaluated for production correctness or performance.

Acknowledgments

lna-lab/GGUF-to-NVFP4-SM120 — published the modelopt + MTP-graft recipe used here.
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP — same recipe, text-only variant.
froggeric/Qwen-Fixed-Chat-Templates — chat template.
Qwen/Qwen3.6-27B — base weights.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

natfii

Model Tree

Base

Qwen/Qwen3.6-27B

Quantized

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

model.safetensors (~20.6 GB): single shard
- NVFP4-packed body weights (uint8 packed + per-block float8_e4m3fn weight_scale + per-tensor float32 weight_scale_2)
- BF16 vision tower (333 model.visual.* tensors)
- BF16 MTP head (15 mtp.* tensors)
- BF16 lm_head.weight
- BF16 linear_attn.conv1d and in_proj_* projections
config.json — quantization_config.ignore lists the 65 entries kept in BF16 (50 vision blocks, 15 MTP modules)
hf_quant_config.json — modelopt metadata
chat_template.jinja — froggeric/Qwen-Fixed-Chat-Templates (see Patches)
tokenizer.json, tokenizer_config.json, preprocessor_config.json, video_preprocessor_config.json, generation_config.json

Input checkpoint size: 55.6 GB BF16 → output 20.6 GB (0.37×).

Base recipe

The 5-step graft procedure is from lna-lab/GGUF-to-NVFP4-SM120 — credit to Tonoken / LNA-LAB. Recipe doc: docs/MTP_GRAFT_RECIPE.md. VLM-preserving variant: src/quantize/qwen36_27b_vlm_mtp.py.

Step 1 — Quant config

NVFP4_DEFAULT_CFG already excludes linear_attn.conv1d, lm_head, router, mlp.gate, block_sparse_moe.gate. Two ignores added on top:

python
import modelopt.torch.quantization as mtq
config = mtq.NVFP4_DEFAULT_CFG
quant_cfg = dict(config["quant_cfg"])
quant_cfg["*visual*"] = {"enable": False}   # keep vision tower BF16
quant_cfg["*mtp*"]    = {"enable": False}   # keep MTP head BF16
build_config = {"quant_cfg": quant_cfg, "algorithm": config["algorithm"]}

Step 2 — Calibration

Step 3 — Export

python
from modelopt.torch.export import export_hf_checkpoint
mtq.quantize(model, build_config, forward_loop=...)
export_hf_checkpoint(model, export_dir=OUT)

compressed-tensors.oneshot does not produce a working SM120 NVFP4 checkpoint per lna-lab's notes; modelopt is the path used here.

Step 4 — Graft `mtp.*` (15 tensors for Qwen3.6-27B dense)

python
from safetensors import safe_open
from safetensors.torch import load_file, save_file
# Walk base BF16 shards, collect mtp.* tensors
shard_to_keys = {...}    # via base index.json
mtp_tensors = {}
for shard, keys in shard_to_keys.items():
    with safe_open(BASE/shard, framework="pt") as f:
        for k in keys: mtp_tensors[k] = f.get_tensor(k)
# Append into the last quantized shard, BF16
target = sorted(OUT.glob("model*.safetensors"))[-1]
existing = load_file(str(target))
for k, v in mtp_tensors.items():
    existing[k] = v.to(torch.bfloat16).contiguous()
save_file(existing, str(target), metadata=meta)
# Update index.json weight_map + total_size if multi-shard

Step 5 — Patch `config.json`

python
mtp_modules = sorted({".".join(k.split(".")[:-1]) for k in mtp_keys if k.endswith(".weight")})
cfg["quantization_config"].setdefault("ignore", []).extend(mtp_modules)
# vision_config stays; language_model_only stays False

Patches applied on top of the lna-lab recipe

Chat template — replaced upstream Qwen/Qwen3.6-27B chat_template.jinja with froggeric/Qwen-Fixed-Chat-Templates (top-level current version). The upstream template has known silent tool-call drops and <|think_on|>/enable_thinking=false issues; see Qwen/Qwen3.6-27B/discussions/16, discussions/20, and froggeric/.../discussions/2 (the kraka40 / openclaw tool-call fix). Pair with --tool-call-parser qwen3_xml if you serve tool-call workloads.
tokenizer_config.json key strip — from emits . (the pin in our serving image) does not recognize this field. The recipe strips it post-export:

Serving with vLLM

bash
vllm serve <local-path-or-repo-id> \
  --port 8000 \
  --max-model-len 65536 \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}' \
  --kv-cache-dtype fp8_e4m3 \
  --mamba-cache-mode align \
  --trust-remote-code

Notes for vLLM:

The method string qwen3_5_mtp is what reads text_config.mtp_num_hidden_layers (which Qwen/Qwen3.6-27B ships as 1). The base config does not carry num_nextn_predict_layers, so the qwen3_next_mtp method shown on some Qwen model cards resolves to n_predict=0 on this checkpoint.
The hybrid attention layout (16 full + 48 linear) is recognized via the qwen3_5 model_type.
Vision is retained on disk. To skip vision at serve time, add --language-model-only --limit-mm-per-prompt '{"image": 0, "video": 0}'.

Context

The artifact is published here so the work can be referenced, not because it has been evaluated for production correctness or performance.

Acknowledgments

lna-lab/GGUF-to-NVFP4-SM120 — published the modelopt + MTP-graft recipe used here.
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP — same recipe, text-only variant.
froggeric/Qwen-Fixed-Chat-Templates — chat template.
Qwen/Qwen3.6-27B — base weights.

python

import json
cfg = json.loads(open("tokenizer_config.json").read())
cfg.pop("backend", None)
open("tokenizer_config.json", "w").write(json.dumps(cfg, indent=2))

Qwen3.6-27B-VLM-NVFP4-MTP

README

Contents

Base recipe

Step 1 — Quant config

Step 2 — Calibration

Step 3 — Export

Step 4 — Graft `mtp.*` (15 tensors for Qwen3.6-27B dense)

Step 5 — Patch `config.json`

Patches applied on top of the lna-lab recipe

Serving with vLLM

Context

Acknowledgments

Explore FriendliAI today

README

Contents

Base recipe

Step 1 — Quant config

Step 2 — Calibration

Step 3 — Export

Step 4 — Graft `mtp.*` (15 tensors for Qwen3.6-27B dense)

Step 5 — Patch `config.json`

Patches applied on top of the lna-lab recipe

Serving with vLLM

Context

Acknowledgments

Qwen3.6-27B-VLM-NVFP4-MTP

README

Contents

Base recipe

Step 1 — Quant config

Step 2 — Calibration

Step 3 — Export

Step 4 — Graft mtp.* (15 tensors for Qwen3.6-27B dense)

Step 5 — Patch config.json

Patches applied on top of the lna-lab recipe

Serving with vLLM

Context

Acknowledgments

Explore FriendliAI today

README

Contents

Base recipe

Step 1 — Quant config

Step 2 — Calibration

Step 3 — Export

Step 4 — Graft mtp.* (15 tensors for Qwen3.6-27B dense)

Step 5 — Patch config.json

Patches applied on top of the lna-lab recipe

Serving with vLLM

Context

Acknowledgments

Step 4 — Graft `mtp.*` (15 tensors for Qwen3.6-27B dense)

Step 5 — Patch `config.json`

Step 4 — Graft `mtp.*` (15 tensors for Qwen3.6-27B dense)

Step 5 — Patch `config.json`