Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Contents

  • model.safetensors (~20.6 GB): single shard
    • NVFP4-packed body weights (uint8 packed + per-block float8_e4m3fn weight_scale + per-tensor float32 weight_scale_2)
    • BF16 vision tower (333 model.visual.* tensors)
    • BF16 MTP head (15 mtp.* tensors)
    • BF16 lm_head.weight
    • BF16 linear_attn.conv1d and in_proj_* projections
  • config.jsonquantization_config.ignore lists the 65 entries kept in BF16 (50 vision blocks, 15 MTP modules)
  • hf_quant_config.json — modelopt metadata
  • chat_template.jinjafroggeric/Qwen-Fixed-Chat-Templates (see Patches)
  • tokenizer.json, tokenizer_config.json, preprocessor_config.json, video_preprocessor_config.json, generation_config.json

Input checkpoint size: 55.6 GB BF16 → output 20.6 GB (0.37×).

Base recipe

The 5-step graft procedure is from lna-lab/GGUF-to-NVFP4-SM120 — credit to Tonoken / LNA-LAB. Recipe doc: docs/MTP_GRAFT_RECIPE.md. VLM-preserving variant: src/quantize/qwen36_27b_vlm_mtp.py.

Step 1 — Quant config

NVFP4_DEFAULT_CFG already excludes linear_attn.conv1d, lm_head, router, mlp.gate, block_sparse_moe.gate. Two ignores added on top:

python

import modelopt.torch.quantization as mtq
config = mtq.NVFP4_DEFAULT_CFG
quant_cfg = dict(config["quant_cfg"])
quant_cfg["*visual*"] = {"enable": False} # keep vision tower BF16
quant_cfg["*mtp*"] = {"enable": False} # keep MTP head BF16
build_config = {"quant_cfg": quant_cfg, "algorithm": config["algorithm"]}

Step 2 — Calibration

20 samples from neuralmagic/calibration (name="LLM", split="train[:20]") at max_seq_len=8192, applied via tokenizer.apply_chat_template(...). Forward-pass calibration with torch.no_grad() and the model in inference mode.

Step 3 — Export

python

from modelopt.torch.export import export_hf_checkpoint
mtq.quantize(model, build_config, forward_loop=...)
export_hf_checkpoint(model, export_dir=OUT)

compressed-tensors.oneshot does not produce a working SM120 NVFP4 checkpoint per lna-lab's notes; modelopt is the path used here.

Step 4 — Graft mtp.* (15 tensors for Qwen3.6-27B dense)

python

from safetensors import safe_open
from safetensors.torch import load_file, save_file
# Walk base BF16 shards, collect mtp.* tensors
shard_to_keys = {...} # via base index.json
mtp_tensors = {}
for shard, keys in shard_to_keys.items():
with safe_open(BASE/shard, framework="pt") as f:
for k in keys: mtp_tensors[k] = f.get_tensor(k)
# Append into the last quantized shard, BF16
target = sorted(OUT.glob("model*.safetensors"))[-1]
existing = load_file(str(target))
for k, v in mtp_tensors.items():
existing[k] = v.to(torch.bfloat16).contiguous()
save_file(existing, str(target), metadata=meta)
# Update index.json weight_map + total_size if multi-shard

Step 5 — Patch config.json

python

mtp_modules = sorted({".".join(k.split(".")[:-1]) for k in mtp_keys if k.endswith(".weight")})
cfg["quantization_config"].setdefault("ignore", []).extend(mtp_modules)
# vision_config stays; language_model_only stays False

Patches applied on top of the lna-lab recipe

  1. Chat template — replaced upstream Qwen/Qwen3.6-27B chat_template.jinja with froggeric/Qwen-Fixed-Chat-Templates (top-level current version). The upstream template has known silent tool-call drops and <|think_on|>/enable_thinking=false issues; see Qwen/Qwen3.6-27B/discussions/16, discussions/20, and froggeric/.../discussions/2 (the kraka40 / openclaw tool-call fix). Pair with --tool-call-parser qwen3_xml if you serve tool-call workloads.

  2. tokenizer_config.json backend key striptokenizer.save_pretrained() from transformers>=5 emits "backend": "tokenizers". transformers==4.57.6 (the pin in our serving image) does not recognize this field. The recipe strips it post-export:

    python

    import json
    cfg = json.loads(open("tokenizer_config.json").read())
    cfg.pop("backend", None)
    open("tokenizer_config.json", "w").write(json.dumps(cfg, indent=2))
  3. No other source modifications. Weights are the base modelopt NVFP4 quant + base mtp.* graft; no fine-tuning, no abliteration, no distillation.

Serving with vLLM

bash

vllm serve <local-path-or-repo-id> \
--port 8000 \
--max-model-len 65536 \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}' \
--kv-cache-dtype fp8_e4m3 \
--mamba-cache-mode align \
--trust-remote-code

Notes for vLLM:

  • The method string qwen3_5_mtp is what reads text_config.mtp_num_hidden_layers (which Qwen/Qwen3.6-27B ships as 1). The base config does not carry num_nextn_predict_layers, so the qwen3_next_mtp method shown on some Qwen model cards resolves to n_predict=0 on this checkpoint.
  • The hybrid attention layout (16 full + 48 linear) is recognized via the qwen3_5 model_type.
  • Vision is retained on disk. To skip vision at serve time, add --language-model-only --limit-mm-per-prompt '{"image": 0, "video": 0}'.

Context

The artifact is published here so the work can be referenced, not because it has been evaluated for production correctness or performance.

Acknowledgments

Model provider

natfii

Model tree

Base

Qwen/Qwen3.6-27B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today