natfii

Qwen3.6-27B-VLM-Cascade-NVFP4-MTP

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Lineage

Table

Base	`Qwen/Qwen3.6-27B` (VLM, image-text-to-text)
Post-training	Cascade-style: reasoning SFT → RL (RLVR + on-policy distillation), vision frozen
Quantization	NVFP4 body via `nvidia-modelopt`; `lm_head` + MTP head + vision tower kept BF16
Speculative decoding	`qwen3_5_mtp` 1-layer draft head (verbatim base head, kept BF16)

Architecture (from `config.json`)

27B params, hybrid attention: 16 full-attention + 48 linear-attention layers (full_attention_interval=4), hidden_size=5120, num_hidden_layers=64.
Full attention: 24 query / 4 KV heads, head_dim=256 (GQA).
Linear attention: 16 key / 48 value heads, head_dim 128, conv kernel 4 — constant-size recurrent state (context-length independent).
Vision tower (model.visual.*) retained in BF16; skip at serve time with --language-model-only / ENABLE_VISION=0.
vocab_size=248320.

Reasoning modes

ChatML with toggleable thinking, à la Cascade. Thinking is off by default — without enable_thinking the template emits an empty <think></think> and the model answers directly.

Instruct (default): adjacent empty <think></think>; no visible reasoning trace.
Thinking (opt-in): pass chat_template_kwargs={"enable_thinking": true} (or <|think_on|> in the system message); generation then begins <think>.
Termination handoff (thinking mode only): the template appends a brief reasoning→answer instruction to the system prompt (reason fully, verify, then close </think> and answer; don't re-confirm settled work) — curbs the runaway re-verification loops; not applied in instruct mode or when tools are passed (the tool path has its own handoff).

Recommended sampling: temperature=0.7, top_p=0.95, top_k=20, repetition_penalty=1.1 — and never greedy (temperature=0 loops). The Cascade-2 paper uses 1.0 for its avg@k evals, but at 1.0 this model rambles (9k–60k-token traces) in single-sample use; 0.7 (top of DeepSeek-R1's 0.5–0.7 band) is the deployment recommendation. The repetition_penalty=1.1 curbs the re-verification loops this model is prone to in thinking mode — it lets the model close </think> and answer (clean termination, no measured accuracy loss on math checks); lowering temperature does not help (it deepens the loop).

Serving (vLLM, NVFP4 + MTP)

Edit for your use. Agentic workflows require more memory.

bash
# REQUIRED on GB10: the auto-selected FlashInfer NVFP4 GEMM leaks a ~394 MB non-torch
# workspace per linear layer (~100 GiB during profile_run) → fills the unified pool →
# hard reboot. CUTLASS uses a torch-managed workspace — no leak.
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export TORCH_CUDA_ARCH_LIST=12.1               # sm_121

vllm serve /path/to/qwen36-vlm-cascade-nvfp4-mtp \
  --served-model-name qwen3.6-27b-vlm-cascade-nvfp4-mtp \
  --host 0.0.0.0 --port 8002 \
  --quantization modelopt_fp4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.7 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 8192 \
  --kv-cache-dtype fp8 \
  --attention-backend TRITON_ATTN \
  --enable-chunked-prefill \
  --language-model-only \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}' \
  --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
  --trust-remote-code

--language-model-only loads only the language model — the text benchmarks skip the vision tower. Omit it to serve the full vision-language model.
Keep TRITON_ATTN main attention and PIECEWISE cuda graphs on this hybrid mamba / full-attention architecture. --gpu-memory-utilization 0.8 trips on the inference-time Triton JIT of the spec/GDN decode kernels (non-torch headroom, not KV, is the limit at this util) — keep 0.7 and run a MemAvailable watchdog as a guard.
MTP/NEXTN spec-decode (num_speculative_tokens: 2) is lossless — pure decode speedup, identical outputs (~80% draft acceptance measured).
Add --reasoning-parser qwen3 to split <think> traces out of content — exposed as message.reasoning on (SGLang uses ); left off here so the trace reads in full. if you use (below).

Thinking is off by default (see Reasoning modes): pass chat_template_kwargs={"enable_thinking": true} per request to enable reasoning, or put <|think_on|> in the system message (<|think_off|> / enable_thinking=false forces it off). This model reasons at length, so enabling thinking under a small max_tokens can return an only-reasoning, truncated reply — budget accordingly, or hard-cap it: pass thinking_token_budget=N (vLLM sampling param; requires --reasoning-parser qwen3) to force-close </think> after N reasoning tokens. Set it generously (~3000–4000 — genuine hard problems use ~2800) so it only catches runaway loops, not legitimate reasoning. (SGLang: --enable-strict-thinking + per-request custom_params={"thinking_budget": N}.) The template ships Qwen-native XML tool calling (<tool_call><function=…>) — add --enable-auto-tool-choice --tool-call-parser qwen3_xml to the serve command to enable it.

Performance (GB10)

Decode is memory-bandwidth bound (~273 GB/s unified memory). Measured on the serve above (NVFP4 body, fp8 KV, MTP num_speculative_tokens: 2, 131072 context):

Single stream: ~16 tok/s.
64-way concurrent: ~400–490 tok/s aggregate — the GB10 throughput ceiling (raising --max-num-seqs past 64 is flat, ~+2%).
MTP/NEXTN spec-decode: ~2.6 mean accepted tokens per step (~80% draft acceptance) — a lossless decode speedup, not a quality change.

Evaluation

Benchmarking was time-gated for this release. We recommend running full benchmarks for a thorough evaluation.

Intended use & limitations

Use: local reasoning + vision-language + agentic/tool use on GB10.
Not production-evaluated beyond the light benchmark above — validate for your use case.
Heavy text-reasoning RL can erode visual grounding even with the vision tower frozen; evaluate vision before relying on it.
License: Apache-2.0 with attribution — see License, attribution & data provenance below. All training-data licenses are attribution-only and commercial-OK.

The two-repo pattern

Table with columns: Repo, Artifact, For
Repo	Artifact	For
`natfii/Qwen3.6-27B-VLM-Cascade`	BF16 master + base `mtp.*` draft head	Re-quantizing to any format (NVFP4 / FP8 / AWQ / GGUF…), further fine-tuning, BF16 inference, the QAD/distill teacher
`natfii/Qwen3.6-27B-VLM-Cascade-NVFP4-MTP` (this one)	NVFP4 body + BF16 `lm_head` + MTP head	Drop-in GB10 / DGX Spark deployment build (vLLM NEXTN spec-decode)

License, attribution & data provenance

License — Apache-2.0. This NVFP4 deployment build is a derivative of Qwen/Qwen3.6-27B (released under Apache-2.0) and is itself published under Apache-2.0. You may use it commercially or non-commercially, provided you retain the LICENSE and NOTICE files and the attributions below. The full-precision BF16 master (the re-quantization source) is at natfii/Qwen3.6-27B-VLM-Cascade.

Non-binding note. This is a personal homelab project, provided as-is with no warranty or support and not commercially maintained. This is courtesy context only — it does not add any restriction to the Apache-2.0 grant.

Attribution.

Base model Qwen/Qwen3.6-27B © Alibaba Cloud / the Qwen team — Apache-2.0.
Cascade-style post-training, NVFP4 quantization, and MTP-head graft + re-align, packaged by natfii.
Method attribution: the recipe emulates Nemotron-Cascade-2 (NVIDIA; arXiv 2603.19220) — method emulation only, not a redistribution of NVIDIA's pipeline or weights.

Training-data provenance. Every dataset in the lineage is attribution-only and commercial-OK; the OML-licensed 593 GB Nemotron SFT corpus was deliberately not used, so no OML obligation attaches.

Table with columns: Stage, Dataset(s), License
Stage	Dataset(s)	License
SFT cold-start (~10k `<think>` traces; ~6k math + ~4k code)	`open-thoughts/OpenThoughts-114k` + `open-r1/OpenR1-Math-220k`	Apache-2.0 (both)
Math RLVR prompts	`nvidia/AceReason-Math` (← NuminaMath-1.5 + DeepScaleR-Preview)	CC-BY-4.0

The SFT traces are DeepSeek-R1-distilled (via the two open datasets above); DeepSeek-R1 is MIT-licensed and expressly permits distillation, and both datasets relicense their traces under Apache-2.0 — disclosed for transparency; no extra obligation attaches. Full attributions are reproduced in the repo NOTICE file.

Provenance

Base quant + MTP graft by natfii (lna-lab NVFP4-SM120 recipe). Cascade-style post-training + re-quant + MTP re-graft via the qwen-cascade pipeline.

Model provider

natfii

Model tree

Base

Qwen/Qwen3.6-27B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Lineage

Table

Base	`Qwen/Qwen3.6-27B` (VLM, image-text-to-text)
Post-training	Cascade-style: reasoning SFT → RL (RLVR + on-policy distillation), vision frozen
Quantization	NVFP4 body via `nvidia-modelopt`; `lm_head` + MTP head + vision tower kept BF16
Speculative decoding	`qwen3_5_mtp` 1-layer draft head (verbatim base head, kept BF16)

Architecture (from `config.json`)

27B params, hybrid attention: 16 full-attention + 48 linear-attention layers (full_attention_interval=4), hidden_size=5120, num_hidden_layers=64.
Full attention: 24 query / 4 KV heads, head_dim=256 (GQA).
Linear attention: 16 key / 48 value heads, head_dim 128, conv kernel 4 — constant-size recurrent state (context-length independent).
Vision tower (model.visual.*) retained in BF16; skip at serve time with --language-model-only / ENABLE_VISION=0.
vocab_size=248320.

Reasoning modes

ChatML with toggleable thinking, à la Cascade. Thinking is off by default — without enable_thinking the template emits an empty <think></think> and the model answers directly.

Instruct (default): adjacent empty <think></think>; no visible reasoning trace.
Thinking (opt-in): pass chat_template_kwargs={"enable_thinking": true} (or <|think_on|> in the system message); generation then begins <think>.
Termination handoff (thinking mode only): the template appends a brief reasoning→answer instruction to the system prompt (reason fully, verify, then close </think> and answer; don't re-confirm settled work) — curbs the runaway re-verification loops; not applied in instruct mode or when tools are passed (the tool path has its own handoff).

Serving (vLLM, NVFP4 + MTP)

Edit for your use. Agentic workflows require more memory.

bash
# REQUIRED on GB10: the auto-selected FlashInfer NVFP4 GEMM leaks a ~394 MB non-torch
# workspace per linear layer (~100 GiB during profile_run) → fills the unified pool →
# hard reboot. CUTLASS uses a torch-managed workspace — no leak.
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export TORCH_CUDA_ARCH_LIST=12.1               # sm_121

vllm serve /path/to/qwen36-vlm-cascade-nvfp4-mtp \
  --served-model-name qwen3.6-27b-vlm-cascade-nvfp4-mtp \
  --host 0.0.0.0 --port 8002 \
  --quantization modelopt_fp4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.7 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 8192 \
  --kv-cache-dtype fp8 \
  --attention-backend TRITON_ATTN \
  --enable-chunked-prefill \
  --language-model-only \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}' \
  --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
  --trust-remote-code

--language-model-only loads only the language model — the text benchmarks skip the vision tower. Omit it to serve the full vision-language model.
Keep TRITON_ATTN main attention and PIECEWISE cuda graphs on this hybrid mamba / full-attention architecture. --gpu-memory-utilization 0.8 trips on the inference-time Triton JIT of the spec/GDN decode kernels (non-torch headroom, not KV, is the limit at this util) — keep 0.7 and run a MemAvailable watchdog as a guard.
MTP/NEXTN spec-decode (num_speculative_tokens: 2) is lossless — pure decode speedup, identical outputs (~80% draft acceptance measured).
Add --reasoning-parser qwen3 to split <think> traces out of content — exposed as message.reasoning on (SGLang uses ); left off here so the trace reads in full. if you use (below).

Performance (GB10)

Decode is memory-bandwidth bound (~273 GB/s unified memory). Measured on the serve above (NVFP4 body, fp8 KV, MTP num_speculative_tokens: 2, 131072 context):

Single stream: ~16 tok/s.
64-way concurrent: ~400–490 tok/s aggregate — the GB10 throughput ceiling (raising --max-num-seqs past 64 is flat, ~+2%).
MTP/NEXTN spec-decode: ~2.6 mean accepted tokens per step (~80% draft acceptance) — a lossless decode speedup, not a quality change.

Evaluation

Benchmarking was time-gated for this release. We recommend running full benchmarks for a thorough evaluation.

Intended use & limitations

Use: local reasoning + vision-language + agentic/tool use on GB10.
Not production-evaluated beyond the light benchmark above — validate for your use case.
Heavy text-reasoning RL can erode visual grounding even with the vision tower frozen; evaluate vision before relying on it.
License: Apache-2.0 with attribution — see License, attribution & data provenance below. All training-data licenses are attribution-only and commercial-OK.

The two-repo pattern

Table with columns: Repo, Artifact, For
Repo	Artifact	For
`natfii/Qwen3.6-27B-VLM-Cascade`	BF16 master + base `mtp.*` draft head	Re-quantizing to any format (NVFP4 / FP8 / AWQ / GGUF…), further fine-tuning, BF16 inference, the QAD/distill teacher
`natfii/Qwen3.6-27B-VLM-Cascade-NVFP4-MTP` (this one)	NVFP4 body + BF16 `lm_head` + MTP head	Drop-in GB10 / DGX Spark deployment build (vLLM NEXTN spec-decode)

License, attribution & data provenance

Non-binding note. This is a personal homelab project, provided as-is with no warranty or support and not commercially maintained. This is courtesy context only — it does not add any restriction to the Apache-2.0 grant.

Attribution.

Cascade-style post-training, NVFP4 quantization, and MTP-head graft + re-align, packaged by natfii.
Method attribution: the recipe emulates Nemotron-Cascade-2 (NVIDIA; arXiv 2603.19220) — method emulation only, not a redistribution of NVIDIA's pipeline or weights.

Table with columns: Stage, Dataset(s), License
Stage	Dataset(s)	License
SFT cold-start (~10k `<think>` traces; ~6k math + ~4k code)	`open-thoughts/OpenThoughts-114k` + `open-r1/OpenR1-Math-220k`	Apache-2.0 (both)
Math RLVR prompts	`nvidia/AceReason-Math` (← NuminaMath-1.5 + DeepScaleR-Preview)	CC-BY-4.0

Provenance

Base quant + MTP graft by natfii (lna-lab NVFP4-SM120 recipe). Cascade-style post-training + re-quant + MTP re-graft via the qwen-cascade pipeline.

Qwen3.6-27B-VLM-Cascade-NVFP4-MTP

Get help setting up a custom Dedicated Endpoints.

README

Lineage

Architecture (from `config.json`)

Reasoning modes

Serving (vLLM, NVFP4 + MTP)

Performance (GB10)

Evaluation

Intended use & limitations

The two-repo pattern

License, attribution & data provenance

Provenance

Explore FriendliAI today

README

Lineage

Architecture (from `config.json`)

Reasoning modes

Serving (vLLM, NVFP4 + MTP)

Performance (GB10)

Evaluation

Intended use & limitations

The two-repo pattern

License, attribution & data provenance

Provenance

Qwen3.6-27B-VLM-Cascade-NVFP4-MTP

Get help setting up a custom Dedicated Endpoints.

Lineage

Architecture (from config.json)

Reasoning modes

Serving (vLLM, NVFP4 + MTP)

Performance (GB10)

Evaluation

Intended use & limitations

The two-repo pattern

License, attribution & data provenance

Provenance

Explore FriendliAI today

Lineage

Architecture (from config.json)

Reasoning modes

Serving (vLLM, NVFP4 + MTP)

Performance (GB10)

Evaluation

Intended use & limitations

The two-repo pattern

License, attribution & data provenance

Provenance

Architecture (from `config.json`)

Architecture (from `config.json`)