natfii

Qwen3.6-27B-VLM-Cascade

README

License: apache-2.0

The two-repo pattern

Table with columns: Repo, Artifact, For
Repo	Artifact	For
`natfii/Qwen3.6-27B-VLM-Cascade` (this one)	BF16 master + base `mtp.*` draft head	Re-quantizing to any format (NVFP4 / FP8 / AWQ / GGUF…), further fine-tuning, BF16 inference, the QAD/distill teacher
`natfii/Qwen3.6-27B-VLM-Cascade-NVFP4-MTP`	NVFP4 body + BF16 `lm_head` + BF16 MTP head	Drop-in GB10 / DGX Spark deployment build (vLLM NEXTN spec-decode)

Lineage

Table

Base	`Qwen/Qwen3.6-27B` (VLM, image-text-to-text), apache-2.0
Post-training	Cascade-style: reasoning SFT → sequential RLVR + MOPD self-distillation, vision tower frozen
Precision	BF16 throughout (this is the master; not quantized)
MTP draft head	1-layer `qwen3_5_mtp` head (verbatim base head, kept BF16)

Architecture (from `config.json`)

27B params, hybrid attention: 16 full-attention + 48 linear-attention layers (full_attention_interval=4), hidden_size=5120, num_hidden_layers=64. The layer_types list places full attention at indices 3, 7, 11, …, 63; the other 48 are GatedDeltaNet (linear-attention) blocks with a constant-size recurrent state (context-length independent).
Full attention: 24 query / 4 KV heads, head_dim=256 (GQA).
Vision tower (model.visual.*) in BF16; frozen during all post-training. Skip at serve time for text-only workloads if your runtime supports it.
MTP: 1 draft-head layer (mtp_num_hidden_layers=1, mtp_use_dedicated_embeddings=False) — fuses [previous-token embedding ; target hidden state] through a small FC, runs one decoder block, and reuses lm_head. Here the head is the verbatim base draft head, kept BF16.
.

The MTP head

This repo ships the verbatim base qwen3_5_mtp draft head — the original 1-layer head, kept BF16, grafted additively onto the post-trained body for NEXTN speculative decoding. Spec-decode is lossless (the draft head only affects decode speed, never the output), so the base head is a safe default; re-measure accepted length on your serving stack, and optionally re-align the head to this target if you want higher acceptance.

Fusion: the head uses single-final-hidden NEXTN (--fusion final), not EAGLE-3 multi-layer fusion.

Reasoning modes

ChatML with toggleable thinking, à la Cascade. Thinking is off by default — when a request does not set enable_thinking, the template emits an empty <think></think> and the model answers directly.

Instruct (default): adjacent empty <think></think>; no visible reasoning trace.
Thinking (opt-in): pass chat_template_kwargs={"enable_thinking": true} (or put <|think_on|> in the system message); generation then begins <think>\n and the model reasons before answering. <|think_off|> / enable_thinking=false forces it off.
Termination handoff (thinking mode only): the template appends a brief reasoning→answer instruction to the system prompt (reason fully, verify, then close </think> and answer; don't re-confirm settled work) — curbs runaway re-verification loops; not applied in instruct mode or when tools are passed.

This model reasons at length, so enabling thinking under a small max_tokens can return an only-reasoning, truncated reply — budget the completion accordingly. When serving via vLLM or SGLang you can hard-cap the thinking: vLLM thinking_token_budget=N (needs --reasoning-parser qwen3), or SGLang --enable-strict-thinking + custom_params={"thinking_budget": N}, force-close </think> after N reasoning tokens — set it generously (~3000–4000; genuine hard problems use ~2800) so it only catches runaway loops.

Recommended sampling: temperature=0.7, top_p=0.95, top_k=20, repetition_penalty=1.1 — and never greedy (temperature=0 loops; at 1.0 it rambles — the paper's 1.0 is for avg@k eval only). The repetition_penalty=1.1 curbs the re-verification loops this model is prone to in thinking mode — it lets the model close </think> and answer (clean termination, no measured accuracy loss); lowering temperature does not help (it deepens the loop). To split the <think> trace into a separate reasoning channel, use your runtime's qwen3 reasoning parser (the separated trace is message.reasoning on vLLM 0.22.0, reasoning_content on SGLang).

Usage (BF16, transformers)

python
# Qwen3.6 VLM loads as Qwen3_5ForConditionalGeneration; AutoModelForImageTextToText
# with trust_remote_code is the portable fallback.
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "natfii/Qwen3.6-27B-VLM-Cascade"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype="bfloat16", device_map="auto", trust_remote_code=True
)
# Thinking is OFF by default (empty "<think></think>"); pass
# apply_chat_template(..., enable_thinking=True) to get the reasoning trace.

Spec-decode / NEXTN: the BF16 mtp.* head is present and aligned to this BF16 target, so runtimes that support the qwen3_5_mtp / NEXTN draft method can speculate directly against this repo. (For a turnkey, memory-bandwidth-friendly GB10 deployment, prefer the NVFP4-MTP repo.)

Re-quantizing this master (e.g. to NVFP4 for GB10)

This BF16 master is the source the NVFP4-MTP deployment build is made from. To reproduce that build, re-quant with nvidia-modelopt and keep the BF16-head invariant ignore-list byte-for-byte (pipeline S4): exclude *model.visual*, *linear_attn.conv1d*, *lm_head*, and *mtp* from NVFP4 (note: linear_attn.in_proj_* and out_proj ARE NVFP4-quantized — re-verify in_proj against hf_quant_config.json at S4 build), and keep the KV-cache FP8 setting identical. Keeping the output and draft heads out of FP4 is what protects both answer quality and speculative acceptance. Graft the mtp.* head into the quantized export (kept BF16, out of the FP4 body); the base head transfers, but re-measure accepted length and optionally re-align it to the quantized target for higher acceptance.

License, attribution & data provenance

License — Apache-2.0. This model is a derivative of Qwen/Qwen3.6-27B (released under Apache-2.0) and is itself published under Apache-2.0. You may use it commercially or non-commercially, provided you retain the LICENSE and NOTICE files and the attributions below.

Non-binding note. This is a personal homelab project, provided as-is with no warranty or support and not commercially maintained. This is courtesy context only — it does not add any restriction to the Apache-2.0 grant.

Attribution.

Base model Qwen/Qwen3.6-27B © Alibaba Cloud / the Qwen team — Apache-2.0.
Cascade-style post-training, MTP-head graft + re-align, and packaging by natfii.
Method attribution: the recipe emulates Nemotron-Cascade-2 (NVIDIA; arXiv 2603.19220) — method emulation only, not a redistribution of NVIDIA's pipeline or weights.

Training-data provenance. Every dataset in the lineage is attribution-only and commercial-OK; the OML-licensed 593 GB Nemotron SFT corpus was deliberately not used, so no OML obligation attaches.

Table with columns: Stage, Dataset(s), License
Stage	Dataset(s)	License
SFT cold-start (~10k `<think>` traces; ~6k math + ~4k code)	`open-thoughts/OpenThoughts-114k` + `open-r1/OpenR1-Math-220k`	Apache-2.0 (both)
Math RLVR prompts	`nvidia/AceReason-Math` (← NuminaMath-1.5 + DeepScaleR-Preview)	CC-BY-4.0

The SFT traces are DeepSeek-R1-distilled (via the two open datasets above); DeepSeek-R1 is MIT-licensed and expressly permits distillation, and both datasets relicense their traces under Apache-2.0 — disclosed for transparency; no extra obligation attaches. Full attributions are reproduced in the repo NOTICE file.

Intended use & limitations

Intended use: local/homelab reasoning + vision-language + agentic/tool use; a re-quantizable BF16 master for building deployment variants.
Not production-evaluated beyond the light benchmark above — validate for your use case.
Visual grounding can erode silently under heavy text-reasoning RL even with the vision tower frozen (grounding lives in LM weights); evaluate vision before relying on it.
MTP acceptance is empirical: the draft head is the verbatim base head, so accepted-length should be re-measured on your serving stack (fusion-index is RESOLVED: single-final-hidden NEXTN, --fusion final).
Inherits all base-model limitations (hallucination, bias, knowledge cutoff).

Evaluation

Benchmarking was time-gated for this release. We recommend running full benchmarks for a thorough evaluation.

Provenance

Cascade-style post-training, MTP-head graft, and packaging by natfii via the qwen-cascade pipeline (single GB10 / DGX Spark, SM121). The NVFP4-MTP deployment repo is re-quantized from this master with the BF16-head invariant.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

natfii

Model Tree

Base

Qwen/Qwen3.6-27B

Fine-tuned

this model

Input Modalities

Text

Image

Video

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

The two-repo pattern

Table with columns: Repo, Artifact, For
Repo	Artifact	For
`natfii/Qwen3.6-27B-VLM-Cascade` (this one)	BF16 master + base `mtp.*` draft head	Re-quantizing to any format (NVFP4 / FP8 / AWQ / GGUF…), further fine-tuning, BF16 inference, the QAD/distill teacher
`natfii/Qwen3.6-27B-VLM-Cascade-NVFP4-MTP`	NVFP4 body + BF16 `lm_head` + BF16 MTP head	Drop-in GB10 / DGX Spark deployment build (vLLM NEXTN spec-decode)

Lineage

Table

Base	`Qwen/Qwen3.6-27B` (VLM, image-text-to-text), apache-2.0
Post-training	Cascade-style: reasoning SFT → sequential RLVR + MOPD self-distillation, vision tower frozen
Precision	BF16 throughout (this is the master; not quantized)
MTP draft head	1-layer `qwen3_5_mtp` head (verbatim base head, kept BF16)

Architecture (from `config.json`)

27B params, hybrid attention: 16 full-attention + 48 linear-attention layers (full_attention_interval=4), hidden_size=5120, num_hidden_layers=64. The layer_types list places full attention at indices 3, 7, 11, …, 63; the other 48 are GatedDeltaNet (linear-attention) blocks with a constant-size recurrent state (context-length independent).
Full attention: 24 query / 4 KV heads, head_dim=256 (GQA).
Vision tower (model.visual.*) in BF16; frozen during all post-training. Skip at serve time for text-only workloads if your runtime supports it.
MTP: 1 draft-head layer (mtp_num_hidden_layers=1, mtp_use_dedicated_embeddings=False) — fuses [previous-token embedding ; target hidden state] through a small FC, runs one decoder block, and reuses lm_head. Here the head is the verbatim base draft head, kept BF16.
.

The MTP head

Fusion: the head uses single-final-hidden NEXTN (--fusion final), not EAGLE-3 multi-layer fusion.

Reasoning modes

Instruct (default): adjacent empty <think></think>; no visible reasoning trace.
Thinking (opt-in): pass chat_template_kwargs={"enable_thinking": true} (or put <|think_on|> in the system message); generation then begins <think>\n and the model reasons before answering. <|think_off|> / enable_thinking=false forces it off.
Termination handoff (thinking mode only): the template appends a brief reasoning→answer instruction to the system prompt (reason fully, verify, then close </think> and answer; don't re-confirm settled work) — curbs runaway re-verification loops; not applied in instruct mode or when tools are passed.

Usage (BF16, transformers)

python
# Qwen3.6 VLM loads as Qwen3_5ForConditionalGeneration; AutoModelForImageTextToText
# with trust_remote_code is the portable fallback.
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "natfii/Qwen3.6-27B-VLM-Cascade"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype="bfloat16", device_map="auto", trust_remote_code=True
)
# Thinking is OFF by default (empty "<think></think>"); pass
# apply_chat_template(..., enable_thinking=True) to get the reasoning trace.

Re-quantizing this master (e.g. to NVFP4 for GB10)

License, attribution & data provenance

Non-binding note. This is a personal homelab project, provided as-is with no warranty or support and not commercially maintained. This is courtesy context only — it does not add any restriction to the Apache-2.0 grant.

Attribution.

Cascade-style post-training, MTP-head graft + re-align, and packaging by natfii.
Method attribution: the recipe emulates Nemotron-Cascade-2 (NVIDIA; arXiv 2603.19220) — method emulation only, not a redistribution of NVIDIA's pipeline or weights.

Table with columns: Stage, Dataset(s), License
Stage	Dataset(s)	License
SFT cold-start (~10k `<think>` traces; ~6k math + ~4k code)	`open-thoughts/OpenThoughts-114k` + `open-r1/OpenR1-Math-220k`	Apache-2.0 (both)
Math RLVR prompts	`nvidia/AceReason-Math` (← NuminaMath-1.5 + DeepScaleR-Preview)	CC-BY-4.0

Intended use & limitations

Intended use: local/homelab reasoning + vision-language + agentic/tool use; a re-quantizable BF16 master for building deployment variants.
Not production-evaluated beyond the light benchmark above — validate for your use case.
Visual grounding can erode silently under heavy text-reasoning RL even with the vision tower frozen (grounding lives in LM weights); evaluate vision before relying on it.
MTP acceptance is empirical: the draft head is the verbatim base head, so accepted-length should be re-measured on your serving stack (fusion-index is RESOLVED: single-final-hidden NEXTN, --fusion final).
Inherits all base-model limitations (hallucination, bias, knowledge cutoff).

Evaluation

Benchmarking was time-gated for this release. We recommend running full benchmarks for a thorough evaluation.

Qwen3.6-27B-VLM-Cascade

README

The two-repo pattern

Lineage

Architecture (from config.json)

The MTP head

Reasoning modes

Usage (BF16, transformers)

Re-quantizing this master (e.g. to NVFP4 for GB10)

License, attribution & data provenance

Intended use & limitations

Evaluation

Provenance

Explore FriendliAI today

README

The two-repo pattern

Lineage

Architecture (from config.json)

The MTP head

Reasoning modes

Usage (BF16, transformers)

Re-quantizing this master (e.g. to NVFP4 for GB10)

License, attribution & data provenance

Intended use & limitations

Evaluation

Provenance

Architecture (from `config.json`)

Architecture (from `config.json`)