Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0The two-repo pattern
| Repo | Artifact | For |
|---|---|---|
natfii/Qwen3.6-27B-VLM-Cascade (this one) | BF16 master + base mtp.* draft head | Re-quantizing to any format (NVFP4 / FP8 / AWQ / GGUF…), further fine-tuning, BF16 inference, the QAD/distill teacher |
natfii/Qwen3.6-27B-VLM-Cascade-NVFP4-MTP | NVFP4 body + BF16 lm_head + BF16 MTP head | Drop-in GB10 / DGX Spark deployment build (vLLM NEXTN spec-decode) |
Lineage
| Base | Qwen/Qwen3.6-27B (VLM, image-text-to-text), apache-2.0 |
| Post-training | Cascade-style: reasoning SFT → sequential RLVR + MOPD self-distillation, vision tower frozen |
| Precision | BF16 throughout (this is the master; not quantized) |
| MTP draft head | 1-layer qwen3_5_mtp head (verbatim base head, kept BF16) |
Architecture (from config.json)
- 27B params, hybrid attention: 16 full-attention + 48 linear-attention
layers (
full_attention_interval=4),hidden_size=5120,num_hidden_layers=64. Thelayer_typeslist places full attention at indices 3, 7, 11, …, 63; the other 48 are GatedDeltaNet (linear-attention) blocks with a constant-size recurrent state (context-length independent). - Full attention: 24 query / 4 KV heads,
head_dim=256(GQA). - Vision tower (
model.visual.*) in BF16; frozen during all post-training. Skip at serve time for text-only workloads if your runtime supports it. - MTP: 1 draft-head layer (
mtp_num_hidden_layers=1,mtp_use_dedicated_embeddings=False) — fuses [previous-token embedding ; target hidden state] through a small FC, runs one decoder block, and reuseslm_head. Here the head is the verbatim base draft head, kept BF16. vocab_size=248320.
The MTP head
This repo ships the verbatim base qwen3_5_mtp draft head — the original
1-layer head, kept BF16, grafted additively onto the post-trained body for NEXTN
speculative decoding. Spec-decode is lossless (the draft head only affects
decode speed, never the output), so the base head is a safe default; re-measure
accepted length on your serving stack, and optionally re-align the head to this
target if you want higher acceptance.
Fusion: the head uses single-final-hidden NEXTN (
--fusion final), not EAGLE-3 multi-layer fusion.
Reasoning modes
ChatML with toggleable thinking, à la Cascade. Thinking is off by default — when
a request does not set enable_thinking, the template emits an empty <think></think>
and the model answers directly.
- Instruct (default): adjacent empty
<think></think>; no visible reasoning trace. - Thinking (opt-in): pass
chat_template_kwargs={"enable_thinking": true}(or put<|think_on|>in the system message); generation then begins<think>\nand the model reasons before answering.<|think_off|>/enable_thinking=falseforces it off. - Termination handoff (thinking mode only): the template appends a brief reasoning→answer
instruction to the system prompt (reason fully, verify, then close
</think>and answer; don't re-confirm settled work) — curbs runaway re-verification loops; not applied in instruct mode or when tools are passed.
This model reasons at length, so enabling thinking under a small max_tokens can
return an only-reasoning, truncated reply — budget the completion accordingly. When serving via
vLLM or SGLang you can hard-cap the thinking: vLLM thinking_token_budget=N (needs
--reasoning-parser qwen3), or SGLang --enable-strict-thinking + custom_params={"thinking_budget": N},
force-close </think> after N reasoning tokens — set it generously (~3000–4000; genuine hard
problems use ~2800) so it only catches runaway loops.
Recommended sampling: temperature=0.7, top_p=0.95, top_k=20, repetition_penalty=1.1 — and never greedy
(temperature=0 loops; at 1.0 it rambles — the paper's 1.0 is for avg@k eval only). The
repetition_penalty=1.1 curbs the re-verification loops this model is prone to in thinking
mode — it lets the model close </think> and answer (clean termination, no measured accuracy
loss); lowering temperature does not help (it deepens the loop).
To split the <think> trace into a separate reasoning channel, use your runtime's qwen3
reasoning parser (the separated trace is message.reasoning on vLLM 0.22.0, reasoning_content
on SGLang).
Usage (BF16, transformers)
python
# Qwen3.6 VLM loads as Qwen3_5ForConditionalGeneration; AutoModelForImageTextToText# with trust_remote_code is the portable fallback.from transformers import AutoProcessor, AutoModelForImageTextToTextmodel_id = "natfii/Qwen3.6-27B-VLM-Cascade"processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="bfloat16", device_map="auto", trust_remote_code=True)# Thinking is OFF by default (empty "<think></think>"); pass# apply_chat_template(..., enable_thinking=True) to get the reasoning trace.
Spec-decode / NEXTN: the BF16 mtp.* head is present and aligned to this
BF16 target, so runtimes that support the qwen3_5_mtp / NEXTN draft method can
speculate directly against this repo. (For a turnkey, memory-bandwidth-friendly
GB10 deployment, prefer the NVFP4-MTP repo.)
Re-quantizing this master (e.g. to NVFP4 for GB10)
This BF16 master is the source the NVFP4-MTP deployment build is made from. To
reproduce that build, re-quant with nvidia-modelopt and keep the
BF16-head invariant ignore-list byte-for-byte (pipeline S4): exclude
*model.visual*, *linear_attn.conv1d*, *lm_head*, and *mtp*
from NVFP4 (note: linear_attn.in_proj_* and out_proj ARE NVFP4-quantized —
re-verify in_proj against hf_quant_config.json at S4 build), and keep the
KV-cache FP8 setting identical. Keeping the output and
draft heads out of FP4 is what protects both answer quality and speculative
acceptance. Graft the mtp.* head into the quantized export (kept BF16, out of the
FP4 body); the base head transfers, but re-measure accepted length and optionally
re-align it to the quantized target for higher acceptance.
License, attribution & data provenance
License — Apache-2.0. This model is a derivative of
Qwen/Qwen3.6-27B (released under
Apache-2.0) and is itself published under Apache-2.0. You may use it
commercially or non-commercially, provided you retain the LICENSE and NOTICE
files and the attributions below.
Non-binding note. This is a personal homelab project, provided as-is with no warranty or support and not commercially maintained. This is courtesy context only — it does not add any restriction to the Apache-2.0 grant.
Attribution.
- Base model
Qwen/Qwen3.6-27B© Alibaba Cloud / the Qwen team — Apache-2.0. - Cascade-style post-training, MTP-head graft + re-align, and packaging by
natfii. - Method attribution: the recipe emulates Nemotron-Cascade-2 (NVIDIA; arXiv 2603.19220) — method emulation only, not a redistribution of NVIDIA's pipeline or weights.
Training-data provenance. Every dataset in the lineage is attribution-only and commercial-OK; the OML-licensed 593 GB Nemotron SFT corpus was deliberately not used, so no OML obligation attaches.
| Stage | Dataset(s) | License |
|---|---|---|
SFT cold-start (~10k <think> traces; ~6k math + ~4k code) | open-thoughts/OpenThoughts-114k + open-r1/OpenR1-Math-220k | Apache-2.0 (both) |
| Math RLVR prompts | nvidia/AceReason-Math (← NuminaMath-1.5 + DeepScaleR-Preview) | CC-BY-4.0 |
| IF-RL / MOPD / multi-domain prompts + verifiers | nvidia/Nemotron-Cascade-2-RL-data | ODC-BY-1.0 |
| MOPD + MTP-head self-distillation | the model's own frozen checkpoint (no third-party teacher) | — |
The SFT traces are DeepSeek-R1-distilled (via the two open datasets above);
DeepSeek-R1 is MIT-licensed and expressly permits distillation, and both datasets
relicense their traces under Apache-2.0 — disclosed for transparency; no extra
obligation attaches. Full attributions are reproduced in the repo NOTICE file.
Intended use & limitations
- Intended use: local/homelab reasoning + vision-language + agentic/tool use; a re-quantizable BF16 master for building deployment variants.
- Not production-evaluated beyond the light benchmark above — validate for your use case.
- Visual grounding can erode silently under heavy text-reasoning RL even with the vision tower frozen (grounding lives in LM weights); evaluate vision before relying on it.
- MTP acceptance is empirical: the draft head is the verbatim base head, so
accepted-length should be re-measured on your serving stack (fusion-index is
RESOLVED: single-final-hidden NEXTN,
--fusion final). - Inherits all base-model limitations (hallucination, bias, knowledge cutoff).
Evaluation
Benchmarking was time-gated for this release. We recommend running full benchmarks for a thorough evaluation.
Provenance
Cascade-style post-training, MTP-head graft, and packaging by natfii
via the qwen-cascade pipeline (single GB10 / DGX Spark,
SM121). The NVFP4-MTP deployment repo is re-quantized from this master with the
BF16-head invariant.
Model provider
natfii
Model tree
Base
Qwen/Qwen3.6-27B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information