Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Lineage

BaseQwen/Qwen3.6-27B (VLM, image-text-to-text)
Post-trainingCascade-style: reasoning SFT → RL (RLVR + on-policy distillation), vision frozen
QuantizationNVFP4 body via nvidia-modelopt; lm_head + MTP head + vision tower kept BF16
Speculative decodingqwen3_5_mtp 1-layer draft head (verbatim base head, kept BF16)

Architecture (from config.json)

  • 27B params, hybrid attention: 16 full-attention + 48 linear-attention layers (full_attention_interval=4), hidden_size=5120, num_hidden_layers=64.
  • Full attention: 24 query / 4 KV heads, head_dim=256 (GQA).
  • Linear attention: 16 key / 48 value heads, head_dim 128, conv kernel 4 — constant-size recurrent state (context-length independent).
  • Vision tower (model.visual.*) retained in BF16; skip at serve time with --language-model-only / ENABLE_VISION=0.
  • MTP: 1 draft-head layer (mtp_num_hidden_layers=1), BF16.
  • vocab_size=248320.

Quantization — the BF16-head invariant

NVFP4 (packed uint8 weights + per-block float8_e4m3 scales + per-tensor float32 scales) on the body only. Quant never touches:

  • lm_head.weight — final logits stay BF16.
  • mtp.* (15 tensors) — draft-verification path stays BF16.
  • model.visual.* — vision tower stays BF16.
  • linear_attn.conv1d (a redundant non-Linear ignore). Note: linear_attn.in_proj_* and out_proj ARE NVFP4-quantized — they are not kept BF16. (re-verify in_proj against hf_quant_config.json at S4 build.)

quantization_config.ignore lists 4 glob patterns (*model.visual*, *linear_attn.conv1d*, *lm_head*, *mtp*) — it does not preserve in_proj. Keeping the output and draft heads out of FP4 is what protects both answer quality and speculative acceptance — the quant's edge over blanket-quantized builds.

Re-quant + MTP re-graft procedure (pipeline S4–S5)

  1. Quant config excludes vision + MTP: quant_cfg["*visual*"]={"enable":False}, quant_cfg["*mtp*"]={"enable":False}, plus *lm_head*, *linear_attn.conv1d*.
  2. Calibrate (e.g. 20 samples @ max_seq_len=8192), export via modelopt.torch.export.export_hf_checkpoint().
  3. Graft the verbatim base mtp.* head back in BF16 (additive, kept out of the FP4 body) — it verifies against the quantized target at serve time.
  4. Patch config.json to list the BF16-preserved modules in quantization_config.ignore.

Reasoning modes

ChatML with toggleable thinking, à la Cascade. Thinking is off by default — without enable_thinking the template emits an empty <think></think> and the model answers directly.

  • Instruct (default): adjacent empty <think></think>; no visible reasoning trace.
  • Thinking (opt-in): pass chat_template_kwargs={"enable_thinking": true} (or <|think_on|> in the system message); generation then begins <think>.
  • Termination handoff (thinking mode only): the template appends a brief reasoning→answer instruction to the system prompt (reason fully, verify, then close </think> and answer; don't re-confirm settled work) — curbs the runaway re-verification loops; not applied in instruct mode or when tools are passed (the tool path has its own handoff).

Recommended sampling: temperature=0.7, top_p=0.95, top_k=20, repetition_penalty=1.1 — and never greedy (temperature=0 loops). The Cascade-2 paper uses 1.0 for its avg@k evals, but at 1.0 this model rambles (9k–60k-token traces) in single-sample use; 0.7 (top of DeepSeek-R1's 0.5–0.7 band) is the deployment recommendation. The repetition_penalty=1.1 curbs the re-verification loops this model is prone to in thinking mode — it lets the model close </think> and answer (clean termination, no measured accuracy loss on math checks); lowering temperature does not help (it deepens the loop).

Serving (vLLM, NVFP4 + MTP)

Edit for your use. Agentic workflows require more memory.

bash

# REQUIRED on GB10: the auto-selected FlashInfer NVFP4 GEMM leaks a ~394 MB non-torch
# workspace per linear layer (~100 GiB during profile_run) → fills the unified pool →
# hard reboot. CUTLASS uses a torch-managed workspace — no leak.
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export TORCH_CUDA_ARCH_LIST=12.1 # sm_121
vllm serve /path/to/qwen36-vlm-cascade-nvfp4-mtp \
--served-model-name qwen3.6-27b-vlm-cascade-nvfp4-mtp \
--host 0.0.0.0 --port 8002 \
--quantization modelopt_fp4 \
--max-model-len 131072 \
--gpu-memory-utilization 0.7 \
--max-num-seqs 64 \
--max-num-batched-tokens 8192 \
--kv-cache-dtype fp8 \
--attention-backend TRITON_ATTN \
--enable-chunked-prefill \
--language-model-only \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 2}' \
--compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
--trust-remote-code
  • --language-model-only loads only the language model — the text benchmarks skip the vision tower. Omit it to serve the full vision-language model.
  • Keep TRITON_ATTN main attention and PIECEWISE cuda graphs on this hybrid mamba / full-attention architecture. --gpu-memory-utilization 0.8 trips on the inference-time Triton JIT of the spec/GDN decode kernels (non-torch headroom, not KV, is the limit at this util) — keep 0.7 and run a MemAvailable watchdog as a guard.
  • MTP/NEXTN spec-decode (num_speculative_tokens: 2) is lossless — pure decode speedup, identical outputs (~80% draft acceptance measured).
  • Add --reasoning-parser qwen3 to split <think> traces out of content — exposed as message.reasoning on vLLM 0.22.0 (SGLang uses reasoning_content); left off here so the trace reads in full. Required if you use thinking_token_budget (below).

Thinking is off by default (see Reasoning modes): pass chat_template_kwargs={"enable_thinking": true} per request to enable reasoning, or put <|think_on|> in the system message (<|think_off|> / enable_thinking=false forces it off). This model reasons at length, so enabling thinking under a small max_tokens can return an only-reasoning, truncated reply — budget accordingly, or hard-cap it: pass thinking_token_budget=N (vLLM sampling param; requires --reasoning-parser qwen3) to force-close </think> after N reasoning tokens. Set it generously (~3000–4000 — genuine hard problems use ~2800) so it only catches runaway loops, not legitimate reasoning. (SGLang: --enable-strict-thinking + per-request custom_params={"thinking_budget": N}.) The template ships Qwen-native XML tool calling (<tool_call><function=…>) — add --enable-auto-tool-choice --tool-call-parser qwen3_xml to the serve command to enable it.

Performance (GB10)

Decode is memory-bandwidth bound (~273 GB/s unified memory). Measured on the serve above (NVFP4 body, fp8 KV, MTP num_speculative_tokens: 2, 131072 context):

  • Single stream: ~16 tok/s.
  • 64-way concurrent: ~400–490 tok/s aggregate — the GB10 throughput ceiling (raising --max-num-seqs past 64 is flat, ~+2%).
  • MTP/NEXTN spec-decode: ~2.6 mean accepted tokens per step (~80% draft acceptance) — a lossless decode speedup, not a quality change.

Evaluation

Benchmarking was time-gated for this release. We recommend running full benchmarks for a thorough evaluation.

Intended use & limitations

  • Use: local reasoning + vision-language + agentic/tool use on GB10.
  • Not production-evaluated beyond the light benchmark above — validate for your use case.
  • Heavy text-reasoning RL can erode visual grounding even with the vision tower frozen; evaluate vision before relying on it.
  • License: Apache-2.0 with attribution — see License, attribution & data provenance below. All training-data licenses are attribution-only and commercial-OK.

The two-repo pattern

RepoArtifactFor
natfii/Qwen3.6-27B-VLM-CascadeBF16 master + base mtp.* draft headRe-quantizing to any format (NVFP4 / FP8 / AWQ / GGUF…), further fine-tuning, BF16 inference, the QAD/distill teacher
natfii/Qwen3.6-27B-VLM-Cascade-NVFP4-MTP (this one)NVFP4 body + BF16 lm_head + BF16 MTP headDrop-in GB10 / DGX Spark deployment build (vLLM NEXTN spec-decode)

License, attribution & data provenance

License — Apache-2.0. This NVFP4 deployment build is a derivative of Qwen/Qwen3.6-27B (released under Apache-2.0) and is itself published under Apache-2.0. You may use it commercially or non-commercially, provided you retain the LICENSE and NOTICE files and the attributions below. The full-precision BF16 master (the re-quantization source) is at natfii/Qwen3.6-27B-VLM-Cascade.

Non-binding note. This is a personal homelab project, provided as-is with no warranty or support and not commercially maintained. This is courtesy context only — it does not add any restriction to the Apache-2.0 grant.

Attribution.

  • Base model Qwen/Qwen3.6-27B © Alibaba Cloud / the Qwen team — Apache-2.0.
  • Cascade-style post-training, NVFP4 quantization, and MTP-head graft + re-align, packaged by natfii.
  • Method attribution: the recipe emulates Nemotron-Cascade-2 (NVIDIA; arXiv 2603.19220) — method emulation only, not a redistribution of NVIDIA's pipeline or weights.

Training-data provenance. Every dataset in the lineage is attribution-only and commercial-OK; the OML-licensed 593 GB Nemotron SFT corpus was deliberately not used, so no OML obligation attaches.

StageDataset(s)License
SFT cold-start (~10k <think> traces; ~6k math + ~4k code)open-thoughts/OpenThoughts-114k + open-r1/OpenR1-Math-220kApache-2.0 (both)
Math RLVR promptsnvidia/AceReason-Math (← NuminaMath-1.5 + DeepScaleR-Preview)CC-BY-4.0
IF-RL / MOPD / multi-domain prompts + verifiersnvidia/Nemotron-Cascade-2-RL-dataODC-BY-1.0
MOPD + MTP-head self-distillationthe model's own frozen checkpoint (no third-party teacher)

The SFT traces are DeepSeek-R1-distilled (via the two open datasets above); DeepSeek-R1 is MIT-licensed and expressly permits distillation, and both datasets relicense their traces under Apache-2.0 — disclosed for transparency; no extra obligation attaches. Full attributions are reproduced in the repo NOTICE file.

Provenance

Base quant + MTP graft by natfii (lna-lab NVFP4-SM120 recipe). Cascade-style post-training + re-quant + MTP re-graft via the qwen-cascade pipeline.

Model provider

natfii

Model tree

Base

Qwen/Qwen3.6-27B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today