Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Lineage
| Base | Qwen/Qwen3.6-27B (VLM, image-text-to-text) |
| Post-training | Cascade-style: reasoning SFT → RL (RLVR + on-policy distillation), vision frozen |
| Quantization | NVFP4 body via nvidia-modelopt; lm_head + MTP head + vision tower kept BF16 |
| Speculative decoding | qwen3_5_mtp 1-layer draft head (verbatim base head, kept BF16) |
Architecture (from config.json)
- 27B params, hybrid attention: 16 full-attention + 48 linear-attention
layers (
full_attention_interval=4),hidden_size=5120,num_hidden_layers=64. - Full attention: 24 query / 4 KV heads,
head_dim=256(GQA). - Linear attention: 16 key / 48 value heads, head_dim 128, conv kernel 4 — constant-size recurrent state (context-length independent).
- Vision tower (
model.visual.*) retained in BF16; skip at serve time with--language-model-only/ENABLE_VISION=0. - MTP: 1 draft-head layer (
mtp_num_hidden_layers=1), BF16. vocab_size=248320.
Quantization — the BF16-head invariant
NVFP4 (packed uint8 weights + per-block float8_e4m3 scales + per-tensor
float32 scales) on the body only. Quant never touches:
lm_head.weight— final logits stay BF16.mtp.*(15 tensors) — draft-verification path stays BF16.model.visual.*— vision tower stays BF16.linear_attn.conv1d(a redundant non-Linear ignore). Note:linear_attn.in_proj_*andout_projARE NVFP4-quantized — they are not kept BF16. (re-verify in_proj againsthf_quant_config.jsonat S4 build.)
quantization_config.ignore lists 4 glob patterns (*model.visual*,
*linear_attn.conv1d*, *lm_head*, *mtp*) — it does not preserve in_proj. Keeping the
output and draft heads out of FP4 is what protects both answer quality and
speculative acceptance — the quant's edge over blanket-quantized builds.
Re-quant + MTP re-graft procedure (pipeline S4–S5)
- Quant config excludes vision + MTP:
quant_cfg["*visual*"]={"enable":False},quant_cfg["*mtp*"]={"enable":False}, plus*lm_head*,*linear_attn.conv1d*. - Calibrate (e.g. 20 samples @
max_seq_len=8192), export viamodelopt.torch.export.export_hf_checkpoint(). - Graft the verbatim base
mtp.*head back in BF16 (additive, kept out of the FP4 body) — it verifies against the quantized target at serve time. - Patch
config.jsonto list the BF16-preserved modules inquantization_config.ignore.
Reasoning modes
ChatML with toggleable thinking, à la Cascade. Thinking is off by default — without
enable_thinking the template emits an empty <think></think> and the model answers directly.
- Instruct (default): adjacent empty
<think></think>; no visible reasoning trace. - Thinking (opt-in): pass
chat_template_kwargs={"enable_thinking": true}(or<|think_on|>in the system message); generation then begins<think>. - Termination handoff (thinking mode only): the template appends a brief reasoning→answer
instruction to the system prompt (reason fully, verify, then close
</think>and answer; don't re-confirm settled work) — curbs the runaway re-verification loops; not applied in instruct mode or when tools are passed (the tool path has its own handoff).
Recommended sampling: temperature=0.7, top_p=0.95, top_k=20, repetition_penalty=1.1 — and never greedy
(temperature=0 loops). The Cascade-2 paper uses 1.0 for its avg@k evals, but at 1.0
this model rambles (9k–60k-token traces) in single-sample use; 0.7 (top of DeepSeek-R1's
0.5–0.7 band) is the deployment recommendation. The repetition_penalty=1.1 curbs the
re-verification loops this model is prone to in thinking mode — it lets the model close
</think> and answer (clean termination, no measured accuracy loss on math checks);
lowering temperature does not help (it deepens the loop).
Serving (vLLM, NVFP4 + MTP)
Edit for your use. Agentic workflows require more memory.
bash
# REQUIRED on GB10: the auto-selected FlashInfer NVFP4 GEMM leaks a ~394 MB non-torch# workspace per linear layer (~100 GiB during profile_run) → fills the unified pool →# hard reboot. CUTLASS uses a torch-managed workspace — no leak.export VLLM_NVFP4_GEMM_BACKEND=cutlassexport TORCH_CUDA_ARCH_LIST=12.1 # sm_121vllm serve /path/to/qwen36-vlm-cascade-nvfp4-mtp \--served-model-name qwen3.6-27b-vlm-cascade-nvfp4-mtp \--host 0.0.0.0 --port 8002 \--quantization modelopt_fp4 \--max-model-len 131072 \--gpu-memory-utilization 0.7 \--max-num-seqs 64 \--max-num-batched-tokens 8192 \--kv-cache-dtype fp8 \--attention-backend TRITON_ATTN \--enable-chunked-prefill \--language-model-only \--speculative-config '{"method": "mtp", "num_speculative_tokens": 2}' \--compilation-config '{"cudagraph_mode": "PIECEWISE"}' \--trust-remote-code
--language-model-onlyloads only the language model — the text benchmarks skip the vision tower. Omit it to serve the full vision-language model.- Keep
TRITON_ATTNmain attention andPIECEWISEcuda graphs on this hybrid mamba / full-attention architecture.--gpu-memory-utilization 0.8trips on the inference-time Triton JIT of the spec/GDN decode kernels (non-torch headroom, not KV, is the limit at this util) — keep0.7and run a MemAvailable watchdog as a guard. - MTP/NEXTN spec-decode (
num_speculative_tokens: 2) is lossless — pure decode speedup, identical outputs (~80% draft acceptance measured). - Add
--reasoning-parser qwen3to split<think>traces out ofcontent— exposed asmessage.reasoningon vLLM 0.22.0 (SGLang usesreasoning_content); left off here so the trace reads in full. Required if you usethinking_token_budget(below).
Thinking is off by default (see Reasoning modes): pass
chat_template_kwargs={"enable_thinking": true} per request to enable reasoning, or put
<|think_on|> in the system message (<|think_off|> / enable_thinking=false forces it
off). This model reasons at length, so enabling thinking under a small max_tokens can
return an only-reasoning, truncated reply — budget accordingly, or hard-cap it: pass
thinking_token_budget=N (vLLM sampling param; requires --reasoning-parser qwen3) to
force-close </think> after N reasoning tokens. Set it generously (~3000–4000 — genuine
hard problems use ~2800) so it only catches runaway loops, not legitimate reasoning. (SGLang:
--enable-strict-thinking + per-request custom_params={"thinking_budget": N}.) The template ships
Qwen-native XML tool calling (<tool_call><function=…>) — add
--enable-auto-tool-choice --tool-call-parser qwen3_xml to the serve command to enable it.
Performance (GB10)
Decode is memory-bandwidth bound (~273 GB/s unified memory). Measured on the serve
above (NVFP4 body, fp8 KV, MTP num_speculative_tokens: 2, 131072 context):
- Single stream: ~16 tok/s.
- 64-way concurrent: ~400–490 tok/s aggregate — the GB10 throughput ceiling
(raising
--max-num-seqspast 64 is flat, ~+2%). - MTP/NEXTN spec-decode: ~2.6 mean accepted tokens per step (~80% draft acceptance) — a lossless decode speedup, not a quality change.
Evaluation
Benchmarking was time-gated for this release. We recommend running full benchmarks for a thorough evaluation.
Intended use & limitations
- Use: local reasoning + vision-language + agentic/tool use on GB10.
- Not production-evaluated beyond the light benchmark above — validate for your use case.
- Heavy text-reasoning RL can erode visual grounding even with the vision tower frozen; evaluate vision before relying on it.
- License: Apache-2.0 with attribution — see License, attribution & data provenance below. All training-data licenses are attribution-only and commercial-OK.
The two-repo pattern
| Repo | Artifact | For |
|---|---|---|
natfii/Qwen3.6-27B-VLM-Cascade | BF16 master + base mtp.* draft head | Re-quantizing to any format (NVFP4 / FP8 / AWQ / GGUF…), further fine-tuning, BF16 inference, the QAD/distill teacher |
natfii/Qwen3.6-27B-VLM-Cascade-NVFP4-MTP (this one) | NVFP4 body + BF16 lm_head + BF16 MTP head | Drop-in GB10 / DGX Spark deployment build (vLLM NEXTN spec-decode) |
License, attribution & data provenance
License — Apache-2.0. This NVFP4 deployment build is a derivative of
Qwen/Qwen3.6-27B (released under
Apache-2.0) and is itself published under Apache-2.0. You may use it
commercially or non-commercially, provided you retain the LICENSE and NOTICE
files and the attributions below. The full-precision BF16 master (the
re-quantization source) is at
natfii/Qwen3.6-27B-VLM-Cascade.
Non-binding note. This is a personal homelab project, provided as-is with no warranty or support and not commercially maintained. This is courtesy context only — it does not add any restriction to the Apache-2.0 grant.
Attribution.
- Base model
Qwen/Qwen3.6-27B© Alibaba Cloud / the Qwen team — Apache-2.0. - Cascade-style post-training, NVFP4 quantization, and MTP-head graft + re-align,
packaged by
natfii. - Method attribution: the recipe emulates Nemotron-Cascade-2 (NVIDIA; arXiv 2603.19220) — method emulation only, not a redistribution of NVIDIA's pipeline or weights.
Training-data provenance. Every dataset in the lineage is attribution-only and commercial-OK; the OML-licensed 593 GB Nemotron SFT corpus was deliberately not used, so no OML obligation attaches.
| Stage | Dataset(s) | License |
|---|---|---|
SFT cold-start (~10k <think> traces; ~6k math + ~4k code) | open-thoughts/OpenThoughts-114k + open-r1/OpenR1-Math-220k | Apache-2.0 (both) |
| Math RLVR prompts | nvidia/AceReason-Math (← NuminaMath-1.5 + DeepScaleR-Preview) | CC-BY-4.0 |
| IF-RL / MOPD / multi-domain prompts + verifiers | nvidia/Nemotron-Cascade-2-RL-data | ODC-BY-1.0 |
| MOPD + MTP-head self-distillation | the model's own frozen checkpoint (no third-party teacher) | — |
The SFT traces are DeepSeek-R1-distilled (via the two open datasets above);
DeepSeek-R1 is MIT-licensed and expressly permits distillation, and both datasets
relicense their traces under Apache-2.0 — disclosed for transparency; no extra
obligation attaches. Full attributions are reproduced in the repo NOTICE file.
Provenance
Base quant + MTP graft by natfii (lna-lab NVFP4-SM120 recipe). Cascade-style
post-training + re-quant + MTP re-graft via the qwen-cascade pipeline.
Model provider
natfii
Model tree
Base
Qwen/Qwen3.6-27B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information