maci0

Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

About the base model

A 40B dense (not MoE) vision-language model expanded from Qwen3.6-27B, made uncensored via Heretic, trained on the internal Deckard/PKD datasets (character, depth, point of view) and on a Claude 4.6 Opus high-reasoning distillation set to sharpen and stabilize reasoning.

  • 96 decoder layers: hybrid gated delta-net linear attention (72) plus full attention (24), dense MLP, plus a vision tower for image and video input.
  • 256K context (max_position_embeddings 262144).
  • Thinking mode by default (variable-length reasoning), with an instruct toggle.

Quantization

Table
SchemeNVFP4, W4A4
Weight roundingGPTQ (Hessian-based error compensation), MSE observer
WeightsFP4 (E2M1), group_size=16, tensor_group, symmetric, FP8 (E4M3) group scales
ActivationsFP4, dynamic per-group (dynamic: local), FP8 (E4M3) scales
Targetsall language-model Linear layers, 744 modules (360 linear-attn projections + 288 MLP + 96 full-attn)
Kept in bf16vision tower (model.visual.*), lm_head
Untouchedgated delta-net Conv1d and SSM params (A_log, dt_bias), not Linear, never targeted

GPTQ is a quantization-time cost only. The output is the same nvfp4-pack-quantized format with identical inference speed; GPTQ just chooses better 4-bit values than plain round-to-nearest.

Calibration

512 samples, domain-matched to the model's actual traffic, max_seq_len=2048, text-only path through the VL model:

Table
sourcesamplesdomain
TeichAI/claude-4.5-opus-high-reasoning-250x250long reasoning (the base model's own training data)
HuggingFaceH4/ultrachat_200k150general chat
m-a-p/Code-Feedback112code

Quality (wikitext-2, 65,504 tokens, vLLM prompt-logprobs)

Table
buildperplexity
plain RTN + minmax + generic calibration7.4773
GPTQ + MSE + domain-matched calibration (this build)7.4111

About 0.9% lower perplexity on generic English. The gap is expected to be larger on the model's own domains (reasoning, creative, code), which wikitext does not cover.

Usage

Offline (vLLM)

NVFP4 activation acceleration needs a Blackwell-class GPU. The

markdown

if __name__ == "__main__"
guard is required for offline LLM(...) because the vLLM v1 engine spawns workers.

python

from vllm import LLM, SamplingParams
def main():
llm = LLM(
model="maci0/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4",
max_model_len=16384,
)
msgs = [{"role": "user", "content": "Write the opening paragraph of a noir short story."}]
sp = SamplingParams(temperature=1.0, top_p=0.95, top_k=20, max_tokens=2048)
out = llm.chat(msgs, sp)
print(out[0].outputs[0].text)
if __name__ == "__main__":
main()

Server (OpenAI-compatible)

Recommended baseline for a single Blackwell GPU. The NVFP4 quantization is auto-detected from config.json (compressed-tensors), so no quantization flag is needed. --reasoning-parser qwen3 splits the <think> block into a separate reasoning_content field.

bash

vllm serve maci0/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4 \
--served-model-name qwen3.6-40b-nvfp4 \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder

These parser flags are not auto-detected; you must pass them explicitly. Drop the last line if you do not need tool calling; --enable-auto-tool-choice requires --tool-call-parser.

Verified flag values for this model on vLLM 0.23.0:

Table
GoalAdd
Tool / function calling--enable-auto-tool-choice --tool-call-parser qwen3_coder
Faster decode (MTP speculative)--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}', the base model ships an MTP head
Text-only (skip vision tower, free KV cache)--language-model-only
Bound multimodal inputs--limit-mm-per-prompt '{"image":4,"video":1}'
Hour-scale video--media-io-kwargs '{"video":{"num_frames":-1}}' (and raise longest_edge in video_preprocessor_config.json)

Context notes:

  • The model supports up to 262144 tokens. Upstream guidance is to keep at least 128K to preserve thinking quality, so --max-model-len 131072 is the recommended default. Go to 262144 if memory allows, or lower it if you hit OOM.
  • On unified-memory parts (e.g. GB10), --gpu-memory-utilization carves from RAM shared with the rest of the system. Use about 0.90 when this is the only model, and leave more headroom (about 0.80) when co-hosting other processes.

KV-cache quantization

The NVFP4 here quantizes weights and activations, not the KV cache (the checkpoint ships kv_cache_scheme: null). KV-cache quantization is a separate runtime vLLM option. Only the 24 full-attention layers hold a standard KV cache; the gated delta-net linear-attention layers use recurrent state and are unaffected.

  • FP8 KV cache (recommended, safe): --kv-cache-dtype fp8 (or fp8_e4m3). About 2x KV savings with small quality cost. Used in the baseline command above.
  • TurboQuant (vLLM 0.23.0, experimental here): lower-bit KV quant via Hadamard rotation plus per-coordinate Lloyd-Max scalar quantization. Values: turboquant_k8v4 (FP8/4-bit, 2.6x, +1.17% PPL), turboquant_4bit_nc (3.8x), turboquant_k3v4_nc (~3.5x), turboquant_3bit_nc (4.9x). turboquant_k8v4 is the quality sweet spot. Caveat: TurboQuant uses a dedicated attention backend whose interaction with this model's linear-attention layers was not verified. Treat as experimental; prefer fp8 for a known-good KV quant.

Performance / backend notes (verified on vLLM 0.23.0)

  • FlashInfer is bundled and autotune is on by default. It is used automatically for the full-attention and NVFP4 GEMM paths on Blackwell; there is nothing to enable. Optional: VLLM_USE_FLASHINFER_SAMPLER=1 for faster sampling.
  • NVFP4 GEMM auto-selects cutlass FP4 on Blackwell. Do not set VLLM_NVFP4_GEMM_BACKEND (deprecated in 0.23.0). Leave VLLM_USE_NVFP4_CT_EMULATIONS=0 (the default; emulation is for pre-Blackwell).
  • Attention backend: leave on auto. This is a hybrid model, so vLLM assigns the per-layer backends (GDNAttentionBackend / LinearAttentionBackend) automatically. Forcing a single global attention backend breaks the linear-attention layers.
  • No sparse-attention knob applies. The efficiency comes from the hybrid 3:1 linear:full attention layout, handled automatically.

Thinking mode is the default.

  • Thinking, general: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, repetition_penalty=1.0
  • Thinking, precise coding: temperature=0.6, top_p=0.95, top_k=20
  • Instruct / non-thinking: temperature=0.7, top_p=0.80, top_k=20, presence_penalty≈1.5
  • If the model loops on thin prompts, add a one-line system prompt (e.g.

    markdown

    Be vivid and precise.
    ) and/or set repetition_penalty 1.05 to 1.1.

To run instruct (non-thinking), set {%- set enable_thinking = false %} in the Jinja chat template, or pass extra_body={"chat_template_kwargs": {"enable_thinking": false}} on OpenAI-compatible endpoints.

Reproduction

See scripts/quantize_nvfp4.py for the full recipe and QUANTIZATION.md for the end-to-end methodology.

Toolchain: llmcompressor==0.12.0, compressed-tensors==0.17.1, transformers==5.10.1, torch==2.11.0+cu130, on an NVIDIA GB10 (Blackwell, sm_121).

License

Apache-2.0, following the base model. Intended use and all responsibility for use follow the base model.

Credits

Benchmarks: NVFP4 vs bf16

Full head-to-head (27B/40B, bf16/NVFP4) on lm-eval and agentic SWE-bench Lite: see BENCHMARKS.md.

benchmarks

TL;DR: NVFP4 matches bf16 on quality (within ~1 instance on SWE-bench once bf16 has time to finish; <1.5 pt avg on lm-eval), and its smaller weights decode faster, which is what makes the 40B thinking model practical: under a 60-min per-instance cap NVFP4-40B resolves 13/20 vs bf16's 6 (bf16 needs a 3-h cap to reach 14).

Model provider

maci0

Model tree

Base

DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today