maci0
Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0About the base model
A 40B dense (not MoE) vision-language model expanded from Qwen3.6-27B, made uncensored via Heretic, trained on the internal Deckard/PKD datasets (character, depth, point of view) and on a Claude 4.6 Opus high-reasoning distillation set to sharpen and stabilize reasoning.
- 96 decoder layers: hybrid gated delta-net linear attention (72) plus full attention (24), dense MLP, plus a vision tower for image and video input.
- 256K context (
max_position_embeddings262144). - Thinking mode by default (variable-length reasoning), with an instruct toggle.
Quantization
| Scheme | NVFP4, W4A4 |
| Weight rounding | GPTQ (Hessian-based error compensation), MSE observer |
| Weights | FP4 (E2M1), group_size=16, tensor_group, symmetric, FP8 (E4M3) group scales |
| Activations | FP4, dynamic per-group (dynamic: local), FP8 (E4M3) scales |
| Targets | all language-model Linear layers, 744 modules (360 linear-attn projections + 288 MLP + 96 full-attn) |
| Kept in bf16 | vision tower (model.visual.*), lm_head |
| Untouched | gated delta-net Conv1d and SSM params (A_log, dt_bias), not Linear, never targeted |
GPTQ is a quantization-time cost only. The output is the same
nvfp4-pack-quantized format with identical inference speed; GPTQ just chooses
better 4-bit values than plain round-to-nearest.
Calibration
512 samples, domain-matched to the model's actual traffic, max_seq_len=2048,
text-only path through the VL model:
| source | samples | domain |
|---|---|---|
TeichAI/claude-4.5-opus-high-reasoning-250x | 250 | long reasoning (the base model's own training data) |
HuggingFaceH4/ultrachat_200k | 150 | general chat |
m-a-p/Code-Feedback | 112 | code |
Quality (wikitext-2, 65,504 tokens, vLLM prompt-logprobs)
| build | perplexity |
|---|---|
| plain RTN + minmax + generic calibration | 7.4773 |
| GPTQ + MSE + domain-matched calibration (this build) | 7.4111 |
About 0.9% lower perplexity on generic English. The gap is expected to be larger on the model's own domains (reasoning, creative, code), which wikitext does not cover.
Usage
Offline (vLLM)
NVFP4 activation acceleration needs a Blackwell-class GPU. The
markdown
if __name__ == "__main__"
LLM(...) because the vLLM v1 engine
spawns workers.
python
from vllm import LLM, SamplingParamsdef main():llm = LLM(model="maci0/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4",max_model_len=16384,)msgs = [{"role": "user", "content": "Write the opening paragraph of a noir short story."}]sp = SamplingParams(temperature=1.0, top_p=0.95, top_k=20, max_tokens=2048)out = llm.chat(msgs, sp)print(out[0].outputs[0].text)if __name__ == "__main__":main()
Server (OpenAI-compatible)
Recommended baseline for a single Blackwell GPU. The NVFP4 quantization is
auto-detected from config.json (compressed-tensors), so no quantization flag is
needed. --reasoning-parser qwen3 splits the <think> block into a separate
reasoning_content field.
bash
vllm serve maci0/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4 \--served-model-name qwen3.6-40b-nvfp4 \--tensor-parallel-size 1 \--max-model-len 131072 \--gpu-memory-utilization 0.90 \--kv-cache-dtype fp8 \--reasoning-parser qwen3 \--enable-auto-tool-choice --tool-call-parser qwen3_coder
These parser flags are not auto-detected; you must pass them explicitly. Drop the
last line if you do not need tool calling; --enable-auto-tool-choice requires
--tool-call-parser.
Verified flag values for this model on vLLM 0.23.0:
| Goal | Add |
|---|---|
| Tool / function calling | --enable-auto-tool-choice --tool-call-parser qwen3_coder |
| Faster decode (MTP speculative) | --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}', the base model ships an MTP head |
| Text-only (skip vision tower, free KV cache) | --language-model-only |
| Bound multimodal inputs | --limit-mm-per-prompt '{"image":4,"video":1}' |
| Hour-scale video | --media-io-kwargs '{"video":{"num_frames":-1}}' (and raise longest_edge in video_preprocessor_config.json) |
Context notes:
- The model supports up to 262144 tokens. Upstream guidance is to keep at least
128K to preserve thinking quality, so
--max-model-len 131072is the recommended default. Go to 262144 if memory allows, or lower it if you hit OOM. - On unified-memory parts (e.g. GB10),
--gpu-memory-utilizationcarves from RAM shared with the rest of the system. Use about 0.90 when this is the only model, and leave more headroom (about 0.80) when co-hosting other processes.
KV-cache quantization
The NVFP4 here quantizes weights and activations, not the KV cache (the checkpoint
ships kv_cache_scheme: null). KV-cache quantization is a separate runtime vLLM
option. Only the 24 full-attention layers hold a standard KV cache; the gated
delta-net linear-attention layers use recurrent state and are unaffected.
- FP8 KV cache (recommended, safe):
--kv-cache-dtype fp8(orfp8_e4m3). About 2x KV savings with small quality cost. Used in the baseline command above. - TurboQuant (vLLM 0.23.0, experimental here): lower-bit KV quant via Hadamard
rotation plus per-coordinate Lloyd-Max scalar quantization. Values:
turboquant_k8v4(FP8/4-bit, 2.6x, +1.17% PPL),turboquant_4bit_nc(3.8x),turboquant_k3v4_nc(~3.5x),turboquant_3bit_nc(4.9x).turboquant_k8v4is the quality sweet spot. Caveat: TurboQuant uses a dedicated attention backend whose interaction with this model's linear-attention layers was not verified. Treat as experimental; preferfp8for a known-good KV quant.
Performance / backend notes (verified on vLLM 0.23.0)
- FlashInfer is bundled and autotune is on by default. It is used automatically for
the full-attention and NVFP4 GEMM paths on Blackwell; there is nothing to enable.
Optional:
VLLM_USE_FLASHINFER_SAMPLER=1for faster sampling. - NVFP4 GEMM auto-selects cutlass FP4 on Blackwell. Do not set
VLLM_NVFP4_GEMM_BACKEND(deprecated in 0.23.0). LeaveVLLM_USE_NVFP4_CT_EMULATIONS=0(the default; emulation is for pre-Blackwell). - Attention backend: leave on auto. This is a hybrid model, so vLLM assigns the
per-layer backends (
GDNAttentionBackend/LinearAttentionBackend) automatically. Forcing a single global attention backend breaks the linear-attention layers. - No sparse-attention knob applies. The efficiency comes from the hybrid 3:1 linear:full attention layout, handled automatically.
Recommended sampling
Thinking mode is the default.
- Thinking, general:
temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,repetition_penalty=1.0 - Thinking, precise coding:
temperature=0.6,top_p=0.95,top_k=20 - Instruct / non-thinking:
temperature=0.7,top_p=0.80,top_k=20,presence_penalty≈1.5 - If the model loops on thin prompts, add a one-line system prompt (e.g. ) and/or set
markdown
Be vivid and precise.repetition_penalty1.05 to 1.1.
To run instruct (non-thinking), set {%- set enable_thinking = false %} in the
Jinja chat template, or pass
extra_body={"chat_template_kwargs": {"enable_thinking": false}} on OpenAI-compatible
endpoints.
Reproduction
See scripts/quantize_nvfp4.py for the full recipe
and QUANTIZATION.md for the end-to-end methodology.
Toolchain: llmcompressor==0.12.0, compressed-tensors==0.17.1,
transformers==5.10.1, torch==2.11.0+cu130, on an NVIDIA GB10 (Blackwell, sm_121).
License
Apache-2.0, following the base model. Intended use and all responsibility for use follow the base model.
Credits
- Base model: DavidAU
- Quantization tooling: llm-compressor / compressed-tensors
Benchmarks: NVFP4 vs bf16
Full head-to-head (27B/40B, bf16/NVFP4) on lm-eval and agentic SWE-bench Lite:
see BENCHMARKS.md.

TL;DR: NVFP4 matches bf16 on quality (within ~1 instance on SWE-bench once bf16 has time to finish; <1.5 pt avg on lm-eval), and its smaller weights decode faster, which is what makes the 40B thinking model practical: under a 60-min per-instance cap NVFP4-40B resolves 13/20 vs bf16's 6 (bf16 needs a 3-h cap to reach 14).
Model provider
maci0
Model tree
Base
DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information