maci0

Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4

Deploy Dedicated

README

License: apache-2.0

Benchmarks

benchmarks

Near-lossless versus the bf16 source:

SWE-bench Lite (agentic, mini-swe-agent, instances 0:20): resolves land within one instance of bf16 at every size (NVFP4 15/13, bf16 16/14 of 20).
lm-eval (27B pair, the clean apples-to-apples): average accuracy gap under 1.5 points.
wikitext-2 perplexity (this 40B build, vLLM prompt-logprobs): 6.89.

Full head-to-head tables and method in BENCHMARKS.md.

Fidelity

Near-lossless versus the bf16 source: wikitext-2 perplexity for this build is 6.89.

Table with columns: Metric, Value
Metric	Value
wikitext-2 PPL	6.89
Weights	NVFP4 W4A4, group 16
Size	25.6 GB vs 73.7 GB bf16 (~35%)

NVFP4 uses GPTQ error compensation, an MSE observer, and shared fused-layer scales, so the drop from bf16 is minimal.

Quickstart

Offline (vLLM)

NVFP4 activation acceleration needs a Blackwell-class GPU. The

markdown

if __name__ == "__main__"

guard is required for offline LLM(...) because the vLLM v1 engine spawns workers.

python
from vllm import LLM, SamplingParams

def main():
    llm = LLM(
        model="maci0/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4",
        max_model_len=16384,
    )
    msgs = [{"role": "user", "content": "Write the opening paragraph of a noir short story."}]
    sp = SamplingParams(temperature=1.0, top_p=0.95, top_k=20, max_tokens=2048)
    out = llm.chat(msgs, sp)
    print(out[0].outputs[0].text)

if __name__ == "__main__":
    main()

Server (OpenAI-compatible)

Recommended baseline for a single Blackwell GPU. The NVFP4 quantization is auto-detected from config.json (compressed-tensors), so no quantization flag is needed. --reasoning-parser qwen3 splits the <think> block into a separate reasoning_content field.

bash
vllm serve maci0/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4 \
  --served-model-name qwen3.6-40b-nvfp4 \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder

These parser flags are not auto-detected; you must pass them explicitly. Drop the last line if you do not need tool calling; --enable-auto-tool-choice requires --tool-call-parser.

Verified flag values for this model on vLLM 0.23.0:

Table with columns: Goal, Add
Goal	Add
Tool / function calling	`--enable-auto-tool-choice --tool-call-parser qwen3_coder`
Faster decode (MTP speculative)	`--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'`, the base model ships an MTP head
Text-only (skip vision tower, free KV cache)	`--language-model-only`
Bound multimodal inputs	`--limit-mm-per-prompt '{"image":4,"video":1}'`
Hour-scale video	`--media-io-kwargs '{"video":{"num_frames":-1}}'` (and raise `longest_edge` in )

Context notes:

The model supports up to 262144 tokens. Upstream guidance is to keep at least 128K to preserve thinking quality, so --max-model-len 131072 is the recommended default. Go to 262144 if memory allows, or lower it if you hit OOM.
On unified-memory parts (e.g. GB10), --gpu-memory-utilization carves from RAM shared with the rest of the system. Use about 0.90 when this is the only model, and leave more headroom (about 0.80) when co-hosting other processes.

Python (OpenAI client)

python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
r = client.chat.completions.create(
    model="qwen3.6-40b-nvfp4",
    messages=[{"role": "user", "content": "Write the opening paragraph of a noir short story."}],
)
print(r.choices[0].message.content)

curl

bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "qwen3.6-40b-nvfp4",
  "messages": [{"role": "user", "content": "Write the opening paragraph of a noir short story."}]
}'

KV-cache quantization

The NVFP4 here quantizes weights and activations, not the KV cache (the checkpoint ships kv_cache_scheme: null). KV-cache quantization is a separate runtime vLLM option. Only the 24 full-attention layers hold a standard KV cache; the gated delta-net linear-attention layers use recurrent state and are unaffected.

FP8 KV cache (recommended, safe): --kv-cache-dtype fp8 (or fp8_e4m3). About 2x KV savings with small quality cost. Used in the baseline command above.
TurboQuant (vLLM 0.23.0, experimental here): lower-bit KV quant via Hadamard rotation plus per-coordinate Lloyd-Max scalar quantization. Values: turboquant_k8v4 (FP8/4-bit, 2.6x, +1.17% PPL), turboquant_4bit_nc (3.8x), turboquant_k3v4_nc (~3.5x), turboquant_3bit_nc (4.9x). turboquant_k8v4 is the quality sweet spot. Caveat: TurboQuant uses a dedicated attention backend whose interaction with this model's linear-attention layers was not verified. Treat as experimental; prefer fp8 for a known-good KV quant.

Performance / backend notes (verified on vLLM 0.23.0)

FlashInfer is bundled and autotune is on by default. It is used automatically for the full-attention and NVFP4 GEMM paths on Blackwell; there is nothing to enable. Optional: VLLM_USE_FLASHINFER_SAMPLER=1 for faster sampling.
NVFP4 GEMM auto-selects cutlass FP4 on Blackwell. Do not set VLLM_NVFP4_GEMM_BACKEND (deprecated in 0.23.0). Leave VLLM_USE_NVFP4_CT_EMULATIONS=0 (the default; emulation is for pre-Blackwell).
Attention backend: leave on auto. This is a hybrid model, so vLLM assigns the per-layer backends (GDNAttentionBackend / LinearAttentionBackend) automatically. Forcing a single global attention backend breaks the linear-attention layers.
No sparse-attention knob applies. The efficiency comes from the hybrid 3:1 linear:full attention layout, handled automatically.

About the base model

A 40B dense (not MoE) vision-language model expanded from Qwen3.6-27B, made uncensored via Heretic, trained on the internal Deckard/PKD datasets (character, depth, point of view) and on a Claude 4.6 Opus high-reasoning distillation set to sharpen and stabilize reasoning.

96 decoder layers: hybrid gated delta-net linear attention (72) plus full attention (24), dense MLP, plus a vision tower for image and video input.
256K context (max_position_embeddings 262144).
Thinking mode by default (variable-length reasoning), with an instruct toggle.

Quantization

Table

Scheme	NVFP4, W4A4
Weight rounding	GPTQ (Hessian-based error compensation), MSE observer
Weights	FP4 (E2M1), `group_size=16`, `tensor_group`, symmetric, FP8 (E4M3) group scales
Activations	FP4, dynamic per-group (`dynamic: local`), FP8 (E4M3) scales
Targets	all language-model `Linear` layers, 744 modules (360 linear-attn projections + 288 MLP + 96 full-attn)
Kept in bf16	vision tower (),

GPTQ is a quantization-time cost only. The output is the same nvfp4-pack-quantized format with identical inference speed; GPTQ just chooses better 4-bit values than plain round-to-nearest.

Calibration

512 samples, domain-matched to the model's actual traffic, max_seq_len=2048, text-only path through the VL model:

Table with columns: source, samples, domain
source	samples	domain
`TeichAI/claude-4.5-opus-high-reasoning-250x`	250	long reasoning (the base model's own training data)
`HuggingFaceH4/ultrachat_200k`	150	general chat
`m-a-p/Code-Feedback`	112	code

Quality

GPTQ with the domain-matched calibration measurably beats plain round-to-nearest. The fused layers (q/k/v, gate/up) share one NVFP4 global scale, so vLLM does not warn or fall back. Measured wikitext-2 perplexity for this build is 6.89 (see Benchmarks); the gain is expected to be larger on the model's own domains (reasoning, creative, code), which wikitext does not cover.

Recommended sampling

Thinking mode is the default.

Thinking, general: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, repetition_penalty=1.0
Thinking, precise coding: temperature=0.6, top_p=0.95, top_k=20
Instruct / non-thinking: temperature=0.7, top_p=0.80, top_k=20, presence_penalty≈1.5
If the model loops on thin prompts, add a one-line system prompt (e.g.
markdown
```
Be vivid and precise.
```
) and/or set repetition_penalty 1.05 to 1.1.

To run instruct (non-thinking), set {%- set enable_thinking = false %} in the Jinja chat template, or pass extra_body={"chat_template_kwargs": {"enable_thinking": false}} on OpenAI-compatible endpoints.

Reproduction

See scripts/quantize_nvfp4.py for the full recipe and QUANTIZATION.md for the end-to-end methodology.

Toolchain: llmcompressor==0.12.0, compressed-tensors==0.17.1, transformers==5.12.1, torch==2.11.0+cu130, on an NVIDIA GB10 (Blackwell, sm_121).

Base model: DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking
Space: Rogue Quants
Collection: NVFP4 Quants
Sibling NVFP4 quants:

Notes

Needs NVIDIA Blackwell (sm_121, e.g. GB10) for accelerated W4A4; pre-Blackwell GPUs run it weight-only.
--reasoning-parser and --tool-call-parser are not auto-detected; pass them explicitly.
Thinking mode is on by default; toggle it via the chat template or chat_template_kwargs.

License

Apache-2.0, following the base model. Intended use and all responsibility for use follow the base model.

Credits

Base model: DavidAU
Quantization tooling: llm-compressor / compressed-tensors

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

maci0

Model Tree

Base

DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Benchmarks

benchmarks

Near-lossless versus the bf16 source:

SWE-bench Lite (agentic, mini-swe-agent, instances 0:20): resolves land within one instance of bf16 at every size (NVFP4 15/13, bf16 16/14 of 20).
lm-eval (27B pair, the clean apples-to-apples): average accuracy gap under 1.5 points.
wikitext-2 perplexity (this 40B build, vLLM prompt-logprobs): 6.89.

Full head-to-head tables and method in BENCHMARKS.md.

Fidelity

Near-lossless versus the bf16 source: wikitext-2 perplexity for this build is 6.89.

Table with columns: Metric, Value
Metric	Value
wikitext-2 PPL	6.89
Weights	NVFP4 W4A4, group 16
Size	25.6 GB vs 73.7 GB bf16 (~35%)

NVFP4 uses GPTQ error compensation, an MSE observer, and shared fused-layer scales, so the drop from bf16 is minimal.

Quickstart

Offline (vLLM)

NVFP4 activation acceleration needs a Blackwell-class GPU. The

markdown

if __name__ == "__main__"

guard is required for offline LLM(...) because the vLLM v1 engine spawns workers.

python
from vllm import LLM, SamplingParams

def main():
    llm = LLM(
        model="maci0/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4",
        max_model_len=16384,
    )
    msgs = [{"role": "user", "content": "Write the opening paragraph of a noir short story."}]
    sp = SamplingParams(temperature=1.0, top_p=0.95, top_k=20, max_tokens=2048)
    out = llm.chat(msgs, sp)
    print(out[0].outputs[0].text)

if __name__ == "__main__":
    main()

Server (OpenAI-compatible)

bash
vllm serve maci0/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4 \
  --served-model-name qwen3.6-40b-nvfp4 \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder

These parser flags are not auto-detected; you must pass them explicitly. Drop the last line if you do not need tool calling; --enable-auto-tool-choice requires --tool-call-parser.

Verified flag values for this model on vLLM 0.23.0:

Table with columns: Goal, Add
Goal	Add
Tool / function calling	`--enable-auto-tool-choice --tool-call-parser qwen3_coder`
Faster decode (MTP speculative)	`--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'`, the base model ships an MTP head
Text-only (skip vision tower, free KV cache)	`--language-model-only`
Bound multimodal inputs	`--limit-mm-per-prompt '{"image":4,"video":1}'`
Hour-scale video	`--media-io-kwargs '{"video":{"num_frames":-1}}'` (and raise `longest_edge` in )

Context notes:

The model supports up to 262144 tokens. Upstream guidance is to keep at least 128K to preserve thinking quality, so --max-model-len 131072 is the recommended default. Go to 262144 if memory allows, or lower it if you hit OOM.
On unified-memory parts (e.g. GB10), --gpu-memory-utilization carves from RAM shared with the rest of the system. Use about 0.90 when this is the only model, and leave more headroom (about 0.80) when co-hosting other processes.

Python (OpenAI client)

python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
r = client.chat.completions.create(
    model="qwen3.6-40b-nvfp4",
    messages=[{"role": "user", "content": "Write the opening paragraph of a noir short story."}],
)
print(r.choices[0].message.content)

curl

bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "qwen3.6-40b-nvfp4",
  "messages": [{"role": "user", "content": "Write the opening paragraph of a noir short story."}]
}'

KV-cache quantization

FP8 KV cache (recommended, safe): --kv-cache-dtype fp8 (or fp8_e4m3). About 2x KV savings with small quality cost. Used in the baseline command above.
TurboQuant (vLLM 0.23.0, experimental here): lower-bit KV quant via Hadamard rotation plus per-coordinate Lloyd-Max scalar quantization. Values: turboquant_k8v4 (FP8/4-bit, 2.6x, +1.17% PPL), turboquant_4bit_nc (3.8x), turboquant_k3v4_nc (~3.5x), turboquant_3bit_nc (4.9x). turboquant_k8v4 is the quality sweet spot. Caveat: TurboQuant uses a dedicated attention backend whose interaction with this model's linear-attention layers was not verified. Treat as experimental; prefer fp8 for a known-good KV quant.

Performance / backend notes (verified on vLLM 0.23.0)

FlashInfer is bundled and autotune is on by default. It is used automatically for the full-attention and NVFP4 GEMM paths on Blackwell; there is nothing to enable. Optional: VLLM_USE_FLASHINFER_SAMPLER=1 for faster sampling.
NVFP4 GEMM auto-selects cutlass FP4 on Blackwell. Do not set VLLM_NVFP4_GEMM_BACKEND (deprecated in 0.23.0). Leave VLLM_USE_NVFP4_CT_EMULATIONS=0 (the default; emulation is for pre-Blackwell).
Attention backend: leave on auto. This is a hybrid model, so vLLM assigns the per-layer backends (GDNAttentionBackend / LinearAttentionBackend) automatically. Forcing a single global attention backend breaks the linear-attention layers.
No sparse-attention knob applies. The efficiency comes from the hybrid 3:1 linear:full attention layout, handled automatically.

About the base model

96 decoder layers: hybrid gated delta-net linear attention (72) plus full attention (24), dense MLP, plus a vision tower for image and video input.
256K context (max_position_embeddings 262144).
Thinking mode by default (variable-length reasoning), with an instruct toggle.

Quantization

Table

Scheme	NVFP4, W4A4
Weight rounding	GPTQ (Hessian-based error compensation), MSE observer
Weights	FP4 (E2M1), `group_size=16`, `tensor_group`, symmetric, FP8 (E4M3) group scales
Activations	FP4, dynamic per-group (`dynamic: local`), FP8 (E4M3) scales
Targets	all language-model `Linear` layers, 744 modules (360 linear-attn projections + 288 MLP + 96 full-attn)
Kept in bf16	vision tower (),

GPTQ is a quantization-time cost only. The output is the same nvfp4-pack-quantized format with identical inference speed; GPTQ just chooses better 4-bit values than plain round-to-nearest.

Calibration

512 samples, domain-matched to the model's actual traffic, max_seq_len=2048, text-only path through the VL model:

Table with columns: source, samples, domain
source	samples	domain
`TeichAI/claude-4.5-opus-high-reasoning-250x`	250	long reasoning (the base model's own training data)
`HuggingFaceH4/ultrachat_200k`	150	general chat
`m-a-p/Code-Feedback`	112	code

Quality

Recommended sampling

Thinking mode is the default.

Thinking, general: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, repetition_penalty=1.0
Thinking, precise coding: temperature=0.6, top_p=0.95, top_k=20
Instruct / non-thinking: temperature=0.7, top_p=0.80, top_k=20, presence_penalty≈1.5
If the model loops on thin prompts, add a one-line system prompt (e.g.
markdown
```
Be vivid and precise.
```
) and/or set repetition_penalty 1.05 to 1.1.

Reproduction

See scripts/quantize_nvfp4.py for the full recipe and QUANTIZATION.md for the end-to-end methodology.

Toolchain: llmcompressor==0.12.0, compressed-tensors==0.17.1, transformers==5.12.1, torch==2.11.0+cu130, on an NVIDIA GB10 (Blackwell, sm_121).

Base model: DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking
Space: Rogue Quants
Collection: NVFP4 Quants
Sibling NVFP4 quants:

Notes

Needs NVIDIA Blackwell (sm_121, e.g. GB10) for accelerated W4A4; pre-Blackwell GPUs run it weight-only.
--reasoning-parser and --tool-call-parser are not auto-detected; pass them explicitly.
Thinking mode is on by default; toggle it via the chat template or chat_template_kwargs.

License

Apache-2.0, following the base model. Intended use and all responsibility for use follow the base model.

Credits

Base model: DavidAU
Quantization tooling: llm-compressor / compressed-tensors

Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4

README

Benchmarks

Fidelity

Quickstart

Offline (vLLM)

Server (OpenAI-compatible)

Python (OpenAI client)

curl

KV-cache quantization

Performance / backend notes (verified on vLLM 0.23.0)

About the base model

Quantization

Calibration

Quality

Recommended sampling

Reproduction

Related

Notes

License

Credits

Explore FriendliAI today

README

Benchmarks

Fidelity

Quickstart

Offline (vLLM)

Server (OpenAI-compatible)

Python (OpenAI client)

curl

KV-cache quantization

Performance / backend notes (verified on vLLM 0.23.0)

About the base model

Quantization

Calibration

Quality

Recommended sampling

Reproduction

Related

Notes

License

Credits