kyaky

Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Quality matches the BF16 source

Measured BF16 source vs this NVFP4 on the same hardware. GSM8K / ARC-Challenge / HellaSwag use the EleutherAI lm-evaluation-harness (5-shot, 500 samples, chat template applied); HumanEval is sandboxed pass@1 over all 164 problems.

Benchmark — BF16 source vs this NVFP4 across GSM8K, ARC-Challenge, HellaSwag, HumanEval, and on-disk size

Table with columns: Benchmark, Tool, BF16 (source), This NVFP4, Δ
Benchmark	Tool	BF16 (source)	This NVFP4	Δ
GSM8K (5-shot, exact-match)	lm-eval-harness	98.4%	97.6%	−0.8
ARC-Challenge (5-shot, acc_norm)	lm-eval-harness	51.6%	51.4%	−0.2
HellaSwag (5-shot, acc_norm)	lm-eval-harness	67.8%	67.8%	0.0
HumanEval (pass@1, 164)	sandboxed exec	94.51%	94.51%	0.0
Disk size	—	104 GB	47.7 GB	−54%

Every delta is within statistical noise (±0.6% on GSM8K, ±2.2% on ARC/HellaSwag) — NVFP4 is at BF16 parity across reasoning, science-MC, commonsense-MC, and code. For reference, the base author's own GGUF quants report IQ4_XS ≈ 94% and Q8_0 ≈ 98.4% of BF16; this NVFP4 is Q8-class at ~46% of the size. A 14-prompt diverse spot-check (math/code/reasoning/factual/creative/uncensored) was additionally 14/14 semantically equivalent to BF16, with short deterministic answers token-identical.

ARC-Challenge / HellaSwag are loglikelihood multiple-choice tasks, which under-measure a reasoning model tuned to "think then answer" (note the strong generative GSM8K). They are included here mainly as a BF16-vs-NVFP4 fidelity check, which they pass cleanly.

Quantization recipe

Mixed precision, chosen data-drivenly (per-layer outlier/kurtosis scan) and for serving robustness:

Table with columns: Component, Precision
Component	Precision
Dense MLP `gate/up/down_proj` (75 layers)	NVFP4 — W4A4, `tensor_group` group-size 16, FP8-E4M3 scales
Full-attention `q/k/v/o_proj` (24 layers)	FP8 — W8A8, block 128×128 weights / group-128 dynamic activations
Gated-DeltaNet `linear_attn.*` (all 72 layers)	BF16 (fp32-sensitive recurrence; most outlier-heavy)
`lm_head`, `embed_tokens`, vision tower	BF16
MLP of layers (endpoints + flagged seams/outlier cluster)

Calibration: 768 samples @ 4096 tokens, domain-matched mix — long-CoT reasoning with literal <think> traces (OpenThoughts, reformatted), code (Magicoder-Evol-Instruct, CodeAlpaca), general chat (UltraChat), and an uncensored slice (Dolphin) — to preserve code, reasoning, and the model's thinking/uncensored behavior.

Serving (vLLM)

bash
vllm serve kyaky/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4 \
  --max-model-len 16384 --reasoning-parser qwen3 --trust-remote-code

Served text-only (the vision tower is kept BF16 and unused). Fits on a single 96 GB card.

Recommended sampling (from the base model card):

Thinking (general): temp 1.0, top_p 0.95, top_k 20
Thinking (coding): temp 0.6, top_p 0.95, top_k 20
Non-thinking: temp 0.7, top_p 0.80, top_k 20, presence_penalty 1.5
On low-quant looping, set repetition_penalty 1.05–1.1 or add a short system prompt.

Notes & limitations

Thinking is on by default (<think>…</think>); pass chat_template_kwargs={"enable_thinking": false} to disable.
Single-stream throughput is modest (dense 40B with BF16 Gated-DeltaNet layers) — this build optimizes for quality, not speed.
For tool-calling the base author suggests higher-bit quants; NVFP4 (4-bit MLP) may be weaker there.

Credits

Base model & training: DavidAU
Architecture: Qwen (Qwen3.5/3.6 hybrid), Apache-2.0
Quantization: llm-compressor / compressed-tensors. NVFP4 self-quant + evaluation by kyaky.

Model provider

kyaky

Model tree

Base

DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Quality matches the BF16 source

Benchmark — BF16 source vs this NVFP4 across GSM8K, ARC-Challenge, HellaSwag, HumanEval, and on-disk size

Table with columns: Benchmark, Tool, BF16 (source), This NVFP4, Δ
Benchmark	Tool	BF16 (source)	This NVFP4	Δ
GSM8K (5-shot, exact-match)	lm-eval-harness	98.4%	97.6%	−0.8
ARC-Challenge (5-shot, acc_norm)	lm-eval-harness	51.6%	51.4%	−0.2
HellaSwag (5-shot, acc_norm)	lm-eval-harness	67.8%	67.8%	0.0
HumanEval (pass@1, 164)	sandboxed exec	94.51%	94.51%	0.0
Disk size	—	104 GB	47.7 GB	−54%

ARC-Challenge / HellaSwag are loglikelihood multiple-choice tasks, which under-measure a reasoning model tuned to "think then answer" (note the strong generative GSM8K). They are included here mainly as a BF16-vs-NVFP4 fidelity check, which they pass cleanly.

Quantization recipe

Mixed precision, chosen data-drivenly (per-layer outlier/kurtosis scan) and for serving robustness:

Table with columns: Component, Precision
Component	Precision
Dense MLP `gate/up/down_proj` (75 layers)	NVFP4 — W4A4, `tensor_group` group-size 16, FP8-E4M3 scales
Full-attention `q/k/v/o_proj` (24 layers)	FP8 — W8A8, block 128×128 weights / group-128 dynamic activations
Gated-DeltaNet `linear_attn.*` (all 72 layers)	BF16 (fp32-sensitive recurrence; most outlier-heavy)
`lm_head`, `embed_tokens`, vision tower	BF16
MLP of layers (endpoints + flagged seams/outlier cluster)

Serving (vLLM)

bash
vllm serve kyaky/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4 \
  --max-model-len 16384 --reasoning-parser qwen3 --trust-remote-code

Served text-only (the vision tower is kept BF16 and unused). Fits on a single 96 GB card.

Recommended sampling (from the base model card):

Thinking (general): temp 1.0, top_p 0.95, top_k 20
Thinking (coding): temp 0.6, top_p 0.95, top_k 20
Non-thinking: temp 0.7, top_p 0.80, top_k 20, presence_penalty 1.5
On low-quant looping, set repetition_penalty 1.05–1.1 or add a short system prompt.

Notes & limitations

Thinking is on by default (<think>…</think>); pass chat_template_kwargs={"enable_thinking": false} to disable.
Single-stream throughput is modest (dense 40B with BF16 Gated-DeltaNet layers) — this build optimizes for quality, not speed.
For tool-calling the base author suggests higher-bit quants; NVFP4 (4-bit MLP) may be weaker there.

Credits

Base model & training: DavidAU
Architecture: Qwen (Qwen3.5/3.6 hybrid), Apache-2.0
Quantization: llm-compressor / compressed-tensors. NVFP4 self-quant + evaluation by kyaky.

Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4

Get help setting up a custom Dedicated Endpoints.

README

Quality matches the BF16 source

Quantization recipe

Serving (vLLM)

Notes & limitations

Credits

Explore FriendliAI today

README

Quality matches the BF16 source

Quantization recipe

Serving (vLLM)

Notes & limitations

Credits