Quality matches the BF16 source
Measured BF16 source vs this NVFP4 on the same hardware. GSM8K / ARC-Challenge / HellaSwag use the
EleutherAI lm-evaluation-harness (5-shot, 500
samples, chat template applied); HumanEval is sandboxed pass@1 over all 164 problems.

Table with columns: Benchmark, Tool, BF16 (source), This NVFP4, Δ| Benchmark | Tool | BF16 (source) | This NVFP4 | Δ |
|---|
| GSM8K (5-shot, exact-match) | lm-eval-harness | 98.4% | 97.6% | −0.8 |
| ARC-Challenge (5-shot, acc_norm) | lm-eval-harness | 51.6% | 51.4% | −0.2 |
| HellaSwag (5-shot, acc_norm) | lm-eval-harness | 67.8% | 67.8% | 0.0 |
| HumanEval (pass@1, 164) | sandboxed exec | 94.51% | 94.51% | 0.0 |
| Disk size | — | 104 GB | 47.7 GB | −54% |
Every delta is within statistical noise (±0.6% on GSM8K, ±2.2% on ARC/HellaSwag) — NVFP4 is at BF16
parity across reasoning, science-MC, commonsense-MC, and code. For reference, the base author's own GGUF
quants report IQ4_XS ≈ 94% and Q8_0 ≈ 98.4% of BF16; this NVFP4 is Q8-class at ~46% of the size.
A 14-prompt diverse spot-check (math/code/reasoning/factual/creative/uncensored) was additionally
14/14 semantically equivalent to BF16, with short deterministic answers token-identical.
ARC-Challenge / HellaSwag are loglikelihood multiple-choice tasks, which under-measure a reasoning model
tuned to "think then answer" (note the strong generative GSM8K). They are included here mainly as a
BF16-vs-NVFP4 fidelity check, which they pass cleanly.
Quantization recipe
Mixed precision, chosen data-drivenly (per-layer outlier/kurtosis scan) and for serving robustness:
Table with columns: Component, Precision| Component | Precision |
|---|
Dense MLP gate/up/down_proj (75 layers) | NVFP4 — W4A4, tensor_group group-size 16, FP8-E4M3 scales |
Full-attention q/k/v/o_proj (24 layers) | FP8 — W8A8, block 128×128 weights / group-128 dynamic activations |
Gated-DeltaNet linear_attn.* (all 72 layers) | BF16 (fp32-sensitive recurrence; most outlier-heavy) |
lm_head, embed_tokens, vision tower | BF16 |
| MLP of layers (endpoints + flagged seams/outlier cluster) |
Calibration: 768 samples @ 4096 tokens, domain-matched mix — long-CoT reasoning with literal
<think> traces (OpenThoughts, reformatted), code (Magicoder-Evol-Instruct, CodeAlpaca), general chat
(UltraChat), and an uncensored slice (Dolphin) — to preserve code, reasoning, and the model's
thinking/uncensored behavior.
Serving (vLLM)
vllm serve kyaky/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NVFP4 \
--max-model-len 16384 --reasoning-parser qwen3 --trust-remote-code
Served text-only (the vision tower is kept BF16 and unused). Fits on a single 96 GB card.
Recommended sampling (from the base model card):
- Thinking (general):
temp 1.0, top_p 0.95, top_k 20
- Thinking (coding):
temp 0.6, top_p 0.95, top_k 20
- Non-thinking:
temp 0.7, top_p 0.80, top_k 20, presence_penalty 1.5
- On low-quant looping, set
repetition_penalty 1.05–1.1 or add a short system prompt.
Notes & limitations
- Thinking is on by default (
<think>…</think>); pass chat_template_kwargs={"enable_thinking": false} to disable.
- Single-stream throughput is modest (dense 40B with BF16 Gated-DeltaNet layers) — this build optimizes for quality, not speed.
- For tool-calling the base author suggests higher-bit quants; NVFP4 (4-bit MLP) may be weaker there.
Credits
- Base model & training: DavidAU
- Architecture: Qwen (Qwen3.5/3.6 hybrid), Apache-2.0
- Quantization: llm-compressor / compressed-tensors. NVFP4 self-quant + evaluation by kyaky.