Qwen3.6-40B-DeckardUncensored-OpusDistilled-HermesCalibrated-W4A16-AWQ API & Inference Endpoint

Quantization Notes (updated 2026-05-27)

This release uses the mse observer (mean-squared-error reconstruction) rather than the previous memoryless_minmax. The earlier memoryless_minmax build exhibited a deterministic typo class on certain low-margin token choices (rare/internal-vocabulary identifiers) at temp=0, even with thinking-mode enabled — verified clean on BF16 base, broken on the prior quant. Switching to mse resolves the canary: 0/12 typos on the fidelity bench prompt that previously failed 6/6.

Recipe (single-variable change from prior version):

Table with columns: Field, This version, Prior version
Field	This version	Prior version
observer	`mse`	`memoryless_minmax`
group_size	128	128
num_bits	4 (INT)	4 (INT)
symmetric	False	False
calibration	`NousResearch/hermes-function-calling-v1` (`func_calling_singleturn`, 256 samples)	identical
ignore-list	471-entry (same FP-keep policy)	identical

Only the observer changed. All other recipe knobs and the calibration dataset are unchanged from the prior Hermes-calibrated release. The prior version remains in this repo's git history at the previous commit for comparison.

Qwen3.6-40B DeckardUncensored OpusDistilled — Hermes-Calibrated W4A16-AWQ

Built for 2× V100-PCIE running 1Cat-vLLM, at full native 256k context. The recipe targets the 1Cat fork's SM70TurboMind W4A16 kernel — group_size=128 and pack-quantized format are inside its supported set, and the ignore list keeps Qwen3.5's hybrid GatedDeltaNet layers + vision tower out of quantization so the kernel never sees an input it can't handle. No mixed-precision keeps the full 262144-token window viable on V100 — but the model lives close to the VRAM ceiling and requires specific tuning to avoid real-prompt OOMs. See Tuning notes for 2× V100 32 GB below before deploying. Should also work on stock vLLM on Ampere/Hopper/etc.

Table

Base model	`DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking` (BF16)
Target stack	1Cat-vLLM on 2× V100-PCIE 32 GB, TP=2
Context window	262144 tokens (full native 256k) — Qwen3.6's hybrid attention + `fp8_e5m2` KV cache let the full window fit on 2× V100
Quantization	AWQ pack-quantized W4A16, `compressed-tensors` format

What this is

A 4-bit AWQ quantization of the DavidAU Deckard-40B model (an Opus-style distill of Qwen3.6-40B), produced with llm-compressor. Calibration data is the Hermes function-calling v1 corpus — chosen for activation-distribution coverage of ChatML tool-call dialogs.

This is the HermesCalibrated variant of a small family of quants that hold every other quantization knob fixed (group_size=128, MSE observer, no mixed-precision, no SpinQuant) and vary only the calibration dataset, in order to isolate the effect of calibration choice on agentic tool-use quality. See "Bench results" below.

Recipe

Hybrid-architecture aware AWQ — the ignore list excludes Qwen3.5's 72 GatedDeltaNet linear-attention layers, the entire vision tower, MTP head, MoE gates, and lm_head. Without those exclusions the quant breaks the model. See recipe.yaml for the exact AWQModifier config.

python
recipe = AWQModifier(
    targets=["Linear"],
    config_groups={
        "group_0": {
            "targets": ["Linear"],
            "weights": {
                "num_bits": 4,
                "group_size": 128,
                "strategy": "group",
                "symmetric": False,
                "dynamic": False,
                "observer": "mse",
                "type": "int",
            },
        },
    },
    ignore=[
        "re:.*lm_head", "re:visual.*", "re:model.visual.*",
        "re:.*mlp.gate$", "re:.*embed_tokens$", "re:.*shared_expert_gate$",
        "re:.*linear_attn.*", "re:.*mtp.*",
    ],
)

Calibration loop: oneshot(model, recipe, dataset=ds, max_seq_length=2048, num_calibration_samples=256, sequential_targets=["Qwen3_5DecoderLayer"], moe_calibrate_all_experts=True). Each Hermes conversation is rendered through processor.apply_chat_template with role mapping gpt→assistant, human→user.

Hardware compatibility

Built for V100 / SM70TurboMind serving via the 1Cat-vLLM fork. The pack-quantized format + group_size=128 falls inside the SM70 kernel's supported set ({32, 64, 128}). Inference also works on Ampere/Hopper/etc. via stock vLLM.

Tested serving: 2× V100-PCIE (32 GB), TP=2, --kv-cache-dtype fp8_e5m2, --attention-backend FLASH_ATTN_V100. Supports up to the model's full native 262144-token (256k) context window on this hardware — Qwen3.6-40B's hybrid architecture (24 attention layers × 4 KV heads × 256 head_dim, plus 72 constant-size GatedDeltaNet state caches) keeps KV growth modest, and fp8_e5m2 halves it again, so at full 256k the KV cache lands ~12.6 GB. The catch: even though the static profile fits, the model lives at ~97% VRAM under load and requires the tuning in the next section to avoid real-prompt OOMs. Startup will succeed even when inference will not.

Tuning notes for 2× V100 32 GB

The model lives close to the VRAM ceiling at long context. Without the tuning below, inference will OOM even when startup succeeds — vLLM's static profile passes, but PyTorch's caching allocator fragments under burst activation allocation and exceeds the leftover headroom.

Required environment variables

bash
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True   # MANDATORY for 40B-class W4A16 on V100
NCCL_P2P_DISABLE=1                                  # MANDATORY for PHB/SYS topology (cross-socket V100s without NVLink)
VLLM_WORKER_MULTIPROC_METHOD=spawn                  # recommended
CUDA_VISIBLE_DEVICES=0,1                            # pin to the GPUs you want — matters if other workloads share the box

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True is the single biggest knob. Without it, the PyTorch caching allocator fragments under burst allocation and OOMs even with the same nominal VRAM budget. Triton OOMs at high --max-num-batched-tokens are the textbook symptom.

Required vLLM args

bash
--kv-cache-dtype fp8_e5m2                                                          # V100 cannot do fp8_e4m3
--attention-backend FLASH_ATTN_V100
--compilation-config '{"cudagraph_mode":"piecewise","cudagraph_capture_sizes":[1,2,4]}'   # lean — full_and_piecewise eats more VRAM
--max-num-batched-tokens 8192

cudagraph_mode=piecewise (not full_and_piecewise) and capture sizes [1,2,4] (not larger) save meaningful activation memory. Don't disable --kv-cache-auto-trim-ratio — let blocks recycle.

Context window ladder

Table with columns: Context, gpu-memory-utilization, max-num-seqs, Notes
Context	`gpu-memory-utilization`	`max-num-seqs`	Notes
200000	0.85	4	Safe default — start here. Comfortable on validated tuning.
240000	0.87	8	Tight but workable with full tuning.
262144	0.89	16	Bare ceiling. Requires all env vars + lean cudagraph + real-prompt validation. Production setup uses this with prefix caching absorbing burst pressure.

Push knobs one at a time if you want to climb past 200000. Validate with a real prompt after each step — startup success is not a stability signal.

Validation

Startup success does not mean stability. Validate with a multi-thousand-token prompt (ideally ≥100k tokens) before declaring the config production-ready. The production setup (262144 / util 0.89 / max-num-seqs 16) was validated against a ~200k token prompt with verbatim needle retrieval at ~5% context depth.

OOM during inference (not startup)?

Almost always one of:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True missing
--gpu-memory-utilization > 0.85 without the env var set
cudagraph_mode=full_and_piecewise or capture sizes > 4
Another workload sharing one of the TP GPUs (check nvidia-smi for other processes on your TP devices)

Serving (vLLM)

Safe-default config — start here, then climb the ladder above if your hardware allows.

bash
# Env (in systemd [Service] block, or `export` before launch)
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_P2P_DISABLE=1
VLLM_WORKER_MULTIPROC_METHOD=spawn

python3 -m vllm.entrypoints.openai.api_server \
  --model philbert440/Qwen3.6-40B-DeckardUncensored-OpusDistilled-HermesCalibrated-W4A16-AWQ \
  --served-model-name deckard-40b \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 200000 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 8192 \
  --kv-cache-dtype fp8_e5m2 \
  --attention-backend FLASH_ATTN_V100 \
  --compilation-config '{"cudagraph_mode":"piecewise","cudagraph_capture_sizes":[1,2,4]}' \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3

Replace FLASH_ATTN_V100 with the appropriate backend on non-V100 hardware. Drop --kv-cache-dtype fp8_e5m2 if the deployment target supports fp8_e4m3 (V100/SM70 does not). On non-V100 hardware most of the V100-specific tuning can be relaxed.

Recommended sampling

Inherited from DavidAU's source model guidance, with notes specific to this quant.

General use

temperature: 0.7
repetition_penalty: 1.0 (off)
top_p: 0.95
top_k: 20

Creative use (fiction, dialog)

This quant is the "lower quant" case DavidAU specifically calls out for creative repetition_penalty — apply it here:

temperature: 0.7
repetition_penalty: 1.05–1.10
top_p: 0.95
top_k: 20

Tool-calling / agentic

DavidAU recommends Q5/Q6 minimum quants for tool-use reliability per Qwen guidance. This quant is W4A16 (~Q4 territory), below his floor for tool-calling. The Hermes function-calling calibration mitigates but does not eliminate the gap.

temperature: 0.3–0.5 (lower for more deterministic tool selection)
repetition_penalty: 1.0
Server: --enable-auto-tool-choice --tool-call-parser qwen3_coder

Looping mitigation

If you see token repetition or stalled output (more common at lower quants like this one), per DavidAU:

Add even a one-sentence system prompt — often fixes it on its own: Be vivid and precise.
If still looping, bump repetition_penalty to 1.05–1.10.

License

Apache-2.0 (inherited from base model).

Acknowledgements

DavidAU — base Qwen3.6-40B Deckard distillation
Qwen team — Qwen3.6 base architecture
NousResearch — Hermes function-calling calibration corpus
llm-compressor — quantization framework
1CatAI/1Cat-vLLM — V100/SM70TurboMind W4A16 serving kernel

Serving on V100 / SM70 - 1Cat-vLLM 1.2.1

Serve on 1Cat-vLLM 1.2.1 + the two SM70 patches in 1Cat-vLLM PR #88 (P7: fp8_e5m2 KV on compressed-tensors W4A16; and the fast fp8-KV prefill gather). The complete recommended config - W4A16 + fp8_e5m2 KV + Qwen3.5 MTP speculative decoding on 2x V100, plus the mtp.fc-must-stay-fp16 packed-head recipe - is documented in that PR's docs/v100-sm70-serving.md.

Required on Volta: --attention-backend FLASH_ATTN_V100 and VLLM_SM70_QUANT_BACKEND=turbomind. This is the 40B Deckard variant - set --max-model-len / --gpu-memory-utilization to your VRAM budget (the doc's 27B config is a starting template; the Deckard MTP head runs K=4).

Quantization Notes (updated 2026-05-27)

Recipe (single-variable change from prior version):

Table with columns: Field, This version, Prior version
Field	This version	Prior version
observer	`mse`	`memoryless_minmax`
group_size	128	128
num_bits	4 (INT)	4 (INT)
symmetric	False	False
calibration	`NousResearch/hermes-function-calling-v1` (`func_calling_singleturn`, 256 samples)	identical
ignore-list	471-entry (same FP-keep policy)	identical

Qwen3.6-40B DeckardUncensored OpusDistilled — Hermes-Calibrated W4A16-AWQ

Built for 2× V100-PCIE running 1Cat-vLLM, at full native 256k context. The recipe targets the 1Cat fork's SM70TurboMind W4A16 kernel — group_size=128 and pack-quantized format are inside its supported set, and the ignore list keeps Qwen3.5's hybrid GatedDeltaNet layers + vision tower out of quantization so the kernel never sees an input it can't handle. No mixed-precision keeps the full 262144-token window viable on V100 — but the model lives close to the VRAM ceiling and requires specific tuning to avoid real-prompt OOMs. See Tuning notes for 2× V100 32 GB below before deploying. Should also work on stock vLLM on Ampere/Hopper/etc.

Table

Base model	`DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking` (BF16)
Target stack	1Cat-vLLM on 2× V100-PCIE 32 GB, TP=2
Context window	262144 tokens (full native 256k) — Qwen3.6's hybrid attention + `fp8_e5m2` KV cache let the full window fit on 2× V100
Quantization	AWQ pack-quantized W4A16, `compressed-tensors` format

What this is

Recipe

python
recipe = AWQModifier(
    targets=["Linear"],
    config_groups={
        "group_0": {
            "targets": ["Linear"],
            "weights": {
                "num_bits": 4,
                "group_size": 128,
                "strategy": "group",
                "symmetric": False,
                "dynamic": False,
                "observer": "mse",
                "type": "int",
            },
        },
    },
    ignore=[
        "re:.*lm_head", "re:visual.*", "re:model.visual.*",
        "re:.*mlp.gate$", "re:.*embed_tokens$", "re:.*shared_expert_gate$",
        "re:.*linear_attn.*", "re:.*mtp.*",
    ],
)

Hardware compatibility

Tuning notes for 2× V100 32 GB

Required environment variables

bash
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True   # MANDATORY for 40B-class W4A16 on V100
NCCL_P2P_DISABLE=1                                  # MANDATORY for PHB/SYS topology (cross-socket V100s without NVLink)
VLLM_WORKER_MULTIPROC_METHOD=spawn                  # recommended
CUDA_VISIBLE_DEVICES=0,1                            # pin to the GPUs you want — matters if other workloads share the box

Required vLLM args

bash
--kv-cache-dtype fp8_e5m2                                                          # V100 cannot do fp8_e4m3
--attention-backend FLASH_ATTN_V100
--compilation-config '{"cudagraph_mode":"piecewise","cudagraph_capture_sizes":[1,2,4]}'   # lean — full_and_piecewise eats more VRAM
--max-num-batched-tokens 8192

cudagraph_mode=piecewise (not full_and_piecewise) and capture sizes [1,2,4] (not larger) save meaningful activation memory. Don't disable --kv-cache-auto-trim-ratio — let blocks recycle.

Context window ladder

Table with columns: Context, gpu-memory-utilization, max-num-seqs, Notes
Context	`gpu-memory-utilization`	`max-num-seqs`	Notes
200000	0.85	4	Safe default — start here. Comfortable on validated tuning.
240000	0.87	8	Tight but workable with full tuning.
262144	0.89	16	Bare ceiling. Requires all env vars + lean cudagraph + real-prompt validation. Production setup uses this with prefix caching absorbing burst pressure.

Push knobs one at a time if you want to climb past 200000. Validate with a real prompt after each step — startup success is not a stability signal.

Validation

OOM during inference (not startup)?

Almost always one of:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True missing
--gpu-memory-utilization > 0.85 without the env var set
cudagraph_mode=full_and_piecewise or capture sizes > 4
Another workload sharing one of the TP GPUs (check nvidia-smi for other processes on your TP devices)

Serving (vLLM)

Safe-default config — start here, then climb the ladder above if your hardware allows.

bash
# Env (in systemd [Service] block, or `export` before launch)
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_P2P_DISABLE=1
VLLM_WORKER_MULTIPROC_METHOD=spawn

python3 -m vllm.entrypoints.openai.api_server \
  --model philbert440/Qwen3.6-40B-DeckardUncensored-OpusDistilled-HermesCalibrated-W4A16-AWQ \
  --served-model-name deckard-40b \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 200000 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 8192 \
  --kv-cache-dtype fp8_e5m2 \
  --attention-backend FLASH_ATTN_V100 \
  --compilation-config '{"cudagraph_mode":"piecewise","cudagraph_capture_sizes":[1,2,4]}' \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3

Recommended sampling

Inherited from DavidAU's source model guidance, with notes specific to this quant.

General use

temperature: 0.7
repetition_penalty: 1.0 (off)
top_p: 0.95
top_k: 20

Creative use (fiction, dialog)

This quant is the "lower quant" case DavidAU specifically calls out for creative repetition_penalty — apply it here:

temperature: 0.7
repetition_penalty: 1.05–1.10
top_p: 0.95
top_k: 20

Tool-calling / agentic

temperature: 0.3–0.5 (lower for more deterministic tool selection)
repetition_penalty: 1.0
Server: --enable-auto-tool-choice --tool-call-parser qwen3_coder

Looping mitigation

If you see token repetition or stalled output (more common at lower quants like this one), per DavidAU:

Add even a one-sentence system prompt — often fixes it on its own: Be vivid and precise.
If still looping, bump repetition_penalty to 1.05–1.10.

License

Apache-2.0 (inherited from base model).

Acknowledgements

DavidAU — base Qwen3.6-40B Deckard distillation
Qwen team — Qwen3.6 base architecture
NousResearch — Hermes function-calling calibration corpus
llm-compressor — quantization framework
1CatAI/1Cat-vLLM — V100/SM70TurboMind W4A16 serving kernel

Qwen3.6-40B-DeckardUncensored-OpusDistilled-HermesCalibrated-W4A16-AWQ

README

Quantization Notes (updated 2026-05-27)

Qwen3.6-40B DeckardUncensored OpusDistilled — Hermes-Calibrated W4A16-AWQ

What this is

Recipe

Hardware compatibility

Tuning notes for 2× V100 32 GB

Required environment variables

Required vLLM args

Context window ladder

Validation

OOM during inference (not startup)?

Serving (vLLM)

Recommended sampling

General use

Creative use (fiction, dialog)

Tool-calling / agentic

Looping mitigation

License

Acknowledgements

Serving on V100 / SM70 - 1Cat-vLLM 1.2.1

Explore FriendliAI today

README

Quantization Notes (updated 2026-05-27)

Qwen3.6-40B DeckardUncensored OpusDistilled — Hermes-Calibrated W4A16-AWQ

What this is

Recipe

Hardware compatibility

Tuning notes for 2× V100 32 GB

Required environment variables

Required vLLM args

Context window ladder

Validation

OOM during inference (not startup)?

Serving (vLLM)

Recommended sampling

General use

Creative use (fiction, dialog)

Tool-calling / agentic

Looping mitigation

License

Acknowledgements

Serving on V100 / SM70 - 1Cat-vLLM 1.2.1