Quantization Notes (updated 2026-05-27)
This release uses the mse observer (mean-squared-error reconstruction) rather than the previous memoryless_minmax. The earlier memoryless_minmax build exhibited a deterministic typo class on certain low-margin token choices (rare/internal-vocabulary identifiers) at temp=0, even with thinking-mode enabled — verified clean on BF16 base, broken on the prior quant. Switching to mse resolves the canary: 0/12 typos on the fidelity bench prompt that previously failed 6/6.
Recipe (single-variable change from prior version):
Table with columns: Field, This version, Prior version| Field | This version | Prior version |
|---|
| observer | mse | memoryless_minmax |
| group_size | 128 | 128 |
| num_bits | 4 (INT) | 4 (INT) |
| symmetric | False | False |
| calibration | NousResearch/hermes-function-calling-v1 (func_calling_singleturn, 256 samples) | identical |
| ignore-list | 471-entry (same FP-keep policy) | identical |
Only the observer changed. All other recipe knobs and the calibration dataset are unchanged from the prior Hermes-calibrated release. The prior version remains in this repo's git history at the previous commit for comparison.
Qwen3.6-40B DeckardUncensored OpusDistilled — Hermes-Calibrated W4A16-AWQ
Built for 2× V100-PCIE running 1Cat-vLLM, at full native 256k context. The recipe targets the 1Cat fork's SM70TurboMind W4A16 kernel — group_size=128 and pack-quantized format are inside its supported set, and the ignore list keeps Qwen3.5's hybrid GatedDeltaNet layers + vision tower out of quantization so the kernel never sees an input it can't handle. No mixed-precision keeps the full 262144-token window viable on V100 — but the model lives close to the VRAM ceiling and requires specific tuning to avoid real-prompt OOMs. See Tuning notes for 2× V100 32 GB below before deploying. Should also work on stock vLLM on Ampere/Hopper/etc.
What this is
A 4-bit AWQ quantization of the DavidAU Deckard-40B model (an Opus-style distill of Qwen3.6-40B), produced with llm-compressor. Calibration data is the Hermes function-calling v1 corpus — chosen for activation-distribution coverage of ChatML tool-call dialogs.
This is the HermesCalibrated variant of a small family of quants that hold every other quantization knob fixed (group_size=128, MSE observer, no mixed-precision, no SpinQuant) and vary only the calibration dataset, in order to isolate the effect of calibration choice on agentic tool-use quality. See "Bench results" below.
Recipe
Hybrid-architecture aware AWQ — the ignore list excludes Qwen3.5's 72 GatedDeltaNet linear-attention layers, the entire vision tower, MTP head, MoE gates, and lm_head. Without those exclusions the quant breaks the model. See recipe.yaml for the exact AWQModifier config.
recipe = AWQModifier(
targets=["Linear"],
config_groups={
"group_0": {
"targets": ["Linear"],
"weights": {
"num_bits": 4,
"group_size": 128,
"strategy": "group",
"symmetric": False,
"dynamic": False,
"observer": "mse",
"type": "int",
},
},
},
ignore=[
"re:.*lm_head", "re:visual.*", "re:model.visual.*",
"re:.*mlp.gate$", "re:.*embed_tokens$", "re:.*shared_expert_gate$",
"re:.*linear_attn.*", "re:.*mtp.*",
],
)
Calibration loop: oneshot(model, recipe, dataset=ds, max_seq_length=2048, num_calibration_samples=256, sequential_targets=["Qwen3_5DecoderLayer"], moe_calibrate_all_experts=True). Each Hermes conversation is rendered through processor.apply_chat_template with role mapping gpt→assistant, human→user.
Hardware compatibility
Built for V100 / SM70TurboMind serving via the 1Cat-vLLM fork. The pack-quantized format + group_size=128 falls inside the SM70 kernel's supported set ({32, 64, 128}). Inference also works on Ampere/Hopper/etc. via stock vLLM.
Tested serving: 2× V100-PCIE (32 GB), TP=2, --kv-cache-dtype fp8_e5m2, --attention-backend FLASH_ATTN_V100. Supports up to the model's full native 262144-token (256k) context window on this hardware — Qwen3.6-40B's hybrid architecture (24 attention layers × 4 KV heads × 256 head_dim, plus 72 constant-size GatedDeltaNet state caches) keeps KV growth modest, and fp8_e5m2 halves it again, so at full 256k the KV cache lands ~12.6 GB. The catch: even though the static profile fits, the model lives at ~97% VRAM under load and requires the tuning in the next section to avoid real-prompt OOMs. Startup will succeed even when inference will not.
Tuning notes for 2× V100 32 GB
The model lives close to the VRAM ceiling at long context. Without the tuning below, inference will OOM even when startup succeeds — vLLM's static profile passes, but PyTorch's caching allocator fragments under burst activation allocation and exceeds the leftover headroom.
Required environment variables
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True # MANDATORY for 40B-class W4A16 on V100
NCCL_P2P_DISABLE=1 # MANDATORY for PHB/SYS topology (cross-socket V100s without NVLink)
VLLM_WORKER_MULTIPROC_METHOD=spawn # recommended
CUDA_VISIBLE_DEVICES=0,1 # pin to the GPUs you want — matters if other workloads share the box
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True is the single biggest knob. Without it, the PyTorch caching allocator fragments under burst allocation and OOMs even with the same nominal VRAM budget. Triton OOMs at high --max-num-batched-tokens are the textbook symptom.
Required vLLM args
--kv-cache-dtype fp8_e5m2 # V100 cannot do fp8_e4m3
--attention-backend FLASH_ATTN_V100
--compilation-config '{"cudagraph_mode":"piecewise","cudagraph_capture_sizes":[1,2,4]}' # lean — full_and_piecewise eats more VRAM
--max-num-batched-tokens 8192
cudagraph_mode=piecewise (not full_and_piecewise) and capture sizes [1,2,4] (not larger) save meaningful activation memory. Don't disable --kv-cache-auto-trim-ratio — let blocks recycle.
Context window ladder
Table with columns: Context, gpu-memory-utilization, max-num-seqs, Notes| Context | gpu-memory-utilization | max-num-seqs | Notes |
|---|
| 200000 | 0.85 | 4 | Safe default — start here. Comfortable on validated tuning. |
| 240000 | 0.87 | 8 | Tight but workable with full tuning. |
| 262144 | 0.89 | 16 | Bare ceiling. Requires all env vars + lean cudagraph + real-prompt validation. Production setup uses this with prefix caching absorbing burst pressure. |
Push knobs one at a time if you want to climb past 200000. Validate with a real prompt after each step — startup success is not a stability signal.
Validation
Startup success does not mean stability. Validate with a multi-thousand-token prompt (ideally ≥100k tokens) before declaring the config production-ready. The production setup (262144 / util 0.89 / max-num-seqs 16) was validated against a ~200k token prompt with verbatim needle retrieval at ~5% context depth.
OOM during inference (not startup)?
Almost always one of:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True missing
--gpu-memory-utilization > 0.85 without the env var set
cudagraph_mode=full_and_piecewise or capture sizes > 4
- Another workload sharing one of the TP GPUs (check
nvidia-smi for other processes on your TP devices)
Serving (vLLM)
Safe-default config — start here, then climb the ladder above if your hardware allows.
# Env (in systemd [Service] block, or `export` before launch)
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_P2P_DISABLE=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
python3 -m vllm.entrypoints.openai.api_server \
--model philbert440/Qwen3.6-40B-DeckardUncensored-OpusDistilled-HermesCalibrated-W4A16-AWQ \
--served-model-name deckard-40b \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.85 \
--max-model-len 200000 \
--max-num-seqs 4 \
--max-num-batched-tokens 8192 \
--kv-cache-dtype fp8_e5m2 \
--attention-backend FLASH_ATTN_V100 \
--compilation-config '{"cudagraph_mode":"piecewise","cudagraph_capture_sizes":[1,2,4]}' \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3
Replace FLASH_ATTN_V100 with the appropriate backend on non-V100 hardware. Drop --kv-cache-dtype fp8_e5m2 if the deployment target supports fp8_e4m3 (V100/SM70 does not). On non-V100 hardware most of the V100-specific tuning can be relaxed.
Recommended sampling
Inherited from DavidAU's source model guidance, with notes specific to this quant.
General use
temperature: 0.7
repetition_penalty: 1.0 (off)
top_p: 0.95
top_k: 20
Creative use (fiction, dialog)
This quant is the "lower quant" case DavidAU specifically calls out for creative repetition_penalty — apply it here:
temperature: 0.7
repetition_penalty: 1.05–1.10
top_p: 0.95
top_k: 20
DavidAU recommends Q5/Q6 minimum quants for tool-use reliability per Qwen guidance. This quant is W4A16 (~Q4 territory), below his floor for tool-calling. The Hermes function-calling calibration mitigates but does not eliminate the gap.
temperature: 0.3–0.5 (lower for more deterministic tool selection)
repetition_penalty: 1.0
- Server:
--enable-auto-tool-choice --tool-call-parser qwen3_coder
Looping mitigation
If you see token repetition or stalled output (more common at lower quants like this one), per DavidAU:
- Add even a one-sentence system prompt — often fixes it on its own:
Be vivid and precise.
- If still looping, bump
repetition_penalty to 1.05–1.10.
License
Apache-2.0 (inherited from base model).
Acknowledgements