Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quantization Notes (updated 2026-05-27)
This release uses the mse observer (mean-squared-error reconstruction) rather than the previous memoryless_minmax. The earlier memoryless_minmax build exhibited a deterministic typo class on certain low-margin token choices (rare/internal-vocabulary identifiers) at temp=0, even with thinking-mode enabled — verified clean on BF16 base, broken on the prior quant. Switching to mse resolves the canary: 0/12 typos on the fidelity bench prompt that previously failed 6/6.
Recipe (single-variable change from prior version):
| Field | This version | Prior version |
|---|---|---|
| observer | mse | memoryless_minmax |
| group_size | 128 | 128 |
| num_bits | 4 (INT) | 4 (INT) |
| symmetric | False | False |
| calibration | NousResearch/hermes-function-calling-v1 (func_calling_singleturn, 256 samples) | identical |
| ignore-list | 471-entry (same FP-keep policy) | identical |
Only the observer changed. All other recipe knobs and the calibration dataset are unchanged from the prior Hermes-calibrated release. The prior version remains in this repo's git history at the previous commit for comparison.
Qwen3.6-40B DeckardUncensored OpusDistilled — Hermes-Calibrated W4A16-AWQ
Built for 2× V100-PCIE running 1Cat-vLLM, at full native 256k context. The recipe targets the 1Cat fork's SM70TurboMind W4A16 kernel —
group_size=128andpack-quantizedformat are inside its supported set, and the ignore list keeps Qwen3.5's hybrid GatedDeltaNet layers + vision tower out of quantization so the kernel never sees an input it can't handle. No mixed-precision keeps the full 262144-token window viable on V100 — but the model lives close to the VRAM ceiling and requires specific tuning to avoid real-prompt OOMs. See Tuning notes for 2× V100 32 GB below before deploying. Should also work on stock vLLM on Ampere/Hopper/etc.
| Base model | DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking (BF16) |
| Target stack | 1Cat-vLLM on 2× V100-PCIE 32 GB, TP=2 |
| Context window | 262144 tokens (full native 256k) — Qwen3.6's hybrid attention + fp8_e5m2 KV cache let the full window fit on 2× V100 |
| Quantization | AWQ pack-quantized W4A16, compressed-tensors format |
| Group size | 128 |
| Observer | mse |
| Mixed-precision | none — uniform INT4 (with the standard ignore set) |
| Calibration data | NousResearch/hermes-function-calling-v1 — func_calling_singleturn subset, 256 samples, MAX_SEQ_LEN=2048 |
| Output size | 37.3 GB |
What this is
A 4-bit AWQ quantization of the DavidAU Deckard-40B model (an Opus-style distill of Qwen3.6-40B), produced with llm-compressor. Calibration data is the Hermes function-calling v1 corpus — chosen for activation-distribution coverage of ChatML tool-call dialogs.
This is the HermesCalibrated variant of a small family of quants that hold every other quantization knob fixed (group_size=128, MSE observer, no mixed-precision, no SpinQuant) and vary only the calibration dataset, in order to isolate the effect of calibration choice on agentic tool-use quality. See "Bench results" below.
Recipe
Hybrid-architecture aware AWQ — the ignore list excludes Qwen3.5's 72 GatedDeltaNet linear-attention layers, the entire vision tower, MTP head, MoE gates, and lm_head. Without those exclusions the quant breaks the model. See recipe.yaml for the exact AWQModifier config.
python
recipe = AWQModifier(targets=["Linear"],config_groups={"group_0": {"targets": ["Linear"],"weights": {"num_bits": 4,"group_size": 128,"strategy": "group","symmetric": False,"dynamic": False,"observer": "mse","type": "int",},},},ignore=["re:.*lm_head", "re:visual.*", "re:model.visual.*","re:.*mlp.gate$", "re:.*embed_tokens$", "re:.*shared_expert_gate$","re:.*linear_attn.*", "re:.*mtp.*",],)
Calibration loop: oneshot(model, recipe, dataset=ds, max_seq_length=2048, num_calibration_samples=256, sequential_targets=["Qwen3_5DecoderLayer"], moe_calibrate_all_experts=True). Each Hermes conversation is rendered through processor.apply_chat_template with role mapping gpt→assistant, human→user.
Hardware compatibility
Built for V100 / SM70TurboMind serving via the 1Cat-vLLM fork. The pack-quantized format + group_size=128 falls inside the SM70 kernel's supported set ({32, 64, 128}). Inference also works on Ampere/Hopper/etc. via stock vLLM.
Tested serving: 2× V100-PCIE (32 GB), TP=2, --kv-cache-dtype fp8_e5m2, --attention-backend FLASH_ATTN_V100. Supports up to the model's full native 262144-token (256k) context window on this hardware — Qwen3.6-40B's hybrid architecture (24 attention layers × 4 KV heads × 256 head_dim, plus 72 constant-size GatedDeltaNet state caches) keeps KV growth modest, and fp8_e5m2 halves it again, so at full 256k the KV cache lands ~12.6 GB. The catch: even though the static profile fits, the model lives at ~97% VRAM under load and requires the tuning in the next section to avoid real-prompt OOMs. Startup will succeed even when inference will not.
Tuning notes for 2× V100 32 GB
The model lives close to the VRAM ceiling at long context. Without the tuning below, inference will OOM even when startup succeeds — vLLM's static profile passes, but PyTorch's caching allocator fragments under burst activation allocation and exceeds the leftover headroom.
Required environment variables
bash
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True # MANDATORY for 40B-class W4A16 on V100NCCL_P2P_DISABLE=1 # MANDATORY for PHB/SYS topology (cross-socket V100s without NVLink)VLLM_WORKER_MULTIPROC_METHOD=spawn # recommendedCUDA_VISIBLE_DEVICES=0,1 # pin to the GPUs you want — matters if other workloads share the box
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True is the single biggest knob. Without it, the PyTorch caching allocator fragments under burst allocation and OOMs even with the same nominal VRAM budget. Triton OOMs at high --max-num-batched-tokens are the textbook symptom.
Required vLLM args
bash
--kv-cache-dtype fp8_e5m2 # V100 cannot do fp8_e4m3--attention-backend FLASH_ATTN_V100--compilation-config '{"cudagraph_mode":"piecewise","cudagraph_capture_sizes":[1,2,4]}' # lean — full_and_piecewise eats more VRAM--max-num-batched-tokens 8192
cudagraph_mode=piecewise (not full_and_piecewise) and capture sizes [1,2,4] (not larger) save meaningful activation memory. Don't disable --kv-cache-auto-trim-ratio — let blocks recycle.
Context window ladder
| Context | gpu-memory-utilization | max-num-seqs | Notes |
|---|---|---|---|
| 200000 | 0.85 | 4 | Safe default — start here. Comfortable on validated tuning. |
| 240000 | 0.87 | 8 | Tight but workable with full tuning. |
| 262144 | 0.89 | 16 | Bare ceiling. Requires all env vars + lean cudagraph + real-prompt validation. Production setup uses this with prefix caching absorbing burst pressure. |
Push knobs one at a time if you want to climb past 200000. Validate with a real prompt after each step — startup success is not a stability signal.
Validation
Startup success does not mean stability. Validate with a multi-thousand-token prompt (ideally ≥100k tokens) before declaring the config production-ready. The production setup (262144 / util 0.89 / max-num-seqs 16) was validated against a ~200k token prompt with verbatim needle retrieval at ~5% context depth.
OOM during inference (not startup)?
Almost always one of:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truemissing--gpu-memory-utilization> 0.85 without the env var setcudagraph_mode=full_and_piecewiseor capture sizes > 4- Another workload sharing one of the TP GPUs (check
nvidia-smifor other processes on your TP devices)
Serving (vLLM)
Safe-default config — start here, then climb the ladder above if your hardware allows.
bash
# Env (in systemd [Service] block, or `export` before launch)PYTORCH_CUDA_ALLOC_CONF=expandable_segments:TrueNCCL_P2P_DISABLE=1VLLM_WORKER_MULTIPROC_METHOD=spawnpython3 -m vllm.entrypoints.openai.api_server \--model philbert440/Qwen3.6-40B-DeckardUncensored-OpusDistilled-HermesCalibrated-W4A16-AWQ \--served-model-name deckard-40b \--tensor-parallel-size 2 \--gpu-memory-utilization 0.85 \--max-model-len 200000 \--max-num-seqs 4 \--max-num-batched-tokens 8192 \--kv-cache-dtype fp8_e5m2 \--attention-backend FLASH_ATTN_V100 \--compilation-config '{"cudagraph_mode":"piecewise","cudagraph_capture_sizes":[1,2,4]}' \--enable-prefix-caching \--enable-auto-tool-choice \--tool-call-parser qwen3_coder \--reasoning-parser qwen3
Replace FLASH_ATTN_V100 with the appropriate backend on non-V100 hardware. Drop --kv-cache-dtype fp8_e5m2 if the deployment target supports fp8_e4m3 (V100/SM70 does not). On non-V100 hardware most of the V100-specific tuning can be relaxed.
Recommended sampling
Inherited from DavidAU's source model guidance, with notes specific to this quant.
General use
temperature: 0.7repetition_penalty: 1.0 (off)top_p: 0.95top_k: 20
Creative use (fiction, dialog)
This quant is the "lower quant" case DavidAU specifically calls out for creative repetition_penalty — apply it here:
temperature: 0.7repetition_penalty: 1.05–1.10top_p: 0.95top_k: 20
Tool-calling / agentic
DavidAU recommends Q5/Q6 minimum quants for tool-use reliability per Qwen guidance. This quant is W4A16 (~Q4 territory), below his floor for tool-calling. The Hermes function-calling calibration mitigates but does not eliminate the gap.
temperature: 0.3–0.5 (lower for more deterministic tool selection)repetition_penalty: 1.0- Server:
--enable-auto-tool-choice --tool-call-parser qwen3_coder
Looping mitigation
If you see token repetition or stalled output (more common at lower quants like this one), per DavidAU:
- Add even a one-sentence system prompt — often fixes it on its own:
Be vivid and precise. - If still looping, bump
repetition_penaltyto 1.05–1.10.
License
Apache-2.0 (inherited from base model).
Acknowledgements
- DavidAU — base Qwen3.6-40B Deckard distillation
- Qwen team — Qwen3.6 base architecture
- NousResearch — Hermes function-calling calibration corpus
- llm-compressor — quantization framework
- 1CatAI/1Cat-vLLM — V100/SM70TurboMind W4A16 serving kernel
Model provider
philbert440
Model tree
Base
philbert440/Qwen3.6-40B-DeckardUncensored-OpusDistilled-W4A16-AWQ
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information