Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quantization Notes (updated 2026-05-27)

This release uses the mse observer (mean-squared-error reconstruction) rather than the previous memoryless_minmax. The earlier memoryless_minmax build exhibited a deterministic typo class on certain low-margin token choices (rare/internal-vocabulary identifiers) at temp=0, even with thinking-mode enabled — verified clean on BF16 base, broken on the prior quant. Switching to mse resolves the canary: 0/12 typos on the fidelity bench prompt that previously failed 6/6.

Recipe (single-variable change from prior version):

FieldThis versionPrior version
observermsememoryless_minmax
group_size128128
num_bits4 (INT)4 (INT)
symmetricFalseFalse
calibrationNousResearch/hermes-function-calling-v1 (func_calling_singleturn, 256 samples)identical
ignore-list471-entry (same FP-keep policy)identical

Only the observer changed. All other recipe knobs and the calibration dataset are unchanged from the prior Hermes-calibrated release. The prior version remains in this repo's git history at the previous commit for comparison.


Qwen3.6-40B DeckardUncensored OpusDistilled — Hermes-Calibrated W4A16-AWQ

Built for 2× V100-PCIE running 1Cat-vLLM, at full native 256k context. The recipe targets the 1Cat fork's SM70TurboMind W4A16 kernel — group_size=128 and pack-quantized format are inside its supported set, and the ignore list keeps Qwen3.5's hybrid GatedDeltaNet layers + vision tower out of quantization so the kernel never sees an input it can't handle. No mixed-precision keeps the full 262144-token window viable on V100 — but the model lives close to the VRAM ceiling and requires specific tuning to avoid real-prompt OOMs. See Tuning notes for 2× V100 32 GB below before deploying. Should also work on stock vLLM on Ampere/Hopper/etc.

Base modelDavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking (BF16)
Target stack1Cat-vLLM on 2× V100-PCIE 32 GB, TP=2
Context window262144 tokens (full native 256k) — Qwen3.6's hybrid attention + fp8_e5m2 KV cache let the full window fit on 2× V100
QuantizationAWQ pack-quantized W4A16, compressed-tensors format
Group size128
Observermse
Mixed-precisionnone — uniform INT4 (with the standard ignore set)
Calibration dataNousResearch/hermes-function-calling-v1func_calling_singleturn subset, 256 samples, MAX_SEQ_LEN=2048
Output size37.3 GB

What this is

A 4-bit AWQ quantization of the DavidAU Deckard-40B model (an Opus-style distill of Qwen3.6-40B), produced with llm-compressor. Calibration data is the Hermes function-calling v1 corpus — chosen for activation-distribution coverage of ChatML tool-call dialogs.

This is the HermesCalibrated variant of a small family of quants that hold every other quantization knob fixed (group_size=128, MSE observer, no mixed-precision, no SpinQuant) and vary only the calibration dataset, in order to isolate the effect of calibration choice on agentic tool-use quality. See "Bench results" below.

Recipe

Hybrid-architecture aware AWQ — the ignore list excludes Qwen3.5's 72 GatedDeltaNet linear-attention layers, the entire vision tower, MTP head, MoE gates, and lm_head. Without those exclusions the quant breaks the model. See recipe.yaml for the exact AWQModifier config.

python

recipe = AWQModifier(
targets=["Linear"],
config_groups={
"group_0": {
"targets": ["Linear"],
"weights": {
"num_bits": 4,
"group_size": 128,
"strategy": "group",
"symmetric": False,
"dynamic": False,
"observer": "mse",
"type": "int",
},
},
},
ignore=[
"re:.*lm_head", "re:visual.*", "re:model.visual.*",
"re:.*mlp.gate$", "re:.*embed_tokens$", "re:.*shared_expert_gate$",
"re:.*linear_attn.*", "re:.*mtp.*",
],
)

Calibration loop: oneshot(model, recipe, dataset=ds, max_seq_length=2048, num_calibration_samples=256, sequential_targets=["Qwen3_5DecoderLayer"], moe_calibrate_all_experts=True). Each Hermes conversation is rendered through processor.apply_chat_template with role mapping gpt→assistant, human→user.

Hardware compatibility

Built for V100 / SM70TurboMind serving via the 1Cat-vLLM fork. The pack-quantized format + group_size=128 falls inside the SM70 kernel's supported set ({32, 64, 128}). Inference also works on Ampere/Hopper/etc. via stock vLLM.

Tested serving: 2× V100-PCIE (32 GB), TP=2, --kv-cache-dtype fp8_e5m2, --attention-backend FLASH_ATTN_V100. Supports up to the model's full native 262144-token (256k) context window on this hardware — Qwen3.6-40B's hybrid architecture (24 attention layers × 4 KV heads × 256 head_dim, plus 72 constant-size GatedDeltaNet state caches) keeps KV growth modest, and fp8_e5m2 halves it again, so at full 256k the KV cache lands ~12.6 GB. The catch: even though the static profile fits, the model lives at ~97% VRAM under load and requires the tuning in the next section to avoid real-prompt OOMs. Startup will succeed even when inference will not.

Tuning notes for 2× V100 32 GB

The model lives close to the VRAM ceiling at long context. Without the tuning below, inference will OOM even when startup succeeds — vLLM's static profile passes, but PyTorch's caching allocator fragments under burst activation allocation and exceeds the leftover headroom.

Required environment variables

bash

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True # MANDATORY for 40B-class W4A16 on V100
NCCL_P2P_DISABLE=1 # MANDATORY for PHB/SYS topology (cross-socket V100s without NVLink)
VLLM_WORKER_MULTIPROC_METHOD=spawn # recommended
CUDA_VISIBLE_DEVICES=0,1 # pin to the GPUs you want — matters if other workloads share the box

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True is the single biggest knob. Without it, the PyTorch caching allocator fragments under burst allocation and OOMs even with the same nominal VRAM budget. Triton OOMs at high --max-num-batched-tokens are the textbook symptom.

Required vLLM args

bash

--kv-cache-dtype fp8_e5m2 # V100 cannot do fp8_e4m3
--attention-backend FLASH_ATTN_V100
--compilation-config '{"cudagraph_mode":"piecewise","cudagraph_capture_sizes":[1,2,4]}' # lean — full_and_piecewise eats more VRAM
--max-num-batched-tokens 8192

cudagraph_mode=piecewise (not full_and_piecewise) and capture sizes [1,2,4] (not larger) save meaningful activation memory. Don't disable --kv-cache-auto-trim-ratio — let blocks recycle.

Context window ladder

Contextgpu-memory-utilizationmax-num-seqsNotes
2000000.854Safe default — start here. Comfortable on validated tuning.
2400000.878Tight but workable with full tuning.
2621440.8916Bare ceiling. Requires all env vars + lean cudagraph + real-prompt validation. Production setup uses this with prefix caching absorbing burst pressure.

Push knobs one at a time if you want to climb past 200000. Validate with a real prompt after each step — startup success is not a stability signal.

Validation

Startup success does not mean stability. Validate with a multi-thousand-token prompt (ideally ≥100k tokens) before declaring the config production-ready. The production setup (262144 / util 0.89 / max-num-seqs 16) was validated against a ~200k token prompt with verbatim needle retrieval at ~5% context depth.

OOM during inference (not startup)?

Almost always one of:

  1. PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True missing
  2. --gpu-memory-utilization > 0.85 without the env var set
  3. cudagraph_mode=full_and_piecewise or capture sizes > 4
  4. Another workload sharing one of the TP GPUs (check nvidia-smi for other processes on your TP devices)

Serving (vLLM)

Safe-default config — start here, then climb the ladder above if your hardware allows.

bash

# Env (in systemd [Service] block, or `export` before launch)
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_P2P_DISABLE=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
python3 -m vllm.entrypoints.openai.api_server \
--model philbert440/Qwen3.6-40B-DeckardUncensored-OpusDistilled-HermesCalibrated-W4A16-AWQ \
--served-model-name deckard-40b \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.85 \
--max-model-len 200000 \
--max-num-seqs 4 \
--max-num-batched-tokens 8192 \
--kv-cache-dtype fp8_e5m2 \
--attention-backend FLASH_ATTN_V100 \
--compilation-config '{"cudagraph_mode":"piecewise","cudagraph_capture_sizes":[1,2,4]}' \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3

Replace FLASH_ATTN_V100 with the appropriate backend on non-V100 hardware. Drop --kv-cache-dtype fp8_e5m2 if the deployment target supports fp8_e4m3 (V100/SM70 does not). On non-V100 hardware most of the V100-specific tuning can be relaxed.

Recommended sampling

Inherited from DavidAU's source model guidance, with notes specific to this quant.

General use

  • temperature: 0.7
  • repetition_penalty: 1.0 (off)
  • top_p: 0.95
  • top_k: 20

Creative use (fiction, dialog)

This quant is the "lower quant" case DavidAU specifically calls out for creative repetition_penalty — apply it here:

  • temperature: 0.7
  • repetition_penalty: 1.05–1.10
  • top_p: 0.95
  • top_k: 20

Tool-calling / agentic

DavidAU recommends Q5/Q6 minimum quants for tool-use reliability per Qwen guidance. This quant is W4A16 (~Q4 territory), below his floor for tool-calling. The Hermes function-calling calibration mitigates but does not eliminate the gap.

  • temperature: 0.3–0.5 (lower for more deterministic tool selection)
  • repetition_penalty: 1.0
  • Server: --enable-auto-tool-choice --tool-call-parser qwen3_coder

Looping mitigation

If you see token repetition or stalled output (more common at lower quants like this one), per DavidAU:

  1. Add even a one-sentence system prompt — often fixes it on its own: Be vivid and precise.
  2. If still looping, bump repetition_penalty to 1.05–1.10.

License

Apache-2.0 (inherited from base model).

Acknowledgements

Model provider

philbert440

Model tree

Base

philbert440/Qwen3.6-40B-DeckardUncensored-OpusDistilled-W4A16-AWQ

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today