Qwen3.6-35B-A3B-heretic-NVFP4 API & Inference Endpoint

What changed in v2 (2026-04-19)

v1 of this checkpoint had model.language_model.layers.X.* keys remapped to model.layers.X.* so vLLM's text-only Qwen3_5MoeForCausalLM loader would pick them up. That layout was unstable in production — intermittent NaN/crash in the prefix-strip codepath during real chat sessions.

v2 re-quantizes the same source (tvall43/Qwen3.6-35B-A3B-heretic) with AutoModelForImageTextToText, restoring the canonical multimodal architecture (the text path):

Architecture: Qwen3_5MoeForConditionalGeneration (vLLM's canonical class — no registry hack required)
Keys: model.language_model.layers.X.* retained natively (no post-quantization key rewriting)
27-block ViT vision encoder preserved BF16 — but note its tensors were mis-nested as model.language_model.visual.* and silently skip-loaded until the 2026-06-18 vision fix below; image inputs did not work in v2 until that rename
30 linear-attention (Mamba/GDN) layers preserved BF16
All 122,880 per-expert NVFP4 keys (40 layers × 256 experts × 3 projections × 4 quant components)

vLLM serves it via the canonical multimodal class with no prefix-strip code path in the inference hot loop. Result: rock-solid text stability where v1 was crashing on virtually every interaction. (Vision, however, was not yet functional in v2 — the ViT was mis-nested and skip-loaded; see the Vision fix section below, which corrected this on 2026-06-18.)

⚠️ If you cloned v1 of this repo, delete and re-pull. Same URL — v2 commits replaced v1.

Vision fix (2026-06-18) — image inputs now work

v2 (above) preserved the vision tower's BF16 weights, but a quantization-time ignore-regex defect nested the 27-block ViT one level too deep: the 333 vision tensors were written as model.language_model.visual.* instead of the sibling model.visual.* that vLLM's Qwen3_5MoeForConditionalGeneration expects (it builds self.visual at model.visual — a sibling of self.language_model, not a child). vLLM therefore silently skip-loaded the entire vision tower (WARNING … Parameter visual.* not found in params_dict, skip loading) and ran it uninitialized — text was perfect, but any image input produced !!!! garbage.

This revision renames those 333 vision tensors to model.visual.* — a header-only edit, so the NVFP4/BF16 weight data is byte-for-byte identical (no re-quantization; same file size, same data offsets). vLLM now loads the full ViT (0 skip-loads) and image understanding works, validated on aeon-vllm-ultimate:latest: a shapes/colors probe scored 7/7 (all shapes + colors + on-image text) and a richer scene 5/7, with LLM text unchanged. The AEON-7 27B multimodal siblings already used the correct model.visual.* layout and were unaffected by this defect.

If you pulled this repo before 2026-06-18 and saw !!!! on image prompts, re-pull — the corrected model.safetensors is live at the same URL (weight data identical; only the vision tensor names changed).

NVFP4-quantized version of tvall43/Qwen3.6-35B-A3B-heretic — an abliterated (decensored, 5/100 refusal rate) Qwen 3.6 35B-A3B Mixture-of-Experts multimodal model with thinking/reasoning capabilities.

Quantized using llmcompressor with the compressed-tensors nvfp4-pack-quantized format. Calibrated with 256 samples from open-platypus over 40 sequential decoder-layer stages. Vision encoder, linear-attention (Mamba/GDN) layers, MoE routers, gates, norms, and lm_head/embed_tokens preserved in BF16.

Designed for deployment on NVIDIA DGX Spark (GB10, Blackwell SM 12.0+) with native FP4 tensor-core support. Pairs with z-lab/Qwen3.6-35B-A3B-DFlash for spec-decode acceptance of 2.7-4.4 mean accepted tokens per target step on greedy workloads.

🚀 Quickstart (copy-paste)

Complete recipe for a DGX Spark / Blackwell box: pull the container, pull this NVFP4 model, pull the DFlash drafter (fresh), then serve with the vetted flags + DFlash spec-decode. No --mamba-block-size is needed for this 35B-A3B body.

bash
# 1. Pull the unified AEON vLLM image (anonymous public GHCR pull)
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

# 2. Pull THIS NVFP4 model (fresh — v2 commits replaced v1 at the same URL)
huggingface-cli download AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4 --local-dir ./aeon-model

# 3. Pull the DFlash drafter, fresh (must be a post 2026-04-19 revision)
huggingface-cli download z-lab/Qwen3.6-35B-A3B-DFlash --local-dir ./aeon-drafter

# 4. Serve (NVFP4 body + DFlash drafter, n=11)
docker run --gpus all --ipc host --network host \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -v ./aeon-model:/model:ro \
  -v ./aeon-drafter:/drafter:ro \
  --entrypoint vllm \
  ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
  serve /model \
  --quantization compressed-tensors \
  --attention-backend flash_attn \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --gpu-memory-utilization 0.65 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --trust-remote-code \
  --max-model-len 40960 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 32768 \
  --speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":11}'

Required with DFlash: --max-num-seqs and --max-num-batched-tokens must be set (as above) — without them the speculative-decoding scheduler can compute a negative token budget and the server fails to boot (max_num_scheduled_tokens is set to -512…). Raise --max-model-len (e.g. 262144) for full long-context once running; see the production compose below.

Lower --gpu-memory-utilization to 0.70 if you co-locate ASR/TTS/embedding services on the Spark's unified memory — on GB10 keep it 0.6-0.7, since above ~0.8 the shared CPU+GPU pool page-thrashes and stalls the box (even 0.85 stalls); go lower for co-located services, high concurrency, fp16 KV, or DFlash. For the full production setup (docker-compose, served-model aliases, 256K --max-model-len, systemd, OpenClaw integration) see the Quick Start (DGX Spark) and Production docker-compose sections below.

Performance Benchmarks

v0.23.0 build — current production image

These are the current measured numbers on ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0 + AEON sm_121a build + DFlash num_speculative_tokens: 11 via z-lab/Qwen3.6-35B-A3B-DFlash), on NVIDIA DGX Spark (GB10, sm_121a, 128 GB unified memory).

The unified image's specific win on this 35B-A3B MoE is clean concurrency scaling to c=64 with no crash (the pre-fix image crashed under concurrent speculative decoding at c≥32 — see What we fixed for the DGX Spark), plus long-context DFlash acceptance that holds ~42–58% out to 32K-token prompts. Aggregate throughput rises from ~75–124 tok/s single-stream to ~430–740 tok/s at c=64 depending on category — DFlash-acceptance-driven: ~610–740 tok/s on structured code/math/reasoning/extraction (which draft well) and ~430–475 tok/s on creative prose/natural-language (lower draft acceptance).

Single-stream (c=1), by category

Greedy (T=0), 256-token output, mixed-domain prompt set. decode_tok_s_p50 / ttft_p50_ms / tpot_p50_ms / prefill (pp_tok_s_p50) / DFlash position-0 acceptance.

Table with columns: Category, 🟢 Decode tok/s, TTFT p50, TPOT p50, Prefill (PP), DFlash accept
Category	🟢 Decode tok/s	TTFT p50	TPOT p50	Prefill (PP)	DFlash accept
Coding	91.7	88 ms	10.9 ms	509 tok/s	32%
Math	123.6	113 ms	8.1 ms	494 tok/s	48%
Reasoning

Aggregate throughput vs concurrency (c=1 → c=64)

Aggregate tok/s across all active streams (median of the per-category levels), 256-token output:

Table with columns: Concurrency, Coding, Math, Reasoning, Prose, Natural language, Extraction / JSON
Concurrency	Coding	Math	Reasoning	Prose	Natural language	Extraction / JSON
1	89	118	115	73	89	78
8	360	416	388	252	296	399

Scales cleanly through c=64 with zero errors at every level (the pre-fix image died at c≥32 under speculative decoding). Aggregate is compute-bound on this 35B-active-3B MoE — per-request decode falls as streams divide the GPU, but no stream stalls or crashes.

Re-validated 2026-06-19 on the exact published recipe (a fresh pull of aeon-vllm-ultimate:latest + the corrected weights, DFlash n=11): multimodal vision works end-to-end (7/7 on an image probe, 0 vision skip-loads) and the throughput spread reproduced with zero errors at c=64 — structured categories 705–781 tok/s @ c=64 (peak ~800 tok/s @ c=32, Reasoning), creative-text 430–474 tok/s @ c=64; DFlash acceptance ~46–55% (structured) / ~22–27% (creative) at short context, holding ~40–45% at 16–20K-token context.

Long-context tiers (DFlash acceptance holds)

Measured at ~16K and ~32K input tokens (median prompt_tokens_p50), greedy, 256-token output. The point of the AEON DFlash patches (PR #40898 SWA + PR #41703 prefix-cache safety) is that draft acceptance does not collapse as context grows:

Table with columns: Context tier, Measured prompt (p50), Decode tok/s (c=1), TTFT (c=1), DFlash accept
Context tier	Measured prompt (p50)	Decode tok/s (c=1)	TTFT (c=1)	DFlash accept
~16K	16.2K – 20.3K	90.8 – 106.1	3.3 – 4.1 s	41 – 52%
~32K	32.6K – 40.8K	73.0 – 94.4	7.1 – 9.5 s	43 – 58%

Acceptance at 32K (43–58%) is on par with — and on the Extraction/JSON class, above — short-context acceptance, confirming the drafter's sliding-window layers are running as true SWA rather than collapsing past ~2K tokens.

Stock baseline note: there is no stock vanilla-vLLM baseline for this 35B-A3B checkpoint yet. A fresh fully-vanilla re-benchmark (default vLLM, no DFlash, no AEON/sm_121a optimizations) is pending; when it lands, a stock-vs-optimized contrast will be added here. The figures above are all measured on the optimized aeon-vllm-ultimate:latest (vLLM 0.23.0) build. The earlier v1.2-image figures retained below for reference are likewise optimized-path numbers (different image, DFlash k=15), not a vanilla baseline.

⚠️ DFlash speedup is workload-dependent. Greedy reasoning workloads (math, code, extraction) hit the highest acceptance (≥40%); creative / open-ended / sampled prose is lower (~24%) and more variable. Use T=0 for maximum DFlash speedup.

Earlier `vllm-spark-omni-q36:v1.2` reference (DFlash k=15)

The detailed single-stream / concurrent tables below were measured on the earlier ghcr.io/aeon-7/vllm-spark-omni-q36:v1.2 image at DFlash k=15, not on the current aeon-vllm-ultimate:latest at n=11. They are retained as a representative reference for the Spark DFlash path; the current production numbers are the v0.23.0 section above. The unified image's specific gain is clean c=64 concurrency scaling and held long-context acceptance, not these short-context single-stream tok/s.

1. Single-Stream Performance

Best for interactive chat and agentic UX. All measurements greedy (T=0) unless noted.

Decode rate (10 trials, 200-token outputs)

Table with columns: Statistic, tok/s
Statistic	tok/s
Median	83.9
p95	127.5
Min	41.1
Max	127.5

Variance reflects DFlash acceptance differences across prompt classes — math/code prompts hit ~125 tok/s with high drafter agreement, more open-ended prompts settle around 60-90 tok/s.

TTFT by prompt length (5 trials per class)

Table with columns: Prompt class, Approx. input tokens, TTFT p50, TTFT p95, TTFT min, Effective prefill
Prompt class	Approx. input tokens	TTFT p50	TTFT p95	TTFT min	Effective prefill
Tiny	2	99 ms	102 ms	98 ms	20 tok/s
Short	7	114 ms	115 ms	110 ms	62 tok/s
Medium	50

Sub-130ms TTFT for any prompt under ~50 tokens — fixed kernel-launch overhead dominates short prefill.

Decode rate by output length (3 trials per length)

Table with columns: Max tokens, Actual tokens (median), TTFT, Decode rate, Total latency
Max tokens	Actual tokens (median)	TTFT	Decode rate	Total latency
50	50	113 ms	70.1 tok/s	0.82 s
200	200	112 ms	88.4 tok/s	2.37 s
500	331*	116 ms	115.6 tok/s	4.44 s

* model emitted EOS naturally before hitting max_tokens.

Decode rate increases with output length — DFlash steady-state amortization improves over the first 100-200 tokens once the drafter and target lock into a stable acceptance pattern.

Sampling: greedy vs stochastic (5 trials per mode)

Table with columns: Mode, Decode p50, Decode p95, TTFT p50
Mode	Decode p50	Decode p95	TTFT p50
Greedy (T=0)	76.5 tok/s	123.0 tok/s	115 ms
Stochastic (T=0.7)	64.8 tok/s	125.4 tok/s	113 ms

15% degradation T=0 → T=0.7. Less dramatic than typical for spec-decode systems — DFlash's drafter remains useful even at moderate sampling. Use T=0 for max DFlash speedup; T=0.7 for diversity.

Long-prompt prefill (RAG / document workloads)

Table with columns: Input tokens, TTFT (≈ prefill), Prefill rate, Decode rate after prefill
Input tokens	TTFT (≈ prefill)	Prefill rate	Decode rate after prefill
1K	519 ms	1,973 tok/s	48.8 tok/s
4K	2,594 ms	1,579 tok/s	41.1 tok/s
16K	8,007 ms	2,046 tok/s	34.6 tok/s
32K	19,368 ms	1,692 tok/s	23.0 tok/s

Prefill rate plateaus around 2K tok/s due to (a) the drafter prefilling the same context in parallel and (b) Qwen3.6's 30 linear-attention (Mamba/GDN) layers having higher prefill constant factor than parallel softmax attention. Decode-after-prefill drops gracefully (~50% from 1K → 32K context).

Single-stream summary

Table with columns: Metric, Value
Metric	Value
Single-stream decode (200-tok output)	83.9 tok/s median
Decode @ 500-1000 tok output (DFlash steady state)	115-118 tok/s
Short-prompt TTFT	99-128 ms
16K-prompt TTFT	8.0 s
32K-prompt TTFT	19.4 s
Peak prefill throughput	~2,046 tok/s @ 16K prompt
Decode rate with 32K context	23.0 tok/s (53% drop vs short context)

2. Concurrent-Session Performance

Best for agent fleets and multi-user serving. 3 trials per level, median run reported (sorted by aggregate throughput). Mixed prompts, 200-token output, T=0.7 (stochastic — production-realistic), SSE streaming.

Throughput scaling (N concurrent clients, 200-tok output)

Table with columns: Concurrent, Errors, Agg tok/s (median of 3), Per-req decode p50, Per-req decode min, TTFT p50, TTFT p95
Concurrent	Errors	Agg tok/s (median of 3)	Per-req decode p50	Per-req decode min	TTFT p50	TTFT p95
1	0	102.9	109.1	109.1	111 ms	111 ms
2	0	131.3	94.0	68.9	144 ms

Zero errors across all 384 requests in the concurrent sweep (3 runs × 128-conc top level alone = 384, plus all lower levels = 1,200+ total).

Aggregate throughput plateaus at ~313 tok/s from 64 concurrent onward — that's the GPU's compute wall on this 35B-active-3B MoE with linear-attention KV reads + DFlash drafter overhead. TTFT spikes severely at 128 concurrent (14s p50, 47s p95) because all 128 sequences fit in the scheduler but compute is fully saturated, so each token's worth of work is divided across 128 streams. For latency-sensitive UX, target 16-32 concurrent; for max throughput, use the full 128.

TTFT-only scaling (1-token output, prefill + first-token)

Measures pure scheduler queue contention — critical for agent UX:

Table with columns: Concurrent, TTFT p50, TTFT p95, TTFT min, TTFT max
Concurrent	TTFT p50	TTFT p95	TTFT min	TTFT max
1	74 ms	75 ms	72 ms	75 ms
4	99 ms	100 ms	97 ms	100 ms
16	249 ms	263 ms	238 ms	263 ms

TTFT stays sub-700ms through 64 concurrent — smooth UX for small agent fleets. Beyond 64, TTFT accumulates queue-wait time as compute is fully consumed.

Concurrent with 1K-token prompts (RAG-style workload)

50-token output with 1,024-token prompts — simulates agents doing document QA or retrieval-augmented responses. Median of 2 runs.

Table with columns: Concurrent, Errors, Agg tok/s, TTFT p50, TTFT p95, Decode p50
Concurrent	Agg tok/s	TTFT p50	TTFT p95	Decode p50
1	23.1	494 ms	494 ms	44.1
4	39.5	1,673 ms	1,720 ms	24.6
16	47.1

RAG throughput peaks around 50 tok/s at 16-64 concurrent. The aggregate is lower than the short-prompt sweep because each request spends most of its wall-clock in prefill (1K tokens) rather than decode. Use prefix caching if your RAG workload has repeated context blocks — the production compose enables --enable-prefix-caching which can give 5-10× speedup on shared-prefix RAG.

Concurrent-session summary

Table with columns: Metric, Value
Metric	Value
Peak aggregate throughput	313.6 tok/s @ 128 concurrent (median of 3 trials)
Scaling from 1 → 128	3.05× throughput (compute-bound — DFlash + 35B MoE saturates GB10 around 64 streams)
Per-request decode @ 128	6.5 tok/s p50, 3.0 min
TTFT @ 64 concurrent	1.07 s p50 (acceptable for agent fleets)
TTFT @ 128 concurrent	14.1 s p50 (queue-bound — useful for batch only)
Error rate across full bench	0.0% (1,200+ requests, conc 1 → 128)
Best concurrency for chat UX	4-16 (per-req 19-48 tok/s, TTFT < 500 ms)

Key Performance Metrics Summary

Table with columns: Metric, Value
Metric	Value
Single-stream decode (200-tok output)	83.9 tok/s median
Single-stream decode @ DFlash steady state	118 tok/s (1000-tok output)
Short-prompt TTFT	99-128 ms
Peak aggregate throughput	313.6 tok/s @ 128 concurrent
TTFT @ 16 concurrent (smooth UX)	501 ms p50
TTFT @ 64 concurrent (still usable)	1.07 s p50
Greedy vs stochastic decode penalty	15% (76.5 → 64.8 tok/s)
(greedy workloads)

Scaling efficiency (200-tok concurrent test)

Table with columns: Concurrency, Throughput gain vs 1-req
Concurrency	Throughput gain vs 1-req
1	1.0×
4	1.2×
16	2.2×
64	3.0×
128	3.05×

Scaling is GPU-compute-bound rather than memory-bound — DFlash on a 35B MoE with hybrid linear+full attention saturates the GB10's compute around 64 concurrent. Per-request throughput degrades from 109 tok/s (1-req) to 6.5 tok/s (128-req). For comparison, a non-spec-decode setup would scale much more linearly but lose the ~2-4× single-stream speedup DFlash provides.

Test methodology notes

enable_thinking=false — bench disables Qwen3.6's thinking tag for clean decode-rate measurement. Production with thinking on adds reasoning-token overhead before content emission (use max_tokens ≥ 2048 for thinking-enabled requests).
DFlash speedup is workload-dependent — math, code, agentic, and reasoning workloads at T=0 hit the highest acceptance rates. Creative writing or open-ended chat sees lower acceptance.
Mixed-prompt set in concurrent tests: code, math, QA, creative writing, single-line answers — to avoid biasing toward DFlash-friendly prompts.
3 trials per concurrency level for the throughput sweep, median run (by aggregate tok/s) reported. RAG section uses 2 trials.
200-token output as the standard test length (except TTFT-only test which uses 1 token, RAG which uses 50, and decode-by-output which sweeps 50→1000).
Error tracking: 0/1,200+ requests failed across the full test (all sections combined).
Reproducible: bench script at scripts/bench_full.py; raw JSON results at .

⚠️ IMPORTANT REQUIREMENTS

Table with columns: #, Requirement, Why
#	Requirement	Why
1	Native Blackwell GPU (SM 10.0+ — B200, GB10, RTX PRO 6000 Blackwell, RTX 5090)	NVFP4 needs hardware FP4 tensor cores
2	vLLM with sm_121a NVFP4 kernels — use `ghcr.io/aeon-7/aeon-vllm-ultimate:latest` (vLLM 0.23.0 source-built for sm_121a + the AEON DFlash stack)	Stock vLLM wheels don't compile FP4 kernels for SM 12.x; the SM121 workarounds aren't all upstream yet
3	`--quantization compressed-tensors` (NOT `modelopt`)	This checkpoint uses llmcompressor's compressed-tensors NVFP4 format
4

Quick Start (DGX Spark, with DFlash spec decode)

bash
# 1. Pull the image (anonymous public GHCR pull — anyone can run this)
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

# 2. Pull both models
sudo mkdir -p /opt/qwen36 && sudo chown $USER:$USER /opt/qwen36
cd /opt/qwen36
export HF_HUB_ENABLE_HF_TRANSFER=1
hf download AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4 --local-dir ./qwen36-nvfp4 &
hf download z-lab/Qwen3.6-35B-A3B-DFlash         --local-dir ./qwen36-dflash &
wait

# 3. Get the production compose file
curl -fsSL \
  https://raw.githubusercontent.com/AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4-DFlash/main/examples/docker-compose.yml \
  -o docker-compose.yml

# 4. Start
docker compose up -d
docker compose logs -f   # wait for "Application startup complete" (~3-5 min)

# 5. Test (use temperature=0 + ≥2048 max_tokens for thinking-enabled requests)
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen36-fast",
    "messages": [{"role":"user","content":"What is 17 × 23? Show your work."}],
    "max_tokens": 2048,
    "temperature": 0
  }'

Full step-by-step (with pre-flight checks, smoke tests, systemd service, OpenClaw integration): github.com/AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4-DFlash/blob/main/docs/dgx-spark-setup.md

Production docker-compose (the actual flags that work)

yaml
services:
  vllm:
    image: ghcr.io/aeon-7/aeon-vllm-ultimate:latest   # = :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703
    container_name: vllm-qwen36-heretic
    restart: unless-stopped
    network_mode: host
    # The image ENTRYPOINT is /bin/bash, so use entrypoint: vllm + command: serve ...
    entrypoint: vllm
    environment:
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TORCH_CUDA_ARCH_LIST=12.1a
      - ENABLE_NVFP4_SM100=0
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - NVIDIA_FORWARD_COMPAT=1
      - VLLM_TEST_FORCE_FP8_MARLIN=1     # pins the NVFP4 MoE backend to MARLIN (256×512 expert shape)
    volumes:
      - /opt/qwen36/qwen36-nvfp4:/models/qwen36
      - /opt/qwen36/qwen36-dflash:/models/qwen36-dflash
    command:
      - serve
      - /models/qwen36
      - --served-model-name
      - qwen36-35b-heretic
      - qwen36-fast
      - qwen36-deep
      - --host
      - 0.0.0.0
      - --port
      - "8000"
      - --tensor-parallel-size
      - "1"
      - --dtype
      - auto
      - --quantization
      - compressed-tensors
      - --max-model-len
      - "262144"
      - --max-num-seqs
      - "64"
      - --max-num-batched-tokens
      - "32768"
      - --gpu-memory-utilization
      - "0.70"
      - --enable-chunked-prefill
      - --enable-prefix-caching
      - --trust-remote-code
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder
      - --reasoning-parser
      - qwen3
      - --speculative-config
      - '{"method":"dflash","model":"/models/qwen36-dflash","num_speculative_tokens":11}'
      - --attention-backend
      - flash_attn
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Recipe notes (current build):

--quantization compressed-tensors (NOT modelopt) — this checkpoint uses llmcompressor's NVFP4 format.

--attention-backend flash_attn is required for the DFlash drafter on the body.

num_speculative_tokens: 11 is the validated production default for this 35B-A3B body — acceptance and long-context hold peak around n≈10–11, above which they fall off. Do not set --kv-cache-dtype (the non-causal DFlash drafter requires BF16 KV).

No --mamba-block-size is needed on the v0.23.0 image (block_size is now an int upstream; the prior patch_kv_cache_utils carry was dropped).

is NOT required — CUDA graphs run cleanly with the unified image's spec-decode CUDA-graph capture-size alignment patch + the post-2026-04-19 DFlash drafter. Cudagraphs on give ~30% over eager.

What we fixed for the DGX Spark

All AEON models run on one unified container — ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703) — vLLM v0.23.0 built from source for GB10 / sm_121a and merged with the AEON speculative-decoding stack.

Table with columns: Fix, What it does, Why it matters on GB10
Fix	What it does	Why it matters on GB10
DFlash high-concurrency fix (new)	Slices the speculative drafter's KV block-table to the unpadded batch (`block_table[:num_reqs]`)	The drafter previously crashed at ≥32 concurrent requests (padded-vs-unpadded block-table shape mismatch in FlashAttention). Now scales cleanly to c=64. A port of upstream PR #43982, which fixed this for MTP but never for DFlash — present and unfixed even in the prior image.
Triton NVFP4 KV cache (PR #44389)	Software NVFP4 KV-cache path	The only 4-bit KV path on sm_121a (upstream's is hard-gated to B200) → ~3× KV capacity / longer context per GB of unified memory.
DFlash sliding-window attention (PR #40898)	Runs the drafter's SWA layers as true sliding-window	Long-context draft acceptance holds as agent histories grow (43–58% out to 32K tokens here) instead of collapsing past ~2k tokens.
(PR #41703)

The result

Scales to 64 concurrent requests with no crash — the prior image crashed at c≥32 under speculative decoding (the block-table fix above). Aggregate throughput rises to ~430–740 tok/s at c=64 depending on category (acceptance-driven: ~610–740 on structured text, ~430–475 on creative prose/natural-language), zero errors at every level.
Native NVFP4 4-bit compute on Blackwell tensor cores — the speed of 4-bit with near-16-bit accuracy.
Speculative decoding (DFlash) holds high draft acceptance from short prompts (32–48% short-context) through long 16K–32K agent histories (43–58% at 32K), rather than collapsing past ~2K tokens.
No measured single-stream speedup vs a stock vanilla baseline yet — there is no stock baseline for this checkpoint; a fresh fully-vanilla re-benchmark is pending. The unified image's concrete, measured win here is the c=64 concurrency scaling and the held long-context acceptance described above.

Model Architecture

Table with columns: Property, Value
Property	Value
Architecture	`qwen3_5_moe` (multimodal — `Qwen3_5MoeForConditionalGeneration`)
Total params	~35B
Active params	~3B / token
Layers	40 (3× Gated DeltaNet + 1× Gated Attention, repeating ×10)
Hidden	2048
Experts	256 routed + 1 shared, top-8 per token
Vocabulary	248,320

Hybrid Attention

Table with columns: Attention type, Layers, Q/K/V heads, Head dim
Attention type	Layers	Q/K/V heads	Head dim
Gated DeltaNet (linear, BF16)	30 (3 of every 4)	QK 16, V 32	128
Gated Attention (NVFP4)	10 (1 of every 4)	Q 16, KV 2	256 (rotary 64)

Quantization Details

Table with columns: Parameter, Value
Parameter	Value
Tool	llmcompressor
Format	`compressed-tensors` `nvfp4-pack-quantized`
Scheme	NVFP4 (FP4 E2M1 + per-block FP8 e4m3 scales + per-tensor FP32 scales)
Block size	16
Calibration data	`open-platypus` (256 samples)
Calibration seq_len

Quantized layers (NVFP4)

Gated Attention projections: q_proj, k_proj, v_proj, o_proj (10 layers)
MoE experts (256 × 40 layers = 10,240 expert modules): gate_proj, up_proj, down_proj
Shared expert: same projections

Excluded from quantization (kept BF16)

lm_head, embed_tokens — accuracy-critical token projections
*.mlp.gate, *.shared_expert_gate — MoE routing (sparsity-critical)
*.norm.* — all RMSNorm layers
*.visual.* — 27-block ViT vision tower
*.linear_attn.* — 30 Gated DeltaNet (Mamba) layers (small relative to MoE; quantizing them tanks accuracy)

The exact recipe + script that produced this checkpoint is at scripts/qwen36_requant_v2.py.

Recommended sampling parameters

From the Qwen3.6 model card:

Table with columns: Mode, General, Coding, Math/Reasoning
Mode	General	Coding	Math/Reasoning
Thinking	T=1.0, P=0.95, K=20, PP=1.5	T=0.6, P=0.95, K=20, PP=0.0	T=1.0, P=1.0, K=40, PP=2.0
Instruct (no think)	T=0.7, P=0.8, K=20, PP=1.5	—	T=1.0, P=0.95, K=20, PP=1.5

For maximum DFlash speedup: use T=0 (greedy). The drafter ↔ target agreement rate collapses with sampling — at T=0.7 you typically see ~10-20% acceptance vs. 60-78% at T=0.

The production compose registers 3 served-model aliases for the same backend so chat clients can route greedy vs sampled requests separately:

qwen36-fast → intended for greedy/agentic (T=0)
qwen36-deep → intended for creative/sampled (T=0.7)
qwen36-35b-heretic → canonical name

Disable thinking per-request:

json
{"chat_template_kwargs": {"enable_thinking": false}}

Preserve thinking across multi-turn:

json
{"chat_template_kwargs": {"preserve_thinking": true}}

Common gotcha: with thinking enabled (default), Qwen3.6 spends most of its max_tokens budget on <think> reasoning before emitting content. Use max_tokens ≥ 2048 for thinking-enabled requests — lower budgets often produce content: null with finish_reason: "length".

Hardware Requirements

Table with columns: Tier, GPU, Notes
Tier	GPU	Notes
Target — production-validated	NVIDIA DGX Spark (128 GB unified, GB10 sm_121a)	Full 256K context, validated to c=64 concurrent on `aeon-vllm-ultimate:latest`
Compatible	RTX PRO 6000 Blackwell (96 GB)	What v2 was calibrated on. Same MoE-shape constraint applies (Marlin is still the only NVFP4 MoE backend that accepts 256×512 grouped GEMM regardless of chip); linear NVFP4 path uses CUTLASS native.
Compatible	B200 / GB200	Image rebuild required (SM 10.0, not SM 12.x)
Compatible	RTX 5090 (32 GB)	Reduced context, low concurrency

Files

Table with columns: File, Size, Description
File	Size	Description
`model-00001-of-00009.safetensors` … `model-00009-of-00009.safetensors`	~22 GB total	NVFP4 quantized weights (~123,724 tensors across 9 shards)
`model.safetensors.index.json`	~5 MB	shard index
`config.json`	~7 KB	Model + quantization config (`Qwen3_5MoeForConditionalGeneration`)
`tokenizer.json`

Disclaimer

THIS IS AN UNCENSORED MODEL. By downloading, accessing, or using this model you expressly assume full and sole responsibility for all outputs generated, all actions taken based on outputs, and compliance with applicable laws. The authors are not responsible for any harmful, illegal, or objectionable content. These tools serve legitimate purposes including security research, red-teaming, content analysis, and creative work. Implement safeguards appropriate to your use case and jurisdiction.

License

Apache 2.0 (inherited from Qwen3.6 base).

Credits

Base model: tvall43/Qwen3.6-35B-A3B-heretic — abliteration via Heretic v1.2.0
Original target: Qwen/Qwen3.6-35B-A3B by Alibaba Tongyi
DFlash drafter: z-lab/Qwen3.6-35B-A3B-DFlash — z-lab (Soroush Mohri et al.)
Quantization tool: llmcompressor — Neural Magic / RedHat
vLLM build chain: vllm-project/vllm v0.23.0 source-built for sm_121a + the AEON speculative-decoding stack (aeon-vllm-ultimate)

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

What changed in v2 (2026-04-19)

v2 re-quantizes the same source (tvall43/Qwen3.6-35B-A3B-heretic) with AutoModelForImageTextToText, restoring the canonical multimodal architecture (the text path):

Architecture: Qwen3_5MoeForConditionalGeneration (vLLM's canonical class — no registry hack required)
Keys: model.language_model.layers.X.* retained natively (no post-quantization key rewriting)
27-block ViT vision encoder preserved BF16 — but note its tensors were mis-nested as model.language_model.visual.* and silently skip-loaded until the 2026-06-18 vision fix below; image inputs did not work in v2 until that rename
30 linear-attention (Mamba/GDN) layers preserved BF16
All 122,880 per-expert NVFP4 keys (40 layers × 256 experts × 3 projections × 4 quant components)

⚠️ If you cloned v1 of this repo, delete and re-pull. Same URL — v2 commits replaced v1.

Vision fix (2026-06-18) — image inputs now work

If you pulled this repo before 2026-06-18 and saw !!!! on image prompts, re-pull — the corrected model.safetensors is live at the same URL (weight data identical; only the vision tensor names changed).

🚀 Quickstart (copy-paste)

bash
# 1. Pull the unified AEON vLLM image (anonymous public GHCR pull)
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

# 2. Pull THIS NVFP4 model (fresh — v2 commits replaced v1 at the same URL)
huggingface-cli download AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4 --local-dir ./aeon-model

# 3. Pull the DFlash drafter, fresh (must be a post 2026-04-19 revision)
huggingface-cli download z-lab/Qwen3.6-35B-A3B-DFlash --local-dir ./aeon-drafter

# 4. Serve (NVFP4 body + DFlash drafter, n=11)
docker run --gpus all --ipc host --network host \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -v ./aeon-model:/model:ro \
  -v ./aeon-drafter:/drafter:ro \
  --entrypoint vllm \
  ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
  serve /model \
  --quantization compressed-tensors \
  --attention-backend flash_attn \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --gpu-memory-utilization 0.65 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --trust-remote-code \
  --max-model-len 40960 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 32768 \
  --speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":11}'

Required with DFlash: --max-num-seqs and --max-num-batched-tokens must be set (as above) — without them the speculative-decoding scheduler can compute a negative token budget and the server fails to boot (max_num_scheduled_tokens is set to -512…). Raise --max-model-len (e.g. 262144) for full long-context once running; see the production compose below.

Lower --gpu-memory-utilization to 0.70 if you co-locate ASR/TTS/embedding services on the Spark's unified memory — on GB10 keep it 0.6-0.7, since above ~0.8 the shared CPU+GPU pool page-thrashes and stalls the box (even 0.85 stalls); go lower for co-located services, high concurrency, fp16 KV, or DFlash. For the full production setup (docker-compose, served-model aliases, 256K --max-model-len, systemd, OpenClaw integration) see the Quick Start (DGX Spark) and Production docker-compose sections below.

Performance Benchmarks

v0.23.0 build — current production image

Single-stream (c=1), by category

Greedy (T=0), 256-token output, mixed-domain prompt set. decode_tok_s_p50 / ttft_p50_ms / tpot_p50_ms / prefill (pp_tok_s_p50) / DFlash position-0 acceptance.

Table with columns: Category, 🟢 Decode tok/s, TTFT p50, TPOT p50, Prefill (PP), DFlash accept
Category	🟢 Decode tok/s	TTFT p50	TPOT p50	Prefill (PP)	DFlash accept
Coding	91.7	88 ms	10.9 ms	509 tok/s	32%
Math	123.6	113 ms	8.1 ms	494 tok/s	48%
Reasoning

Aggregate throughput vs concurrency (c=1 → c=64)

Aggregate tok/s across all active streams (median of the per-category levels), 256-token output:

Table with columns: Concurrency, Coding, Math, Reasoning, Prose, Natural language, Extraction / JSON
Concurrency	Coding	Math	Reasoning	Prose	Natural language	Extraction / JSON
1	89	118	115	73	89	78
8	360	416	388	252	296	399

Re-validated 2026-06-19 on the exact published recipe (a fresh pull of aeon-vllm-ultimate:latest + the corrected weights, DFlash n=11): multimodal vision works end-to-end (7/7 on an image probe, 0 vision skip-loads) and the throughput spread reproduced with zero errors at c=64 — structured categories 705–781 tok/s @ c=64 (peak ~800 tok/s @ c=32, Reasoning), creative-text 430–474 tok/s @ c=64; DFlash acceptance ~46–55% (structured) / ~22–27% (creative) at short context, holding ~40–45% at 16–20K-token context.

Long-context tiers (DFlash acceptance holds)

Table with columns: Context tier, Measured prompt (p50), Decode tok/s (c=1), TTFT (c=1), DFlash accept
Context tier	Measured prompt (p50)	Decode tok/s (c=1)	TTFT (c=1)	DFlash accept
~16K	16.2K – 20.3K	90.8 – 106.1	3.3 – 4.1 s	41 – 52%
~32K	32.6K – 40.8K	73.0 – 94.4	7.1 – 9.5 s	43 – 58%

Stock baseline note: there is no stock vanilla-vLLM baseline for this 35B-A3B checkpoint yet. A fresh fully-vanilla re-benchmark (default vLLM, no DFlash, no AEON/sm_121a optimizations) is pending; when it lands, a stock-vs-optimized contrast will be added here. The figures above are all measured on the optimized aeon-vllm-ultimate:latest (vLLM 0.23.0) build. The earlier v1.2-image figures retained below for reference are likewise optimized-path numbers (different image, DFlash k=15), not a vanilla baseline.

⚠️ DFlash speedup is workload-dependent. Greedy reasoning workloads (math, code, extraction) hit the highest acceptance (≥40%); creative / open-ended / sampled prose is lower (~24%) and more variable. Use T=0 for maximum DFlash speedup.

Earlier `vllm-spark-omni-q36:v1.2` reference (DFlash k=15)

The detailed single-stream / concurrent tables below were measured on the earlier ghcr.io/aeon-7/vllm-spark-omni-q36:v1.2 image at DFlash k=15, not on the current aeon-vllm-ultimate:latest at n=11. They are retained as a representative reference for the Spark DFlash path; the current production numbers are the v0.23.0 section above. The unified image's specific gain is clean c=64 concurrency scaling and held long-context acceptance, not these short-context single-stream tok/s.

1. Single-Stream Performance

Best for interactive chat and agentic UX. All measurements greedy (T=0) unless noted.

Decode rate (10 trials, 200-token outputs)

Table with columns: Statistic, tok/s
Statistic	tok/s
Median	83.9
p95	127.5
Min	41.1
Max	127.5

Variance reflects DFlash acceptance differences across prompt classes — math/code prompts hit ~125 tok/s with high drafter agreement, more open-ended prompts settle around 60-90 tok/s.

TTFT by prompt length (5 trials per class)

Table with columns: Prompt class, Approx. input tokens, TTFT p50, TTFT p95, TTFT min, Effective prefill
Prompt class	Approx. input tokens	TTFT p50	TTFT p95	TTFT min	Effective prefill
Tiny	2	99 ms	102 ms	98 ms	20 tok/s
Short	7	114 ms	115 ms	110 ms	62 tok/s
Medium	50

Sub-130ms TTFT for any prompt under ~50 tokens — fixed kernel-launch overhead dominates short prefill.

Decode rate by output length (3 trials per length)

Table with columns: Max tokens, Actual tokens (median), TTFT, Decode rate, Total latency
Max tokens	Actual tokens (median)	TTFT	Decode rate	Total latency
50	50	113 ms	70.1 tok/s	0.82 s
200	200	112 ms	88.4 tok/s	2.37 s
500	331*	116 ms	115.6 tok/s	4.44 s

* model emitted EOS naturally before hitting max_tokens.

Decode rate increases with output length — DFlash steady-state amortization improves over the first 100-200 tokens once the drafter and target lock into a stable acceptance pattern.

Sampling: greedy vs stochastic (5 trials per mode)

Table with columns: Mode, Decode p50, Decode p95, TTFT p50
Mode	Decode p50	Decode p95	TTFT p50
Greedy (T=0)	76.5 tok/s	123.0 tok/s	115 ms
Stochastic (T=0.7)	64.8 tok/s	125.4 tok/s	113 ms

15% degradation T=0 → T=0.7. Less dramatic than typical for spec-decode systems — DFlash's drafter remains useful even at moderate sampling. Use T=0 for max DFlash speedup; T=0.7 for diversity.

Long-prompt prefill (RAG / document workloads)

Table with columns: Input tokens, TTFT (≈ prefill), Prefill rate, Decode rate after prefill
Input tokens	TTFT (≈ prefill)	Prefill rate	Decode rate after prefill
1K	519 ms	1,973 tok/s	48.8 tok/s
4K	2,594 ms	1,579 tok/s	41.1 tok/s
16K	8,007 ms	2,046 tok/s	34.6 tok/s
32K	19,368 ms	1,692 tok/s	23.0 tok/s

Single-stream summary

Table with columns: Metric, Value
Metric	Value
Single-stream decode (200-tok output)	83.9 tok/s median
Decode @ 500-1000 tok output (DFlash steady state)	115-118 tok/s
Short-prompt TTFT	99-128 ms
16K-prompt TTFT	8.0 s
32K-prompt TTFT	19.4 s
Peak prefill throughput	~2,046 tok/s @ 16K prompt
Decode rate with 32K context	23.0 tok/s (53% drop vs short context)

2. Concurrent-Session Performance

Throughput scaling (N concurrent clients, 200-tok output)

Table with columns: Concurrent, Errors, Agg tok/s (median of 3), Per-req decode p50, Per-req decode min, TTFT p50, TTFT p95
Concurrent	Errors	Agg tok/s (median of 3)	Per-req decode p50	Per-req decode min	TTFT p50	TTFT p95
1	0	102.9	109.1	109.1	111 ms	111 ms
2	0	131.3	94.0	68.9	144 ms

Zero errors across all 384 requests in the concurrent sweep (3 runs × 128-conc top level alone = 384, plus all lower levels = 1,200+ total).

TTFT-only scaling (1-token output, prefill + first-token)

Measures pure scheduler queue contention — critical for agent UX:

Table with columns: Concurrent, TTFT p50, TTFT p95, TTFT min, TTFT max
Concurrent	TTFT p50	TTFT p95	TTFT min	TTFT max
1	74 ms	75 ms	72 ms	75 ms
4	99 ms	100 ms	97 ms	100 ms
16	249 ms	263 ms	238 ms	263 ms

TTFT stays sub-700ms through 64 concurrent — smooth UX for small agent fleets. Beyond 64, TTFT accumulates queue-wait time as compute is fully consumed.

Concurrent with 1K-token prompts (RAG-style workload)

50-token output with 1,024-token prompts — simulates agents doing document QA or retrieval-augmented responses. Median of 2 runs.

Table with columns: Concurrent, Errors, Agg tok/s, TTFT p50, TTFT p95, Decode p50
Concurrent	Agg tok/s	TTFT p50	TTFT p95	Decode p50
1	23.1	494 ms	494 ms	44.1
4	39.5	1,673 ms	1,720 ms	24.6
16	47.1

Concurrent-session summary

Table with columns: Metric, Value
Metric	Value
Peak aggregate throughput	313.6 tok/s @ 128 concurrent (median of 3 trials)
Scaling from 1 → 128	3.05× throughput (compute-bound — DFlash + 35B MoE saturates GB10 around 64 streams)
Per-request decode @ 128	6.5 tok/s p50, 3.0 min
TTFT @ 64 concurrent	1.07 s p50 (acceptable for agent fleets)
TTFT @ 128 concurrent	14.1 s p50 (queue-bound — useful for batch only)
Error rate across full bench	0.0% (1,200+ requests, conc 1 → 128)
Best concurrency for chat UX	4-16 (per-req 19-48 tok/s, TTFT < 500 ms)

Key Performance Metrics Summary

Table with columns: Metric, Value
Metric	Value
Single-stream decode (200-tok output)	83.9 tok/s median
Single-stream decode @ DFlash steady state	118 tok/s (1000-tok output)
Short-prompt TTFT	99-128 ms
Peak aggregate throughput	313.6 tok/s @ 128 concurrent
TTFT @ 16 concurrent (smooth UX)	501 ms p50
TTFT @ 64 concurrent (still usable)	1.07 s p50
Greedy vs stochastic decode penalty	15% (76.5 → 64.8 tok/s)
(greedy workloads)

Scaling efficiency (200-tok concurrent test)

Table with columns: Concurrency, Throughput gain vs 1-req
Concurrency	Throughput gain vs 1-req
1	1.0×
4	1.2×
16	2.2×
64	3.0×
128	3.05×

Test methodology notes

enable_thinking=false — bench disables Qwen3.6's thinking tag for clean decode-rate measurement. Production with thinking on adds reasoning-token overhead before content emission (use max_tokens ≥ 2048 for thinking-enabled requests).
DFlash speedup is workload-dependent — math, code, agentic, and reasoning workloads at T=0 hit the highest acceptance rates. Creative writing or open-ended chat sees lower acceptance.
Mixed-prompt set in concurrent tests: code, math, QA, creative writing, single-line answers — to avoid biasing toward DFlash-friendly prompts.
3 trials per concurrency level for the throughput sweep, median run (by aggregate tok/s) reported. RAG section uses 2 trials.
200-token output as the standard test length (except TTFT-only test which uses 1 token, RAG which uses 50, and decode-by-output which sweeps 50→1000).
Error tracking: 0/1,200+ requests failed across the full test (all sections combined).
Reproducible: bench script at scripts/bench_full.py; raw JSON results at .

⚠️ IMPORTANT REQUIREMENTS

Table with columns: #, Requirement, Why
#	Requirement	Why
1	Native Blackwell GPU (SM 10.0+ — B200, GB10, RTX PRO 6000 Blackwell, RTX 5090)	NVFP4 needs hardware FP4 tensor cores
2	vLLM with sm_121a NVFP4 kernels — use `ghcr.io/aeon-7/aeon-vllm-ultimate:latest` (vLLM 0.23.0 source-built for sm_121a + the AEON DFlash stack)	Stock vLLM wheels don't compile FP4 kernels for SM 12.x; the SM121 workarounds aren't all upstream yet
3	`--quantization compressed-tensors` (NOT `modelopt`)	This checkpoint uses llmcompressor's compressed-tensors NVFP4 format
4

Quick Start (DGX Spark, with DFlash spec decode)

bash
# 1. Pull the image (anonymous public GHCR pull — anyone can run this)
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

# 2. Pull both models
sudo mkdir -p /opt/qwen36 && sudo chown $USER:$USER /opt/qwen36
cd /opt/qwen36
export HF_HUB_ENABLE_HF_TRANSFER=1
hf download AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4 --local-dir ./qwen36-nvfp4 &
hf download z-lab/Qwen3.6-35B-A3B-DFlash         --local-dir ./qwen36-dflash &
wait

# 3. Get the production compose file
curl -fsSL \
  https://raw.githubusercontent.com/AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4-DFlash/main/examples/docker-compose.yml \
  -o docker-compose.yml

# 4. Start
docker compose up -d
docker compose logs -f   # wait for "Application startup complete" (~3-5 min)

# 5. Test (use temperature=0 + ≥2048 max_tokens for thinking-enabled requests)
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen36-fast",
    "messages": [{"role":"user","content":"What is 17 × 23? Show your work."}],
    "max_tokens": 2048,
    "temperature": 0
  }'

Full step-by-step (with pre-flight checks, smoke tests, systemd service, OpenClaw integration): github.com/AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4-DFlash/blob/main/docs/dgx-spark-setup.md

Production docker-compose (the actual flags that work)

yaml
services:
  vllm:
    image: ghcr.io/aeon-7/aeon-vllm-ultimate:latest   # = :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703
    container_name: vllm-qwen36-heretic
    restart: unless-stopped
    network_mode: host
    # The image ENTRYPOINT is /bin/bash, so use entrypoint: vllm + command: serve ...
    entrypoint: vllm
    environment:
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TORCH_CUDA_ARCH_LIST=12.1a
      - ENABLE_NVFP4_SM100=0
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - NVIDIA_FORWARD_COMPAT=1
      - VLLM_TEST_FORCE_FP8_MARLIN=1     # pins the NVFP4 MoE backend to MARLIN (256×512 expert shape)
    volumes:
      - /opt/qwen36/qwen36-nvfp4:/models/qwen36
      - /opt/qwen36/qwen36-dflash:/models/qwen36-dflash
    command:
      - serve
      - /models/qwen36
      - --served-model-name
      - qwen36-35b-heretic
      - qwen36-fast
      - qwen36-deep
      - --host
      - 0.0.0.0
      - --port
      - "8000"
      - --tensor-parallel-size
      - "1"
      - --dtype
      - auto
      - --quantization
      - compressed-tensors
      - --max-model-len
      - "262144"
      - --max-num-seqs
      - "64"
      - --max-num-batched-tokens
      - "32768"
      - --gpu-memory-utilization
      - "0.70"
      - --enable-chunked-prefill
      - --enable-prefix-caching
      - --trust-remote-code
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder
      - --reasoning-parser
      - qwen3
      - --speculative-config
      - '{"method":"dflash","model":"/models/qwen36-dflash","num_speculative_tokens":11}'
      - --attention-backend
      - flash_attn
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Recipe notes (current build):

--quantization compressed-tensors (NOT modelopt) — this checkpoint uses llmcompressor's NVFP4 format.

--attention-backend flash_attn is required for the DFlash drafter on the body.

num_speculative_tokens: 11 is the validated production default for this 35B-A3B body — acceptance and long-context hold peak around n≈10–11, above which they fall off. Do not set --kv-cache-dtype (the non-causal DFlash drafter requires BF16 KV).

No --mamba-block-size is needed on the v0.23.0 image (block_size is now an int upstream; the prior patch_kv_cache_utils carry was dropped).

is NOT required — CUDA graphs run cleanly with the unified image's spec-decode CUDA-graph capture-size alignment patch + the post-2026-04-19 DFlash drafter. Cudagraphs on give ~30% over eager.

What we fixed for the DGX Spark

Table with columns: Fix, What it does, Why it matters on GB10
Fix	What it does	Why it matters on GB10
DFlash high-concurrency fix (new)	Slices the speculative drafter's KV block-table to the unpadded batch (`block_table[:num_reqs]`)	The drafter previously crashed at ≥32 concurrent requests (padded-vs-unpadded block-table shape mismatch in FlashAttention). Now scales cleanly to c=64. A port of upstream PR #43982, which fixed this for MTP but never for DFlash — present and unfixed even in the prior image.
Triton NVFP4 KV cache (PR #44389)	Software NVFP4 KV-cache path	The only 4-bit KV path on sm_121a (upstream's is hard-gated to B200) → ~3× KV capacity / longer context per GB of unified memory.
DFlash sliding-window attention (PR #40898)	Runs the drafter's SWA layers as true sliding-window	Long-context draft acceptance holds as agent histories grow (43–58% out to 32K tokens here) instead of collapsing past ~2k tokens.
(PR #41703)

The result

Scales to 64 concurrent requests with no crash — the prior image crashed at c≥32 under speculative decoding (the block-table fix above). Aggregate throughput rises to ~430–740 tok/s at c=64 depending on category (acceptance-driven: ~610–740 on structured text, ~430–475 on creative prose/natural-language), zero errors at every level.
Native NVFP4 4-bit compute on Blackwell tensor cores — the speed of 4-bit with near-16-bit accuracy.
Speculative decoding (DFlash) holds high draft acceptance from short prompts (32–48% short-context) through long 16K–32K agent histories (43–58% at 32K), rather than collapsing past ~2K tokens.
No measured single-stream speedup vs a stock vanilla baseline yet — there is no stock baseline for this checkpoint; a fresh fully-vanilla re-benchmark is pending. The unified image's concrete, measured win here is the c=64 concurrency scaling and the held long-context acceptance described above.

Model Architecture

Table with columns: Property, Value
Property	Value
Architecture	`qwen3_5_moe` (multimodal — `Qwen3_5MoeForConditionalGeneration`)
Total params	~35B
Active params	~3B / token
Layers	40 (3× Gated DeltaNet + 1× Gated Attention, repeating ×10)
Hidden	2048
Experts	256 routed + 1 shared, top-8 per token
Vocabulary	248,320

Hybrid Attention

Table with columns: Attention type, Layers, Q/K/V heads, Head dim
Attention type	Layers	Q/K/V heads	Head dim
Gated DeltaNet (linear, BF16)	30 (3 of every 4)	QK 16, V 32	128
Gated Attention (NVFP4)	10 (1 of every 4)	Q 16, KV 2	256 (rotary 64)

Quantization Details

Table with columns: Parameter, Value
Parameter	Value
Tool	llmcompressor
Format	`compressed-tensors` `nvfp4-pack-quantized`
Scheme	NVFP4 (FP4 E2M1 + per-block FP8 e4m3 scales + per-tensor FP32 scales)
Block size	16
Calibration data	`open-platypus` (256 samples)
Calibration seq_len

Quantized layers (NVFP4)

Gated Attention projections: q_proj, k_proj, v_proj, o_proj (10 layers)
MoE experts (256 × 40 layers = 10,240 expert modules): gate_proj, up_proj, down_proj
Shared expert: same projections

Excluded from quantization (kept BF16)

lm_head, embed_tokens — accuracy-critical token projections
*.mlp.gate, *.shared_expert_gate — MoE routing (sparsity-critical)
*.norm.* — all RMSNorm layers
*.visual.* — 27-block ViT vision tower
*.linear_attn.* — 30 Gated DeltaNet (Mamba) layers (small relative to MoE; quantizing them tanks accuracy)

The exact recipe + script that produced this checkpoint is at scripts/qwen36_requant_v2.py.

Recommended sampling parameters

From the Qwen3.6 model card:

Table with columns: Mode, General, Coding, Math/Reasoning
Mode	General	Coding	Math/Reasoning
Thinking	T=1.0, P=0.95, K=20, PP=1.5	T=0.6, P=0.95, K=20, PP=0.0	T=1.0, P=1.0, K=40, PP=2.0
Instruct (no think)	T=0.7, P=0.8, K=20, PP=1.5	—	T=1.0, P=0.95, K=20, PP=1.5

For maximum DFlash speedup: use T=0 (greedy). The drafter ↔ target agreement rate collapses with sampling — at T=0.7 you typically see ~10-20% acceptance vs. 60-78% at T=0.

The production compose registers 3 served-model aliases for the same backend so chat clients can route greedy vs sampled requests separately:

qwen36-fast → intended for greedy/agentic (T=0)
qwen36-deep → intended for creative/sampled (T=0.7)
qwen36-35b-heretic → canonical name

Disable thinking per-request:

json
{"chat_template_kwargs": {"enable_thinking": false}}

Preserve thinking across multi-turn:

json
{"chat_template_kwargs": {"preserve_thinking": true}}

Common gotcha: with thinking enabled (default), Qwen3.6 spends most of its max_tokens budget on <think> reasoning before emitting content. Use max_tokens ≥ 2048 for thinking-enabled requests — lower budgets often produce content: null with finish_reason: "length".

Hardware Requirements

Table with columns: Tier, GPU, Notes
Tier	GPU	Notes
Target — production-validated	NVIDIA DGX Spark (128 GB unified, GB10 sm_121a)	Full 256K context, validated to c=64 concurrent on `aeon-vllm-ultimate:latest`
Compatible	RTX PRO 6000 Blackwell (96 GB)	What v2 was calibrated on. Same MoE-shape constraint applies (Marlin is still the only NVFP4 MoE backend that accepts 256×512 grouped GEMM regardless of chip); linear NVFP4 path uses CUTLASS native.
Compatible	B200 / GB200	Image rebuild required (SM 10.0, not SM 12.x)
Compatible	RTX 5090 (32 GB)	Reduced context, low concurrency

Files

Table with columns: File, Size, Description
File	Size	Description
`model-00001-of-00009.safetensors` … `model-00009-of-00009.safetensors`	~22 GB total	NVFP4 quantized weights (~123,724 tensors across 9 shards)
`model.safetensors.index.json`	~5 MB	shard index
`config.json`	~7 KB	Model + quantization config (`Qwen3_5MoeForConditionalGeneration`)
`tokenizer.json`

Disclaimer

License

Apache 2.0 (inherited from Qwen3.6 base).

Credits

Base model: tvall43/Qwen3.6-35B-A3B-heretic — abliteration via Heretic v1.2.0
Original target: Qwen/Qwen3.6-35B-A3B by Alibaba Tongyi
DFlash drafter: z-lab/Qwen3.6-35B-A3B-DFlash — z-lab (Soroush Mohri et al.)
Quantization tool: llmcompressor — Neural Magic / RedHat
vLLM build chain: vllm-project/vllm v0.23.0 source-built for sm_121a + the AEON speculative-decoding stack (aeon-vllm-ultimate)

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Qwen3.6-35B-A3B-heretic-NVFP4

README

What changed in v2 (2026-04-19)

Vision fix (2026-06-18) — image inputs now work

🚀 Quickstart (copy-paste)

Performance Benchmarks

v0.23.0 build — current production image

Single-stream (c=1), by category

Aggregate throughput vs concurrency (c=1 → c=64)

Long-context tiers (DFlash acceptance holds)

Earlier vllm-spark-omni-q36:v1.2 reference (DFlash k=15)

1. Single-Stream Performance

Decode rate (10 trials, 200-token outputs)

TTFT by prompt length (5 trials per class)

Decode rate by output length (3 trials per length)

Sampling: greedy vs stochastic (5 trials per mode)

Long-prompt prefill (RAG / document workloads)

Single-stream summary

2. Concurrent-Session Performance

Throughput scaling (N concurrent clients, 200-tok output)

TTFT-only scaling (1-token output, prefill + first-token)

Concurrent with 1K-token prompts (RAG-style workload)

Concurrent-session summary

Key Performance Metrics Summary

Scaling efficiency (200-tok concurrent test)

Test methodology notes

⚠️ IMPORTANT REQUIREMENTS

Quick Start (DGX Spark, with DFlash spec decode)

Production docker-compose (the actual flags that work)

What we fixed for the DGX Spark

The result

Model Architecture

Hybrid Attention

Quantization Details

Quantized layers (NVFP4)

Excluded from quantization (kept BF16)

Recommended sampling parameters

Hardware Requirements

Files

Disclaimer

License

Credits

☕ Support the work

Explore FriendliAI today

README

What changed in v2 (2026-04-19)

Vision fix (2026-06-18) — image inputs now work

🚀 Quickstart (copy-paste)

Performance Benchmarks

v0.23.0 build — current production image

Single-stream (c=1), by category

Aggregate throughput vs concurrency (c=1 → c=64)

Long-context tiers (DFlash acceptance holds)

Earlier vllm-spark-omni-q36:v1.2 reference (DFlash k=15)

1. Single-Stream Performance

Decode rate (10 trials, 200-token outputs)

TTFT by prompt length (5 trials per class)

Decode rate by output length (3 trials per length)

Sampling: greedy vs stochastic (5 trials per mode)

Long-prompt prefill (RAG / document workloads)

Single-stream summary

2. Concurrent-Session Performance

Throughput scaling (N concurrent clients, 200-tok output)

TTFT-only scaling (1-token output, prefill + first-token)

Concurrent with 1K-token prompts (RAG-style workload)

Concurrent-session summary

Key Performance Metrics Summary

Scaling efficiency (200-tok concurrent test)

Test methodology notes

⚠️ IMPORTANT REQUIREMENTS

Quick Start (DGX Spark, with DFlash spec decode)

Production docker-compose (the actual flags that work)

What we fixed for the DGX Spark

The result

Model Architecture

Hybrid Attention

Quantization Details

Quantized layers (NVFP4)

Excluded from quantization (kept BF16)

Recommended sampling parameters

Earlier `vllm-spark-omni-q36:v1.2` reference (DFlash k=15)

Earlier `vllm-spark-omni-q36:v1.2` reference (DFlash k=15)