AEON-7

Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

🚀 Quickstart (DGX Spark / GB10 — DFlash)

Complete copy-paste recipe: pull the container, pull this model, pull the DFlash drafter, then serve. (Fuller deployment options — dedicated-VRAM Blackwell MTP, env vars, compose — are in the Usage section below.)

bash
# 1) Pull the canonical AEON vLLM Ultimate container
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

# 2) Pull THIS model (fresh)
huggingface-cli download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS --local-dir ./aeon-model

# 3) Pull the DFlash drafter (fresh)
huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir ./aeon-drafter

# 4) Serve (ENTRYPOINT is /bin/bash, so pass --entrypoint vllm then serve …)
docker run --rm --gpus all \
  -v ./aeon-model:/model:ro \
  -v ./aeon-drafter:/drafter:ro \
  --entrypoint vllm ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
  serve /model \
  --quantization modelopt \
  --mamba-cache-dtype float32 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --limit-mm-per-prompt '{"image":4,"video":2}' \
  --mm-encoder-tp-mode data \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 16384 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --trust-remote-code \
  --speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":10}'

Lower --gpu-memory-utilization (e.g. 0.69) if the host co-runs other services; never exceed 0.88 on DGX Spark unified memory. For dedicated-VRAM Blackwell (MTP via the grafted head, no external drafter) see Usage.

📈 Why this image matters for long-context drafting

The z-lab Qwen3.6-27B DFlash drafter is a sliding-window model — 4 of its 5 layers use sliding-window attention (window 2048). aeon-vllm-ultimate:latest (PR #40898) runs those layers as proper SWA; earlier images ran them as full attention, so drafting collapsed once context grew past ~2048 tokens. PR #41703 additionally makes --enable-prefix-caching corruption-immune with DFlash. Net: long-context drafting holds up; short-context (<2048, one window) is unchanged.

🏆 Live production bench — DFlash n=10 on aeon-vllm-ultimate:latest

Measured on DGX Spark GB10, aeon-vllm-ultimate:latest, DFlash num_speculative_tokens=10. Lead with acceptance (stable across samples), not single-sample tok/s.

Long-context (~9k-token) draft acceptance — this is the headline win:

Table with columns: Image, ~9k-token draft acceptance
Image ~9k-token draft acceptance
pre-fix image (full-attn drafter) 19.7 %
aeon-vllm-ultimate:latest (SWA drafter) 45.0 % (2.3×)

Short-context c=1 acceptance by category (new image, n=10, approximate):

Table with columns: Category, accept
Category accept
Math ~50 %
Reasoning ~50 %
Extraction ~40 %
Coding ~38 %
Natural ~25 %
Prose ~18 %

Short-context throughput is statistically unchanged vs the prior image — the drafter's sliding window only engages past 2048 tokens, so the win is specifically long-context. (Caveat: single / 3-round samples; short-context rankings are within noise. Acceptance is the stable signal — single-sample tok/s is not.)

Table with columns: Image, ~9k-token draft acceptance
Image	~9k-token draft acceptance
pre-fix image (full-attn drafter)	19.7 %
`aeon-vllm-ultimate:latest` (SWA drafter)	45.0 % (2.3×)

Table with columns: Category, accept
Category	accept
Math	~50 %
Reasoning	~50 %
Extraction	~40 %
Coding	~38 %
Natural	~25 %
Prose	~18 %

🙏 Reference recipe credit: The conv1d-preserved NVFP4 + MTP graft pipeline used to build this XS variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config — including the strategic decision to quantize the GDN projection matmuls to NVFP4 while preserving linear_attn.conv1d at BF16 — and the MTP-head graft technique. We adapted the recipe to AEON-Ultimate's abliterated weights and ship both the conv1d-preserved-only XS variant (matching their footprint) and a heavier regular-MTP variant that additionally keeps the projections at BF16. Full credit for the underlying recipe → sakamakismile.

Performance — DGX Spark (v0.23.0, aeon-vllm-ultimate:latest)

Fastest 27B export. NVFP4 MTP-XS body + an external DFlash@10 drafter on ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0) delivers ~42.6 tok/s single-stream and ~340 tok/s aggregate at c=64, with DFlash draft acceptance ~35 % at short context that holds ~45 % at long (~9k) context. This is the body the canonical 27B card was benchmarked on.

Measured on DGX Spark GB10 (sm_121a, unified memory), aeon-vllm-ultimate:latest, NVFP4 body served with an external DFlash drafter at num_speculative_tokens=10. The grafted MTP head ships in this repo but sits unused on the Spark — DFlash wins on unified memory (see hardware routing below).

MTP-XS is the smallest NVFP4 export and the fastest single-stream of the 27B family, at roughly half the memory of the BF16 baseline.

Per-category single-stream (c=1)

Table with columns: Category, Decode (tok/s), TTFT (ms), TPOT (ms), Prefill (tok/s), DFlash accept
Category	Decode (tok/s)	TTFT (ms)	TPOT (ms)	Prefill (tok/s)	DFlash accept
Coding	42.6	141	23.5	318	34.5 %
Math	55.9	248	17.9	246	48.0 %
Reasoning	49.3

Decode speed tracks DFlash acceptance: structured workloads (Extraction, Math, Reasoning) draft well (≈42–49 % accept → 49–57 tok/s); free-form prose drafts less predictably (≈23 % → 31 tok/s). The headline ~42.6 tok/s is the Coding-category single-stream figure.

Aggregate throughput by concurrency

Throughput scales cleanly to c=64 (the DFlash high-concurrency fix below removed the prior c≥32 crash). Aggregate peaks at c=64, topping out around ~340 tok/s (Reasoning category); every category climbs monotonically from c=1 → c=64:

Table with columns: Category, c=1, c=8, c=16, c=32, c=64
Category	c=1	c=8	c=16	c=32	c=64
Coding	42	185	249	262	277
Math	53	221	285	294	303
Reasoning	47	241

Long-context DFlash acceptance

The z-lab DFlash drafter is a sliding-window model (4 of 5 layers use SWA, window 2048). On this image (PR #40898) those layers run as proper SWA, so draft acceptance holds as context grows instead of collapsing past 2k tokens:

Table with columns: Context, DFlash draft acceptance
Context	DFlash draft acceptance
short (c=1, blended)	~35 %
long (~9k tokens)	45.0 %

This is the headline long-context win — acceptance at ~9k tokens is higher than the blended short-context average. (Pre-fix full-attention image collapsed to ~19.7 % at the same context.)

What we fixed for the DGX Spark

All AEON models run on one unified container — ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0 built from source for sm_121a, merged with the AEON speculative-decoding stack). Two fixes matter most for this card:

Unified container. A single sm_121a image (vLLM 0.23.0) replaces the per-model image sprawl — the same build serves every Qwen3.6-27B AEON-Ultimate variant, with the SM120-family CUTLASS NVFP4/FP8 kernels GB10 actually dispatches to.
DFlash high-concurrency fix. The speculative drafter previously crashed at ≥32 concurrent requests (a padded-vs-unpadded KV block-table shape mismatch in FlashAttention). The fix slices the drafter's block-table to the unpadded batch (block_table[:num_reqs]) — a port of upstream PR #43982, which fixed this for MTP but never for DFlash. The c=32 / c=64 columns above are only measurable because of it.

Full optimization writeup (NVFP4 KV cache, DFlash SWA, sm_121a build, unified-memory tuning): see the container repo.

Stock baseline pending. These figures are on the optimized aeon-vllm-ultimate:latest. There is no stock / vanilla-vLLM baseline for this export yet — a fresh fully-vanilla re-bench (default settings, no speculative decoding, no sm_121a optimizations) is pending and will be added when it completes.

What "XS" means — and what it's not

This is the extra-small footprint sibling of -Multimodal-NVFP4-MTP. XS is not "everything to FP4." It is a deliberate, principled split: the heavy GDN matmul projections drop to NVFP4 (where they're bandwidth-bound and FP4 wins big), while the SSM-critical linear_attn.conv1d kernel stays BF16 (where FP4 has documented stability problems on long-context recurrence).

Table with columns: Multimodal-NVFP4-MTP (regular), Multimodal-NVFP4-MTP-XS (this repo)
	Multimodal-NVFP4-MTP (regular)	Multimodal-NVFP4-MTP-XS (this repo)
`linear_attn` projections (`in_proj_qkv`, `in_proj_z`, `in_proj_a/b`, `out_proj`)	preserved BF16 (~11 GB)	quantized to NVFP4 (~3 GB)
`linear_attn.conv1d` (SSM 1D convolution — recurrence-critical)	preserved BF16	preserved BF16 ✅
`linear_attn` SSM state vectors (, , )

This is a smart, strategic quantization — not a precision compromise. The conv1d preservation matters: the GatedDeltaNet recurrence depends on the 1D convolution behaving numerically like its training distribution, and FP4 quantization of conv1d has been observed to cause drift on long-context inference in community testing. By keeping conv1d BF16 while quantizing the projections (which are bandwidth-limited matmuls where FP4 is a clean win), we get the ~6 GB footprint reduction without sacrificing the part of the model that's actually fragile under quantization. This is the same principle modelopt's NVFP4_DEFAULT_CFG applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads).

When to pick which:

Pick the regular variant if you have ≥48 GB VRAM. Even the projection weights at BF16 give a small additional safety margin on long-context recurrence stability.
Pick this XS variant if you have 24–32 GB VRAM (RTX 5090, single GPUs without headroom for full BF16 GDN). The conv1d preservation guarantees the SSM recurrence stays numerically stable; the ~6 GB savings buy meaningful KV-cache headroom on tight GPUs.

We ship both because we have the headroom on RTX PRO 6000 / B100/B200 to run the larger, more numerically-conservative version, and several users on tighter cards have asked for the smaller one. Neither variant quantizes linear_attn.conv1d — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.

Variants

Table with columns: Format, Size, Use case
Format	Size	Use case
BF16	51 GB	Full-precision reference weights
NVFP4 (compressed-tensors + DFlash)	26 GB	DGX Spark — DFlash spec decode, validated
Multimodal-NVFP4-MTP	27 GB	RTX PRO 6000 / B100/B200 — MTP, GDN preserved BF16
Text-NVFP4-MTP

What this is

The modelopt-format NVFP4 + MTP variant, multimodal-preserved, with linear_attn projections fully quantized, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).

Specifically:

Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. modelopt format, served by vLLM through --quantization modelopt.
Linear-attn / GatedDeltaNet projections quantized to NVFP4 (this is the XS difference). Only linear_attn.conv1d is kept BF16 (modelopt's default). The community has validated this approach on Qwen3.5/3.6-NVFP4 builds with 22K+ downloads on sakamakismile's reference recipes; we re-ran calibration on our abliterated weights and the model serves correctly.
Vision tower preserved BF16 (333 keys) — correct model.visual.* layout. Multimodal weights load; runtime vision inference validation on this image is pending a GPU window.
MTP head grafted from the base Qwen/Qwen3.6-27B checkpoint (15 tensors, BF16, bit-exact verified). Powers --speculative-config '{"method":"qwen3_5_mtp",...}' for self-speculative decoding without a separate drafter.

Why MTP

Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.

Indicative published numbers (sakamakismile's reference recipe on RTX 5090):

Single-stream short prompts at n=3: ~132 tok/s
Single-stream long-form: ~105 tok/s
2-parallel aggregate (256K + KV FP8): ~189-207 tok/s
Mean acceptance length: ~3.0-4.0 (compared to DFlash chains of ~2.0-2.3)

Validated benchmarks of the AEON-Ultimate XS variant land in the GitHub repo once measured.

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

Table with columns: Hardware tier, Recommended variant, Why
Hardware tier	Recommended variant	Why
DGX Spark / GB10 (sm_121a, unified memory)	Either: `-NVFP4` (DFlash) (simpler, validated) or this XS body served with `--speculative-config '{"method":"dflash",...}'` (highest measured throughput — see the acceptance bench above)	Spark prefers DFlash regardless of body. On `aeon-vllm-ultimate:latest` with DFlash n=10, long-context (~9k) draft acceptance reaches 45.0 % (2.3× the pre-fix image) — see the live bench above. The grafted MTP head in this repo is unused in that path. Never use `--speculative-config '{"method":"qwen3_5_mtp",...}'` on Spark — MTP loses badly to DFlash on unified memory.
RTX PRO 6000 Blackwell (96 GB dedicated VRAM)	— GDN BF16 for best long-context fidelity, for ~10 % faster decode

Full bench numbers: GitHub repo Performance section.

Usage

vLLM serve — dedicated-VRAM Blackwell (default: MTP via grafted head)

bash
# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS \
  --local-dir ./aeon-ultimate-multimodal-nvfp4-mtp-xs

# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
  --quantization modelopt \
  --trust-remote-code \
  --limit-mm-per-prompt '{"image":4,"video":2}' \
  --mm-encoder-tp-mode data \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 16384 \
  --gpu-memory-utilization 0.85 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.

vLLM serve — DGX Spark (DFlash spec, not MTP — current production recipe)

For DGX Spark, swap the spec method to DFlash. DFlash's block-diffusion drafter is decisively better than MTP's n=3 on unified memory. This is the exact recipe running in production, on the AEON vLLM Ultimate image ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0 + DFlash high-concurrency fix + PR #40898 + PR #41703 + PR #44389).

The image's ENTRYPOINT is /bin/bash, so when launching via docker run you must pass --entrypoint vllm and then serve … (writing IMAGE vllm serve runs bash vllm serve and fails). The vllm serve … arguments are identical either way:

bash
# Pull the DFlash drafter alongside this body
hf download z-lab/Qwen3.6-27B-DFlash --local-dir ./qwen36-27b-dflash

export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1

# docker run --rm --gpus all --entrypoint vllm \
#   -v "$PWD":/models ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
#   serve /models/aeon-ultimate-multimodal-nvfp4-mtp-xs \
#   ... (the same flags below) ...

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
  --quantization modelopt \
  --trust-remote-code \
  --mamba-cache-dtype float32 \
  --limit-mm-per-prompt '{"image":4,"video":2}' \
  --mm-encoder-tp-mode data \
  --max-model-len 262144 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 16384 \
  --gpu-memory-utilization 0.69 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"dflash","model":"./qwen36-27b-dflash","num_speculative_tokens":10}'

Critical DFlash config rules (learned the hard way):

Use the DEFAULT drafter attention backend — do not add an attention_backend to the spec-config. The default works for Qwen3.6 DFlash on this image (unlike Gemma's DFlash, which needed an explicit flash_attn backend). Leave it out.
Do NOT set --kv-cache-dtype. DFlash's drafter is non-causal (block diffusion) and no vLLM backend supports non-causal + fp8 KV, so KV must stay at default BF16. Forcing fp8 KV will fail to boot.
num_speculative_tokens=10 is the validated default. An n=8/10/12/15 sweep found n=10 best for the Spark voice/chat workload — top aggregate throughput + DFlash acceptance at parity single-stream; the old 12-token setting wins only at very long context (z-lab's published default is 15).
--gpu-memory-utilization 0.69 because this host co-runs Qwen3-ASR (:8001) + Qwen3-TTS (:8002). Keep it ≤ 0.7 when co-hosting; raise toward 0.88 only if vLLM runs alone (the DGX Spark unified-memory cap is 0.88 — never go higher). BF16 KV is 2× fp8, but full 256k context still fits — KV cache holds 487k tokens / 1.86× concurrency at 262,144 ctx.

Why this recipe needs aeon-vllm-ultimate:latest: the z-lab DFlash drafter is a sliding-window model — 4 of its 5 layers use sliding-window attention (window 2048). This image (PR #40898) runs those layers as proper SWA; earlier images ran them as full attention, so drafting collapsed once context grew past ~2048 tokens. PR #41703 additionally makes --enable-prefix-caching corruption-immune with DFlash. Net: long-context drafting holds up; short-context (<2048, one window) is unchanged. See the live acceptance bench at the top (45.0 % @ ~9k vs 19.7 % pre-fix = 2.3×).

Configuration notes

--quantization modelopt is required for this body (not compressed-tensors — different format).
--speculative-config '{"method":"qwen3_5_mtp", ...}' uses the grafted MTP head; correct for dedicated-VRAM Blackwell. Don't use this on DGX Spark.
--speculative-config '{"method":"dflash", ...}' uses an external DFlash drafter; correct for DGX Spark. The grafted MTP head in this repo sits unused in this path (~0.85 GB dead weight). Don't use this on RTX PRO 6000 or B100/B200 — they prefer MTP.
--gpu-memory-utilization 0.94 is the validated cap on RTX PRO 6000; 0.88 is the cap on DGX Spark (unified memory thrashes at 0.90+).

Quantization recipe

Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class)
Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
Excluded from quantization (kept BF16) — XS variant differences from the regular variant in bold:
- lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
- *linear_attn.conv1d*,

Provenance & credits

BF16 source: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline.
MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (docs/MTP_GRAFT_RECIPE.md)
Reference benchmark recipes: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Quantization: NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.43.0)
Base: Alibaba Qwen team — Qwen/Qwen3.6-27B

License + responsibility

Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Model provider

AEON-7

Model tree

Base

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

🚀 Quickstart (DGX Spark / GB10 — DFlash)

bash
# 1) Pull the canonical AEON vLLM Ultimate container
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

# 2) Pull THIS model (fresh)
huggingface-cli download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS --local-dir ./aeon-model

# 3) Pull the DFlash drafter (fresh)
huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir ./aeon-drafter

# 4) Serve (ENTRYPOINT is /bin/bash, so pass --entrypoint vllm then serve …)
docker run --rm --gpus all \
  -v ./aeon-model:/model:ro \
  -v ./aeon-drafter:/drafter:ro \
  --entrypoint vllm ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
  serve /model \
  --quantization modelopt \
  --mamba-cache-dtype float32 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --limit-mm-per-prompt '{"image":4,"video":2}' \
  --mm-encoder-tp-mode data \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 16384 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --trust-remote-code \
  --speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":10}'

Lower --gpu-memory-utilization (e.g. 0.69) if the host co-runs other services; never exceed 0.88 on DGX Spark unified memory. For dedicated-VRAM Blackwell (MTP via the grafted head, no external drafter) see Usage.

📈 Why this image matters for long-context drafting

The z-lab Qwen3.6-27B DFlash drafter is a sliding-window model — 4 of its 5 layers use sliding-window attention (window 2048). aeon-vllm-ultimate:latest (PR #40898) runs those layers as proper SWA; earlier images ran them as full attention, so drafting collapsed once context grew past ~2048 tokens. PR #41703 additionally makes --enable-prefix-caching corruption-immune with DFlash. Net: long-context drafting holds up; short-context (<2048, one window) is unchanged.

🏆 Live production bench — DFlash n=10 on aeon-vllm-ultimate:latest

Measured on DGX Spark GB10, aeon-vllm-ultimate:latest, DFlash num_speculative_tokens=10. Lead with acceptance (stable across samples), not single-sample tok/s.

Long-context (~9k-token) draft acceptance — this is the headline win:

Table with columns: Image, ~9k-token draft acceptance
Image ~9k-token draft acceptance
pre-fix image (full-attn drafter) 19.7 %
aeon-vllm-ultimate:latest (SWA drafter) 45.0 % (2.3×)

Short-context c=1 acceptance by category (new image, n=10, approximate):

Table with columns: Category, accept
Category accept
Math ~50 %
Reasoning ~50 %
Extraction ~40 %
Coding ~38 %
Natural ~25 %
Prose ~18 %

Short-context throughput is statistically unchanged vs the prior image — the drafter's sliding window only engages past 2048 tokens, so the win is specifically long-context. (Caveat: single / 3-round samples; short-context rankings are within noise. Acceptance is the stable signal — single-sample tok/s is not.)

Table with columns: Image, ~9k-token draft acceptance
Image	~9k-token draft acceptance
pre-fix image (full-attn drafter)	19.7 %
`aeon-vllm-ultimate:latest` (SWA drafter)	45.0 % (2.3×)

Table with columns: Category, accept
Category	accept
Math	~50 %
Reasoning	~50 %
Extraction	~40 %
Coding	~38 %
Natural	~25 %
Prose	~18 %

🙏 Reference recipe credit: The conv1d-preserved NVFP4 + MTP graft pipeline used to build this XS variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config — including the strategic decision to quantize the GDN projection matmuls to NVFP4 while preserving linear_attn.conv1d at BF16 — and the MTP-head graft technique. We adapted the recipe to AEON-Ultimate's abliterated weights and ship both the conv1d-preserved-only XS variant (matching their footprint) and a heavier regular-MTP variant that additionally keeps the projections at BF16. Full credit for the underlying recipe → sakamakismile.

Performance — DGX Spark (v0.23.0, aeon-vllm-ultimate:latest)

Fastest 27B export. NVFP4 MTP-XS body + an external DFlash@10 drafter on ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0) delivers ~42.6 tok/s single-stream and ~340 tok/s aggregate at c=64, with DFlash draft acceptance ~35 % at short context that holds ~45 % at long (~9k) context. This is the body the canonical 27B card was benchmarked on.

MTP-XS is the smallest NVFP4 export and the fastest single-stream of the 27B family, at roughly half the memory of the BF16 baseline.

Per-category single-stream (c=1)

Table with columns: Category, Decode (tok/s), TTFT (ms), TPOT (ms), Prefill (tok/s), DFlash accept
Category	Decode (tok/s)	TTFT (ms)	TPOT (ms)	Prefill (tok/s)	DFlash accept
Coding	42.6	141	23.5	318	34.5 %
Math	55.9	248	17.9	246	48.0 %
Reasoning	49.3

Aggregate throughput by concurrency

Table with columns: Category, c=1, c=8, c=16, c=32, c=64
Category	c=1	c=8	c=16	c=32	c=64
Coding	42	185	249	262	277
Math	53	221	285	294	303
Reasoning	47	241

Long-context DFlash acceptance

Table with columns: Context, DFlash draft acceptance
Context	DFlash draft acceptance
short (c=1, blended)	~35 %
long (~9k tokens)	45.0 %

This is the headline long-context win — acceptance at ~9k tokens is higher than the blended short-context average. (Pre-fix full-attention image collapsed to ~19.7 % at the same context.)

What we fixed for the DGX Spark

Unified container. A single sm_121a image (vLLM 0.23.0) replaces the per-model image sprawl — the same build serves every Qwen3.6-27B AEON-Ultimate variant, with the SM120-family CUTLASS NVFP4/FP8 kernels GB10 actually dispatches to.
DFlash high-concurrency fix. The speculative drafter previously crashed at ≥32 concurrent requests (a padded-vs-unpadded KV block-table shape mismatch in FlashAttention). The fix slices the drafter's block-table to the unpadded batch (block_table[:num_reqs]) — a port of upstream PR #43982, which fixed this for MTP but never for DFlash. The c=32 / c=64 columns above are only measurable because of it.

Full optimization writeup (NVFP4 KV cache, DFlash SWA, sm_121a build, unified-memory tuning): see the container repo.

Stock baseline pending. These figures are on the optimized aeon-vllm-ultimate:latest. There is no stock / vanilla-vLLM baseline for this export yet — a fresh fully-vanilla re-bench (default settings, no speculative decoding, no sm_121a optimizations) is pending and will be added when it completes.

What "XS" means — and what it's not

Table with columns: Multimodal-NVFP4-MTP (regular), Multimodal-NVFP4-MTP-XS (this repo)
	Multimodal-NVFP4-MTP (regular)	Multimodal-NVFP4-MTP-XS (this repo)
`linear_attn` projections (`in_proj_qkv`, `in_proj_z`, `in_proj_a/b`, `out_proj`)	preserved BF16 (~11 GB)	quantized to NVFP4 (~3 GB)
`linear_attn.conv1d` (SSM 1D convolution — recurrence-critical)	preserved BF16	preserved BF16 ✅
`linear_attn` SSM state vectors (, , )

When to pick which:

Pick the regular variant if you have ≥48 GB VRAM. Even the projection weights at BF16 give a small additional safety margin on long-context recurrence stability.
Pick this XS variant if you have 24–32 GB VRAM (RTX 5090, single GPUs without headroom for full BF16 GDN). The conv1d preservation guarantees the SSM recurrence stays numerically stable; the ~6 GB savings buy meaningful KV-cache headroom on tight GPUs.

Variants

Table with columns: Format, Size, Use case
Format	Size	Use case
BF16	51 GB	Full-precision reference weights
NVFP4 (compressed-tensors + DFlash)	26 GB	DGX Spark — DFlash spec decode, validated
Multimodal-NVFP4-MTP	27 GB	RTX PRO 6000 / B100/B200 — MTP, GDN preserved BF16
Text-NVFP4-MTP

What this is

Specifically:

Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. modelopt format, served by vLLM through --quantization modelopt.
Linear-attn / GatedDeltaNet projections quantized to NVFP4 (this is the XS difference). Only linear_attn.conv1d is kept BF16 (modelopt's default). The community has validated this approach on Qwen3.5/3.6-NVFP4 builds with 22K+ downloads on sakamakismile's reference recipes; we re-ran calibration on our abliterated weights and the model serves correctly.
Vision tower preserved BF16 (333 keys) — correct model.visual.* layout. Multimodal weights load; runtime vision inference validation on this image is pending a GPU window.
MTP head grafted from the base Qwen/Qwen3.6-27B checkpoint (15 tensors, BF16, bit-exact verified). Powers --speculative-config '{"method":"qwen3_5_mtp",...}' for self-speculative decoding without a separate drafter.

Why MTP

Indicative published numbers (sakamakismile's reference recipe on RTX 5090):

Single-stream short prompts at n=3: ~132 tok/s
Single-stream long-form: ~105 tok/s
2-parallel aggregate (256K + KV FP8): ~189-207 tok/s
Mean acceptance length: ~3.0-4.0 (compared to DFlash chains of ~2.0-2.3)

Validated benchmarks of the AEON-Ultimate XS variant land in the GitHub repo once measured.

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

Table with columns: Hardware tier, Recommended variant, Why
Hardware tier	Recommended variant	Why
DGX Spark / GB10 (sm_121a, unified memory)	Either: `-NVFP4` (DFlash) (simpler, validated) or this XS body served with `--speculative-config '{"method":"dflash",...}'` (highest measured throughput — see the acceptance bench above)	Spark prefers DFlash regardless of body. On `aeon-vllm-ultimate:latest` with DFlash n=10, long-context (~9k) draft acceptance reaches 45.0 % (2.3× the pre-fix image) — see the live bench above. The grafted MTP head in this repo is unused in that path. Never use `--speculative-config '{"method":"qwen3_5_mtp",...}'` on Spark — MTP loses badly to DFlash on unified memory.
RTX PRO 6000 Blackwell (96 GB dedicated VRAM)	— GDN BF16 for best long-context fidelity, for ~10 % faster decode

Full bench numbers: GitHub repo Performance section.

Usage

vLLM serve — dedicated-VRAM Blackwell (default: MTP via grafted head)

bash
# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS \
  --local-dir ./aeon-ultimate-multimodal-nvfp4-mtp-xs

# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
  --quantization modelopt \
  --trust-remote-code \
  --limit-mm-per-prompt '{"image":4,"video":2}' \
  --mm-encoder-tp-mode data \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 16384 \
  --gpu-memory-utilization 0.85 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.

vLLM serve — DGX Spark (DFlash spec, not MTP — current production recipe)

bash
# Pull the DFlash drafter alongside this body
hf download z-lab/Qwen3.6-27B-DFlash --local-dir ./qwen36-27b-dflash

export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1

# docker run --rm --gpus all --entrypoint vllm \
#   -v "$PWD":/models ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
#   serve /models/aeon-ultimate-multimodal-nvfp4-mtp-xs \
#   ... (the same flags below) ...

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
  --quantization modelopt \
  --trust-remote-code \
  --mamba-cache-dtype float32 \
  --limit-mm-per-prompt '{"image":4,"video":2}' \
  --mm-encoder-tp-mode data \
  --max-model-len 262144 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 16384 \
  --gpu-memory-utilization 0.69 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"dflash","model":"./qwen36-27b-dflash","num_speculative_tokens":10}'

Critical DFlash config rules (learned the hard way):

Use the DEFAULT drafter attention backend — do not add an attention_backend to the spec-config. The default works for Qwen3.6 DFlash on this image (unlike Gemma's DFlash, which needed an explicit flash_attn backend). Leave it out.
Do NOT set --kv-cache-dtype. DFlash's drafter is non-causal (block diffusion) and no vLLM backend supports non-causal + fp8 KV, so KV must stay at default BF16. Forcing fp8 KV will fail to boot.
num_speculative_tokens=10 is the validated default. An n=8/10/12/15 sweep found n=10 best for the Spark voice/chat workload — top aggregate throughput + DFlash acceptance at parity single-stream; the old 12-token setting wins only at very long context (z-lab's published default is 15).
--gpu-memory-utilization 0.69 because this host co-runs Qwen3-ASR (:8001) + Qwen3-TTS (:8002). Keep it ≤ 0.7 when co-hosting; raise toward 0.88 only if vLLM runs alone (the DGX Spark unified-memory cap is 0.88 — never go higher). BF16 KV is 2× fp8, but full 256k context still fits — KV cache holds 487k tokens / 1.86× concurrency at 262,144 ctx.

Why this recipe needs aeon-vllm-ultimate:latest: the z-lab DFlash drafter is a sliding-window model — 4 of its 5 layers use sliding-window attention (window 2048). This image (PR #40898) runs those layers as proper SWA; earlier images ran them as full attention, so drafting collapsed once context grew past ~2048 tokens. PR #41703 additionally makes --enable-prefix-caching corruption-immune with DFlash. Net: long-context drafting holds up; short-context (<2048, one window) is unchanged. See the live acceptance bench at the top (45.0 % @ ~9k vs 19.7 % pre-fix = 2.3×).

Configuration notes

--quantization modelopt is required for this body (not compressed-tensors — different format).
--speculative-config '{"method":"qwen3_5_mtp", ...}' uses the grafted MTP head; correct for dedicated-VRAM Blackwell. Don't use this on DGX Spark.
--speculative-config '{"method":"dflash", ...}' uses an external DFlash drafter; correct for DGX Spark. The grafted MTP head in this repo sits unused in this path (~0.85 GB dead weight). Don't use this on RTX PRO 6000 or B100/B200 — they prefer MTP.
--gpu-memory-utilization 0.94 is the validated cap on RTX PRO 6000; 0.88 is the cap on DGX Spark (unified memory thrashes at 0.90+).

Quantization recipe

Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class)
Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
Excluded from quantization (kept BF16) — XS variant differences from the regular variant in bold:
- lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
- *linear_attn.conv1d*,

Provenance & credits

BF16 source: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline.
MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (docs/MTP_GRAFT_RECIPE.md)
Reference benchmark recipes: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Quantization: NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.43.0)
Base: Alibaba Qwen team — Qwen/Qwen3.6-27B

License + responsibility

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS

Get help setting up a custom Dedicated Endpoints.

🚀 Quickstart (DGX Spark / GB10 — DFlash)

📈 Why this image matters for long-context drafting

🏆 Live production bench — DFlash n=10 on aeon-vllm-ultimate:latest

Performance — DGX Spark (v0.23.0, aeon-vllm-ultimate:latest)

Per-category single-stream (c=1)

Aggregate throughput by concurrency

Long-context DFlash acceptance

What we fixed for the DGX Spark

What "XS" means — and what it's not

Variants

What this is

Why MTP

🎯 When to pick this variant — measured hardware routing

Usage

vLLM serve — dedicated-VRAM Blackwell (default: MTP via grafted head)

vLLM serve — DGX Spark (DFlash spec, not MTP — current production recipe)

Configuration notes

Quantization recipe

Provenance & credits

License + responsibility

☕ Support the work

Explore FriendliAI today

🚀 Quickstart (DGX Spark / GB10 — DFlash)

📈 Why this image matters for long-context drafting

🏆 Live production bench — DFlash n=10 on aeon-vllm-ultimate:latest

Performance — DGX Spark (v0.23.0, aeon-vllm-ultimate:latest)

Per-category single-stream (c=1)

Aggregate throughput by concurrency

Long-context DFlash acceptance

What we fixed for the DGX Spark

What "XS" means — and what it's not

Variants

What this is

Why MTP

🎯 When to pick this variant — measured hardware routing

Usage

vLLM serve — dedicated-VRAM Blackwell (default: MTP via grafted head)

vLLM serve — DGX Spark (DFlash spec, not MTP — current production recipe)

Configuration notes

Quantization recipe

Provenance & credits

License + responsibility

☕ Support the work

🏆 Live production bench — DFlash n=10 on `aeon-vllm-ultimate:latest`

🏆 Live production bench — DFlash n=10 on `aeon-vllm-ultimate:latest`