AEON-7
Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0🚀 Quickstart (DGX Spark / GB10 — DFlash)
Complete copy-paste recipe: pull the container, pull this model, pull the DFlash drafter, then serve. (Fuller deployment options — dedicated-VRAM Blackwell MTP, env vars, compose — are in the Usage section below.)
bash
# 1) Pull the canonical AEON vLLM Ultimate containerdocker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest# 2) Pull THIS model (fresh)huggingface-cli download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS --local-dir ./aeon-model# 3) Pull the DFlash drafter (fresh)huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir ./aeon-drafter# 4) Serve (ENTRYPOINT is /bin/bash, so pass --entrypoint vllm then serve …)docker run --rm --gpus all \-v ./aeon-model:/model:ro \-v ./aeon-drafter:/drafter:ro \--entrypoint vllm ghcr.io/aeon-7/aeon-vllm-ultimate:latest \serve /model \--quantization modelopt \--mamba-cache-dtype float16 \--mamba-block-size 256 \--reasoning-parser qwen3 \--tool-call-parser qwen3_coder \--enable-auto-tool-choice \--limit-mm-per-prompt '{"image":4,"video":2}' \--mm-encoder-tp-mode data \--gpu-memory-utilization 0.85 \--max-num-seqs 64 \--max-num-batched-tokens 16384 \--enable-chunked-prefill \--enable-prefix-caching \--trust-remote-code \--speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":12}'
Lower
--gpu-memory-utilization(e.g.0.69) if the host co-runs other services; never exceed 0.88 on DGX Spark unified memory. For dedicated-VRAM Blackwell (MTP via the grafted head, no external drafter) see Usage.
📈 Why this image matters for long-context drafting
The z-lab Qwen3.6-27B DFlash drafter is a sliding-window model — 4 of its 5 layers use sliding-window attention (window 2048).
aeon-vllm-ultimate:latest(PR #40898) runs those layers as proper SWA; earlier images ran them as full attention, so drafting collapsed once context grew past ~2048 tokens. PR #41703 additionally makes--enable-prefix-cachingcorruption-immune with DFlash. Net: long-context drafting holds up; short-context (<2048, one window) is unchanged.🏆 Live production bench — DFlash n=12 on
aeon-vllm-ultimate:latestMeasured on DGX Spark GB10,
aeon-vllm-ultimate:latest, DFlashnum_speculative_tokens=12. Lead with acceptance (stable across samples), not single-sample tok/s.Long-context (~9k-token) draft acceptance — this is the headline win:
Table Image ~9k-token draft acceptance pre-fix image (full-attn drafter) 19.7 % aeon-vllm-ultimate:latest(SWA drafter)45.0 % (2.3×) Short-context c=1 acceptance by category (new image, n=12, approximate):
Table Category accept Math ~50 % Reasoning ~50 % Extraction ~40 % Coding ~38 % Natural ~25 % Prose ~18 % Short-context throughput is statistically unchanged vs the prior image — the drafter's sliding window only engages past 2048 tokens, so the win is specifically long-context. (Caveat: single / 3-round samples; short-context rankings are within noise. Acceptance is the stable signal — single-sample tok/s is not.)
🙏 Reference recipe credit: The conv1d-preserved NVFP4 + MTP graft pipeline used to build this XS variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config — including the strategic decision to quantize the GDN projection matmuls to NVFP4 while preserving
linear_attn.conv1dat BF16 — and the MTP-head graft technique. We adapted the recipe to AEON-Ultimate's abliterated weights and ship both the conv1d-preserved-only XS variant (matching their footprint) and a heavier regular-MTP variant that additionally keeps the projections at BF16. Full credit for the underlying recipe → sakamakismile.
Performance — DGX Spark (v0.23.0, aeon-vllm-ultimate:latest)
Fastest 27B export. NVFP4 MTP-XS body + an external DFlash@12 drafter on
ghcr.io/aeon-7/aeon-vllm-ultimate:latest(vLLM 0.23.0) delivers ~42.6 tok/s single-stream and ~340 tok/s aggregate at c=64, with DFlash draft acceptance ~35 % at short context that holds ~45 % at long (~9k) context. This is the body the canonical 27B card was benchmarked on.
Measured on DGX Spark GB10 (sm_121a, unified memory), aeon-vllm-ultimate:latest, NVFP4 body served with an external DFlash drafter at num_speculative_tokens=12. The grafted MTP head ships in this repo but sits unused on the Spark — DFlash wins on unified memory (see hardware routing below).
MTP-XS is the smallest NVFP4 export and the fastest single-stream of the 27B family, at roughly half the memory of the BF16 baseline.
Per-category single-stream (c=1)
| Category | Decode (tok/s) | TTFT (ms) | TPOT (ms) | Prefill (tok/s) | DFlash accept |
|---|---|---|---|---|---|
| Coding | 42.6 | 141 | 23.5 | 318 | 34.5 % |
| Math | 55.9 | 248 | 17.9 | 246 | 48.0 % |
| Reasoning | 49.3 | 232 | 20.3 | 211 | 41.7 % |
| Prose | 31.2 | 229 | 32.1 | 166 | 23.2 % |
| Natural language | 34.8 | 228 | 28.7 | 175 | 26.6 % |
| Extraction / JSON | 57.4 | 234 | 17.4 | 231 | 49.3 % |
Decode speed tracks DFlash acceptance: structured workloads (Extraction, Math, Reasoning) draft well (≈42–49 % accept → 49–57 tok/s); free-form prose drafts less predictably (≈23 % → 31 tok/s). The headline ~42.6 tok/s is the Coding-category single-stream figure.
Aggregate throughput by concurrency
Throughput scales cleanly to c=64 (the DFlash high-concurrency fix below removed the prior c≥32 crash). Aggregate peaks at c=64, topping out around ~340 tok/s (Reasoning category); every category climbs monotonically from c=1 → c=64:
| Category | c=1 | c=8 | c=16 | c=32 | c=64 |
|---|---|---|---|---|---|
| Coding | 42 | 185 | 249 | 262 | 277 |
| Math | 53 | 221 | 285 | 294 | 303 |
| Reasoning | 47 | 241 | 301 | 319 | 340 |
| Prose | 30 | 130 | 165 | 179 | 193 |
| Natural language | 34 | 148 | 217 | 208 | 230 |
| Extraction / JSON | 55 | 201 | 250 | 270 | 299 |
Long-context DFlash acceptance
The z-lab DFlash drafter is a sliding-window model (4 of 5 layers use SWA, window 2048). On this image (PR #40898) those layers run as proper SWA, so draft acceptance holds as context grows instead of collapsing past 2k tokens:
| Context | DFlash draft acceptance |
|---|---|
| short (c=1, blended) | ~35 % |
| long (~9k tokens) | 45.0 % |
This is the headline long-context win — acceptance at ~9k tokens is higher than the blended short-context average. (Pre-fix full-attention image collapsed to ~19.7 % at the same context.)
What we fixed for the DGX Spark
All AEON models run on one unified container — ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0 built from source for sm_121a, merged with the AEON speculative-decoding stack). Two fixes matter most for this card:
- Unified container. A single sm_121a image (vLLM 0.23.0) replaces the per-model image sprawl — the same build serves every Qwen3.6-27B AEON-Ultimate variant, with the SM120-family CUTLASS NVFP4/FP8 kernels GB10 actually dispatches to.
- DFlash high-concurrency fix. The speculative drafter previously crashed at ≥32 concurrent requests (a padded-vs-unpadded KV block-table shape mismatch in FlashAttention). The fix slices the drafter's block-table to the unpadded batch (
block_table[:num_reqs]) — a port of upstream PR #43982, which fixed this for MTP but never for DFlash. The c=32 / c=64 columns above are only measurable because of it.
Full optimization writeup (NVFP4 KV cache, DFlash SWA, sm_121a build, unified-memory tuning): see the container repo.
Stock baseline pending. These figures are on the optimized
aeon-vllm-ultimate:latest. There is no stock / vanilla-vLLM baseline for this export yet — a fresh fully-vanilla re-bench (default settings, no speculative decoding, no sm_121a optimizations) is pending and will be added when it completes.
What "XS" means — and what it's not
This is the extra-small footprint sibling of -Multimodal-NVFP4-MTP. XS is not "everything to FP4." It is a deliberate, principled split: the heavy GDN matmul projections drop to NVFP4 (where they're bandwidth-bound and FP4 wins big), while the SSM-critical linear_attn.conv1d kernel stays BF16 (where FP4 has documented stability problems on long-context recurrence).
| Multimodal-NVFP4-MTP (regular) | Multimodal-NVFP4-MTP-XS (this repo) | |
|---|---|---|
linear_attn projections (in_proj_qkv, in_proj_z, in_proj_a/b, out_proj) | preserved BF16 (~11 GB) | quantized to NVFP4 (~3 GB) |
linear_attn.conv1d (SSM 1D convolution — recurrence-critical) | preserved BF16 | preserved BF16 ✅ |
linear_attn SSM state vectors (A_log, dt_bias, norm.weight) | preserved BF16 | preserved BF16 ✅ |
mtp.* head (grafted bf16 from base, bit-exact verified) | yes | yes |
| Vision tower | preserved BF16 | preserved BF16 |
| Total disk | ~27 GB | ~21 GB |
| VRAM footprint at runtime | ~28 GB | ~22 GB |
This is a smart, strategic quantization — not a precision compromise. The conv1d preservation matters: the GatedDeltaNet recurrence depends on the 1D convolution behaving numerically like its training distribution, and FP4 quantization of conv1d has been observed to cause drift on long-context inference in community testing. By keeping conv1d BF16 while quantizing the projections (which are bandwidth-limited matmuls where FP4 is a clean win), we get the ~6 GB footprint reduction without sacrificing the part of the model that's actually fragile under quantization. This is the same principle modelopt's NVFP4_DEFAULT_CFG applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads).
When to pick which:
- Pick the regular variant if you have ≥48 GB VRAM. Even the projection weights at BF16 give a small additional safety margin on long-context recurrence stability.
- Pick this XS variant if you have 24–32 GB VRAM (RTX 5090, single GPUs without headroom for full BF16 GDN). The conv1d preservation guarantees the SSM recurrence stays numerically stable; the ~6 GB savings buy meaningful KV-cache headroom on tight GPUs.
We ship both because we have the headroom on RTX PRO 6000 / B100/B200 to run the larger, more numerically-conservative version, and several users on tighter cards have asked for the smaller one. Neither variant quantizes linear_attn.conv1d — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.
Variants
| Format | Size | Use case |
|---|---|---|
| BF16 | 51 GB | Full-precision reference weights |
| NVFP4 (compressed-tensors + DFlash) | 26 GB | DGX Spark — DFlash spec decode, validated |
| Multimodal-NVFP4-MTP | 27 GB | RTX PRO 6000 / B100/B200 — MTP, GDN preserved BF16 |
| Text-NVFP4-MTP | 26 GB | Same as above without vision tower |
| Multimodal-NVFP4-MTP-XS (this repo) | 21 GB | RTX 5090 / smaller dedicated VRAM — MTP, full FP4 incl. GDN projections |
| Text-NVFP4-MTP-XS | 20 GB | Same as this repo without vision tower |
What this is
The modelopt-format NVFP4 + MTP variant, multimodal-preserved, with linear_attn projections fully quantized, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).
Specifically:
- Body quantized to NVFP4 via
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG. modelopt format, served by vLLM through--quantization modelopt. - Linear-attn / GatedDeltaNet projections quantized to NVFP4 (this is the XS difference). Only
linear_attn.conv1dis kept BF16 (modelopt's default). The community has validated this approach on Qwen3.5/3.6-NVFP4 builds with 22K+ downloads on sakamakismile's reference recipes; we re-ran calibration on our abliterated weights and the model serves correctly. - Vision tower preserved BF16 (333 keys) — correct
model.visual.*layout. Multimodal weights load; runtime vision validated 7/7 onaeon-vllm-ultimate:latest(2026-06-19 image probe, 0 skip-loads). - MTP head grafted from the base
Qwen/Qwen3.6-27Bcheckpoint (15 tensors, BF16, bit-exact verified). Powers--speculative-config '{"method":"qwen3_5_mtp",...}'for self-speculative decoding without a separate drafter.
Why MTP
Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.
Indicative published numbers (sakamakismile's reference recipe on RTX 5090):
- Single-stream short prompts at
n=3: ~132 tok/s - Single-stream long-form: ~105 tok/s
- 2-parallel aggregate (256K + KV FP8): ~189-207 tok/s
- Mean acceptance length: ~3.0-4.0 (compared to DFlash chains of ~2.0-2.3)
Validated benchmarks of the AEON-Ultimate XS variant land in the GitHub repo once measured.
🎯 When to pick this variant — measured hardware routing
The right speculative-decode method depends on memory architecture:
| Hardware tier | Recommended variant | Why |
|---|---|---|
| DGX Spark / GB10 (sm_121a, unified memory) | Either: -NVFP4 (DFlash) (simpler, validated) or this XS body served with --speculative-config '{"method":"dflash",...}' (highest measured throughput — see the acceptance bench above) | Spark prefers DFlash regardless of body. On aeon-vllm-ultimate:latest with DFlash n=12, long-context (~9k) draft acceptance reaches 45.0 % (2.3× the pre-fix image) — see the live bench above. The grafted MTP head in this repo is unused in that path. Never use --speculative-config '{"method":"qwen3_5_mtp",...}' on Spark — MTP loses badly to DFlash on unified memory. |
| RTX PRO 6000 Blackwell (96 GB dedicated VRAM) | Multimodal-NVFP4-MTP — GDN BF16 for best long-context fidelity, or this XS variant for ~10 % faster decode | XS measured 111.4 tok/s median vs regular's 101.5 on RTX PRO 6000. Both win against DFlash on dedicated VRAM. |
| B100 / B200 (sm_100, dedicated FP4) | Multimodal-NVFP4-MTP (preferred — GDN BF16 fits) or this XS | Native FP4 + dedicated VRAM = MTP territory. Whichever fits cleanly. |
| RTX 5090 (sm_120, 32 GB dedicated VRAM) | This XS variant if you need the vision tower (correct model.visual.* layout; vision validated 7/7 2026-06-19); Text-XS if text-only | XS variants fit comfortably in 32 GB; matches sakamakismile's reference footprint. |
| A100 / H100 (no native FP4) | BF16 | NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit. |
Full bench numbers: GitHub repo Performance section.
Usage
vLLM serve — dedicated-VRAM Blackwell (default: MTP via grafted head)
bash
# One-time: pull this repo locallyhf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS \--local-dir ./aeon-ultimate-multimodal-nvfp4-mtp-xs# Serveexport VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlassexport VLLM_USE_FLASHINFER_MOE_FP4=0export VLLM_USE_FLASHINFER_SAMPLER=1vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \--quantization modelopt \--trust-remote-code \--limit-mm-per-prompt '{"image":4,"video":2}' \--mm-encoder-tp-mode data \--max-model-len 262144 \--max-num-seqs 32 \--max-num-batched-tokens 16384 \--gpu-memory-utilization 0.85 \--enable-chunked-prefill \--enable-prefix-caching \--reasoning-parser qwen3 \--tool-call-parser qwen3_coder \--enable-auto-tool-choice \--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.
vLLM serve — DGX Spark (DFlash spec, not MTP — current production recipe)
For DGX Spark, swap the spec method to DFlash. DFlash's block-diffusion drafter is decisively better than MTP's n=3 on unified memory. This is the exact recipe running in production, on the AEON vLLM Ultimate image ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0 + DFlash high-concurrency fix + PR #40898 + PR #41703 + PR #44389).
The image's ENTRYPOINT is /bin/bash, so when launching via docker run you must pass --entrypoint vllm and then serve … (writing IMAGE vllm serve runs bash vllm serve and fails). The vllm serve … arguments are identical either way:
bash
# Pull the DFlash drafter alongside this bodyhf download z-lab/Qwen3.6-27B-DFlash --local-dir ./qwen36-27b-dflashexport VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlassexport VLLM_USE_FLASHINFER_MOE_FP4=0export VLLM_USE_FLASHINFER_SAMPLER=1# docker run --rm --gpus all --entrypoint vllm \# -v "$PWD":/models ghcr.io/aeon-7/aeon-vllm-ultimate:latest \# serve /models/aeon-ultimate-multimodal-nvfp4-mtp-xs \# ... (the same flags below) ...vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \--quantization modelopt \--trust-remote-code \--mamba-cache-dtype float16 \--mamba-block-size 256 \--limit-mm-per-prompt '{"image":4,"video":2}' \--mm-encoder-tp-mode data \--max-model-len 262144 \--max-num-seqs 64 \--max-num-batched-tokens 16384 \--gpu-memory-utilization 0.69 \--enable-chunked-prefill \--enable-prefix-caching \--reasoning-parser qwen3 \--tool-call-parser qwen3_coder \--enable-auto-tool-choice \--speculative-config '{"method":"dflash","model":"./qwen36-27b-dflash","num_speculative_tokens":12}'
Critical DFlash config rules (learned the hard way):
- Use the DEFAULT drafter attention backend — do not add an
attention_backendto the spec-config. The default works for Qwen3.6 DFlash on this image (unlike Gemma's DFlash, which needed an explicitflash_attnbackend). Leave it out. - Do NOT set
--kv-cache-dtype. DFlash's drafter is non-causal (block diffusion) and no vLLM backend supports non-causal + fp8 KV, so KV must stay at default BF16. Forcing fp8 KV will fail to boot. num_speculative_tokens=12is the validated production default. An n=8–15 sweep found n=10–12 statistically tied at short context, with n=12 best for long-context acceptance (z-lab's published default is 15).--gpu-memory-utilization 0.69because this host co-runs Qwen3-ASR (:8001) + Qwen3-TTS (:8002). Keep it ≤ 0.7 when co-hosting; raise toward 0.88 only if vLLM runs alone (the DGX Spark unified-memory cap is 0.88 — never go higher). BF16 KV is 2× fp8, but full 256k context still fits — KV cache holds 487k tokens / 1.86× concurrency at 262,144 ctx.
Why this recipe needs
aeon-vllm-ultimate:latest: the z-lab DFlash drafter is a sliding-window model — 4 of its 5 layers use sliding-window attention (window 2048). This image (PR #40898) runs those layers as proper SWA; earlier images ran them as full attention, so drafting collapsed once context grew past ~2048 tokens. PR #41703 additionally makes--enable-prefix-cachingcorruption-immune with DFlash. Net: long-context drafting holds up; short-context (<2048, one window) is unchanged. See the live acceptance bench at the top (45.0 % @ ~9k vs 19.7 % pre-fix = 2.3×).
Configuration notes
--quantization modeloptis required for this body (notcompressed-tensors— different format).--speculative-config '{"method":"qwen3_5_mtp", ...}'uses the grafted MTP head; correct for dedicated-VRAM Blackwell. Don't use this on DGX Spark.--speculative-config '{"method":"dflash", ...}'uses an external DFlash drafter; correct for DGX Spark. The grafted MTP head in this repo sits unused in this path (~0.85 GB dead weight). Don't use this on RTX PRO 6000 or B100/B200 — they prefer MTP.--gpu-memory-utilization 0.94is the validated cap on RTX PRO 6000;0.88is the cap on DGX Spark (unified memory thrashes at 0.90+).
Quantization recipe
- Tool:
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG - Loader:
Qwen3_5ForConditionalGeneration.from_pretrained(multimodal-preserved class) - Calibration:
neuralmagic/calibrationLLM split, 20 samples × 8192 tokens - Excluded from quantization (kept BF16) — XS variant differences from the regular variant in bold:
lm_head,proj_out.*,*router*,*mlp.gate.*(NVFP4_DEFAULT_CFG)*linear_attn.conv1d*,*mixer.conv1d*(NVFP4_DEFAULT_CFG default — kept BF16 because FP4 quantization of the SSM 1D convolution causes drift on long-context recurrence; this is the recurrence-critical kernel of the GatedDeltaNet block. Both regular and XS variants preserve this.)*linear_attn*is NOT broadly excluded (XS difference — the projection matmulsin_proj_qkv,in_proj_z,in_proj_a/b,out_projget NVFP4-quantized; saves ~8 GB; FP4 is a clean win on bandwidth-bound matmuls)*visual*(vision tower preservation)*mtp*(MTP head preservation)*output_layer*,output.*
- MTP graft: 15 tensors copied bf16 from
Qwen/Qwen3.6-27Bafter modelopt export - Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source
Provenance & credits
- BF16 source:
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline. - MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (
docs/MTP_GRAFT_RECIPE.md) - Reference benchmark recipes:
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP - Quantization: NVIDIA TensorRT Model Optimizer (
nvidia-modelopt0.43.0) - Base: Alibaba Qwen team —
Qwen/Qwen3.6-27B
License + responsibility
Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.
☕ Support the work
If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.
Model provider
AEON-7
Model tree
Base
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information