AEON-7

Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quickstart

Complete copy-paste recipe — pull the container, pull this model, pull the DFlash drafter (fresh), then serve with the validated flags. The image ENTRYPOINT is /bin/bash, so docker run overrides it with --entrypoint vllm. DFlash needs BF16 KV — leave --kv-cache-dtype unset.

bash

# 1. Pull the AEON vLLM Ultimate container (vLLM 0.23.0 sm_121a from-source + PR #44389 NVFP4-KV
# + PR #40898/#41703 DFlash fixes + DFlash high-concurrency fix).
# :latest = :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703.
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest
# 2. Download this model (fresh).
huggingface-cli download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP \
--local-dir ./aeon-model
# 3. Download the z-lab DFlash drafter (fresh — pull every time).
huggingface-cli download z-lab/Qwen3.6-27B-DFlash \
--local-dir ./aeon-drafter
# 4. Serve — DFlash@12 on the NVFP4 (modelopt) body, vision tower preserved.
docker run --gpus all --ipc host --network host \
-e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-e VLLM_USE_FLASHINFER_SAMPLER=1 \
-v ./aeon-model:/model:ro \
-v ./aeon-drafter:/drafter:ro \
--entrypoint vllm \
ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
serve /model \
--quantization modelopt \
--trust-remote-code \
--mamba-cache-dtype float16 \
--mamba-block-size 256 \
--max-model-len 262144 \
--max-num-seqs 32 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.85 \
--enable-chunked-prefill \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--limit-mm-per-prompt '{"image":4,"video":2}' \
--mm-encoder-tp-mode data \
--speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":12}'

num_speculative_tokens=12 is the validated DFlash setting for this NVFP4 body. On dedicated-VRAM Blackwell you can swap to the model's native grafted MTP head with --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' (see hardware routing). Full flag reference, env vars, and BF16 / dedicated-GPU examples are in Usage below; deployment & compose configs live in the GitHub repo.

Variants

Table
FormatSizeUse case
BF1651 GBFull-precision reference weights (A100/H100 80 GB, RTX PRO 6000 96 GB, multi-GPU, fine-tuning)
NVFP4 (compressed-tensors + DFlash)26 GBDGX Spark / GB10 — production validated with DFlash speculative decoding. Unified ghcr.io/aeon-7/aeon-vllm-ultimate:latest container.
Multimodal-NVFP4-MTP (this repo)27 GBHigh-bandwidth dedicated GPUs (RTX 5090, RTX PRO 6000, B100/B200) with MTP speculative decoding via the model's native mtp.* head. modelopt format, --quantization modelopt. Vision tower preserved.
Text-NVFP4-MTP20 GBSame as this repo but with vision tower stripped. Smaller footprint for text-only deployments on tighter VRAM.

What this is

This is the modelopt-format NVFP4 variant with MTP speculative decoding, multimodal-preserved, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).

Specifically:

  • Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. This is the modelopt compressed-tensors format that vLLM serves through --quantization modelopt (different code path from the -NVFP4 sibling release which uses --quantization compressed-tensors).
  • Linear-attn / GatedDeltaNet layers preserved BF16 (432 keys across 48 GDN layers). NVFP4 quantization on Mamba/SSM state collapses the recurrence; modelopt's *linear_attn.conv1d* ignore plus our explicit *linear_attn* exclude keeps these intact.
  • Vision tower preserved BF16 (333 keys), in the correct model.visual.* layout. The vision weights are intact and unquantized; runtime multimodal inference has not yet been validated on this 27B variant (GPU validation window pending) — do not assume image inference is confirmed working.
  • MTP head grafted from the base Qwen/Qwen3.6-27B checkpoint (15 tensors, BF16). The base contains MTP heads but Qwen3_5ForConditionalGeneration.from_pretrained drops them during loading; the lna-lab pipeline pattern (which this build follows) explicitly grafts them back into the quantized output, giving vLLM a working drafter for --speculative-config '{"method":"qwen3_5_mtp",...}'.

Why MTP — and where it actually wins

Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.

Measured numbers on AEON-Ultimate (this exact variant)

Table
HardwareMedian tok/sPeak tok/sSpec-decode acceptance
RTX PRO 6000 Blackwell (96 GB dedicated VRAM)~92 (this variant) / 111.4 (XS sibling)124.7 (XS sibling)67.7 % regular / 69.2 % XS
DGX Spark / GB10 (unified memory) — MTP method24.1 (XS sibling)27.566.3 %
DGX Spark / GB10 — DFlash method on this body 🏆38.5 tok/s thinking-on / 38.1 thinking-off71.3 tok/s thinking-on / 68.4 offDFlash (n=12)
RTX 5090, B100 / B200not yet measured by us — community welcome

Reference numbers from sakamakismile's un-abliterated recipe (RTX 5090)

  • Single-stream short prompts at n=3: ~132 tok/s
  • Single-stream long-form: ~105 tok/s
  • 2-parallel aggregate (256K + KV FP8): ~189–207 tok/s
  • Mean MTP acceptance length: ~3.0–4.0 (vs DFlash chains ~2.0–2.3)

The hardware-routing punchline

On RTX PRO 6000 the XS sibling beats DFlash territory (~111 tok/s vs DFlash-class ~85 we'd expect there). On DGX Spark, DFlash beats MTP by 26 % median / 52 % peak — the unified-memory bandwidth caps how much MTP's high acceptance can translate to throughput. So: MTP is a dedicated-VRAM-Blackwell variant, not a universal upgrade. Full bench data: GitHub repo Performance section.

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

Table
Hardware tierRecommended variantWhy
DGX Spark / GB10 (sm_121a, unified memory)-NVFP4 (DFlash)not this MTP variantBench on Spark: DFlash beats MTP by +26 % median, +52 % peak. Spark's unified-memory bandwidth doesn't reward MTP's high acceptance rate. Don't run MTP on Spark.
RTX PRO 6000 Blackwell (sm_120, 96 GB dedicated VRAM)This variant (Multimodal-NVFP4-MTP) ✅ if you need vision; Text if text-onlyMTP wins on dedicated VRAM. ~92 tok/s median measured with GDN BF16; dedicated-VRAM bandwidth lets the MTP head's high acceptance rate translate to throughput.
RTX 5090 (sm_120, 32 GB dedicated VRAM)Multimodal-XS if you use vision; Text-XS if text-onlyXS variants fit comfortably in 32 GB. 111.4 tok/s median measured on RTX PRO 6000; RTX 5090 should land near or above that.
A100 / H100 (no native FP4)BF16NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit.
B100 / B200 (sm_100, dedicated FP4)This variant (Multimodal) or Text variantNative FP4 + dedicated VRAM = MTP territory.

Full bench numbers: GitHub repo Performance section.

Usage

vLLM serve

bash

# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP \
--local-dir ./aeon-ultimate-multimodal-nvfp4-mtp
# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1
vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp \
--quantization modelopt \
--trust-remote-code \
--mamba-cache-dtype float16 \
--mamba-block-size 256 \
--max-model-len 262144 \
--max-num-seqs 32 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.85 \
--enable-chunked-prefill \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--limit-mm-per-prompt '{"image":4,"video":2}' \
--mm-encoder-tp-mode data \
--speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":12}'

num_speculative_tokens=12 is the validated DFlash setting for this NVFP4 body. The --limit-mm-per-prompt / --mm-encoder-tp-mode data flags drive the preserved vision tower (multimodal body).

Configuration notes

  • --quantization modelopt is required (not compressed-tensors — different format).
  • --speculative-config '{"method":"dflash", ...}' drives the z-lab Qwen3.6-27B-DFlash drafter at num_speculative_tokens=12 — the validated optimal for this NVFP4 body. (The native qwen3_5_mtp head is also grafted into this repo's safetensors and can be selected instead on dedicated-VRAM Blackwell; see the GitHub repo for the MTP-vs-DFlash hardware routing.)
  • --gpu-memory-utilization 0.85 keeps headroom on unified-memory parts; on dedicated-VRAM RTX PRO 6000 you can push higher, but 0.95+ causes the FlashInfer NVFP4 GEMM autotuner to OOM on first boot. See the GitHub repo's RTX PRO 6000 page for the same OOM behavior under DFlash.

Performance — DGX Spark (v0.23.0, aeon-vllm-ultimate:latest)

Measured on a single DGX Spark / GB10 (Blackwell sm_121a, unified memory) with ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0), this NVFP4 body driven by the z-lab DFlash drafter @ n=12 (DFlash@12 speculative decoding). Headline: ~36 tok/s single-stream, ~274 tok/s aggregate at c=64, ~38% DFlash acceptance (holds ~41% at long context).

Single-stream (c=1) by prompt category — DFlash@12

Table
CategoryDecode tok/sTTFT (ms)TPOT (ms)Prefill (tok/s)DFlash accept
Coding36.117727.725437.9 %
Math37.730826.519841.5 %
Reasoning42.429923.616447.0 %
Prose24.429941.012722.9 %
Natural language27.031737.012626.8 %
Extraction / JSON36.130427.717838.0 %

Single-stream decode lands around 24–42 tok/s depending on category (~36 tok/s on the structured Coding/Extraction workloads); the higher-acceptance Reasoning/Math prompts decode fastest. Acceptance tracks how predictable the next tokens are — high on Reasoning (47%) and Math (41.5%), lower on free-form Prose (22.9%).

Aggregate throughput by concurrency

Aggregate throughput scales cleanly from c=1 up to c=64 with no crash (the prior image crashed at c≥32 under speculative decoding — see What we fixed below). Peak aggregate throughput is ~274 tok/s at c=64 (Reasoning); other categories at c=64: Math ~251, Extraction/JSON ~240, Coding ~229, Natural language ~186, Prose ~156 tok/s. Most of the gain is already captured by c=16; c=16→64 adds only a few percent.

Table
Categoryc=1c=8c=16c=32c=64
Coding35162214222229
Math36180246248251
Reasoning41177258269274
Prose24106142155156
Natural language26129180183186
Extraction / JSON35155248218240

Long-context DFlash acceptance

DFlash draft acceptance holds at ~41% (40.9%) at long context rather than collapsing — PR #40898 runs the drafter's sliding-window layers as true SWA, so drafting survives as the agent history grows past ~2048 tokens.

Stock baseline pending fresh vanilla re-bench. No matched stock / un-optimized vanilla-vLLM baseline exists yet for this variant; the BF16 bar in the variant chart is the unquantized AEON body, not a stock-vLLM reference. A fully-vanilla (no DFlash, no sm_121a opts) re-bench is planned and these figures will be cross-referenced once it lands.

What we fixed for the DGX Spark

All AEON Qwen3.6-27B repos now run on one unified containerghcr.io/aeon-7/aeon-vllm-ultimate:latest — vLLM 0.23.0 built from source for sm_121a and merged with the AEON speculative-decoding stack, tuned end-to-end for the GB10's unified-memory Blackwell architecture. The two changes that matter most for this card:

  • DFlash high-concurrency fix (new in v0.23.0) — the speculative drafter previously crashed at ≥32 concurrent requests (a padded-vs-unpadded KV block-table shape mismatch in FlashAttention). The fix slices the drafter's block-table to the unpadded batch (block_table[:num_reqs]), so it now scales cleanly to c=64. This is a port of upstream PR #43982, which fixed the same bug for MTP but never for DFlash — it was present and unfixed even in the prior image.
  • Triton NVFP4 KV cache (PR #44389) — the only 4-bit KV path on sm_121a (upstream's is hard-gated to B200), giving ~3× KV capacity / longer context per GB of unified memory.
  • DFlash sliding-window attention (PR #40898) — runs the drafter's SWA layers as true sliding window, so long-context draft acceptance holds (~41% here at long context) instead of collapsing past ~2k tokens.

Container rollback tag: :2026-06-11-pr41703. Full writeup: container README.

Quantization recipe

  • Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
  • Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class)
  • Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
  • Excluded from quantization (kept BF16):
    • lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
    • *linear_attn.conv1d*, *mixer.conv1d* (NVFP4_DEFAULT_CFG)
    • *linear_attn* (added — full GDN preservation)
    • *visual* (added — vision tower preservation)
    • *mtp* (added — MTP head preservation)
    • *output_layer*, output.*
  • MTP graft: 15 tensors copied bf16 from Qwen/Qwen3.6-27B after modelopt export (AutoModelForCausalLM.from_pretrained drops them; explicit graft restores)
  • Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source

Provenance & credits

License + responsibility

Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.


☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Model provider

AEON-7

Model tree

Base

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today