AEON-7
Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0vLLM compatibility (DGX Spark / aeon-vllm-ultimate:latest)
Status: this text export does not load on the unified container as shipped. Validated 2026-06-18 on ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM v0.23.0, sm_121a).
The reason is purely a packaging mismatch, not a problem with the weights:
config.jsondeclares the multimodal architectureQwen3_5ForConditionalGeneration(model_type: qwen3_5), so vLLM routes it through the multimodal loader.- The multimodal loader expects an image processor in the repo, but this is a text-only export (the vision tower was stripped during quantization), so no processor files ship here.
- vLLM therefore aborts during init with "cannot load image processor" before serving can start.
This is fixable two ways, either of which would let this exact NVFP4+MTP body serve on the Spark:
- Re-export with a text-only architecture (e.g. a
*ForCausalLMhead class instead ofQwen3_5ForConditionalGeneration), so vLLM uses the text loader and never looks for an image processor; or - Add the processor files (
preprocessor_config.json/ image-processor config) so the multimodal loader can initialize even though no image input is used.
Until then, for vLLM serving on aeon-vllm-ultimate:latest (DGX Spark / GB10), use the -Multimodal-NVFP4-MTP-XS sibling instead — it is the smallest NVFP4 export, loads cleanly, and is the fastest single-stream option in this family (~42 tok/s at c=1 with DFlash). This card still serves fine on dedicated-VRAM Blackwell (RTX PRO 6000 / B100 / B200) via the text path with native MTP — see Usage below.
Where this variant sits in the family
Recommended Spark alternative — measured c=1 throughput (Multimodal-NVFP4-MTP-XS)
These are the per-category single-stream numbers for the working sibling on aeon-vllm-ultimate:latest (DGX Spark / GB10, DFlash speculative decoding):
| Category | Decode tok/s | TTFT (ms) | TPOT (ms) | Prefill (tok/s) | DFlash accept % |
|---|---|---|---|---|---|
| Coding | 42.6 | 141 | 23.5 | 318 | 34.5 |
| Math | 55.9 | 248 | 17.9 | 246 | 48.0 |
| Reasoning | 49.3 | 232 | 20.3 | 211 | 41.7 |
| Prose | 31.2 | 229 | 32.1 | 166 | 23.2 |
| Natural language | 34.8 | 228 | 28.7 | 175 | 26.6 |
| Extraction / JSON | 57.4 | 234 | 17.4 | 231 | 49.3 |
Long-context (≈16k–32k) DFlash acceptance holds at ~45%. Aggregate throughput scales to c=64 on the unified container (Reasoning peaks ~340 tok/s). Full per-concurrency data lives on the XS sibling card.
What we fixed for the DGX Spark
All AEON models run on one unified container — ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703). It is vLLM v0.23.0 built from source for sm_121a (GB10 / Blackwell) and merged with the AEON speculative-decoding stack.
- DFlash high-concurrency fix (new) — slices the speculative drafter's KV block-table to the unpadded batch (
block_table[:num_reqs]). The drafter previously crashed at ≥32 concurrent requests (padded-vs-unpadded block-table shape mismatch in FlashAttention); it now scales cleanly to c=64. A port of upstream PR #43982, which fixed this for MTP but never for DFlash. - Unified vLLM 0.23.0 image — NVFP4 KV cache (PR #44389, the only 4-bit KV path on sm_121a) + DFlash sliding-window attention (PR #40898, so long-context draft acceptance holds) + sm_121a-native CUTLASS NVFP4/FP8 kernels + boot/CUDA-graph patches, all in a single tag.
Stock baseline pending fresh vanilla re-bench: no apples-to-apples stock (vanilla vLLM, no DFlash, no sm_121a opts) baseline exists for this family yet. A fully-vanilla benchmark on the current version is pending; the optimized figures above are measured on
aeon-vllm-ultimate:latest(vLLM 0.23.0).
Variants
| Format | Size | Use case |
|---|---|---|
| BF16 | 51 GB | Full-precision reference weights (A100/H100 80 GB, RTX PRO 6000 96 GB, multi-GPU, fine-tuning) |
| NVFP4 (compressed-tensors + DFlash) | 26 GB | DGX Spark / GB10 — production validated with DFlash speculative decoding. Unified ghcr.io/aeon-7/aeon-vllm-ultimate:latest container. |
| Multimodal-NVFP4-MTP | 27 GB | High-bandwidth dedicated GPUs (RTX 5090, RTX PRO 6000, B100/B200) with MTP speculative decoding via the model's native mtp.* head. modelopt format, --quantization modelopt. Vision tower preserved. |
| Text-NVFP4-MTP (this repo) | 20 GB | Same recipe but with vision tower stripped. Smaller footprint for text-only deployments on tighter VRAM (RTX 5090 32 GB fits comfortably). |
What this is
This is the modelopt-format NVFP4 variant with MTP speculative decoding, text-only (vision tower stripped), of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).
Specifically:
- Body quantized to NVFP4 via
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG. This is the modelopt compressed-tensors format that vLLM serves through--quantization modelopt(different code path from the-NVFP4sibling release which uses--quantization compressed-tensors). - Linear-attn / GatedDeltaNet layers preserved BF16 (432 keys across 48 GDN layers). NVFP4 quantization on Mamba/SSM state collapses the recurrence; modelopt's
*linear_attn.conv1d*ignore plus our explicit*linear_attn*exclude keeps these intact. - Vision tower stripped (333 visual keys removed, ~0.92 GB). Text-only build — no image / video input.
language_model_only: trueset inconfig.json. - MTP head grafted from the base
Qwen/Qwen3.6-27Bcheckpoint (15 tensors, BF16). The base contains MTP heads butQwen3_5ForConditionalGeneration.from_pretraineddrops them during loading; the lna-lab pipeline pattern (which this build follows) explicitly grafts them back into the quantized output, giving vLLM a working drafter for--speculative-config '{"method":"qwen3_5_mtp",...}'.
Why MTP — and where it actually wins
Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.
Measured numbers on AEON-Ultimate (this MTP family)
| Hardware | Median tok/s | Peak tok/s | Spec-decode acceptance |
|---|---|---|---|
| RTX PRO 6000 Blackwell (96 GB dedicated VRAM) | ~92 (regular) / 111.4 (XS sibling) | 124.7 (XS sibling) | 67.7 % regular / 69.2 % XS |
| DGX Spark / GB10 (unified memory) — MTP method | 24.1 (XS sibling) | 27.5 | 66.3 % |
| DGX Spark / GB10 — DFlash on the same XS body 🏆 | 38.5 tok/s thinking-on / 38.1 off | 71.3 tok/s thinking-on / 68.4 off | DFlash (n=12) |
| RTX 5090, B100 / B200 | not yet measured by us — community welcome |
Reference numbers from sakamakismile's un-abliterated recipe (RTX 5090)
- Single-stream short prompts at
n=3: ~132 tok/s - Single-stream long-form: ~105 tok/s
- 2-parallel aggregate (256K + KV FP8): ~189–207 tok/s
- Mean MTP acceptance length: ~3.0–4.0 (vs DFlash chains ~2.0–2.3)
The hardware-routing punchline
On RTX PRO 6000 the XS sibling beats DFlash territory (~111 tok/s vs DFlash-class ~85 we'd expect there). On DGX Spark, DFlash beats MTP by 26 % median / 52 % peak — the unified-memory bandwidth caps how much MTP's high acceptance can translate to throughput. So: MTP is a dedicated-VRAM-Blackwell variant, not a universal upgrade. Full bench data: GitHub repo Performance section.
🎯 When to pick this variant — measured hardware routing
The right speculative-decode method depends on memory architecture:
| Hardware tier | Recommended variant | Why |
|---|---|---|
| DGX Spark / GB10 (sm_121a, unified memory) | -NVFP4 (DFlash) — not this MTP variant | Bench on Spark: DFlash beats MTP by +26 % median, +52 % peak. Spark's unified-memory bandwidth doesn't reward MTP's high acceptance rate. Don't run MTP on Spark. |
| RTX PRO 6000 Blackwell (sm_120, 96 GB dedicated VRAM) | This variant ✅ if text-only; Multimodal if you need vision | MTP wins on dedicated VRAM. ~92 tok/s median measured (multimodal sibling, GDN BF16). |
| RTX 5090 (sm_120, 32 GB dedicated VRAM) | Text-XS is the better fit (~20 GB), or this variant if you have headroom | XS variant matches sakamakismile's reference footprint. 111.4 tok/s median measured on RTX PRO 6000; RTX 5090 should land near or above. |
| A100 / H100 (no native FP4) | BF16 | NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit. |
| B100 / B200 (sm_100, dedicated FP4) | This variant or Multimodal | Native FP4 + dedicated VRAM = MTP territory. |
Full bench numbers: GitHub repo Performance section.
Usage
vLLM serve
No runnable serve command is published for this repo. This text export does not load on the unified
ghcr.io/aeon-7/aeon-vllm-ultimate:latestcontainer (multimodal architecture → "cannot load image processor"; see the vLLM compatibility note above for the root cause and the two fixes).For a working vLLM quickstart on
aeon-vllm-ultimate:latest, use the-Multimodal-NVFP4-MTP-XSsibling — it loads cleanly and is the fastest single-stream option in this family.
Download the weights (text-only):
bash
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP \--local-dir ./aeon-ultimate-text-nvfp4-mtp
Earlier reference — dedicated-VRAM Blackwell text path (NOT a copy-paste serve command; this body does not load on the unified container as shipped). On RTX PRO 6000 / B100 / B200 the NVFP4+MTP body was driven through the text path with native MTP using the flags below; --quantization modelopt, the qwen3_5_mtp spec-decode head, the qwen3 reasoning parser, and the qwen3_coder tool-call parser. num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp (higher values diverge the drafter from the target distribution and acceptance falls); --gpu-memory-utilization was kept ≤ 0.85 to avoid the FlashInfer NVFP4 GEMM autotuner OOM on first boot.
Configuration notes
--quantization modeloptis required (notcompressed-tensors— different format).--speculative-config '{"method":"qwen3_5_mtp", ...}'activates the grafted MTP head as the spec-decode drafter. No external drafter download needed — the head is in the safetensors of this repo.--gpu-memory-utilizationshould be kept ≤ 0.85; higher values risk the FlashInfer NVFP4 GEMM autotuner OOMing on first boot. See the GitHub repo's RTX PRO 6000 page for the same OOM behavior under DFlash.
Quantization recipe
- Tool:
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG - Loader:
Qwen3_5ForConditionalGeneration.from_pretrained(multimodal-preserved class) - Calibration:
neuralmagic/calibrationLLM split, 20 samples × 8192 tokens - Excluded from quantization (kept BF16):
lm_head,proj_out.*,*router*,*mlp.gate.*(NVFP4_DEFAULT_CFG)*linear_attn.conv1d*,*mixer.conv1d*(NVFP4_DEFAULT_CFG)*linear_attn*(added — full GDN preservation)*visual*(added — vision tower preservation)*mtp*(added — MTP head preservation)*output_layer*,output.*
- Vision strip: post-export,
model.visual.*keys (333 tensors, ~0.92 GB) removed;vision_configremoved fromconfig.json;language_model_only: trueset; preprocessor configs cleaned - MTP graft: 15 tensors copied bf16 from
Qwen/Qwen3.6-27Bafter modelopt export (AutoModelForCausalLM.from_pretraineddrops them; explicit graft restores) - Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source
Provenance & credits
- BF16 source:
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline. - MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (
docs/MTP_GRAFT_RECIPE.md) - Reference benchmark recipes:
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP - Quantization: NVIDIA TensorRT Model Optimizer (
nvidia-modelopt0.43.0) - Base: Alibaba Qwen team —
Qwen/Qwen3.6-27B
License + responsibility
Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.
☕ Support the work
If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.
Model provider
AEON-7
Model tree
Base
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information