AEON-7
Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP-XS
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0vLLM compatibility (DGX Spark / aeon-vllm-ultimate:latest)
Status: does not load under ghcr.io/aeon-7/aeon-vllm-ultimate:latest — same as the -Text-NVFP4-MTP sibling. Validated 2026-06-18 on aeon-vllm-ultimate:latest (vLLM 0.23.0, DGX Spark GB10).
This is a text-only export carried on the multimodal Qwen3_5ForConditionalGeneration architecture with the vision tower stripped — and therefore no image processor. vLLM selects its multimodal loader from the architecture, then fails to initialize because the processor it expects for a multimodal model is absent. This is a packaging mismatch, not a weight-quality issue: the NVFP4 + MTP weights themselves are fine.
For vLLM serving on aeon-vllm-ultimate:latest, use the -Multimodal-NVFP4-MTP-XS sibling instead — it keeps the image processor the multimodal loader needs and is the benchmarked production body (served with DFlash spec decode on Spark; the grafted MTP head is unused in that path). For dedicated-VRAM Blackwell where the multimodal loader is not in play, this text-only XS body is still a valid MTP target.
Where each variant in the family lands (working alternatives shown — single-stream decode and aggregate throughput on the Spark container):
What we fixed for the DGX Spark
The unified container ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703) is vLLM 0.23.0 built from source for sm_121a and merged with the AEON speculative-decoding stack. Two highlights relevant to the Qwen3.6-27B family:
- DFlash high-concurrency fix (new) — the speculative drafter previously crashed at ≥32 concurrent requests (padded-vs-unpadded KV block-table shape mismatch in FlashAttention). The drafter's block-table is now sliced to the unpadded batch (
block_table[:num_reqs]), so DFlash scales cleanly to c=64. A port of upstream PR #43982, which fixed this for MTP but never for DFlash. - Unified single image — one container now loads every correctly-packaged Qwen3.6-27B AEON-Ultimate body (NVFP4 KV cache via PR #44389, DFlash SWA via PR #40898, prefix-cache corruption-immunity via PR #41703, sm_121a build + CUDA-graph patches), replacing the per-model image sprawl.
Stock baseline pending fresh vanilla re-bench: no same-harness stock (un-optimized vanilla vLLM) baseline exists for these variants yet; the comparison numbers will be added once a fresh fully-vanilla benchmark completes on vLLM 0.23.0.
🏆 DGX Spark performance — current production
On Spark/GB10 the production body is the
-Multimodal-NVFP4-MTP-XSsibling (this text-only export does not load under the multimodal loader — see "vLLM compatibility" above). It is served with DFlash spec decode (not the MTP head) under the canonicalghcr.io/aeon-7/aeon-vllm-ultimate:latestimage. Recommended drafter setting:num_speculative_tokens: 12(the n=8–15 sweep found n=10–12 statistically tied short-context, n=12 best for long-context acceptance and the production default). Use the default drafter attention backend — do not addattention_backendto the spec-config — and do not set--kv-cache-dtype(BF16 is required for the non-causal DFlash drafter). See the GitHub Performance section for the measured comparison table.Why long-context drafting holds up: the z-lab Qwen3.6-27B DFlash drafter is a sliding-window model — 4 of its 5 layers use sliding-window attention (window 2048). vLLM PR #40898 (in
aeon-vllm-ultimate:latest) runs those layers as proper SWA; earlier images ran them as full attention, so drafting collapsed once context grew past ~2048 tokens. PR #41703 additionally makes--enable-prefix-cachingcorruption-immune with DFlash. Net: long-context drafting holds up; short-context (<2048, one window) is unchanged.
🙏 Reference recipe credit: The conv1d-preserved NVFP4 + MTP graft pipeline used to build this XS variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config — including the strategic decision to quantize the GDN projection matmuls to NVFP4 while preserving
linear_attn.conv1dat BF16 — and the MTP-head graft technique. We adapted the recipe to AEON-Ultimate's abliterated weights and ship both the conv1d-preserved-only XS variant (matching their footprint) and a heavier regular-MTP variant that additionally keeps the projections at BF16. Full credit for the underlying recipe → sakamakismile.
What "XS" means — and what it's not
This is the extra-small footprint sibling of -Text-NVFP4-MTP. XS is not "everything to FP4." It is a deliberate, principled split: the heavy GDN matmul projections drop to NVFP4 (where they're bandwidth-bound and FP4 wins big), while the SSM-critical linear_attn.conv1d kernel stays BF16 (where FP4 has documented stability problems on long-context recurrence).
| Text-NVFP4-MTP (regular) | Text-NVFP4-MTP-XS (this repo) | |
|---|---|---|
linear_attn projections (in_proj_qkv, in_proj_z, in_proj_a/b, out_proj) | preserved BF16 (~11 GB) | quantized to NVFP4 (~3 GB) |
linear_attn.conv1d (SSM 1D convolution — recurrence-critical) | preserved BF16 | preserved BF16 ✅ |
linear_attn SSM state vectors (A_log, dt_bias, norm.weight) | preserved BF16 | preserved BF16 ✅ |
mtp.* head (grafted bf16 from base, bit-exact verified) | yes | yes |
| Vision tower | stripped | stripped |
| Total disk | ~26 GB | ~20 GB |
| VRAM footprint at runtime | ~27 GB | ~21 GB |
This is a smart, strategic quantization — not a precision compromise. The conv1d preservation matters: the GatedDeltaNet recurrence depends on the 1D convolution behaving numerically like its training distribution, and FP4 quantization of conv1d has been observed to cause drift on long-context inference in community testing. By keeping conv1d BF16 while quantizing the projections (which are bandwidth-limited matmuls where FP4 is a clean win), we get the ~6 GB footprint reduction without sacrificing the part of the model that's actually fragile under quantization. This is the same principle modelopt's NVFP4_DEFAULT_CFG applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads).
When to pick which:
- Pick the regular variant if you have ≥48 GB VRAM. Even the projection weights at BF16 give a small additional safety margin on long-context recurrence stability.
- Pick this XS variant if you have 24–32 GB VRAM (RTX 5090, single GPUs without headroom for full BF16 GDN). The conv1d preservation guarantees the SSM recurrence stays numerically stable; the ~6 GB savings buy meaningful KV-cache headroom on tight GPUs.
We ship both because we have the headroom on RTX PRO 6000 / B100/B200 to run the larger, more numerically-conservative version, and several users on tighter cards have asked for the smaller one. Neither variant quantizes linear_attn.conv1d — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.
🆕 AEON vLLM Ultimate container
ghcr.io/aeon-7/aeon-vllm-ultimate:latest(= tag:2026-06-18-v0.23.0-dflashfix; rollback tag:2026-06-11-pr41703) — the canonical image for all Qwen3.6-27B AEON-Ultimate repos: vLLM 0.23.0 (built from source for sm_121a) + PR #44389 NVFP4 KV cache (~3× capacity) + DFlash high-concurrency fix (c=64) + TurboQuant K8V4 + AEON sm_121a patches, plus the PR #40898 / PR #41703 DFlash sliding-window fixes (see "Why long-context drafting holds up" below). The image ENTRYPOINT is/bin/bash, sodocker runmust pass--entrypoint vllmand thenserve …(do not writeIMAGE vllm serve— that runsbash vllm serveand fails). Same recipe family as the-Multimodal-NVFP4-MTP-XSsibling (the benchmarked body). This text-only variant uses the same modelopt NVFP4 format, the sameqwen3_5_mtpnative head, and the same hybrid GDN+attention stack — but onaeon-vllm-ultimate:latestthe multimodal loader cannot initialize it without an image processor (see "vLLM compatibility" above), so serve the Multimodal-XS sibling there. On dedicated-VRAM Blackwell it serves with--quantization modeloptand either--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'(native MTP) or a DFlash drafter (recommended on Spark — see container README Recipe A). Do not set--kv-cache-dtypewith DFlash; the non-causal DFlash drafter requires BF16 KV.
Variants
| Format | Size | Use case |
|---|---|---|
| BF16 | 51 GB | Full-precision reference weights |
| NVFP4 (compressed-tensors + DFlash) | 26 GB | DGX Spark — DFlash spec decode, validated |
| Multimodal-NVFP4-MTP | 27 GB | RTX PRO 6000 / B100/B200 — MTP, GDN preserved BF16 |
| Text-NVFP4-MTP | 26 GB | Same as above without vision tower |
| Multimodal-NVFP4-MTP-XS | 21 GB | RTX 5090 / smaller dedicated VRAM — MTP, full FP4 incl. GDN projections |
| Text-NVFP4-MTP-XS (this repo) | 20 GB | Same as Multimodal-XS without vision tower |
What this is
The modelopt-format NVFP4 + MTP variant, text-only (vision tower stripped), with linear_attn projections fully quantized, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).
Specifically:
- Body quantized to NVFP4 via
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG. modelopt format, served by vLLM through--quantization modelopt. - Linear-attn / GatedDeltaNet projections quantized to NVFP4 (this is the XS difference). Only
linear_attn.conv1dis kept BF16 (modelopt's default). The community has validated this approach on Qwen3.5/3.6-NVFP4 builds with 22K+ downloads on sakamakismile's reference recipes; we re-ran calibration on our abliterated weights and the model serves correctly. - Vision tower stripped (333 visual keys removed, ~0.92 GB). Text-only build — no image / video input.
language_model_only: trueset inconfig.json. - MTP head grafted from the base
Qwen/Qwen3.6-27Bcheckpoint (15 tensors, BF16, bit-exact verified). Powers--speculative-config '{"method":"qwen3_5_mtp",...}'for self-speculative decoding without a separate drafter.
Why MTP
Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.
Indicative published numbers (sakamakismile's reference recipe on RTX 5090):
- Single-stream short prompts at
n=3: ~132 tok/s - Single-stream long-form: ~105 tok/s
- 2-parallel aggregate (256K + KV FP8): ~189-207 tok/s
- Mean acceptance length: ~3.0-4.0 (compared to DFlash chains of ~2.0-2.3)
Validated benchmarks of the AEON-Ultimate XS variant land in the GitHub repo once measured.
🎯 When to pick this variant — measured hardware routing
The right speculative-decode method depends on memory architecture:
| Hardware tier | Recommended variant | Why |
|---|---|---|
| DGX Spark / GB10 (sm_121a, unified memory) | -NVFP4 (DFlash) — not any MTP variant | Bench on Spark: DFlash beats MTP-XS by +26 % median, +52 % peak. Don't run MTP on Spark. |
| RTX PRO 6000 Blackwell (96 GB dedicated VRAM) | Text-NVFP4-MTP — GDN BF16 for best long-context fidelity, or this XS variant for ~10 % faster decode | XS measured 111.4 tok/s median vs regular ~92 tok/s on RTX PRO 6000. Both win against DFlash on dedicated VRAM. |
| B100 / B200 (sm_100, dedicated FP4) | Text-NVFP4-MTP (preferred — GDN BF16 fits) or this XS | Native FP4 + dedicated VRAM = MTP territory. Whichever fits cleanly. |
| RTX 5090 (sm_120, 32 GB dedicated VRAM) | This XS variant ✅ — fits at ~21 GB runtime, matches sakamakismile's reference footprint | XS variants fit comfortably in 32 GB with KV headroom. |
| A100 / H100 (no native FP4) | BF16 | NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit. |
Full bench numbers: GitHub repo Performance section.
Usage
vLLM serve
⚠️ This text-only export does not load under
ghcr.io/aeon-7/aeon-vllm-ultimate:latest(the canonical DGX Spark image) — vLLM selects the multimodal loader from theQwen3_5ForConditionalGenerationarchitecture and then fails because this build has no image processor (see vLLM compatibility above). There is intentionally no copy-paste serve command for this repo on that image.For vLLM serving on
aeon-vllm-ultimate:latest, serve the-Multimodal-NVFP4-MTP-XSsibling instead — it keeps the image processor the multimodal loader needs, ships the same NVFP4 +qwen3_5_mtphead, and is the benchmarked production body. Its card carries the validated copy-paste serve command (DFlashnum_speculative_tokens=12on Spark; MTP via the grafted head on dedicated-VRAM Blackwell).
On dedicated-VRAM Blackwell (RTX PRO 6000 / B100 / B200), where the multimodal loader is not in play, this text-only body is a valid MTP target served with --quantization modelopt and --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":12}'.
Configuration notes
--quantization modeloptis required (notcompressed-tensors— different format).--speculative-config '{"method":"qwen3_5_mtp", ...}'activates the grafted MTP head as the spec-decode drafter. No external drafter download needed — the head is in the safetensors of this repo.--gpu-memory-utilization— never exceed0.88on unified-memory hosts (DGX Spark thrashes at 0.90+); on dedicated-VRAM Blackwell you have more headroom.
Quantization recipe
- Tool:
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG - Loader:
Qwen3_5ForConditionalGeneration.from_pretrained(multimodal-preserved class — vision stripped post-export) - Calibration:
neuralmagic/calibrationLLM split, 20 samples × 8192 tokens - Excluded from quantization (kept BF16) — XS variant differences from the regular variant in bold:
lm_head,proj_out.*,*router*,*mlp.gate.*(NVFP4_DEFAULT_CFG)*linear_attn.conv1d*,*mixer.conv1d*(NVFP4_DEFAULT_CFG default — kept BF16 because FP4 quantization of the SSM 1D convolution causes drift on long-context recurrence; this is the recurrence-critical kernel of the GatedDeltaNet block. Both regular and XS variants preserve this.)*linear_attn*is NOT broadly excluded (XS difference — the projection matmulsin_proj_qkv,in_proj_z,in_proj_a/b,out_projget NVFP4-quantized; saves ~8 GB; FP4 is a clean win on bandwidth-bound matmuls)*visual*(excluded during quant; vision tower then stripped post-export)*mtp*(MTP head preservation)*output_layer*,output.*
- MTP graft: 15 tensors copied bf16 from
Qwen/Qwen3.6-27Bafter modelopt export - Vision strip: post-export, all
model.visual.*keys removed;config.jsonpatched withlanguage_model_only: true - Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source
Provenance & credits
- BF16 source:
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline. - MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (
docs/MTP_GRAFT_RECIPE.md) - Reference benchmark recipes:
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP - Quantization: NVIDIA TensorRT Model Optimizer (
nvidia-modelopt0.43.0) - Base: Alibaba Qwen team —
Qwen/Qwen3.6-27B
License + responsibility
Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.
☕ Support the work
If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.
Model provider
AEON-7
Model tree
Base
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information