AEON-7

Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP-XS

README

License: apache-2.0

What "XS" means — and what it's not

This is the extra-small footprint sibling of -Text-NVFP4-MTP. XS is not "everything to FP4." It is a deliberate, principled split: the heavy GDN matmul projections drop to NVFP4 (where they're bandwidth-bound and FP4 wins big), while the SSM-critical linear_attn.conv1d kernel stays BF16 (where FP4 has documented stability problems on long-context recurrence).

Table with columns: Text-NVFP4-MTP (regular), Text-NVFP4-MTP-XS (this repo)
	Text-NVFP4-MTP (regular)	Text-NVFP4-MTP-XS (this repo)
`linear_attn` projections (`in_proj_qkv`, `in_proj_z`, `in_proj_a/b`, `out_proj`)	preserved BF16 (~11 GB)	quantized to NVFP4 (~3 GB)
`linear_attn.conv1d` (SSM 1D convolution — recurrence-critical)	preserved BF16	preserved BF16 ✅
`linear_attn` SSM state vectors (`A_log`, `dt_bias`, `norm.weight`)	preserved BF16	preserved BF16 ✅
`mtp.` head (grafted bf16 from base, bit-exact verified)*	yes	yes
Vision tower	stripped	stripped
Total disk	~26 GB	~20 GB
VRAM footprint at runtime	~27 GB	~21 GB

This is a smart, strategic quantization — not a precision compromise. The conv1d preservation matters: the GatedDeltaNet recurrence depends on the 1D convolution behaving numerically like its training distribution, and FP4 quantization of conv1d has been observed to cause drift on long-context inference in community testing. By keeping conv1d BF16 while quantizing the projections (which are bandwidth-limited matmuls where FP4 is a clean win), we get the ~6 GB footprint reduction without sacrificing the part of the model that's actually fragile under quantization. This is the same principle modelopt's NVFP4_DEFAULT_CFG applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads).

When to pick which:

Pick the regular variant if you have ≥48 GB VRAM. Even the projection weights at BF16 give a small additional safety margin on long-context recurrence stability.
Pick this XS variant if you have 24–32 GB VRAM (RTX 5090, single GPUs without headroom for full BF16 GDN). The conv1d preservation guarantees the SSM recurrence stays numerically stable; the ~6 GB savings buy meaningful KV-cache headroom on tight GPUs.

We ship both because we have the headroom on RTX PRO 6000 / B100/B200 to run the larger, more numerically-conservative version, and several users on tighter cards have asked for the smaller one. Neither variant quantizes linear_attn.conv1d — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.

🆕 AEON vLLM Ultimate container (2026-06-04)

ghcr.io/aeon-7/aeon-vllm-ultimate:latest — vLLM 0.23.0 (= :2026-06-18-v0.23.0-dflashfix) + PR #44389 NVFP4 KV cache (~3× capacity) + DFlash + TurboQuant K8V4 + AEON sm_121a patches. Same recipe family as the -Multimodal-NVFP4-MTP-XS sibling which has been benchmarked end-to-end (production-style greedy + n_spec=15 by category: math/code peak ~45 tok/s, overall mean 34.7 tok/s; concurrent ×4 steady ~84 tok/s aggregate). This variant uses the same modelopt NVFP4 format, the same qwen3_5_mtp native head, and the same hybrid GDN+attention stack — it should serve identically with --quantization modelopt and either --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' (native MTP) or a DFlash drafter (recommended on Spark — see container README Recipe A).

The v3 image (ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3) remains the stable production target if you need FP8 KV + DFlash; in the new image DFlash requires (BF16). Full setup + 4-config bench comparison: .

Variants

Table with columns: Format, Size, Use case
Format	Size	Use case
BF16	51 GB	Full-precision reference weights
NVFP4 (compressed-tensors + DFlash)	26 GB	DGX Spark — DFlash spec decode, validated
Multimodal-NVFP4-MTP	27 GB	RTX PRO 6000 / B100/B200 — MTP, GDN preserved BF16
Text-NVFP4-MTP

What this is

The modelopt-format NVFP4 + MTP variant, text-only (vision tower stripped), with linear_attn projections fully quantized, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).

Specifically:

Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. modelopt format, served by vLLM through --quantization modelopt.
Linear-attn / GatedDeltaNet projections quantized to NVFP4 (this is the XS difference). Only linear_attn.conv1d is kept BF16 (modelopt's default). The community has validated this approach on Qwen3.5/3.6-NVFP4 builds with 22K+ downloads on sakamakismile's reference recipes; we re-ran calibration on our abliterated weights and the model serves correctly.
Vision tower stripped (333 visual keys removed, ~0.92 GB). Text-only build — no image / video input. language_model_only: true set in config.json.
MTP head grafted from the base Qwen/Qwen3.6-27B checkpoint (15 tensors, BF16, bit-exact verified). Powers --speculative-config '{"method":"qwen3_5_mtp",...}' for self-speculative decoding without a separate drafter.

Why MTP

Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.

Indicative published numbers (sakamakismile's reference recipe on RTX 5090):

Single-stream short prompts at n=3: ~132 tok/s
Single-stream long-form: ~105 tok/s
2-parallel aggregate (256K + KV FP8): ~189-207 tok/s
Mean acceptance length: ~3.0-4.0 (compared to DFlash chains of ~2.0-2.3)

Validated benchmarks of the AEON-Ultimate XS variant land in the GitHub repo once measured.

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

Table with columns: Hardware tier, Recommended variant, Why
Hardware tier	Recommended variant	Why
DGX Spark / GB10 (sm_121a, unified memory)	`-NVFP4` (DFlash) — not any MTP variant	Bench on Spark: DFlash beats MTP-XS by +26 % median, +52 % peak. Don't run MTP on Spark.
RTX PRO 6000 Blackwell (96 GB dedicated VRAM)	Text-NVFP4-MTP — GDN BF16 for best long-context fidelity, or this XS variant for ~10 % faster decode	XS measured 111.4 tok/s median vs regular ~92 tok/s on RTX PRO 6000. Both win against DFlash on dedicated VRAM.
B100 / B200 (sm_100, dedicated FP4)	(preferred — GDN BF16 fits) or this XS

Full bench numbers: GitHub repo Performance section. | A100 / H100 (no native FP4) | BF16 |

Usage

vLLM serve

bash
# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP-XS \
  --local-dir ./aeon-ultimate-text-nvfp4-mtp-xs

# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve ./aeon-ultimate-text-nvfp4-mtp-xs \
&
  --mamba-cache-dtype float32 \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 16384 \
  --gpu-memory-utilization 0.94 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.

Configuration notes

--quantization modelopt is required (not compressed-tensors — different format).
--speculative-config '{"method":"qwen3_5_mtp", ...}' activates the grafted MTP head as the spec-decode drafter. No external drafter download needed — the head is in the safetensors of this repo.
--gpu-memory-utilization 0.94 is the validated cap on RTX PRO 6000; on RTX 5090's tighter 32 GB you'll want 0.92 and a smaller --max-model-len (try 65536 first).

Quantization recipe

Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class — vision stripped post-export)
Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
Excluded from quantization (kept BF16) — XS variant differences from the regular variant in bold:
- lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
- *linear_attn.conv1d*,

Provenance & credits

BF16 source: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline.
MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (docs/MTP_GRAFT_RECIPE.md)
Reference benchmark recipes: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Quantization: NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.43.0)
Base: Alibaba Qwen team — Qwen/Qwen3.6-27B

License + responsibility

Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

AEON-7

Model Tree

Base

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16

Quantized

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality