AEON-7

Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP

README

License: apache-2.0

Variants

Table with columns: Format, Size, Use case
Format	Size	Use case
BF16	51 GB	Full-precision reference weights (A100/H100 80 GB, RTX PRO 6000 96 GB, multi-GPU, fine-tuning)
NVFP4 (compressed-tensors + DFlash)	26 GB	DGX Spark / GB10 — production validated with DFlash speculative decoding. Patched `vllm-aeon-ultimate-dflash` container.
Multimodal-NVFP4-MTP	27 GB	High-bandwidth dedicated GPUs (RTX 5090, RTX PRO 6000, B100/B200) with MTP speculative decoding via the model's native `mtp.*` head. modelopt format, `--quantization modelopt`. Vision tower preserved.
Text-NVFP4-MTP (this repo)	20 GB	Same recipe but with vision tower stripped. Smaller footprint for text-only deployments on tighter VRAM (RTX 5090 32 GB fits comfortably).

What this is

This is the modelopt-format NVFP4 variant with MTP speculative decoding, text-only (vision tower stripped), of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).

Specifically:

Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. This is the modelopt compressed-tensors format that vLLM serves through --quantization modelopt (different code path from the -NVFP4 sibling release which uses --quantization compressed-tensors).
Linear-attn / GatedDeltaNet layers preserved BF16 (432 keys across 48 GDN layers). NVFP4 quantization on Mamba/SSM state collapses the recurrence; modelopt's *linear_attn.conv1d* ignore plus our explicit *linear_attn* exclude keeps these intact.
Vision tower stripped (333 visual keys removed, ~0.92 GB). Text-only build — no image / video input. language_model_only: true set in config.json.
checkpoint (15 tensors, BF16). The base contains MTP heads but drops them during loading; the lna-lab pipeline pattern (which this build follows) explicitly grafts them back into the quantized output, giving vLLM a working drafter for .

Why MTP — and where it actually wins

Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.

Measured numbers on AEON-Ultimate (this MTP family)

Table with columns: Hardware, Median tok/s, Peak tok/s, Spec-decode acceptance
Hardware	Median tok/s	Peak tok/s	Spec-decode acceptance
RTX PRO 6000 Blackwell (96 GB dedicated VRAM)	~92 (regular) / 111.4 (XS sibling)	124.7 (XS sibling)	67.7 % regular / 69.2 % XS
DGX Spark / GB10 (unified memory) — MTP method	24.1 (XS sibling)	27.5	66.3 %
DGX Spark / GB10 — DFlash on the same XS body 🏆	38.5 tok/s thinking-on / 38.1 off	71.3 tok/s thinking-on / 68.4 off	DFlash v2
RTX 5090, B100 / B200

Reference numbers from sakamakismile's un-abliterated recipe (RTX 5090)

Single-stream short prompts at n=3: ~132 tok/s
Single-stream long-form: ~105 tok/s
2-parallel aggregate (256K + KV FP8): ~189–207 tok/s
Mean MTP acceptance length: ~3.0–4.0 (vs DFlash chains ~2.0–2.3)

The hardware-routing punchline

On RTX PRO 6000 the XS sibling beats DFlash territory (~111 tok/s vs DFlash-class ~85 we'd expect there). On DGX Spark, DFlash beats MTP by 26 % median / 52 % peak — the unified-memory bandwidth caps how much MTP's high acceptance can translate to throughput. So: MTP is a dedicated-VRAM-Blackwell variant, not a universal upgrade. Full bench data: GitHub repo Performance section.

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

Table with columns: Hardware tier, Recommended variant, Why
Hardware tier	Recommended variant	Why
DGX Spark / GB10 (sm_121a, unified memory)	`-NVFP4` (DFlash) — not this MTP variant	Bench on Spark: DFlash beats MTP by +26 % median, +52 % peak. Spark's unified-memory bandwidth doesn't reward MTP's high acceptance rate. Don't run MTP on Spark.
RTX PRO 6000 Blackwell (sm_120, 96 GB dedicated VRAM)	This variant ✅ if text-only; Multimodal if you need vision	MTP wins on dedicated VRAM. ~92 tok/s median measured (multimodal sibling, GDN BF16).
RTX 5090 (sm_120, 32 GB dedicated VRAM)	is the better fit (~20 GB), or this variant if you have headroom

Full bench numbers: GitHub repo Performance section.

Usage

vLLM serve

bash
# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP \
  --local-dir ./aeon-ultimate-text-nvfp4-mtp

# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve ./aeon-ultimate-text-nvfp4-mtp \
&
  --mamba-cache-dtype float32 \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.94 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.

Configuration notes

--quantization modelopt is required (not compressed-tensors — different format).
--speculative-config '{"method":"qwen3_5_mtp", ...}' activates the grafted MTP head as the spec-decode drafter. No external drafter download needed — the head is in the safetensors of this repo.
--gpu-memory-utilization 0.94 is the validated cap on RTX PRO 6000; 0.95 causes the FlashInfer NVFP4 GEMM autotuner to OOM on first boot. See the GitHub repo's RTX PRO 6000 page for the same OOM behavior under DFlash.

Quantization recipe

Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class)
Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
Excluded from quantization (kept BF16):
- lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
- *linear_attn.conv1d*, *mixer.conv1d* (NVFP4_DEFAULT_CFG)

Provenance & credits

BF16 source: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline.
MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (docs/MTP_GRAFT_RECIPE.md)
Reference benchmark recipes: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Quantization: NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.43.0)
Base: Alibaba Qwen team — Qwen/Qwen3.6-27B

License + responsibility

Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

AEON-7

Model Tree

Base

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16

Quantized

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality