AEON-7

Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

🚀 Quickstart — DGX Spark / GB10 (recommended daily-driver)

On DGX Spark, run this body with a DFlash drafter (not native MTP): it lands at parity speed with the smaller -MTP-XS sibling while scoring higher on quality-eval benchmarks — the recommended daily-driver body when you have the VRAM. (Native qwen3_5_mtp decoding stays a dedicated-VRAM-Blackwell path — see the routing table below.)

bash
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

# body (this repo) + z-lab DFlash drafter
GIT_LFS_SKIP_SMUDGE=1 git clone \
  https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP /models/mm-mtp
( cd /models/mm-mtp && git lfs pull )
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/z-lab/Qwen3.6-27B-DFlash /models/dflash
( cd /models/dflash && git lfs pull )

# ENTRYPOINT is /bin/bash — pass --entrypoint vllm
docker run -d --name aeon-vllm --gpus all --ipc=host --shm-size=16g --net=host \
    -e VLLM_USE_FLASHINFER_SAMPLER=1 \
    -v /models/mm-mtp:/model:ro -v /models/dflash:/drafter:ro \
    --entrypoint vllm ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
    serve /model --served-model-name aeon \
        --quantization modelopt --kv-cache-dtype fp8_e4m3 \
        --attention-backend TRITON_ATTN \
        --max-model-len 229376 --max-num-seqs 16 --max-num-batched-tokens 32768 \
        --gpu-memory-utilization 0.60 \
        --enable-chunked-prefill --enable-prefix-caching \
        --generation-config vllm \
        --reasoning-parser qwen3 --tool-call-parser qwen3_coder --enable-auto-tool-choice \
        --mm-encoder-tp-mode data \
        --speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":12,"attention_backend":"TRITON_ATTN"}' \
        --trust-remote-code

Keep --gpu-memory-utilization ≤ 0.88 on GB10 (unified memory). Use 0.60 when ASR/TTS/embedding sidecars share the Spark; raise toward 0.75–0.85 only when the LLM is the dominant GPU workload. The recommended Spark sidecar profile uses --max-model-len 229376, --max-num-seqs 16, and --max-num-batched-tokens 32768: one near-full-context session can run, while smaller agent sessions still share the pooled FP8 KV budget dynamically.

vLLM 0.24.0 DFlash note: set the attention backend in both places. --attention-backend TRITON_ATTN selects the target-model backend, but vLLM does not inherit that into the speculative drafter; the DFlash JSON must also include "attention_backend":"TRITON_ATTN". Leave --mamba-block-size unset and let vLLM derive the page/block geometry for the hybrid GDN stack. Full recipe matrix (NVFP4-KV capacity, TurboQuant, dedicated-VRAM MTP): container README.

Variants

Table with columns: Format, Size, Use case
Format	Size	Use case
BF16	51 GB	Full-precision reference weights (A100/H100 80 GB, RTX PRO 6000 96 GB, multi-GPU, fine-tuning)
NVFP4 (compressed-tensors + DFlash)	26 GB	DGX Spark / GB10 — production validated with DFlash speculative decoding. Patched `vllm-aeon-ultimate-dflash` container.
Multimodal-NVFP4-MTP (this repo)	27 GB

What this is

This is the modelopt-format NVFP4 variant with MTP speculative decoding, multimodal-preserved, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).

Specifically:

Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. This is the modelopt compressed-tensors format that vLLM serves through --quantization modelopt (different code path from the -NVFP4 sibling release which uses --quantization compressed-tensors).
Linear-attn / GatedDeltaNet layers preserved BF16 (432 keys across 48 GDN layers). NVFP4 quantization on Mamba/SSM state collapses the recurrence; modelopt's *linear_attn.conv1d* ignore plus our explicit *linear_attn* exclude keeps these intact.
Vision tower preserved BF16 (333 keys). Multimodal inference fully functional.
MTP head grafted from the base Qwen/Qwen3.6-27B checkpoint (15 tensors, BF16). The base contains MTP heads but Qwen3_5ForConditionalGeneration.from_pretrained drops them during loading; the lna-lab pipeline pattern (which this build follows) explicitly grafts them back into the quantized output, giving vLLM a working drafter for .

Why MTP — and where it actually wins

Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.

Measured numbers on AEON-Ultimate (this exact variant)

Table with columns: Hardware, Median tok/s, Peak tok/s, Spec-decode acceptance
Hardware	Median tok/s	Peak tok/s	Spec-decode acceptance
RTX PRO 6000 Blackwell (96 GB dedicated VRAM)	~92 (this variant) / 111.4 (XS sibling)	124.7 (XS sibling)	67.7 % regular / 69.2 % XS
DGX Spark / GB10 (unified memory) — MTP method	24.1 (XS sibling)	27.5	66.3 %
DGX Spark / GB10 — DFlash method on this body 🏆	38.5 tok/s thinking-on / 38.1 thinking-off	71.3 tok/s thinking-on / 68.4 off	DFlash v2
RTX 5090, B100 / B200

Reference numbers from sakamakismile's un-abliterated recipe (RTX 5090)

Single-stream short prompts at n=3: ~132 tok/s
Single-stream long-form: ~105 tok/s
2-parallel aggregate (256K + KV FP8): ~189–207 tok/s
Mean MTP acceptance length: ~3.0–4.0 (vs DFlash chains ~2.0–2.3)

The hardware-routing punchline

On RTX PRO 6000 the XS sibling beats DFlash territory (~111 tok/s vs DFlash-class ~85 we'd expect there). On DGX Spark, DFlash beats MTP by 26 % median / 52 % peak — the unified-memory bandwidth caps how much MTP's high acceptance can translate to throughput. So: MTP is a dedicated-VRAM-Blackwell variant, not a universal upgrade. Full bench data: GitHub repo Performance section.

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

Table with columns: Hardware tier, Recommended variant, Why
Hardware tier	Recommended variant	Why
DGX Spark / GB10 (sm_121a, unified memory)	This body — with a DFlash drafter ✅ (recommended daily-driver)	Run this body + z-lab DFlash drafter (see Quickstart above): parity speed with the XS sibling, higher quality-eval scores. Use DFlash, not native `qwen3_5_mtp`, on Spark — DFlash beats the MTP method by +26 % median / +52 % peak here (unified-memory bandwidth doesn't reward MTP's high acceptance).
RTX PRO 6000 Blackwell (sm_120, 96 GB dedicated VRAM)	This variant (Multimodal-NVFP4-MTP) ✅ if you need vision; Text if text-only	MTP wins on dedicated VRAM. ~92 tok/s median measured with GDN BF16; dedicated-VRAM bandwidth lets the MTP head's high acceptance rate translate to throughput.
RTX 5090 (sm_120, 32 GB dedicated VRAM)	if you use vision; if text-only

Full bench numbers: GitHub repo Performance section.

Usage

vLLM serve

bash
# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP \
  --local-dir ./aeon-ultimate-multimodal-nvfp4-mtp

# Serve
export VLLM_USE_FLASHINFER_SAMPLER=1

# v0.24.0 removed VLLM_NVFP4_GEMM_BACKEND / VLLM_USE_FLASHINFER_MOE_* — use the KernelConfig flags
vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp \
  --quantization modelopt \
  --linear-backend flashinfer_cutlass --moe-backend cutlass \
  --mamba-cache-dtype float32 \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.94 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.

Configuration notes

--quantization modelopt is required (not compressed-tensors — different format).
--speculative-config '{"method":"qwen3_5_mtp", ...}' activates the grafted MTP head as the spec-decode drafter. No external drafter download needed — the head is in the safetensors of this repo.
--gpu-memory-utilization 0.94 is the validated cap on RTX PRO 6000; 0.95 causes the FlashInfer NVFP4 GEMM autotuner to OOM on first boot. See the GitHub repo's RTX PRO 6000 page for the same OOM behavior under DFlash.

Quantization recipe

Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class)
Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
Excluded from quantization (kept BF16):
- lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
- *linear_attn.conv1d*, *mixer.conv1d* (NVFP4_DEFAULT_CFG)

Provenance & credits

BF16 source: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline.
MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (docs/MTP_GRAFT_RECIPE.md)
Reference benchmark recipes: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Quantization: NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.43.0)
Base: Alibaba Qwen team — Qwen/Qwen3.6-27B

License + responsibility

Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Model provider

AEON-7

Model tree

Base

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

🚀 Quickstart — DGX Spark / GB10 (recommended daily-driver)

bash
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

# body (this repo) + z-lab DFlash drafter
GIT_LFS_SKIP_SMUDGE=1 git clone \
  https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP /models/mm-mtp
( cd /models/mm-mtp && git lfs pull )
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/z-lab/Qwen3.6-27B-DFlash /models/dflash
( cd /models/dflash && git lfs pull )

# ENTRYPOINT is /bin/bash — pass --entrypoint vllm
docker run -d --name aeon-vllm --gpus all --ipc=host --shm-size=16g --net=host \
    -e VLLM_USE_FLASHINFER_SAMPLER=1 \
    -v /models/mm-mtp:/model:ro -v /models/dflash:/drafter:ro \
    --entrypoint vllm ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
    serve /model --served-model-name aeon \
        --quantization modelopt --kv-cache-dtype fp8_e4m3 \
        --attention-backend TRITON_ATTN \
        --max-model-len 229376 --max-num-seqs 16 --max-num-batched-tokens 32768 \
        --gpu-memory-utilization 0.60 \
        --enable-chunked-prefill --enable-prefix-caching \
        --generation-config vllm \
        --reasoning-parser qwen3 --tool-call-parser qwen3_coder --enable-auto-tool-choice \
        --mm-encoder-tp-mode data \
        --speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":12,"attention_backend":"TRITON_ATTN"}' \
        --trust-remote-code

Keep --gpu-memory-utilization ≤ 0.88 on GB10 (unified memory). Use 0.60 when ASR/TTS/embedding sidecars share the Spark; raise toward 0.75–0.85 only when the LLM is the dominant GPU workload. The recommended Spark sidecar profile uses --max-model-len 229376, --max-num-seqs 16, and --max-num-batched-tokens 32768: one near-full-context session can run, while smaller agent sessions still share the pooled FP8 KV budget dynamically.

vLLM 0.24.0 DFlash note: set the attention backend in both places. --attention-backend TRITON_ATTN selects the target-model backend, but vLLM does not inherit that into the speculative drafter; the DFlash JSON must also include "attention_backend":"TRITON_ATTN". Leave --mamba-block-size unset and let vLLM derive the page/block geometry for the hybrid GDN stack. Full recipe matrix (NVFP4-KV capacity, TurboQuant, dedicated-VRAM MTP): container README.

Variants

Table with columns: Format, Size, Use case
Format	Size	Use case
BF16	51 GB	Full-precision reference weights (A100/H100 80 GB, RTX PRO 6000 96 GB, multi-GPU, fine-tuning)
NVFP4 (compressed-tensors + DFlash)	26 GB	DGX Spark / GB10 — production validated with DFlash speculative decoding. Patched `vllm-aeon-ultimate-dflash` container.
Multimodal-NVFP4-MTP (this repo)	27 GB

What this is

Specifically:

Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. This is the modelopt compressed-tensors format that vLLM serves through --quantization modelopt (different code path from the -NVFP4 sibling release which uses --quantization compressed-tensors).
Linear-attn / GatedDeltaNet layers preserved BF16 (432 keys across 48 GDN layers). NVFP4 quantization on Mamba/SSM state collapses the recurrence; modelopt's *linear_attn.conv1d* ignore plus our explicit *linear_attn* exclude keeps these intact.
Vision tower preserved BF16 (333 keys). Multimodal inference fully functional.
MTP head grafted from the base Qwen/Qwen3.6-27B checkpoint (15 tensors, BF16). The base contains MTP heads but Qwen3_5ForConditionalGeneration.from_pretrained drops them during loading; the lna-lab pipeline pattern (which this build follows) explicitly grafts them back into the quantized output, giving vLLM a working drafter for .

Why MTP — and where it actually wins

Measured numbers on AEON-Ultimate (this exact variant)

Table with columns: Hardware, Median tok/s, Peak tok/s, Spec-decode acceptance
Hardware	Median tok/s	Peak tok/s	Spec-decode acceptance
RTX PRO 6000 Blackwell (96 GB dedicated VRAM)	~92 (this variant) / 111.4 (XS sibling)	124.7 (XS sibling)	67.7 % regular / 69.2 % XS
DGX Spark / GB10 (unified memory) — MTP method	24.1 (XS sibling)	27.5	66.3 %
DGX Spark / GB10 — DFlash method on this body 🏆	38.5 tok/s thinking-on / 38.1 thinking-off	71.3 tok/s thinking-on / 68.4 off	DFlash v2
RTX 5090, B100 / B200

Reference numbers from sakamakismile's un-abliterated recipe (RTX 5090)

Single-stream short prompts at n=3: ~132 tok/s
Single-stream long-form: ~105 tok/s
2-parallel aggregate (256K + KV FP8): ~189–207 tok/s
Mean MTP acceptance length: ~3.0–4.0 (vs DFlash chains ~2.0–2.3)

The hardware-routing punchline

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

Table with columns: Hardware tier, Recommended variant, Why
Hardware tier	Recommended variant	Why
DGX Spark / GB10 (sm_121a, unified memory)	This body — with a DFlash drafter ✅ (recommended daily-driver)	Run this body + z-lab DFlash drafter (see Quickstart above): parity speed with the XS sibling, higher quality-eval scores. Use DFlash, not native `qwen3_5_mtp`, on Spark — DFlash beats the MTP method by +26 % median / +52 % peak here (unified-memory bandwidth doesn't reward MTP's high acceptance).
RTX PRO 6000 Blackwell (sm_120, 96 GB dedicated VRAM)	This variant (Multimodal-NVFP4-MTP) ✅ if you need vision; Text if text-only	MTP wins on dedicated VRAM. ~92 tok/s median measured with GDN BF16; dedicated-VRAM bandwidth lets the MTP head's high acceptance rate translate to throughput.
RTX 5090 (sm_120, 32 GB dedicated VRAM)	if you use vision; if text-only

Full bench numbers: GitHub repo Performance section.

Usage

vLLM serve

bash
# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP \
  --local-dir ./aeon-ultimate-multimodal-nvfp4-mtp

# Serve
export VLLM_USE_FLASHINFER_SAMPLER=1

# v0.24.0 removed VLLM_NVFP4_GEMM_BACKEND / VLLM_USE_FLASHINFER_MOE_* — use the KernelConfig flags
vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp \
  --quantization modelopt \
  --linear-backend flashinfer_cutlass --moe-backend cutlass \
  --mamba-cache-dtype float32 \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.94 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

num_speculative_tokens=3 is the canonical setting for qwen3_5_mtp. Higher values diverge the drafter further from the target distribution and acceptance falls.

Configuration notes

--quantization modelopt is required (not compressed-tensors — different format).
--speculative-config '{"method":"qwen3_5_mtp", ...}' activates the grafted MTP head as the spec-decode drafter. No external drafter download needed — the head is in the safetensors of this repo.
--gpu-memory-utilization 0.94 is the validated cap on RTX PRO 6000; 0.95 causes the FlashInfer NVFP4 GEMM autotuner to OOM on first boot. See the GitHub repo's RTX PRO 6000 page for the same OOM behavior under DFlash.

Quantization recipe

Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class)
Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
Excluded from quantization (kept BF16):
- lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
- *linear_attn.conv1d*, *mixer.conv1d* (NVFP4_DEFAULT_CFG)

Provenance & credits

BF16 source: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline.
MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (docs/MTP_GRAFT_RECIPE.md)
Reference benchmark recipes: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Quantization: NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.43.0)
Base: Alibaba Qwen team — Qwen/Qwen3.6-27B

License + responsibility

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.