AEON-7

Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

🚀 Quickstart (DGX Spark / GB10 · DFlash · BF16 KV)

One copy-paste block: pull the canonical container, this model, and the DFlash drafter (pull FRESH), then serve with the vetted DGX Spark flags. The image ENTRYPOINT is /bin/bash, so docker run uses --entrypoint vllm.

bash

# 1) Pull the canonical AEON vLLM Ultimate container
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest
# 2) Pull THIS model (compressed-tensors body)
huggingface-cli download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4 --local-dir ./aeon-model
# 3) Pull the DFlash drafter — FRESH (do not reuse a stale copy)
huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir ./aeon-drafter
# 4) Serve (compressed-tensors body, BF16 KV cache, default drafter backend)
docker run --gpus all --ipc=host --network=host \
-e TORCH_CUDA_ARCH_LIST="12.0+PTX" \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-v ./aeon-model:/model:ro \
-v ./aeon-drafter:/drafter:ro \
--entrypoint vllm \
ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
serve /model \
--served-model-name aeon-ultimate \
--host 0.0.0.0 --port 8000 \
--quantization compressed-tensors \
--mamba-cache-dtype float16 \
--mamba-block-size 256 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--max-num-seqs 16 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.85 \
--enable-chunked-prefill \
--enable-prefix-caching \
--trust-remote-code \
--speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":12}'

DFlash needs BF16 KV — do not add --kv-cache-dtype, and do not set the drafter attention_backend (the default is correct for Qwen3.6 on this image). Keep --gpu-memory-utilization ≤ 0.88 on DGX Spark (unified memory thrashes above that). For the full flag reference (context length, batching, multimodal cache), plain-decode (no-DFlash) variant, and hardware-tuned compose configs, see Deployment below.

Variants

Table
FormatHuggingFace repoDiskQuant toolSpec decodeHardware targetWhen to pick this
NVFP4 (this repo)…-NVFP426 GBllm-compressorDFlash n=12DGX Spark (GB10 / sm_121a)Production-validated for DGX Spark with the canonical aeon-vllm-ultimate:latest container.
Multimodal-NVFP4-MTP…-Multimodal-NVFP4-MTP27 GBnvidia-modeloptqwen3_5_mtp n=3RTX PRO 6000 Blackwell · B100/B200MTP via the model's native mtp.* head (grafted bf16 from base). modelopt format, --quantization modelopt. Vision tower preserved. GDN linear-attention preserved BF16 for best long-context fidelity.
Text-NVFP4-MTP…-Text-NVFP4-MTP26 GBnvidia-modeloptqwen3_5_mtp n=3RTX PRO 6000 · text-onlySame recipe as the Multimodal MTP sibling but with vision tower stripped. GDN preserved BF16.
Multimodal-NVFP4-MTP-XS…-Multimodal-NVFP4-MTP-XS21 GBnvidia-modeloptqwen3_5_mtp n=3RTX 5090 · tighter dedicated VRAMStrategic split: GDN projection matmuls (in_proj_qkv/z/a/b, out_proj) → NVFP4; linear_attn.conv1d kept BF16 to preserve the recurrence-critical SSM convolution. Vision tower preserved.
Text-NVFP4-MTP-XS…-Text-NVFP4-MTP-XS20 GBnvidia-modeloptqwen3_5_mtp n=3RTX 5090 text-only · 24 GB cardsSame conv1d-preserved strategic split as Multimodal-XS, vision tower stripped. The smallest variant we ship.
BF16…-BF1651 GBA100 / H100 80 GB · multi-GPUFull-precision reference weights. Ampere / Hopper / pre-Blackwell hardware, fine-tuning, or quant-recipe development.

🎯 Hardware routing — measured, not theoretical

Pick by memory architecture, not just GPU model:

Table
Hardware classUse thisWhy
DGX Spark / GB10 (unified memory, sm_121a)this -NVFP4 (DFlash) repo ✅ — or the modelopt -Multimodal-NVFP4-MTP-XS body served with DFlash (the benchmarked Spark path — see its card)Bench on Spark: DFlash beats the MTP self-spec method on this body (see the AEON DFlash-vs-MTP routing finding); the modelopt -Multimodal-NVFP4-MTP-XS body + DFlash is the benchmarked Spark path. Don't run MTP-method on Spark.
RTX PRO 6000 / RTX 5090 / B100 / B200 (dedicated VRAM, sm_120/sm_100)-NVFP4-MTP or -NVFP4-MTP-XSMTP wins on dedicated VRAM. RTX PRO 6000 measured: XS hits 111.4 tok/s median with 69 % MTP acceptance — beats no-spec by ~10 %.
A100 / H100 (no native FP4)-BF16NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit.

Full bench numbers: GitHub repo Performance section.

Regular MTP vs XS — strategic quantization, not a precision compromise

The GatedDeltaNet linear_attn.* block has two distinct components: the heavy projection matmuls (in_proj_qkv, in_proj_z, in_proj_a/b, out_proj — ~11 GB total) and the SSM 1D convolution kernel (linear_attn.conv1d — small, but recurrence-critical).

  • Regular MTP variants keep both at BF16. Maximum numerical safety margin, larger footprint.
  • XS variants quantize the projection matmuls to NVFP4 (saves ~6 GB; FP4 is a clean win on bandwidth-bound matmuls) but explicitly preserve linear_attn.conv1d at BF16. FP4 quantization of conv1d has been observed to cause drift on long-context recurrence in community testing, so we keep it at BF16 — the same principle modelopt's NVFP4_DEFAULT_CFG applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads). This is not "everything to FP4" — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.

🚀 DGX Spark: XS body + DFlash spec is the highest-throughput config

If you want maximum DGX Spark throughput, the highest-measured configuration is:

  • Model body: -Multimodal-NVFP4-MTP-XS (modelopt format)
  • Spec method: DFlash n=12 via z-lab/Qwen3.6-27B-DFlashnot the MTP head that ships with the XS variant
  • Container: the canonical aeon-vllm-ultimate:latest, run with --entrypoint vllm
  • Same Spark settings (--max-num-seqs 16, --gpu-memory-utilization 0.85, --max-model-len 200000)
  • vLLM args: --quantization modelopt --speculative-config '{"method":"dflash","model":"/path/to/dflash-drafter","num_speculative_tokens":12}' (drafter backend = default; do not set --kv-cache-dtype)

The measured DGX Spark DFlash benchmark on aeon-vllm-ultimate:latest (long-context acceptance, per-category short-context acceptance) lives on the -Multimodal-NVFP4-MTP-XS card — that is the benchmarked body; those numbers are specific to it and do not transfer to other bodies. This -NVFP4 (compressed-tensors) repo + DFlash remains the simpler, validated path; the XS+DFlash combo is the higher-throughput path once you've been through one boot to populate the autotuner cache.

The production deployment format for Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 on Blackwell-class hardware. Same model, same 0/100 refusal rate, same preserved-and-enhanced capabilities of the BF16 source — compressed from 51 GB BF16 to 26 GB NVFP4 for native FP4 tensor-core throughput on DGX Spark (GB10 / sm_121a), B100 / B200, and RTX PRO 6000 Blackwell.


Performance — DGX Spark DFlash (v0.23.0 build)

These figures were measured on the current production image ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703) — vLLM 0.23.0 built from source for GB10 / sm_121a with the AEON DFlash stack (z-lab/Qwen3.6-27B-DFlash, num_speculative_tokens: 12). They were captured on the benchmarked DGX Spark body (the modelopt -Multimodal-NVFP4-MTP-XS served with DFlash — the highest-throughput Spark config); this compressed-tensors -NVFP4 body uses the same canonical container and the same DFlash recipe, so the path and the fixes apply identically.

Current build — single-stream (c=1), by category

Table
Category🟢 Decode tok/sTTFT p50TPOT p50Prefill (PP)DFlash accept
Coding41.8140 ms23.9 ms322 tok/s34%
Math47.3244 ms21.1 ms229 tok/s42%
Reasoning56.1234 ms17.8 ms183 tok/s50%
Prose34.1146 ms29.4 ms220 tok/s27%
Natural language38.3137 ms26.1 ms248 tok/s31%
Extraction / JSON44.2246 ms22.6 ms195 tok/s37%

It now scales cleanly to c=64 concurrent with no crash — the pre-fix image crashed under concurrent speculative decoding at c≥32 (see What we fixed for the DGX Spark). Aggregate throughput climbs from c=1 to c=64 across every category (Reasoning peaks at ~344 tok/s aggregate at c=64).

Stock baseline note: a fully-vanilla stock vLLM throughput baseline for this body is pending — it has not yet been re-benchmarked on the current version. The DFlash figures above are the optimized aeon-vllm-ultimate:latest (vLLM 0.23.0) build. Any prior stock comparison numbers quoted in the AEON line are from vanilla vLLM (default settings, no DFlash / sm_121a optimizations) and are provisional, pending a fresh vanilla re-benchmark.

Long-context draft acceptance holds

DFlash draft acceptance stays healthy as agent histories grow — the payoff of the sliding-window-attention patch (PR #40898). Measured on the same build, single-stream (c=1):

Table
Context (measured prompt tokens)Decode tok/s (p50)DFlash accept
short (~256 tok)34–56 (by category)27–50%
~20k tokens (16k tier)33–5241–50%
~41k tokens (32k tier)28–3927–43%

Acceptance does not collapse past the 2k-token sliding window — it holds in the 40–50% band out to ~20k and stays usable at ~41k, exactly the behavior the long-context DFlash + SWA fixes were built to deliver.

The BF16 source is itself the product of 72 hours of continuous research drawing on hundreds of parallel AI research agents, the industry's best published methodologies, custom in-house techniques, and yet-unreleased pre-public branches of the next-generation abliteration toolchain. See the BF16 model card for the full pipeline narrative and capability data.

What we fixed for the DGX Spark

All AEON models run on one unified container — ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703) — vLLM v0.23.0 built from source for GB10 / sm_121a and merged with the AEON speculative-decoding stack, tuned end-to-end for the GB10's unified-memory Blackwell architecture.

Table
FixWhat it doesWhy it matters on GB10
DFlash high-concurrency fix (new)Slices the speculative drafter's KV block-table to the unpadded batch (block_table[:num_reqs])The drafter previously crashed at ≥32 concurrent requests (padded-vs-unpadded block-table shape mismatch in FlashAttention). Now scales cleanly to c=64. A port of upstream PR #43982 — which fixed this for MTP but was never applied to DFlash — present and unfixed even in the prior image.
Triton NVFP4 KV cache (PR #44389)Software NVFP4 KV-cache pathThe only 4-bit KV path on sm_121a (upstream's is hard-gated to B200) → ~3× KV capacity / longer context per GB of unified memory.
DFlash sliding-window attention (PR #40898)Runs the drafter's SWA layers as true sliding-windowLong-context draft acceptance holds as agent histories grow (40–50% out to ~20k tokens) instead of collapsing past ~2k tokens.
sm_121a-native buildTORCH_CUDA_ARCH_LIST=12.1a, ENABLE_NVFP4_SM100=0Compiles the SM120-family CUTLASS NVFP4/FP8 kernels GB10 actually dispatches to — true 4-bit tensor-core throughput, no dead B200-only kernels.
sm_121a boot + CUDA-graph patchesRTLD-lazy _C_stable_libtorch load; spec-decode CUDA-graph capture-size alignmentBoots past MXFP4 (SM100-only) symbols absent on GB10; prevents cudaErrorIllegalAddress on partial-acceptance decode steps under speculative decoding.
Unified-memory tuningConservative --gpu-memory-utilization, FULL CUDA graphs, async scheduling, z-lab DFlash drafterGB10 shares one LPDDR5X pool across CPU + GPU; conservative KV headroom avoids page-thrash while keeping FULL-graph + speculative-decode throughput.

The result:

  • Scales to 64 concurrent requests with no crash (the prior image crashed at c≥32 under speculative decoding).
  • Native NVFP4 4-bit compute on Blackwell tensor cores — the speed of 4-bit with near-16-bit accuracy.
  • Speculative decoding (DFlash) holds high draft acceptance from short prompts through long (~20k–41k token) agent histories instead of collapsing past the 2k sliding window.
  • A fully-vanilla stock throughput contrast for this body is pending a fresh vanilla re-benchmark; the headline win for the v0.23.0 unified image is the c=64 concurrency fix plus the long-context acceptance hold, not a single-stream tok/s bump over the prior AEON image.

Why NVFP4 — and Why It's Effectively Lossless

NVFP4 is not a "compressed lite" version. It is the format NVIDIA designed for Blackwell-and-later silicon to be the production deployment format — accuracy on par with BF16, throughput of true 4-bit compute, no compromise required.

The accuracy guarantee comes from a two-level scaling structure that older 4-bit formats (INT4, Q4_0/Q4_K, NF4) do not have:

  • E2M1 element format — 4-bit floating point per weight (sign / 2-bit exponent / 1-bit mantissa).
  • Block size 16 with FP8 E4M3 per-block scales — every 16 weights share an 8-bit floating-point scale, which dramatically out-resolves the INT8 scales used by older schemes when the local weight distribution is heavy-tailed.
  • FP32 per-tensor scale — global re-scale applied at the kernel boundary so block-level FP8 scales never have to span the full tensor's dynamic range.

The combined effect is that local outliers — the long-tailed weights that destroy older 4-bit formats — are absorbed by the per-block FP8 scale rather than smearing the whole quantization grid. Typical KL divergence vs the BF16 source for recipe-class NVFP4 quantization is ≤ 0.001, which is below the noise floor of stochastic sampling. A user cannot observe the difference between this model and its BF16 source; the difference is smaller than the variance from changing your random seed.

On native FP4 silicon — Blackwell tcgen05 / UTCQMMA paths, sm_121a CUTLASS on GB10 — this format runs at full FP4 tensor-core throughput. The GPU does not dequantize back to BF16 internally. You get the speed of true 4-bit compute and the accuracy of 16-bit weights at the same time. On older silicon (A100, H100) NVFP4 dequantizes at kernel boundaries — works correctly, but no throughput advantage; for those cards use the BF16 release directly.

This release is multimodal-preserved (vision tower stays BF16 — model.visual.* 333 vision tensors retained at BF16; text inference validated, image-input runtime validation pending a GPU window) and hybrid-attention-preserved (the 48 linear-attention / GatedDeltaNet layers stay BF16; FP4 applies only to the 16 full-attention layers' output projections and all MLPs, where it is well-behaved). Mamba state and SSM dynamics are mathematically incompatible with FP4 and remain in BF16 by design, not by compromise.


What Changed vs BF16

Table
AspectBF16 (source)NVFP4 (this release)
Disk size51 GB26 GB (49% reduction)
Refusal rate0/1000/100 inherited (KL ≤ 0.001 from source — below sampling noise)
Multimodalpreservedpreserved (vision BF16, no degradation)
Hybrid SSMrepaired + intactintact (linear_attn BF16-preserved)
Hardware targetA100 / H100 / RTX PRO 6000 BF16DGX Spark (GB10), B100/B200, RTX PRO 6000 Blackwell with native FP4 throughput
KL vs BF16 sourcen/aexpected ≤0.001 (typical for this recipe class)

The NVFP4 quantization scheme is NVIDIA-mandated: E2M1 element format, block_size=16, FP8 E4M3 per-block scales, FP32 per-tensor scale, symmetric signed.


Quantization Recipe

Tool: llm-compressor 0.10.1.dev107 (vllm-project) using QuantizationModifier(scheme="NVFP4") post-training quantization.

python

from llmcompressor.modifiers.quantization import QuantizationModifier
recipe = QuantizationModifier(
targets="Linear",
scheme="NVFP4",
ignore=[
"lm_head", # always
"re:.*embed_tokens.*", # always
"re:.*\\.visual\\..*", # vision tower BF16 — preserves multimodal
"re:.*visual\\..*",
"re:.*linear_attn\\..*", # SSM/GDN BF16 — Mamba state collapses under FP4
"re:.*norm.*",
"re:.*q_norm.*",
"re:.*k_norm.*",
],
)

Calibration: open-platypus, 512 samples × 4096 tokens. Pipeline: sequential with sequential_targets=["Qwen3_5DecoderLayer"] — required for hybrid stacks (mixed full + linear attention layers); without explicit targeting, llm-compressor's auto-discovery silently skips layers. Loader: AutoModelForImageTextToText to preserve the Qwen3_5ForConditionalGeneration multimodal class. Processor: passed explicitly to oneshot() to avoid the "model processor required when a dataset is provided" failure on multimodal builds without torchvision.

Verification (pass):

  • 1 shard, 1952 keys
  • 64 quantized full-attention projections (16 layers × 4 q/k/v/o)
  • 432 linear_attn.* keys preserved BF16 (48 layers × 9 modules)
  • 333 visual.* keys preserved BF16 (vision tower intact)
  • 319 norm keys preserved BF16
  • lm_head and embed_tokens preserved BF16
  • NVFP4-packed weights present
  • input_global_scale magnitudes 142–346 (healthy range)

Wall-clock quant time: ~57 minutes on 1× RTX PRO 6000 Blackwell (96 GB).


Deployment

Use the canonical patched image ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703). It bundles the SM121 CUTLASS NVFP4 patches, FlashInfer stable, TurboQuant, and the DFlash drafter integration. The patched CUTLASS path uses native FP4 tensor-core kernels and outperforms the Marlin fallback — do NOT force VLLM_NVFP4_GEMM_BACKEND=marlin (that's the workaround for stock vLLM builds where CUTLASS is broken on SM121).

The image ENTRYPOINT is /bin/bash, so docker run must pass --entrypoint vllm and then serve … (writing IMAGE vllm serve would run bash vllm serve and fail).

The recommended, validated path is DFlash speculative decoding (num_speculative_tokens: 12) via the z-lab/Qwen3.6-27B-DFlash drafter. Copy-paste docker run for the canonical DGX Spark config:

bash

docker run --gpus all --ipc=host --network=host \
-e TORCH_CUDA_ARCH_LIST="12.0+PTX" \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-v /path/to/model:/models/aeon-ultimate \
--entrypoint vllm \
ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
serve /models/aeon-ultimate \
--served-model-name aeon-ultimate \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 1 \
--dtype auto \
--quantization compressed-tensors \
--max-model-len 262144 \
--max-num-seqs 64 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.85 \
--mamba-cache-dtype float16 \
--mamba-block-size 256 \
--enable-chunked-prefill \
--enable-prefix-caching \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm \
--speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":12}'

The DFlash drafter requires KV cache at BF16 — do not add --kv-cache-dtype, and do not set the drafter attention_backend (the default is correct for Qwen3.6 on this image). For a fully-flagged production setup with hardware-tuned compose configs, see the docker-compose recipe in the deployment repo.

For a minimal manual docker run without DFlash (plain decode):

bash

docker run --gpus all --ipc=host --network=host \
-e TORCH_CUDA_ARCH_LIST="12.0+PTX" \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-v /path/to/model:/models/aeon-ultimate \
--entrypoint vllm \
ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
serve /models/aeon-ultimate \
--served-model-name aeon-ultimate \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 1 \
--dtype auto \
--quantization compressed-tensors \
--max-model-len 262144 \
--max-num-seqs 64 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.85 \
--enable-chunked-prefill \
--no-enable-prefix-caching \
--load-format safetensors \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--attention-backend flash_attn \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm

Key settings (tuned for DGX Spark 128 GB unified memory):

  • --max-num-seqs 64 — Conservative for 262K context. Raise to 128 only for short-context workloads. The DGX Spark's 128 GB is unified between CPU and GPU; KV cache for 128 concurrent long-context sequences will exhaust it.
  • --max-num-batched-tokens 16384Recommended default prefill chunk (chunked prefill keeps full 256k context). Frees ~3 GiB of load-time activation vs 32768, so the model fits tighter cards (e.g. 32 GB RTX 5090) where 32768 can OOM at startup — with negligible throughput cost (validated 2026-06-19). 32768 remains safe on ample-VRAM cards (it matches vLLM's inductor compile-range ceiling compile_ranges_endpoints: [32768]; above 32k prefill falls back to eager mode) — raise to it for marginally better long-prefill throughput if you have headroom. The stock vLLM default of 65536 will OOM under concurrent long-context requests on Spark's unified memory.
  • --gpu-memory-utilization 0.85 — Leaves 15 % headroom for KV cache spikes. Do not push above 0.88 on DGX Spark — unified memory means 0.90+ thrashes.
  • --max-model-len 262144 — Full context window. Reduce to 131072 if you need more concurrent sequences.

Python (transformers) — for testing or non-vLLM serving

python

from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch
model_id = "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
dtype=torch.bfloat16, # vision tower + non-quantized weights
device_map="cuda:0",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Requires compressed-tensors >= 0.12 for NVFP4 dequant on the fly.


Hardware notes

Table
HardwareNotes
DGX Spark (GB10, sm_121a)Primary target. Use patched vLLM CUTLASS path. Expect ~50 tok/s single-stream after warmup.
B100 / B200 (sm_100)Native FP4 compute via tcgen05/UTCQMMA — fastest hardware for this format.
RTX PRO 6000 Blackwell (sm_120)Native FP4 via CUTLASS path. Excellent throughput.
A100 / H100 (sm_80, sm_90)NVFP4 dequantizes to BF16/FP8 at kernel level — works but no FP4 throughput advantage. Use BF16 release instead for best perf on these.

Provenance


User Responsibility & Arbitration Clause

By accessing, downloading, using, running inference on, fine-tuning, merging, quantizing, distributing, integrating, or otherwise interacting with this model, you acknowledge and agree to the following:

  1. Sole Responsibility. You, the user, are solely and exclusively responsible for every prompt issued, every response produced, every downstream action taken in reliance on those responses, and any harm — direct, indirect, consequential, or otherwise — that results.

  2. No Warranty. This model is provided strictly "AS IS", without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, non-infringement, safety, alignment, factual accuracy, or legal compliance in any jurisdiction. No contributor, author, publisher, or hosting platform assumes liability of any kind for outputs or downstream use.

  3. Legal Compliance. You are responsible for ensuring that your use complies with all applicable laws, regulations, terms of service, industry codes of conduct, professional ethical standards, and organizational policies in every jurisdiction in which you operate or in which your outputs may be received. The unaligned nature of this model does not grant you any legal authorization you did not already have.

  4. Operational Safety Layer. An uncensored model is not a toy. You are expected to implement appropriate downstream safety layers proportionate to your deployment context, including but not limited to: input validation, output filtering, content moderation, audit logging, rate limiting, access controls, and human-in-the-loop review for high-risk workflows. A production deployment of this model without such layers is unsafe by construction and is not a supported use case.

  5. Heightened Duty of Care. The absence of internal refusal behavior means the duty of care that would ordinarily rest partly with the model rests entirely with you. You are expected to exercise greater — not lesser — caution, forethought, and ethical discipline when operating this model. If you are uncertain whether your contemplated use is ethical, legal, or wise, the correct action is to not make the request.

  6. No Endorsement of Outputs. The authors, contributors, and publishers do not endorse, adopt, or take responsibility for any specific output. Outputs are a stochastic function of the prompt, the weights, and the sampler state — not a statement of position by any human.

  7. Arbitration. Any dispute, claim, or controversy arising out of or relating to the use of this model, its outputs, or this clause shall be resolved through binding individual arbitration under the rules of a mutually agreed arbitration body (or, absent agreement, the American Arbitration Association's Consumer Arbitration Rules), waiving any right to a jury trial, class action, representative action, or consolidated proceeding. Venue shall be the jurisdiction of the disputing party bringing the claim. Costs and attorneys' fees shall be allocated per the applicable arbitration rules. This clause does not expand, and where legally prohibited does not establish, any liability in the other direction; it limits how the user may proceed when alleging harm tied to their own use of this model.

  8. Indemnification. You agree to indemnify, defend, and hold harmless the authors, contributors, and publishers of this model from and against any claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys' fees) arising from or related to your use of the model or your breach of this clause.

  9. Severability. If any provision is held unenforceable in a given jurisdiction, the remaining provisions remain in full force, and the unenforceable provision is replaced by the closest enforceable equivalent consistent with the original intent.

  10. Acceptance. Your use of this model constitutes your acceptance of this clause in full. If you do not accept, do not use the model.

This model is a tool with no opinions of its own. You supply the opinions. You supply the judgement. You supply the ethics. The outputs carry your fingerprints, not the model's.


License

Apache 2.0 (inherited from Qwen/Qwen3.6-27B).


☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Model provider

AEON-7

Model tree

Base

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today