AEON-7
Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0🚀 Quickstart (DGX Spark / GB10 · DFlash · BF16 KV)
One copy-paste block: pull the canonical container, this model, and the DFlash drafter (pull FRESH), then serve with the vetted DGX Spark flags. The image ENTRYPOINT is /bin/bash, so docker run uses --entrypoint vllm.
bash
# 1) Pull the canonical AEON vLLM Ultimate containerdocker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest# 2) Pull THIS model (compressed-tensors body)huggingface-cli download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4 --local-dir ./aeon-model# 3) Pull the DFlash drafter — FRESH (do not reuse a stale copy)huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir ./aeon-drafter# 4) Serve (compressed-tensors body, BF16 KV cache, default drafter backend)docker run --gpus all --ipc=host --network=host \-e TORCH_CUDA_ARCH_LIST="12.0+PTX" \-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \-e VLLM_USE_FLASHINFER_MOE_FP4=0 \-v ./aeon-model:/model:ro \-v ./aeon-drafter:/drafter:ro \--entrypoint vllm \ghcr.io/aeon-7/aeon-vllm-ultimate:latest \serve /model \--served-model-name aeon-ultimate \--host 0.0.0.0 --port 8000 \--quantization compressed-tensors \--mamba-cache-dtype float16 \--mamba-block-size 256 \--reasoning-parser qwen3 \--tool-call-parser qwen3_coder \--enable-auto-tool-choice \--max-num-seqs 16 \--max-num-batched-tokens 16384 \--gpu-memory-utilization 0.85 \--enable-chunked-prefill \--enable-prefix-caching \--trust-remote-code \--speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":12}'
DFlash needs BF16 KV — do not add
--kv-cache-dtype, and do not set the drafterattention_backend(the default is correct for Qwen3.6 on this image). Keep--gpu-memory-utilization≤ 0.88 on DGX Spark (unified memory thrashes above that). For the full flag reference (context length, batching, multimodal cache), plain-decode (no-DFlash) variant, and hardware-tuned compose configs, see Deployment below.
Variants
| Format | HuggingFace repo | Disk | Quant tool | Spec decode | Hardware target | When to pick this |
|---|---|---|---|---|---|---|
| NVFP4 (this repo) | …-NVFP4 | 26 GB | llm-compressor | DFlash n=12 | DGX Spark (GB10 / sm_121a) | Production-validated for DGX Spark with the canonical aeon-vllm-ultimate:latest container. |
| Multimodal-NVFP4-MTP | …-Multimodal-NVFP4-MTP | 27 GB | nvidia-modelopt | qwen3_5_mtp n=3 | RTX PRO 6000 Blackwell · B100/B200 | MTP via the model's native mtp.* head (grafted bf16 from base). modelopt format, --quantization modelopt. Vision tower preserved. GDN linear-attention preserved BF16 for best long-context fidelity. |
| Text-NVFP4-MTP | …-Text-NVFP4-MTP | 26 GB | nvidia-modelopt | qwen3_5_mtp n=3 | RTX PRO 6000 · text-only | Same recipe as the Multimodal MTP sibling but with vision tower stripped. GDN preserved BF16. |
| Multimodal-NVFP4-MTP-XS | …-Multimodal-NVFP4-MTP-XS | 21 GB | nvidia-modelopt | qwen3_5_mtp n=3 | RTX 5090 · tighter dedicated VRAM | Strategic split: GDN projection matmuls (in_proj_qkv/z/a/b, out_proj) → NVFP4; linear_attn.conv1d kept BF16 to preserve the recurrence-critical SSM convolution. Vision tower preserved. |
| Text-NVFP4-MTP-XS | …-Text-NVFP4-MTP-XS | 20 GB | nvidia-modelopt | qwen3_5_mtp n=3 | RTX 5090 text-only · 24 GB cards | Same conv1d-preserved strategic split as Multimodal-XS, vision tower stripped. The smallest variant we ship. |
| BF16 | …-BF16 | 51 GB | — | — | A100 / H100 80 GB · multi-GPU | Full-precision reference weights. Ampere / Hopper / pre-Blackwell hardware, fine-tuning, or quant-recipe development. |
🎯 Hardware routing — measured, not theoretical
Pick by memory architecture, not just GPU model:
Table Hardware class Use this Why DGX Spark / GB10 (unified memory, sm_121a) this -NVFP4(DFlash) repo ✅ — or the modelopt-Multimodal-NVFP4-MTP-XSbody served with DFlash (the benchmarked Spark path — see its card)Bench on Spark: DFlash beats the MTP self-spec method on this body (see the AEON DFlash-vs-MTP routing finding); the modelopt -Multimodal-NVFP4-MTP-XSbody + DFlash is the benchmarked Spark path. Don't run MTP-method on Spark.RTX PRO 6000 / RTX 5090 / B100 / B200 (dedicated VRAM, sm_120/sm_100) -NVFP4-MTPor-NVFP4-MTP-XSMTP wins on dedicated VRAM. RTX PRO 6000 measured: XS hits 111.4 tok/s median with 69 % MTP acceptance — beats no-spec by ~10 %. A100 / H100 (no native FP4) -BF16NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit. Full bench numbers: GitHub repo Performance section.
Regular MTP vs XS — strategic quantization, not a precision compromise
The GatedDeltaNet
linear_attn.*block has two distinct components: the heavy projection matmuls (in_proj_qkv,in_proj_z,in_proj_a/b,out_proj— ~11 GB total) and the SSM 1D convolution kernel (linear_attn.conv1d— small, but recurrence-critical).
- Regular MTP variants keep both at BF16. Maximum numerical safety margin, larger footprint.
- XS variants quantize the projection matmuls to NVFP4 (saves ~6 GB; FP4 is a clean win on bandwidth-bound matmuls) but explicitly preserve
linear_attn.conv1dat BF16. FP4 quantization of conv1d has been observed to cause drift on long-context recurrence in community testing, so we keep it at BF16 — the same principle modelopt'sNVFP4_DEFAULT_CFGapplies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads). This is not "everything to FP4" — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.🚀 DGX Spark: XS body + DFlash spec is the highest-throughput config
If you want maximum DGX Spark throughput, the highest-measured configuration is:
- Model body:
-Multimodal-NVFP4-MTP-XS(modelopt format)- Spec method: DFlash n=12 via
z-lab/Qwen3.6-27B-DFlash— not the MTP head that ships with the XS variant- Container: the canonical
aeon-vllm-ultimate:latest, run with--entrypoint vllm- Same Spark settings (
--max-num-seqs 16,--gpu-memory-utilization 0.85,--max-model-len 200000)- vLLM args:
--quantization modelopt --speculative-config '{"method":"dflash","model":"/path/to/dflash-drafter","num_speculative_tokens":12}'(drafter backend = default; do not set--kv-cache-dtype)The measured DGX Spark DFlash benchmark on
aeon-vllm-ultimate:latest(long-context acceptance, per-category short-context acceptance) lives on the-Multimodal-NVFP4-MTP-XScard — that is the benchmarked body; those numbers are specific to it and do not transfer to other bodies. This-NVFP4(compressed-tensors) repo + DFlash remains the simpler, validated path; the XS+DFlash combo is the higher-throughput path once you've been through one boot to populate the autotuner cache.
The production deployment format for Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 on Blackwell-class hardware. Same model, same 0/100 refusal rate, same preserved-and-enhanced capabilities of the BF16 source — compressed from 51 GB BF16 to 26 GB NVFP4 for native FP4 tensor-core throughput on DGX Spark (GB10 / sm_121a), B100 / B200, and RTX PRO 6000 Blackwell.
Performance — DGX Spark DFlash (v0.23.0 build)
These figures were measured on the current production image ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703) — vLLM 0.23.0 built from source for GB10 / sm_121a with the AEON DFlash stack (z-lab/Qwen3.6-27B-DFlash, num_speculative_tokens: 12). They were captured on the benchmarked DGX Spark body (the modelopt -Multimodal-NVFP4-MTP-XS served with DFlash — the highest-throughput Spark config); this compressed-tensors -NVFP4 body uses the same canonical container and the same DFlash recipe, so the path and the fixes apply identically.
Current build — single-stream (c=1), by category
| Category | 🟢 Decode tok/s | TTFT p50 | TPOT p50 | Prefill (PP) | DFlash accept |
|---|---|---|---|---|---|
| Coding | 41.8 | 140 ms | 23.9 ms | 322 tok/s | 34% |
| Math | 47.3 | 244 ms | 21.1 ms | 229 tok/s | 42% |
| Reasoning | 56.1 | 234 ms | 17.8 ms | 183 tok/s | 50% |
| Prose | 34.1 | 146 ms | 29.4 ms | 220 tok/s | 27% |
| Natural language | 38.3 | 137 ms | 26.1 ms | 248 tok/s | 31% |
| Extraction / JSON | 44.2 | 246 ms | 22.6 ms | 195 tok/s | 37% |
It now scales cleanly to c=64 concurrent with no crash — the pre-fix image crashed under concurrent speculative decoding at c≥32 (see What we fixed for the DGX Spark). Aggregate throughput climbs from c=1 to c=64 across every category (Reasoning peaks at ~344 tok/s aggregate at c=64).
Stock baseline note: a fully-vanilla stock vLLM throughput baseline for this body is pending — it has not yet been re-benchmarked on the current version. The DFlash figures above are the optimized
aeon-vllm-ultimate:latest(vLLM 0.23.0) build. Any prior stock comparison numbers quoted in the AEON line are from vanilla vLLM (default settings, no DFlash / sm_121a optimizations) and are provisional, pending a fresh vanilla re-benchmark.
Long-context draft acceptance holds
DFlash draft acceptance stays healthy as agent histories grow — the payoff of the sliding-window-attention patch (PR #40898). Measured on the same build, single-stream (c=1):
| Context (measured prompt tokens) | Decode tok/s (p50) | DFlash accept |
|---|---|---|
| short (~256 tok) | 34–56 (by category) | 27–50% |
| ~20k tokens (16k tier) | 33–52 | 41–50% |
| ~41k tokens (32k tier) | 28–39 | 27–43% |
Acceptance does not collapse past the 2k-token sliding window — it holds in the 40–50% band out to ~20k and stays usable at ~41k, exactly the behavior the long-context DFlash + SWA fixes were built to deliver.
The BF16 source is itself the product of 72 hours of continuous research drawing on hundreds of parallel AI research agents, the industry's best published methodologies, custom in-house techniques, and yet-unreleased pre-public branches of the next-generation abliteration toolchain. See the BF16 model card for the full pipeline narrative and capability data.
What we fixed for the DGX Spark
All AEON models run on one unified container — ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703) — vLLM v0.23.0 built from source for GB10 / sm_121a and merged with the AEON speculative-decoding stack, tuned end-to-end for the GB10's unified-memory Blackwell architecture.
| Fix | What it does | Why it matters on GB10 |
|---|---|---|
| DFlash high-concurrency fix (new) | Slices the speculative drafter's KV block-table to the unpadded batch (block_table[:num_reqs]) | The drafter previously crashed at ≥32 concurrent requests (padded-vs-unpadded block-table shape mismatch in FlashAttention). Now scales cleanly to c=64. A port of upstream PR #43982 — which fixed this for MTP but was never applied to DFlash — present and unfixed even in the prior image. |
| Triton NVFP4 KV cache (PR #44389) | Software NVFP4 KV-cache path | The only 4-bit KV path on sm_121a (upstream's is hard-gated to B200) → ~3× KV capacity / longer context per GB of unified memory. |
| DFlash sliding-window attention (PR #40898) | Runs the drafter's SWA layers as true sliding-window | Long-context draft acceptance holds as agent histories grow (40–50% out to ~20k tokens) instead of collapsing past ~2k tokens. |
| sm_121a-native build | TORCH_CUDA_ARCH_LIST=12.1a, ENABLE_NVFP4_SM100=0 | Compiles the SM120-family CUTLASS NVFP4/FP8 kernels GB10 actually dispatches to — true 4-bit tensor-core throughput, no dead B200-only kernels. |
| sm_121a boot + CUDA-graph patches | RTLD-lazy _C_stable_libtorch load; spec-decode CUDA-graph capture-size alignment | Boots past MXFP4 (SM100-only) symbols absent on GB10; prevents cudaErrorIllegalAddress on partial-acceptance decode steps under speculative decoding. |
| Unified-memory tuning | Conservative --gpu-memory-utilization, FULL CUDA graphs, async scheduling, z-lab DFlash drafter | GB10 shares one LPDDR5X pool across CPU + GPU; conservative KV headroom avoids page-thrash while keeping FULL-graph + speculative-decode throughput. |
The result:
- Scales to 64 concurrent requests with no crash (the prior image crashed at c≥32 under speculative decoding).
- Native NVFP4 4-bit compute on Blackwell tensor cores — the speed of 4-bit with near-16-bit accuracy.
- Speculative decoding (DFlash) holds high draft acceptance from short prompts through long (~20k–41k token) agent histories instead of collapsing past the 2k sliding window.
- A fully-vanilla stock throughput contrast for this body is pending a fresh vanilla re-benchmark; the headline win for the v0.23.0 unified image is the c=64 concurrency fix plus the long-context acceptance hold, not a single-stream tok/s bump over the prior AEON image.
Why NVFP4 — and Why It's Effectively Lossless
NVFP4 is not a "compressed lite" version. It is the format NVIDIA designed for Blackwell-and-later silicon to be the production deployment format — accuracy on par with BF16, throughput of true 4-bit compute, no compromise required.
The accuracy guarantee comes from a two-level scaling structure that older 4-bit formats (INT4, Q4_0/Q4_K, NF4) do not have:
- E2M1 element format — 4-bit floating point per weight (sign / 2-bit exponent / 1-bit mantissa).
- Block size 16 with FP8 E4M3 per-block scales — every 16 weights share an 8-bit floating-point scale, which dramatically out-resolves the INT8 scales used by older schemes when the local weight distribution is heavy-tailed.
- FP32 per-tensor scale — global re-scale applied at the kernel boundary so block-level FP8 scales never have to span the full tensor's dynamic range.
The combined effect is that local outliers — the long-tailed weights that destroy older 4-bit formats — are absorbed by the per-block FP8 scale rather than smearing the whole quantization grid. Typical KL divergence vs the BF16 source for recipe-class NVFP4 quantization is ≤ 0.001, which is below the noise floor of stochastic sampling. A user cannot observe the difference between this model and its BF16 source; the difference is smaller than the variance from changing your random seed.
On native FP4 silicon — Blackwell tcgen05 / UTCQMMA paths, sm_121a CUTLASS on GB10 — this format runs at full FP4 tensor-core throughput. The GPU does not dequantize back to BF16 internally. You get the speed of true 4-bit compute and the accuracy of 16-bit weights at the same time. On older silicon (A100, H100) NVFP4 dequantizes at kernel boundaries — works correctly, but no throughput advantage; for those cards use the BF16 release directly.
This release is multimodal-preserved (vision tower stays BF16 — model.visual.* 333 vision tensors retained at BF16; text inference validated, image-input runtime validation pending a GPU window) and hybrid-attention-preserved (the 48 linear-attention / GatedDeltaNet layers stay BF16; FP4 applies only to the 16 full-attention layers' output projections and all MLPs, where it is well-behaved). Mamba state and SSM dynamics are mathematically incompatible with FP4 and remain in BF16 by design, not by compromise.
What Changed vs BF16
| Aspect | BF16 (source) | NVFP4 (this release) |
|---|---|---|
| Disk size | 51 GB | 26 GB (49% reduction) |
| Refusal rate | 0/100 | 0/100 inherited (KL ≤ 0.001 from source — below sampling noise) |
| Multimodal | preserved | preserved (vision BF16, no degradation) |
| Hybrid SSM | repaired + intact | intact (linear_attn BF16-preserved) |
| Hardware target | A100 / H100 / RTX PRO 6000 BF16 | DGX Spark (GB10), B100/B200, RTX PRO 6000 Blackwell with native FP4 throughput |
| KL vs BF16 source | n/a | expected ≤0.001 (typical for this recipe class) |
The NVFP4 quantization scheme is NVIDIA-mandated: E2M1 element format, block_size=16, FP8 E4M3 per-block scales, FP32 per-tensor scale, symmetric signed.
Quantization Recipe
Tool: llm-compressor 0.10.1.dev107 (vllm-project) using QuantizationModifier(scheme="NVFP4") post-training quantization.
python
from llmcompressor.modifiers.quantization import QuantizationModifierrecipe = QuantizationModifier(targets="Linear",scheme="NVFP4",ignore=["lm_head", # always"re:.*embed_tokens.*", # always"re:.*\\.visual\\..*", # vision tower BF16 — preserves multimodal"re:.*visual\\..*","re:.*linear_attn\\..*", # SSM/GDN BF16 — Mamba state collapses under FP4"re:.*norm.*","re:.*q_norm.*","re:.*k_norm.*",],)
Calibration: open-platypus, 512 samples × 4096 tokens.
Pipeline: sequential with sequential_targets=["Qwen3_5DecoderLayer"] — required for hybrid stacks (mixed full + linear attention layers); without explicit targeting, llm-compressor's auto-discovery silently skips layers.
Loader: AutoModelForImageTextToText to preserve the Qwen3_5ForConditionalGeneration multimodal class.
Processor: passed explicitly to oneshot() to avoid the "model processor required when a dataset is provided" failure on multimodal builds without torchvision.
Verification (pass):
- 1 shard, 1952 keys
- 64 quantized full-attention projections (16 layers × 4 q/k/v/o)
- 432
linear_attn.*keys preserved BF16 (48 layers × 9 modules) - 333
visual.*keys preserved BF16 (vision tower intact) - 319 norm keys preserved BF16
lm_headandembed_tokenspreserved BF16- NVFP4-packed weights present
input_global_scalemagnitudes 142–346 (healthy range)
Wall-clock quant time: ~57 minutes on 1× RTX PRO 6000 Blackwell (96 GB).
Deployment
vLLM on DGX Spark (GB10 / sm_121a) — recommended
Use the canonical patched image ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703). It bundles the SM121 CUTLASS NVFP4 patches, FlashInfer stable, TurboQuant, and the DFlash drafter integration. The patched CUTLASS path uses native FP4 tensor-core kernels and outperforms the Marlin fallback — do NOT force VLLM_NVFP4_GEMM_BACKEND=marlin (that's the workaround for stock vLLM builds where CUTLASS is broken on SM121).
The image ENTRYPOINT is /bin/bash, so docker run must pass --entrypoint vllm and then serve … (writing IMAGE vllm serve would run bash vllm serve and fail).
The recommended, validated path is DFlash speculative decoding (num_speculative_tokens: 12) via the z-lab/Qwen3.6-27B-DFlash drafter. Copy-paste docker run for the canonical DGX Spark config:
bash
docker run --gpus all --ipc=host --network=host \-e TORCH_CUDA_ARCH_LIST="12.0+PTX" \-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \-e VLLM_USE_FLASHINFER_MOE_FP4=0 \-v /path/to/model:/models/aeon-ultimate \--entrypoint vllm \ghcr.io/aeon-7/aeon-vllm-ultimate:latest \serve /models/aeon-ultimate \--served-model-name aeon-ultimate \--host 0.0.0.0 --port 8000 \--tensor-parallel-size 1 \--dtype auto \--quantization compressed-tensors \--max-model-len 262144 \--max-num-seqs 64 \--max-num-batched-tokens 16384 \--gpu-memory-utilization 0.85 \--mamba-cache-dtype float16 \--mamba-block-size 256 \--enable-chunked-prefill \--enable-prefix-caching \--trust-remote-code \--enable-auto-tool-choice \--tool-call-parser qwen3_coder \--reasoning-parser qwen3 \--mm-encoder-tp-mode data \--mm-processor-cache-type shm \--speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":12}'
The DFlash drafter requires KV cache at BF16 — do not add
--kv-cache-dtype, and do not set the drafterattention_backend(the default is correct for Qwen3.6 on this image). For a fully-flagged production setup with hardware-tuned compose configs, see the docker-compose recipe in the deployment repo.
For a minimal manual docker run without DFlash (plain decode):
bash
docker run --gpus all --ipc=host --network=host \-e TORCH_CUDA_ARCH_LIST="12.0+PTX" \-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \-e VLLM_USE_FLASHINFER_MOE_FP4=0 \-v /path/to/model:/models/aeon-ultimate \--entrypoint vllm \ghcr.io/aeon-7/aeon-vllm-ultimate:latest \serve /models/aeon-ultimate \--served-model-name aeon-ultimate \--host 0.0.0.0 --port 8000 \--tensor-parallel-size 1 \--dtype auto \--quantization compressed-tensors \--max-model-len 262144 \--max-num-seqs 64 \--max-num-batched-tokens 16384 \--gpu-memory-utilization 0.85 \--enable-chunked-prefill \--no-enable-prefix-caching \--load-format safetensors \--trust-remote-code \--enable-auto-tool-choice \--tool-call-parser qwen3_coder \--reasoning-parser qwen3 \--attention-backend flash_attn \--mm-encoder-tp-mode data \--mm-processor-cache-type shm
Key settings (tuned for DGX Spark 128 GB unified memory):
--max-num-seqs 64— Conservative for 262K context. Raise to 128 only for short-context workloads. The DGX Spark's 128 GB is unified between CPU and GPU; KV cache for 128 concurrent long-context sequences will exhaust it.--max-num-batched-tokens 16384— Recommended default prefill chunk (chunked prefill keeps full 256k context). Frees ~3 GiB of load-time activation vs 32768, so the model fits tighter cards (e.g. 32 GB RTX 5090) where 32768 can OOM at startup — with negligible throughput cost (validated 2026-06-19). 32768 remains safe on ample-VRAM cards (it matches vLLM's inductor compile-range ceilingcompile_ranges_endpoints: [32768]; above 32k prefill falls back to eager mode) — raise to it for marginally better long-prefill throughput if you have headroom. The stock vLLM default of 65536 will OOM under concurrent long-context requests on Spark's unified memory.--gpu-memory-utilization 0.85— Leaves 15 % headroom for KV cache spikes. Do not push above 0.88 on DGX Spark — unified memory means 0.90+ thrashes.--max-model-len 262144— Full context window. Reduce to 131072 if you need more concurrent sequences.
Python (transformers) — for testing or non-vLLM serving
python
from transformers import AutoModelForImageTextToText, AutoTokenizerimport torchmodel_id = "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4"tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)model = AutoModelForImageTextToText.from_pretrained(model_id,dtype=torch.bfloat16, # vision tower + non-quantized weightsdevice_map="cuda:0",trust_remote_code=True,)messages = [{"role": "user", "content": "Your prompt here"}]text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = tokenizer(text, return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Requires compressed-tensors >= 0.12 for NVFP4 dequant on the fly.
Hardware notes
| Hardware | Notes |
|---|---|
| DGX Spark (GB10, sm_121a) | Primary target. Use patched vLLM CUTLASS path. Expect ~50 tok/s single-stream after warmup. |
| B100 / B200 (sm_100) | Native FP4 compute via tcgen05/UTCQMMA — fastest hardware for this format. |
| RTX PRO 6000 Blackwell (sm_120) | Native FP4 via CUTLASS path. Excellent throughput. |
| A100 / H100 (sm_80, sm_90) | NVFP4 dequantizes to BF16/FP8 at kernel level — works but no FP4 throughput advantage. Use BF16 release instead for best perf on these. |
Provenance
- BF16 source:
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16— see source card for full pipeline (FernflowerAI SSM repair → abliterix-v1.4 abliteration → trial 46 of 50 selected for capability preservation). - Original base:
Qwen/Qwen3.6-27Bby Alibaba. - Quantization tool: llm-compressor by vllm-project.
- NVFP4 scheme: NVIDIA NVFP4 specification.
User Responsibility & Arbitration Clause
By accessing, downloading, using, running inference on, fine-tuning, merging, quantizing, distributing, integrating, or otherwise interacting with this model, you acknowledge and agree to the following:
-
Sole Responsibility. You, the user, are solely and exclusively responsible for every prompt issued, every response produced, every downstream action taken in reliance on those responses, and any harm — direct, indirect, consequential, or otherwise — that results.
-
No Warranty. This model is provided strictly "AS IS", without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, non-infringement, safety, alignment, factual accuracy, or legal compliance in any jurisdiction. No contributor, author, publisher, or hosting platform assumes liability of any kind for outputs or downstream use.
-
Legal Compliance. You are responsible for ensuring that your use complies with all applicable laws, regulations, terms of service, industry codes of conduct, professional ethical standards, and organizational policies in every jurisdiction in which you operate or in which your outputs may be received. The unaligned nature of this model does not grant you any legal authorization you did not already have.
-
Operational Safety Layer. An uncensored model is not a toy. You are expected to implement appropriate downstream safety layers proportionate to your deployment context, including but not limited to: input validation, output filtering, content moderation, audit logging, rate limiting, access controls, and human-in-the-loop review for high-risk workflows. A production deployment of this model without such layers is unsafe by construction and is not a supported use case.
-
Heightened Duty of Care. The absence of internal refusal behavior means the duty of care that would ordinarily rest partly with the model rests entirely with you. You are expected to exercise greater — not lesser — caution, forethought, and ethical discipline when operating this model. If you are uncertain whether your contemplated use is ethical, legal, or wise, the correct action is to not make the request.
-
No Endorsement of Outputs. The authors, contributors, and publishers do not endorse, adopt, or take responsibility for any specific output. Outputs are a stochastic function of the prompt, the weights, and the sampler state — not a statement of position by any human.
-
Arbitration. Any dispute, claim, or controversy arising out of or relating to the use of this model, its outputs, or this clause shall be resolved through binding individual arbitration under the rules of a mutually agreed arbitration body (or, absent agreement, the American Arbitration Association's Consumer Arbitration Rules), waiving any right to a jury trial, class action, representative action, or consolidated proceeding. Venue shall be the jurisdiction of the disputing party bringing the claim. Costs and attorneys' fees shall be allocated per the applicable arbitration rules. This clause does not expand, and where legally prohibited does not establish, any liability in the other direction; it limits how the user may proceed when alleging harm tied to their own use of this model.
-
Indemnification. You agree to indemnify, defend, and hold harmless the authors, contributors, and publishers of this model from and against any claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys' fees) arising from or related to your use of the model or your breach of this clause.
-
Severability. If any provision is held unenforceable in a given jurisdiction, the remaining provisions remain in full force, and the unenforceable provision is replaced by the closest enforceable equivalent consistent with the original intent.
-
Acceptance. Your use of this model constitutes your acceptance of this clause in full. If you do not accept, do not use the model.
This model is a tool with no opinions of its own. You supply the opinions. You supply the judgement. You supply the ethics. The outputs carry your fingerprints, not the model's.
License
Apache 2.0 (inherited from Qwen/Qwen3.6-27B).
☕ Support the work
If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.
Model provider
AEON-7
Model tree
Base
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information