AEON-7

DFlash-Qwen3.5-27B-Uncensored-NVFP4

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quick Start (DGX Spark)

1. Download the model

bash

huggingface-cli download AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
--local-dir ~/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4

2. Create your environment file

bash

# Auto-generate API key and create .env
cat > .env.dflash << 'EOF'
# Authentication
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=$(openssl rand -hex 32)
# Model path
MODEL_HOST_PATH=~/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4
# DFlash speculative decoding (auto-downloads drafter on first run)
DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash
DFLASH_NUM_SPEC_TOKENS=15
# DGX Spark optimal settings (64K context, 4 concurrent sequences)
MAX_MODEL_LEN=65536
MAX_NUM_SEQS=4
GPU_MEMORY_UTILIZATION=0.85
MAX_NUM_BATCHED_TOKENS=65536
EOF
# Generate a real API key and inject it
sed -i "s|\$(openssl rand -hex 32)|$(openssl rand -hex 32)|" .env.dflash
echo "Your API key: $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)"

3. Save docker-compose.dflash.yml

yaml

services:
vllm-dflash:
image: ghcr.io/aeon-7/vllm-dflash:latest
container_name: vllm-dflash
restart: unless-stopped
network_mode: host
ipc: host
volumes:
- ${MODEL_HOST_PATH}:/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4
- dflash-drafter-cache:/models/drafter-cache
environment:
- MODEL_PATH=/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4
- SERVED_MODEL_NAME=DFlash-Qwen3.5-27B-Uncensored
- DFLASH_DRAFTER=${DFLASH_DRAFTER}
- DFLASH_NUM_SPEC_TOKENS=${DFLASH_NUM_SPEC_TOKENS}
- GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION}
- MAX_MODEL_LEN=${MAX_MODEL_LEN}
- MAX_NUM_SEQS=${MAX_NUM_SEQS}
- MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS}
- NVIDIA_VISIBLE_DEVICES=all
- TORCH_MATMUL_PRECISION=high
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- HF_TOKEN=${HF_TOKEN}
- VLLM_API_KEY=${VLLM_API_KEY}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
dflash-drafter-cache:

4. Launch

bash

docker compose --env-file .env.dflash -f docker-compose.dflash.yml up -d
# Watch startup (~5 min for weight loading + CUDA graph compilation)
docker compose -f docker-compose.dflash.yml logs -f

5. Test

bash

# Text generation
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
-d '{
"model": "DFlash-Qwen3.5-27B-Uncensored",
"messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
"max_tokens": 200
}'
# Vision (image understanding)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
-d '{
"model": "DFlash-Qwen3.5-27B-Uncensored",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
{"type": "text", "text": "What do you see?"}
]}],
"max_tokens": 200
}'

Environment Variables

Table
VariableDefaultDescription
MODEL_HOST_PATHHost path to model weights
DFLASH_DRAFTERz-lab/Qwen3.5-27B-DFlashHF repo ID for drafter (auto-downloaded). Set off to disable.
DFLASH_NUM_SPEC_TOKENS15Tokens per draft step. 15 = fast single-stream, 5 = high concurrency.
VLLM_API_KEYAPI key for LAN authentication
HF_TOKENHuggingFace token for gated models
GPU_MEMORY_UTILIZATION0.80GPU memory fraction
MAX_MODEL_LEN4096Max sequence length
MAX_NUM_SEQS8Max concurrent sequences

Performance (DGX Spark GB10)

DFlash Speculative Decoding (Measured)

Table
ConfigurationShort (200 tok)Long (2000 tok)Speedup
No speculation12.2 tok/s12.2 tok/s1.0x
DFlash (5 spec tokens)29.5 tok/s25.4 tok/s2.1-2.4x
DFlash (10 spec tokens)28.7 tok/s25.5 tok/s2.1-2.4x
DFlash (15 spec tokens)33.2 tok/s26.3 tok/s2.2-2.7x

Throughput Scaling

Table
ConcurrentAggregate tok/sPer-Request Latency
133.26.0s
247.97.7s
485.58.3s
892.512.9s

Baseline (No Speculation)

Table
MetricValue
Decode Speed12.2 tok/s
TTFT98-138 ms
ITL (p50/p99)81 / 88 ms

What Makes This Model Special

Why Dense Over MoE

Qwen3.5 comes in two flavors: the 122B-A10B MoE (256 experts, 10B active per token) and this 27B dense model (all parameters active on every token). The dense model has real advantages:

  • Higher quality per FLOP — Every one of the 27B parameters contributes to every token. MoE models route to a sparse subset, which means some experts are undertrained and routing decisions introduce noise. Dense models don't have this problem.
  • No routing overhead — MoE models spend compute on expert selection, load balancing, and all-to-all communication. Dense models just run the computation.
  • Predictable latency — No variance from different experts being selected per token. Every forward pass costs the same.
  • Simpler deployment — No expert parallelism concerns, no load imbalance, fits on a single GPU with NVFP4.

The tradeoff has always been speed: a 27B dense model moves all parameters through memory per token. On a memory-bandwidth-limited device like DGX Spark (273 GB/s), that meant 12 tok/s baseline. DFlash changes this entirely.

Why DFlash Makes Dense Practical on DGX Spark

The fundamental bottleneck on DGX Spark is memory bandwidth. At 273 GB/s, loading 20 GB of NVFP4 weights per token limits you to ~12 tok/s. Every dense model hits this wall.

DFlash block-diffusion speculative decoding breaks through it:

  1. The 2B drafter proposes multiple tokens simultaneously — one diffusion forward pass generates an entire block of speculative tokens in parallel, not sequentially. This costs roughly the same as generating a single token.
  2. The 27B target verifies all proposed tokens in one forward pass — instead of paying the full memory bandwidth cost per token, you pay it once and produce 3-4 accepted tokens on average.
  3. Net effect: you amortize the bandwidth cost across multiple tokens per forward pass.

The result on DGX Spark:

Table
Without DFlashWith DFlash
Single-stream12.2 tok/s33.2 tok/s
Effective bandwidth utilization1 token per pass~3.5 tokens per pass
Practical feelSluggish, noticeable delayResponsive, fluid

This makes the 27B dense model faster than the 122B MoE on a single DGX Spark while delivering the quality advantages of a dense architecture. DFlash turns the DGX Spark from "it can run a 27B model" into "it runs a 27B model well."

Hybrid Architecture

Qwen3.5-27B uses a hybrid architecture mixing two attention types across 64 layers:

  • Linear attention (GDN) — Gated Delta Network layers for efficient long-context processing with O(1) per-token state (48 layers)
  • Full attention — Standard multi-head attention every 4th layer for global context capture (16 layers)

This gives near-linear scaling with sequence length while maintaining full-attention quality at key intervals.

Vision + Text

The model includes a 27-layer ViT vision encoder (460M params, BF16) with a merger that projects visual features into the language model's hidden space. Supports image understanding alongside text generation.

DFlash Block-Diffusion Speculative Decoding

z-lab/Qwen3.5-27B-DFlash is a 2B block-diffusion drafter that generates all speculative tokens simultaneously in a single diffusion step (not sequentially like standard speculative decoding). The 27B target model then verifies in one pass, achieving 2-5x speedup with zero quality loss.

Key difference from standard spec decode: drafting cost is ~constant regardless of token count (one diffusion forward pass), so the tradeoff is purely about verification overhead vs acceptance rate.

AWQ_FULL Quantization

This model uses the most thorough NVFP4 quantization pipeline available:

  1. AWQ_FULL — Exhaustive grid search with alpha_step=0.1 across 10 scaling factors per layer, plus a second awq_clip pass that optimizes clipping ratios
  2. Full NVFP4 Quantization — All attention projections (Q/K/V/O) and all MLP layers (gate/up/down) quantized to FP4. Excludes: vision tower, embeddings, norms, and lm_head
  3. Pre-quantization scales — Channel-wise BF16 factors that redistribute weight magnitudes before quantization

Model Details

Table
PropertyValue
ArchitectureQwen3.5 (Hybrid, 27B parameters)
Layers64 (48 GDN + 16 full-attention)
Hidden Size5120
Attention Heads24 (4 KV heads), head_dim=256
Vision Encoder27-layer ViT, 460M params (BF16)
Max Context131,072 tokens
Vocabulary248,320 tokens
QuantizationNVFP4 AWQ_FULL (ModelOpt 0.43.0)
Model Size~20 GB (quantized + vision)

NVFP4 Weight Format

Each quantized layer stores:

  • weight (uint8) — packed FP4 E2M1 pairs (16-element blocks)
  • weight_scale (float8_e4m3fn) — per-block scale (1 per 16 elements)
  • weight_scale_2 (float32) — per-tensor global scale
  • pre_quant_scale (bfloat16) — AWQ per-channel pre-scaling factors
  • input_scale (float32) — static activation scale from calibration

Optimization Stack

Table
OptimizationStatus
torch.compile (inductor)Active
CUDA graphs (FULL + PIECEWISE)Active
FlashInfer CUTLASS FP4 GEMMAutotuned for GB10
Flash Attention v2Active
Triton/FLA GDN prefill kernelActive
FP8 KV cacheActive (BF16 when DFlash enabled)
Chunked prefillActive
Prefix cachingActive
Act-quant fusionActive

Alternative Deployment Methods

vLLM (Manual)

bash

vllm serve AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
--quantization modelopt \
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
--attention-backend flash_attn \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.80 \
--max-num-batched-tokens 8192 \
--max-num-seqs 8 \
--enable-chunked-prefill \
--enable-prefix-caching \
--trust-remote-code

Note: DFlash uses non-causal attention, which requires --kv-cache-dtype auto (BF16). FP8 KV cache is incompatible with DFlash.

SGLang

bash

python -m sglang.launch_server \
--model-path AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3.5-27B-DFlash \
--speculative-num-draft-tokens 16 \
--tp-size 1 \
--attention-backend fa3 \
--mem-fraction-static 0.75 \
--mamba-scheduler-strategy extra_buffer \
--trust-remote-code

Credits

THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. This model has had safety alignment removed. Users are responsible for ensuring ethical and legal use.


☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Model provider

AEON-7

Model tree

Base

AEON-7/DFlash-Qwen3.5-27B-Uncensored

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today