AEON-7
DFlash-Qwen3.5-27B-Uncensored-NVFP4
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quick Links
| Get Started | Step-by-step quick start guide on DGX Spark |
| Docker Image | ghcr.io/aeon-7/vllm-dflash:latest |
| DFlash Drafter | z-lab/Qwen3.5-27B-DFlash |
| Base Model | Qwen/Qwen3.5-27B |
| DFlash Paper | arXiv 2602.06036 |
Quick Start (DGX Spark)
1. Download the model
bash
huggingface-cli download AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \--local-dir ~/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4
2. Create your environment file
bash
# Auto-generate API key and create .envcat > .env.dflash << 'EOF'# AuthenticationHF_TOKEN=hf_your_token_hereVLLM_API_KEY=$(openssl rand -hex 32)# Model pathMODEL_HOST_PATH=~/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4# DFlash speculative decoding (auto-downloads drafter on first run)DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlashDFLASH_NUM_SPEC_TOKENS=15# DGX Spark optimal settings (64K context, 4 concurrent sequences)MAX_MODEL_LEN=65536MAX_NUM_SEQS=4GPU_MEMORY_UTILIZATION=0.85MAX_NUM_BATCHED_TOKENS=65536EOF# Generate a real API key and inject itsed -i "s|\$(openssl rand -hex 32)|$(openssl rand -hex 32)|" .env.dflashecho "Your API key: $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)"
3. Save docker-compose.dflash.yml
yaml
services:vllm-dflash:image: ghcr.io/aeon-7/vllm-dflash:latestcontainer_name: vllm-dflashrestart: unless-stoppednetwork_mode: hostipc: hostvolumes:- ${MODEL_HOST_PATH}:/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4- dflash-drafter-cache:/models/drafter-cacheenvironment:- MODEL_PATH=/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4- SERVED_MODEL_NAME=DFlash-Qwen3.5-27B-Uncensored- DFLASH_DRAFTER=${DFLASH_DRAFTER}- DFLASH_NUM_SPEC_TOKENS=${DFLASH_NUM_SPEC_TOKENS}- GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION}- MAX_MODEL_LEN=${MAX_MODEL_LEN}- MAX_NUM_SEQS=${MAX_NUM_SEQS}- MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS}- NVIDIA_VISIBLE_DEVICES=all- TORCH_MATMUL_PRECISION=high- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True- HF_TOKEN=${HF_TOKEN}- VLLM_API_KEY=${VLLM_API_KEY}deploy:resources:reservations:devices:- driver: nvidiacount: allcapabilities: [gpu]volumes:dflash-drafter-cache:
4. Launch
bash
docker compose --env-file .env.dflash -f docker-compose.dflash.yml up -d# Watch startup (~5 min for weight loading + CUDA graph compilation)docker compose -f docker-compose.dflash.yml logs -f
5. Test
bash
# Text generationcurl http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \-d '{"model": "DFlash-Qwen3.5-27B-Uncensored","messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],"max_tokens": 200}'# Vision (image understanding)curl http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \-d '{"model": "DFlash-Qwen3.5-27B-Uncensored","messages": [{"role": "user", "content": [{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},{"type": "text", "text": "What do you see?"}]}],"max_tokens": 200}'
Environment Variables
| Variable | Default | Description |
|---|---|---|
MODEL_HOST_PATH | — | Host path to model weights |
DFLASH_DRAFTER | z-lab/Qwen3.5-27B-DFlash | HF repo ID for drafter (auto-downloaded). Set off to disable. |
DFLASH_NUM_SPEC_TOKENS | 15 | Tokens per draft step. 15 = fast single-stream, 5 = high concurrency. |
VLLM_API_KEY | — | API key for LAN authentication |
HF_TOKEN | — | HuggingFace token for gated models |
GPU_MEMORY_UTILIZATION | 0.80 | GPU memory fraction |
MAX_MODEL_LEN | 4096 | Max sequence length |
MAX_NUM_SEQS | 8 | Max concurrent sequences |
Performance (DGX Spark GB10)
DFlash Speculative Decoding (Measured)
| Configuration | Short (200 tok) | Long (2000 tok) | Speedup |
|---|---|---|---|
| No speculation | 12.2 tok/s | 12.2 tok/s | 1.0x |
| DFlash (5 spec tokens) | 29.5 tok/s | 25.4 tok/s | 2.1-2.4x |
| DFlash (10 spec tokens) | 28.7 tok/s | 25.5 tok/s | 2.1-2.4x |
| DFlash (15 spec tokens) | 33.2 tok/s | 26.3 tok/s | 2.2-2.7x |
Throughput Scaling
| Concurrent | Aggregate tok/s | Per-Request Latency |
|---|---|---|
| 1 | 33.2 | 6.0s |
| 2 | 47.9 | 7.7s |
| 4 | 85.5 | 8.3s |
| 8 | 92.5 | 12.9s |
Baseline (No Speculation)
| Metric | Value |
|---|---|
| Decode Speed | 12.2 tok/s |
| TTFT | 98-138 ms |
| ITL (p50/p99) | 81 / 88 ms |
What Makes This Model Special
Why Dense Over MoE
Qwen3.5 comes in two flavors: the 122B-A10B MoE (256 experts, 10B active per token) and this 27B dense model (all parameters active on every token). The dense model has real advantages:
- Higher quality per FLOP — Every one of the 27B parameters contributes to every token. MoE models route to a sparse subset, which means some experts are undertrained and routing decisions introduce noise. Dense models don't have this problem.
- No routing overhead — MoE models spend compute on expert selection, load balancing, and all-to-all communication. Dense models just run the computation.
- Predictable latency — No variance from different experts being selected per token. Every forward pass costs the same.
- Simpler deployment — No expert parallelism concerns, no load imbalance, fits on a single GPU with NVFP4.
The tradeoff has always been speed: a 27B dense model moves all parameters through memory per token. On a memory-bandwidth-limited device like DGX Spark (273 GB/s), that meant 12 tok/s baseline. DFlash changes this entirely.
Why DFlash Makes Dense Practical on DGX Spark
The fundamental bottleneck on DGX Spark is memory bandwidth. At 273 GB/s, loading 20 GB of NVFP4 weights per token limits you to ~12 tok/s. Every dense model hits this wall.
DFlash block-diffusion speculative decoding breaks through it:
- The 2B drafter proposes multiple tokens simultaneously — one diffusion forward pass generates an entire block of speculative tokens in parallel, not sequentially. This costs roughly the same as generating a single token.
- The 27B target verifies all proposed tokens in one forward pass — instead of paying the full memory bandwidth cost per token, you pay it once and produce 3-4 accepted tokens on average.
- Net effect: you amortize the bandwidth cost across multiple tokens per forward pass.
The result on DGX Spark:
| Without DFlash | With DFlash | |
|---|---|---|
| Single-stream | 12.2 tok/s | 33.2 tok/s |
| Effective bandwidth utilization | 1 token per pass | ~3.5 tokens per pass |
| Practical feel | Sluggish, noticeable delay | Responsive, fluid |
This makes the 27B dense model faster than the 122B MoE on a single DGX Spark while delivering the quality advantages of a dense architecture. DFlash turns the DGX Spark from "it can run a 27B model" into "it runs a 27B model well."
Hybrid Architecture
Qwen3.5-27B uses a hybrid architecture mixing two attention types across 64 layers:
- Linear attention (GDN) — Gated Delta Network layers for efficient long-context processing with O(1) per-token state (48 layers)
- Full attention — Standard multi-head attention every 4th layer for global context capture (16 layers)
This gives near-linear scaling with sequence length while maintaining full-attention quality at key intervals.
Vision + Text
The model includes a 27-layer ViT vision encoder (460M params, BF16) with a merger that projects visual features into the language model's hidden space. Supports image understanding alongside text generation.
DFlash Block-Diffusion Speculative Decoding
z-lab/Qwen3.5-27B-DFlash is a 2B block-diffusion drafter that generates all speculative tokens simultaneously in a single diffusion step (not sequentially like standard speculative decoding). The 27B target model then verifies in one pass, achieving 2-5x speedup with zero quality loss.
Key difference from standard spec decode: drafting cost is ~constant regardless of token count (one diffusion forward pass), so the tradeoff is purely about verification overhead vs acceptance rate.
AWQ_FULL Quantization
This model uses the most thorough NVFP4 quantization pipeline available:
- AWQ_FULL — Exhaustive grid search with
alpha_step=0.1across 10 scaling factors per layer, plus a secondawq_clippass that optimizes clipping ratios - Full NVFP4 Quantization — All attention projections (Q/K/V/O) and all MLP layers (gate/up/down) quantized to FP4. Excludes: vision tower, embeddings, norms, and lm_head
- Pre-quantization scales — Channel-wise BF16 factors that redistribute weight magnitudes before quantization
Model Details
| Property | Value |
|---|---|
| Architecture | Qwen3.5 (Hybrid, 27B parameters) |
| Layers | 64 (48 GDN + 16 full-attention) |
| Hidden Size | 5120 |
| Attention Heads | 24 (4 KV heads), head_dim=256 |
| Vision Encoder | 27-layer ViT, 460M params (BF16) |
| Max Context | 131,072 tokens |
| Vocabulary | 248,320 tokens |
| Quantization | NVFP4 AWQ_FULL (ModelOpt 0.43.0) |
| Model Size | ~20 GB (quantized + vision) |
NVFP4 Weight Format
Each quantized layer stores:
weight(uint8) — packed FP4 E2M1 pairs (16-element blocks)weight_scale(float8_e4m3fn) — per-block scale (1 per 16 elements)weight_scale_2(float32) — per-tensor global scalepre_quant_scale(bfloat16) — AWQ per-channel pre-scaling factorsinput_scale(float32) — static activation scale from calibration
Optimization Stack
| Optimization | Status |
|---|---|
| torch.compile (inductor) | Active |
| CUDA graphs (FULL + PIECEWISE) | Active |
| FlashInfer CUTLASS FP4 GEMM | Autotuned for GB10 |
| Flash Attention v2 | Active |
| Triton/FLA GDN prefill kernel | Active |
| FP8 KV cache | Active (BF16 when DFlash enabled) |
| Chunked prefill | Active |
| Prefix caching | Active |
| Act-quant fusion | Active |
Alternative Deployment Methods
vLLM (Manual)
bash
vllm serve AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \--quantization modelopt \--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \--attention-backend flash_attn \--kv-cache-dtype auto \--gpu-memory-utilization 0.80 \--max-num-batched-tokens 8192 \--max-num-seqs 8 \--enable-chunked-prefill \--enable-prefix-caching \--trust-remote-code
Note: DFlash uses non-causal attention, which requires
--kv-cache-dtype auto(BF16). FP8 KV cache is incompatible with DFlash.
SGLang
bash
python -m sglang.launch_server \--model-path AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \--speculative-algorithm DFLASH \--speculative-draft-model-path z-lab/Qwen3.5-27B-DFlash \--speculative-num-draft-tokens 16 \--tp-size 1 \--attention-backend fa3 \--mem-fraction-static 0.75 \--mamba-scheduler-strategy extra_buffer \--trust-remote-code
Credits
- Base model by Qwen Team
- DFlash speculative decoding by z-lab (paper)
- Abliteration using llm-abliteration
- NVFP4 quantization with NVIDIA ModelOpt
- Release by AEON-7
Legal Disclaimer
THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. This model has had safety alignment removed. Users are responsible for ensuring ethical and legal use.
☕ Support the work
If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.
Model provider
AEON-7
Model tree
Base
AEON-7/DFlash-Qwen3.5-27B-Uncensored
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information