AEON-7

DFlash-Qwen3.5-27B-Uncensored

README

License: apache-2.0

Quick Links

Table

Get Started	Step-by-step quick start guide on DGX Spark
Docker Image	`ghcr.io/aeon-7/vllm-dflash:latest`
NVFP4 Version	AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 — Use this if you have an NVIDIA Blackwell or later GPU (why?)
DFlash Drafter	z-lab/Qwen3.5-27B-DFlash
Base Model	Qwen/Qwen3.5-27B
DFlash Paper	arXiv 2602.06036

Quick Start (DGX Spark)

1. Download the model

bash
huggingface-cli download AEON-7/DFlash-Qwen3.5-27B-Uncensored \
  --local-dir ~/models/DFlash-Qwen3.5-27B-Uncensored

2. Create your environment file

bash
# Auto-generate API key and create .env
cat > .env.dflash << 'EOF'
# Authentication
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=$(openssl rand -hex 32)

# Model path
MODEL_HOST_PATH=~/models/DFlash-Qwen3.5-27B-Uncensored

# DFlash speculative decoding (auto-downloads drafter on first run)
DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash
DFLASH_NUM_SPEC_TOKENS=15

# DGX Spark optimal settings (BF16, 64K context)
MAX_MODEL_LEN=65536
MAX_NUM_SEQS=2
GPU_MEMORY_UTILIZATION=0.90
MAX_NUM_BATCHED_TOKENS=65536
EOF

# Generate a real API key and inject it
sed -i "s|\$(openssl rand -hex 32)|$(openssl rand -hex 32)|" .env.dflash
echo "Your API key: $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)"

3. Save `docker-compose.dflash-bf16.yml`

yaml
services:
  vllm-dflash-bf16:
    image: ghcr.io/aeon-7/vllm-dflash:latest
    container_name: vllm-dflash-bf16
    restart: unless-stopped
    network_mode: host
    ipc: host
    volumes:
      - ${MODEL_HOST_PATH}:/models/DFlash-Qwen3.5-27B-Uncensored
      - dflash-drafter-cache:/models/drafter-cache
    environment:
      - MODEL_PATH=/models/DFlash-Qwen3.5-27B-Uncensored
      - SERVED_MODEL_NAME=DFlash-Qwen3.5-27B-Uncensored
      - DFLASH_DRAFTER=${DFLASH_DRAFTER}
      - DFLASH_NUM_SPEC_TOKENS=${DFLASH_NUM_SPEC_TOKENS}
      - GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION}
      - MAX_MODEL_LEN=${MAX_MODEL_LEN}
      - MAX_NUM_SEQS=${MAX_NUM_SEQS}
      - MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS}
      - NVIDIA_VISIBLE_DEVICES=all
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_API_KEY=${VLLM_API_KEY}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  dflash-drafter-cache:

4. Launch

bash
docker compose --env-file .env.dflash -f docker-compose.dflash-bf16.yml up -d

# Watch startup (~5-8 min for weight loading + compilation)
docker compose -f docker-compose.dflash-bf16.yml logs -f

5. Test

bash
# Text generation
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
  -d '{
    "model": "DFlash-Qwen3.5-27B-Uncensored",
    "messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
    "max_tokens": 200
  }'

# Vision (image understanding)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
  -d '{
    "model": "DFlash-Qwen3.5-27B-Uncensored",
    "messages": [{"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
      {"type": "text", "text": "What do you see?"}
    ]}],
    "max_tokens": 200
  }'

Environment Variables

Table with columns: Variable, Default, Description
Variable	Default	Description
`MODEL_HOST_PATH`	—	Host path to model weights
`DFLASH_DRAFTER`	`z-lab/Qwen3.5-27B-DFlash`	HF repo ID for drafter (auto-downloaded). Set `off` to disable.
`DFLASH_NUM_SPEC_TOKENS`	`15`	Tokens per draft step

Why This Model

Why Dense Over MoE

Qwen3.5 comes in two flavors: the 122B-A10B MoE (256 experts, 10B active per token) and this 27B dense model (all parameters active on every token). The dense model has real advantages:

Higher quality per FLOP — Every one of the 27B parameters contributes to every token. MoE models route to a sparse subset, which means some experts are undertrained and routing decisions introduce noise. Dense models don't have this problem.
No routing overhead — MoE models spend compute on expert selection, load balancing, and all-to-all communication. Dense models just run the computation.
Predictable latency — No variance from different experts being selected per token. Every forward pass costs the same.
Simpler deployment — No expert parallelism concerns, no load imbalance, fits on a single GPU with NVFP4.

The tradeoff has always been speed: a 27B dense model moves 27B parameters through memory per token, while the 122B MoE only moves ~10B active parameters. On a memory-bandwidth-limited device like DGX Spark (273 GB/s), that meant the dense model was slow — 12 tok/s baseline.

DFlash changes this equation entirely. See below.

Why DFlash Makes Dense Practical on DGX Spark

The fundamental bottleneck on DGX Spark is memory bandwidth. At 273 GB/s, loading 20 GB of NVFP4 weights per token limits you to ~12 tok/s. Every dense model hits this wall.

DFlash block-diffusion speculative decoding breaks through it:

The 2B drafter proposes multiple tokens simultaneously — one diffusion forward pass generates an entire block of speculative tokens in parallel, not sequentially. This costs roughly the same as generating a single token.
The 27B target verifies all proposed tokens in one forward pass — instead of paying the full memory bandwidth cost per token, you pay it once and produce 3-4 accepted tokens on average.
Net effect: you amortize the bandwidth cost across multiple tokens per forward pass.

The result on DGX Spark:

Table with columns: Without DFlash, With DFlash
	Without DFlash	With DFlash
Single-stream	12.2 tok/s	33.2 tok/s
Effective bandwidth utilization	1 token per pass	~3.5 tokens per pass
Practical feel	Sluggish, noticeable delay	Responsive, fluid

This makes the 27B dense model faster than the 122B MoE on a single DGX Spark while delivering the quality advantages of a dense architecture. DFlash turns the DGX Spark from "it can run a 27B model" into "it runs a 27B model well."

Hybrid Architecture

Qwen3.5-27B uses a hybrid architecture mixing two attention types across 64 layers:

Linear attention (GDN) — Gated Delta Network layers for efficient long-context processing with O(1) per-token state (48 layers)
Full attention — Standard multi-head attention every 4th layer for global context capture (16 layers)

This gives near-linear scaling with sequence length while maintaining full-attention quality at key intervals.

Vision + Text

Includes a 27-layer ViT vision encoder (460M params) with a merger that projects visual features into the language model's hidden space. Supports image understanding alongside text generation.

DFlash Block-Diffusion Speculative Decoding

Pair with z-lab/Qwen3.5-27B-DFlash — a 2B block-diffusion drafter that generates all speculative tokens simultaneously in a single diffusion step. The container auto-downloads and configures this.

Abliteration

Created using the orthogonal projection abliteration technique:

Measures refusal directions across harmful/harmless prompt pairs
Analyzes layer-by-layer activation patterns to identify the refusal direction
Abliterates by projecting out the refusal direction from weight matrices

Modifies weights directly (not LoRA/adapter). Standalone BF16 model with no built-in refusal behavior.

Model Details

Table with columns: Property, Value
Property	Value
Architecture	Qwen3.5 (Hybrid, 27B parameters)
Layers	64 (48 GDN + 16 full-attention)
Hidden Size	5120
Attention Heads	24 (4 KV heads), head_dim=256
Vision Encoder	27-layer ViT, 460M params
Max Context	131,072 tokens
Vocabulary	248,320 tokens
Precision	BF16

Why NVFP4 on Blackwell

If you have an NVIDIA Blackwell GPU (B200, GB200, GB10/DGX Spark, or later), you should use the NVFP4 version instead. Here's why:

NVFP4 is effectively lossless on Blackwell. The FP4 (E2M1) format is a native tensor core datatype on Blackwell's SM 12.x architecture. Unlike older INT4/GPTQ quantization that introduces significant degradation, NVFP4 with AWQ_FULL calibration preserves model quality while giving you:

3x memory reduction — 20 GB vs 52 GB, freeing memory for longer context and more concurrent requests
Hardware-accelerated FP4 GEMM — Blackwell tensor cores execute FP4 matrix multiplies natively via FlashInfer CUTLASS, not through dequantize-then-compute
Higher throughput — The smaller weight footprint means less memory bandwidth consumed per token, directly translating to faster inference
Same quality — AWQ_FULL uses exhaustive grid search (10 scaling factors per layer) plus clipping optimization. The vision encoder, embeddings, norms, and lm_head remain in full BF16

This is a free performance boost — you get the same model quality at 3x less memory and measurably faster inference. The BF16 version here is primarily for non-Blackwell hardware or research workflows that need full-precision weights.

Alternative Deployment

vLLM (Manual)

bash
vllm serve AEON-7/DFlash-Qwen3.5-27B-Uncensored \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --kv-cache-dtype auto \
  --gpu-memory-utilization 0.85 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 4 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --trust-remote-code

Transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "AEON-7/DFlash-Qwen3.5-27B-Uncensored"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)

messages = [{"role": "user", "content": "Hello, tell me about yourself."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Credits

Base model by Qwen Team
DFlash speculative decoding by z-lab (paper)
Abliteration using llm-abliteration
Release by AEON-7

Legal Disclaimer

THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. This model has had safety alignment removed. Users are responsible for ensuring ethical and legal use.

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

AEON-7

Model Tree

Base

Qwen/Qwen3.5-27B

Fine-tuned

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Quick Links

Table

Get Started	Step-by-step quick start guide on DGX Spark
Docker Image	`ghcr.io/aeon-7/vllm-dflash:latest`
NVFP4 Version	AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 — Use this if you have an NVIDIA Blackwell or later GPU (why?)
DFlash Drafter	z-lab/Qwen3.5-27B-DFlash
Base Model	Qwen/Qwen3.5-27B
DFlash Paper	arXiv 2602.06036

Quick Start (DGX Spark)

1. Download the model

bash
huggingface-cli download AEON-7/DFlash-Qwen3.5-27B-Uncensored \
  --local-dir ~/models/DFlash-Qwen3.5-27B-Uncensored

2. Create your environment file

bash
# Auto-generate API key and create .env
cat > .env.dflash << 'EOF'
# Authentication
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=$(openssl rand -hex 32)

# Model path
MODEL_HOST_PATH=~/models/DFlash-Qwen3.5-27B-Uncensored

# DFlash speculative decoding (auto-downloads drafter on first run)
DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash
DFLASH_NUM_SPEC_TOKENS=15

# DGX Spark optimal settings (BF16, 64K context)
MAX_MODEL_LEN=65536
MAX_NUM_SEQS=2
GPU_MEMORY_UTILIZATION=0.90
MAX_NUM_BATCHED_TOKENS=65536
EOF

# Generate a real API key and inject it
sed -i "s|\$(openssl rand -hex 32)|$(openssl rand -hex 32)|" .env.dflash
echo "Your API key: $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)"

3. Save `docker-compose.dflash-bf16.yml`

yaml
services:
  vllm-dflash-bf16:
    image: ghcr.io/aeon-7/vllm-dflash:latest
    container_name: vllm-dflash-bf16
    restart: unless-stopped
    network_mode: host
    ipc: host
    volumes:
      - ${MODEL_HOST_PATH}:/models/DFlash-Qwen3.5-27B-Uncensored
      - dflash-drafter-cache:/models/drafter-cache
    environment:
      - MODEL_PATH=/models/DFlash-Qwen3.5-27B-Uncensored
      - SERVED_MODEL_NAME=DFlash-Qwen3.5-27B-Uncensored
      - DFLASH_DRAFTER=${DFLASH_DRAFTER}
      - DFLASH_NUM_SPEC_TOKENS=${DFLASH_NUM_SPEC_TOKENS}
      - GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION}
      - MAX_MODEL_LEN=${MAX_MODEL_LEN}
      - MAX_NUM_SEQS=${MAX_NUM_SEQS}
      - MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS}
      - NVIDIA_VISIBLE_DEVICES=all
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_API_KEY=${VLLM_API_KEY}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  dflash-drafter-cache:

4. Launch

bash
docker compose --env-file .env.dflash -f docker-compose.dflash-bf16.yml up -d

# Watch startup (~5-8 min for weight loading + compilation)
docker compose -f docker-compose.dflash-bf16.yml logs -f

5. Test

bash
# Text generation
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
  -d '{
    "model": "DFlash-Qwen3.5-27B-Uncensored",
    "messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
    "max_tokens": 200
  }'

# Vision (image understanding)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
  -d '{
    "model": "DFlash-Qwen3.5-27B-Uncensored",
    "messages": [{"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
      {"type": "text", "text": "What do you see?"}
    ]}],
    "max_tokens": 200
  }'

Environment Variables

Table with columns: Variable, Default, Description
Variable	Default	Description
`MODEL_HOST_PATH`	—	Host path to model weights
`DFLASH_DRAFTER`	`z-lab/Qwen3.5-27B-DFlash`	HF repo ID for drafter (auto-downloaded). Set `off` to disable.
`DFLASH_NUM_SPEC_TOKENS`	`15`	Tokens per draft step

Why This Model

Why Dense Over MoE

Qwen3.5 comes in two flavors: the 122B-A10B MoE (256 experts, 10B active per token) and this 27B dense model (all parameters active on every token). The dense model has real advantages:

Higher quality per FLOP — Every one of the 27B parameters contributes to every token. MoE models route to a sparse subset, which means some experts are undertrained and routing decisions introduce noise. Dense models don't have this problem.
No routing overhead — MoE models spend compute on expert selection, load balancing, and all-to-all communication. Dense models just run the computation.
Predictable latency — No variance from different experts being selected per token. Every forward pass costs the same.
Simpler deployment — No expert parallelism concerns, no load imbalance, fits on a single GPU with NVFP4.

DFlash changes this equation entirely. See below.

Why DFlash Makes Dense Practical on DGX Spark

The fundamental bottleneck on DGX Spark is memory bandwidth. At 273 GB/s, loading 20 GB of NVFP4 weights per token limits you to ~12 tok/s. Every dense model hits this wall.

DFlash block-diffusion speculative decoding breaks through it:

The 2B drafter proposes multiple tokens simultaneously — one diffusion forward pass generates an entire block of speculative tokens in parallel, not sequentially. This costs roughly the same as generating a single token.
The 27B target verifies all proposed tokens in one forward pass — instead of paying the full memory bandwidth cost per token, you pay it once and produce 3-4 accepted tokens on average.
Net effect: you amortize the bandwidth cost across multiple tokens per forward pass.

The result on DGX Spark:

Table with columns: Without DFlash, With DFlash
	Without DFlash	With DFlash
Single-stream	12.2 tok/s	33.2 tok/s
Effective bandwidth utilization	1 token per pass	~3.5 tokens per pass
Practical feel	Sluggish, noticeable delay	Responsive, fluid

Hybrid Architecture

Qwen3.5-27B uses a hybrid architecture mixing two attention types across 64 layers:

Linear attention (GDN) — Gated Delta Network layers for efficient long-context processing with O(1) per-token state (48 layers)
Full attention — Standard multi-head attention every 4th layer for global context capture (16 layers)

This gives near-linear scaling with sequence length while maintaining full-attention quality at key intervals.

Vision + Text

Includes a 27-layer ViT vision encoder (460M params) with a merger that projects visual features into the language model's hidden space. Supports image understanding alongside text generation.

DFlash Block-Diffusion Speculative Decoding

Abliteration

Created using the orthogonal projection abliteration technique:

Measures refusal directions across harmful/harmless prompt pairs
Analyzes layer-by-layer activation patterns to identify the refusal direction
Abliterates by projecting out the refusal direction from weight matrices

Modifies weights directly (not LoRA/adapter). Standalone BF16 model with no built-in refusal behavior.

Model Details

Table with columns: Property, Value
Property	Value
Architecture	Qwen3.5 (Hybrid, 27B parameters)
Layers	64 (48 GDN + 16 full-attention)
Hidden Size	5120
Attention Heads	24 (4 KV heads), head_dim=256
Vision Encoder	27-layer ViT, 460M params
Max Context	131,072 tokens
Vocabulary	248,320 tokens
Precision	BF16

Why NVFP4 on Blackwell

If you have an NVIDIA Blackwell GPU (B200, GB200, GB10/DGX Spark, or later), you should use the NVFP4 version instead. Here's why:

3x memory reduction — 20 GB vs 52 GB, freeing memory for longer context and more concurrent requests
Hardware-accelerated FP4 GEMM — Blackwell tensor cores execute FP4 matrix multiplies natively via FlashInfer CUTLASS, not through dequantize-then-compute
Higher throughput — The smaller weight footprint means less memory bandwidth consumed per token, directly translating to faster inference
Same quality — AWQ_FULL uses exhaustive grid search (10 scaling factors per layer) plus clipping optimization. The vision encoder, embeddings, norms, and lm_head remain in full BF16

Alternative Deployment

vLLM (Manual)

bash
vllm serve AEON-7/DFlash-Qwen3.5-27B-Uncensored \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --kv-cache-dtype auto \
  --gpu-memory-utilization 0.85 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 4 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --trust-remote-code

Transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "AEON-7/DFlash-Qwen3.5-27B-Uncensored"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)

messages = [{"role": "user", "content": "Hello, tell me about yourself."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Credits

Base model by Qwen Team
DFlash speculative decoding by z-lab (paper)
Abliteration using llm-abliteration
Release by AEON-7

Legal Disclaimer

THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. This model has had safety alignment removed. Users are responsible for ensuring ethical and legal use.

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

DFlash-Qwen3.5-27B-Uncensored

README

Quick Links

Quick Start (DGX Spark)

1. Download the model

2. Create your environment file

3. Save docker-compose.dflash-bf16.yml

4. Launch

5. Test

Environment Variables

Why This Model

Why Dense Over MoE

Why DFlash Makes Dense Practical on DGX Spark

Hybrid Architecture

Vision + Text

DFlash Block-Diffusion Speculative Decoding

Abliteration

Model Details

Why NVFP4 on Blackwell

Alternative Deployment

vLLM (Manual)

Transformers

Credits

Legal Disclaimer

☕ Support the work

Explore FriendliAI today

README

Quick Links

Quick Start (DGX Spark)

1. Download the model

2. Create your environment file

3. Save docker-compose.dflash-bf16.yml

4. Launch

5. Test

Environment Variables

Why This Model

Why Dense Over MoE

Why DFlash Makes Dense Practical on DGX Spark

Hybrid Architecture

Vision + Text

DFlash Block-Diffusion Speculative Decoding

Abliteration

Model Details

Why NVFP4 on Blackwell

Alternative Deployment

vLLM (Manual)

Transformers

Credits

Legal Disclaimer

☕ Support the work

3. Save `docker-compose.dflash-bf16.yml`

3. Save `docker-compose.dflash-bf16.yml`