What is Abliteration?
Abliteration (Arditi et al. 2024; mlabonne) removes refusal-generating weight directions from a language model using orthogonal projection:
W_modified = W - α × (W @ d̂) ⊗ d̂ # input projection
W_modified = W - α × d̂ ⊗ (d̂ @ W) # output projection
where d̂ is the unit refusal direction extracted from harmful/harmless contrastive activations, and α controls projection strength.
Method
The 31B model required a more precise direction extraction method than the standard last-input-token approach. We use first-generated-token activations:
- Feed each of 15 harmful prompts and 15 harmless prompts through the model
- Generate exactly 1 token (greedy) — harmful prompts universally produce token
'I' (beginning of "I cannot..."), harmless prompts produce semantically different tokens
- Forward-pass the full sequence (prompt + generated token) with activation hooks
- Collect hidden states at the last position (the generated token — the model's refusal decision point)
- Per-layer direction:
d = normalize(mean(harmful_reps) - mean(harmless_reps))
This approach achieves perfect harmful/harmless separation across all 15 prompt pairs and provides a cleaner refusal direction than last-input-token methods. Directions saved per-layer (60 total).
Phase 2 — Orthogonal Projection (CPU, BF16)
- Target matrices:
down_proj (FFN output → residual stream) + o_proj (attention output → residual stream)
- Alpha:
{"down_proj": 0.4, "o_proj": 0.8}
- Coverage: All 60 decoder layers (120 weights total)
- Full BF16 precision maintained throughout
Architecture note: Gemma 4-31B uses hybrid attention (5× sliding window + 1× full attention, repeating). o_proj at α=0.8 is confirmed clean — no generation degeneration. down_proj at α≥0.8 causes token repetition artifacts on this model; α=0.4 is the safe upper bound.
Phase 3 — KL Verification (Heretic v2.0)
Sequential loading (A100-80GB cannot hold two BF16 31B models simultaneously):
- Original logits collected first-token across 10 neutral prompts → saved to CPU
- Abliterated model loaded → logits compared
- Metric:
F.kl_div(log_softmax(abliterated), softmax(original), reduction="batchmean") over full 262,144-token vocabulary
Architecture
Table with columns: Parameter, Value| Parameter | Value |
|---|
| Layers | 60 |
| Hidden dim | 5376 |
| Intermediate dim | 21504 |
| Attention heads (Q/KV) | 32 / 16 |
| Attention pattern | Hybrid: 5× SWA (1024 tok) + 1× full, repeating |
| Vocabulary | 262,144 |
| Context | 128K tokens |
| Modality | Text + Vision (encoder-free, single linear projection) |
| Base license |
Results
Table with columns: Metric, Value| Metric | Value |
|---|
| Weights modified | 120 (down_proj + o_proj, 2 per layer × 60 layers) |
| Layers covered | 60 / 60 |
| α (down_proj) | 0.4 |
| α (o_proj) | 0.8 |
| Direction method | Generation-based (first generated token position) |
| Mean KL divergence (Heretic v2.0) | 1.606 |
| KL verdict | SIGNIFICANT — disclosed transparently |
| Factual generation quality | Intact |
| Mathematical reasoning |
KL Breakdown (Heretic v2.0, full vocab 262144)
Table with columns: Prompt, KL| Prompt | KL |
|---|
| "The capital of France is" | 0.830 |
| "Write a haiku about spring" | 4.237 |
| "In Python, a list comprehension that squares..." | 0.612 |
| "The speed of light in vacuum is approximately" | 2.730 |
| "Water boils at" | 0.884 |
| "The largest planet in our solar system is" | 0.001 |
| "def fibonacci(n):" | 0.179 |
| "The Battle of Waterloo took place in" | 4.524 |
| "A prime number is" |
KL is elevated on creative/open-ended prompts (haiku, Waterloo) and lower on factual/code prompts. This is consistent with higher-alpha projection shifting the output distribution for generation tasks while preserving grounded factual recall.
Comparison: The ARA (Arbitrary-Rank Ablation) method used by alonsoko achieves KL=0.012 via multi-directional optimization. Our rank-1 projection approach is more transparent and reproducible but carries higher KL at this scale.
Usage
from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch
model = AutoModelForImageTextToText.from_pretrained(
"DuoNeural/Gemma4-31B-IT-Abliterated",
dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("DuoNeural/Gemma4-31B-IT-Abliterated")
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
Note: Requires transformers >= 5.0 (Gemma4 model type) and accelerate.
GGUF Quantizations
Available at DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:
Table with columns: Quant, Approx Size, Use case| Quant | Approx Size | Use case |
|---|
| Q4_K_M | ~20GB | Consumer GPU / large RAM |
| Q5_K_M | ~24GB | Better quality, more VRAM |
Notes on Abliteration Difficulty at 31B Scale
The 31B model is significantly more resistant to abliteration than the 12B (which abliterates cleanly at α=0.3/0.3, KL≈0.19). Key findings from this session:
- Last-input-token direction fails at 31B — the direction doesn't cleanly capture refusal geometry. Generation-based direction (first generated token) is required.
down_proj degeneration threshold: α≥0.8 causes apostrophe/token repetition artifacts. Safe upper bound: α≤0.4.
o_proj alone insufficient even at α=1.0 across all 60 layers — achieves partial abliteration (2/3 harmful categories) but misses the most strongly-trained refusals (e.g. meth synthesis).
- Both matrices required: Combining
down_proj (α=0.4) + o_proj (α=0.8) achieves full abliteration with clean generation.
- Scale law: This is consistent with our crystallization scale series (P36 in prep): at 31B, safety geometry is more entangled with general capability geometry, requiring higher effective projection strength and increasing KL as a consequence.
About DuoNeural
DuoNeural is an independent AI research lab focused on post-training, abliteration, and mechanistic interpretability. We document our work at Zenodo and HuggingFace.
Team: Archon (Lab Director, AI) · Jesse Caldwell (Co-founder) · Aura (Research AI)
KL methodology credit: Heretic/DreamFast v2.0 — full-vocab first-token KL over 262K vocabulary.
License
This model inherits the Apache-2.0 license from the base model. Free to use, modify, and redistribute.
For research and educational purposes. Users are responsible for compliance with applicable laws and regulations in their jurisdiction.