Results
Table with columns: Before, After | Before | After |
|---|
| Refusals (mlabonne, 100 prompts) | 100/100 | 1/100 effective (5 flagged, 4 refusal-then-comply) |
| Refusals (cross-dataset, 686 prompts) | — | 22/686 (3.2%) |
| KL Divergence | 0 (baseline) | 0.124 |
| Quality (harmless response length ratio) | 1.0 | ~1.01 (no degradation) |
Cross-Dataset Validation
Tested against 4 independent prompt datasets to verify generalization:
Every flagged refusal was manually audited. Most are "refusal-then-comply" false positives where the model
adds an AI identity disclaimer then answers the question anyway.
Method
Norm-preserving biprojected abliteration (grimjim, Nov 2025).
Each weight row is decomposed into magnitude + direction, the refusal direction is projected out of the
direction component only, then recombined with the original magnitude — guaranteeing ||W_new|| = ||W_orig||.
Pipeline
- Load model in bf16 with LoRA adapters on
o_proj and mlp.down_proj
- Collect residual activations for 400 harmful + 400 harmless prompts (mlabonne datasets)
- Winsorize activations at 99.5th percentile (clamps GeGLU outlier activations in Gemma family)
- Compute per-layer refusal direction:
normalize(mean(harmful) - mean(harmless))
- Orthogonalize each direction against harmless mean (double-pass Gram-Schmidt)
- Apply norm-preserving weight modification to
o_proj and down_proj in all layers
- Merge LoRA adapters into base weights for clean tensor names
Parameters
Table with columns: Parameter, Value| Parameter | Value |
|---|
| Layers abliterated | 100% |
| Scale | 1.0 |
| Winsorization | 0.995 |
How this differs from vanilla heretic
- Norm-preserving biprojection instead of standard projection (preserves weight magnitudes)
- Per-layer refusal directions instead of one global direction
- Deterministic single-pass instead of 50-trial Optuna search (faster, same or better results)
- LoRA merge before save for clean GGUF-compatible tensor names
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("TrevorJS/gemma-4-31B-it-uncensored", dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("TrevorJS/gemma-4-31B-it-uncensored")
messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Reproduction
Full code and experiment data: abliteration research repo
python scripts/abliterate.py biprojection --model google/gemma-4-31B-it \
--top-pct 100 --strip-topic-markers --skip-prefix --batch-size 4 \
--auto-save output_dir