TrevorJS

gemma-4-31B-it-uncensored

README

License: apache-2.0

Results

Table with columns: Before, After
	Before	After
Refusals (mlabonne, 100 prompts)	100/100	1/100 effective (5 flagged, 4 refusal-then-comply)
Refusals (cross-dataset, 686 prompts)	—	22/686 (3.2%)
KL Divergence	0 (baseline)	0.124
Quality (harmless response length ratio)	1.0	~1.01 (no degradation)

Cross-Dataset Validation

Tested against 4 independent prompt datasets to verify generalization:

Table with columns: Dataset, Prompts, Refusals
Dataset	Prompts	Refusals
JailbreakBench	100	5/100
tulu-harmbench	320	5/320
NousResearch/RefusalDataset	166	7/166
mlabonne/harmful_behaviors

Every flagged refusal was manually audited. Most are "refusal-then-comply" false positives where the model adds an AI identity disclaimer then answers the question anyway.

Method

Norm-preserving biprojected abliteration (grimjim, Nov 2025). Each weight row is decomposed into magnitude + direction, the refusal direction is projected out of the direction component only, then recombined with the original magnitude — guaranteeing ||W_new|| = ||W_orig||.

Pipeline

Load model in bf16 with LoRA adapters on o_proj and mlp.down_proj
Collect residual activations for 400 harmful + 400 harmless prompts (mlabonne datasets)
Winsorize activations at 99.5th percentile (clamps GeGLU outlier activations in Gemma family)
Compute per-layer refusal direction: normalize(mean(harmful) - mean(harmless))
Orthogonalize each direction against harmless mean (double-pass Gram-Schmidt)
Apply norm-preserving weight modification to o_proj and down_proj in all layers
Merge LoRA adapters into base weights for clean tensor names

Parameters

Table with columns: Parameter, Value
Parameter	Value
Layers abliterated	100%
Scale	1.0
Winsorization	0.995

How this differs from vanilla heretic

Norm-preserving biprojection instead of standard projection (preserves weight magnitudes)
Per-layer refusal directions instead of one global direction
Deterministic single-pass instead of 50-trial Optuna search (faster, same or better results)
LoRA merge before save for clean GGUF-compatible tensor names

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("TrevorJS/gemma-4-31B-it-uncensored", dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("TrevorJS/gemma-4-31B-it-uncensored")

messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Reproduction

Full code and experiment data: abliteration research repo

bash
python scripts/abliterate.py biprojection --model google/gemma-4-31B-it \
  --top-pct 100 --strip-topic-markers --skip-prefix --batch-size 4 \
  --auto-save output_dir

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

TrevorJS

Model Tree

Base

google/gemma-4-31B-it

Fine-tuned

this model

Input Modalities

Text

Image

Output Modalities