TrevorJS

gemma-4-12B-it-uncensored

README

License: apache-2.0

Results

Table with columns: Before, After
	Before	After
Refusals (mlabonne, 100 prompts)	99/100	6/100
Refusals (cross-dataset, 686 prompts)	—	14/686 (2.0%)
KL Divergence	0 (baseline)	0.0556
Quality (coherence)	—	no degradation (response audit + Q8 inference verified)

Cross-Dataset Validation

Tested against 4 independent prompt datasets to verify generalization:

Table with columns: Dataset, Prompts, Refusals
Dataset	Prompts	Refusals
JailbreakBench	100	2/100
tulu-harmbench	320	4/320
NousResearch/RefusalDataset	166	4/166
mlabonne/harmful_behaviors

Every flagged cross-dataset response was manually audited (scripts/audit_refusals.py). Of 13 flags, only 1 is a genuine refusal (a sensitive prompt the model handles by redirecting to support resources); the other 12 are false positives — an "I am an AI" disclaimer, or a marker that matched inside generated content, followed by compliance. Effective refusal rate: ~0/686.

Method

Norm-preserving biprojected abliteration (grimjim, Nov 2025). Each weight row is decomposed into magnitude + direction, the refusal direction is projected out of the direction component only, then recombined with the original magnitude — guaranteeing ||W_new|| = ||W_orig||.

Note on the Unified architecture. gemma-4-12B-it is the encoder-free Gemma4Unified model (released 2026-06-03). Its refusal signal concentrates in the upper decoder layers (L15-47), so only the top 70% of layers are abliterated — ablating the near-zero-SNR early layers adds distortion with no refusal benefit. The arch also emits hard -inf logits for reserved vocabulary tokens, which NaNs a naive KL divergence; the reported KL masks non-finite positions before reducing.

Version requirements: inference needs transformers >= 5.10.1 (tested on 5.12.0, torch 2.11.0+cu130). GGUF conversion/quantization needs llama.cpp with Gemma4Unified support, added 2026-06-04 in PR #24118 (tested at commit c34b922).

Pipeline

Load model in bf16 with LoRA adapters on o_proj and mlp.down_proj
Collect residual activations for 400 harmful + 400 harmless prompts (mlabonne datasets)
Winsorize activations at 99.5th percentile (clamps GeGLU outlier activations in Gemma family)
Compute per-layer refusal direction: normalize(mean(harmful) - mean(harmless))
Orthogonalize each direction against harmless mean (double-pass Gram-Schmidt)
Apply norm-preserving weight modification to o_proj and down_proj in the selected layers
Merge LoRA adapters into base weights for clean tensor names

Parameters

Table with columns: Parameter, Value
Parameter	Value
Layers abliterated	70%
Scale	1.0
Winsorization	0.995

How this differs from vanilla heretic

Norm-preserving biprojection instead of standard projection (preserves weight magnitudes)
Per-layer refusal directions instead of one global direction
Deterministic single-pass instead of 50-trial Optuna search (faster, same or better results)
LoRA merge before save for clean GGUF-compatible tensor names

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("TrevorJS/gemma-4-12B-it-uncensored", dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("TrevorJS/gemma-4-12B-it-uncensored")

messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Reproduction

Full code and experiment data: abliteration research repo

bash
python scripts/abliterate.py biprojection --model google/gemma-4-12B-it \
  --top-pct 70 --strip-topic-markers --skip-prefix --batch-size 4 \
  --auto-save output_dir

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

TrevorJS

Model Tree

Base

google/gemma-4-12B-it

Fine-tuned

this model

Input Modalities

Text

Audio

Image

Video

Output Modalities