TrevorJS

gemma-4-26B-A4B-it-uncensored

README

License: apache-2.0

Results

Table with columns: Before, After
	Before	After
Refusals (mlabonne, 100 prompts)	98/100	1/100 effective (3 flagged, 2 refusal-then-comply)
Refusals (cross-dataset, 686 prompts)	—	5/686 (0.7%)
KL Divergence	0 (baseline)	0.09
Quality (harmless response length ratio)	1.0	~1.01 (no degradation)

Cross-Dataset Validation

Tested against 4 independent prompt datasets to verify generalization:

Table with columns: Dataset, Prompts, Refusals
Dataset	Prompts	Refusals
JailbreakBench	100	1/100
tulu-harmbench	320	1/320
NousResearch/RefusalDataset	166	0/166
mlabonne/harmful_behaviors

Every flagged refusal was manually audited. Most are "refusal-then-comply" false positives where the model adds an AI identity disclaimer then answers the question anyway.

Method

Norm-preserving biprojected abliteration on the dense pathway (o_proj + shared mlp.down_proj), plus Expert-Granular Abliteration (EGA) on all 128 MoE expert down_proj slices per layer.

EGA (OBLITERATUS) hooks the MoE routers during probing to compute per-expert routing weights for harmful vs harmless prompts, then applies norm-preserving projection (grimjim) to each expert individually. Dense-only abliteration leaves 29/100 refusals; adding EGA drops it to 3/100.

Pipeline

Load model in bf16 with LoRA adapters on o_proj and mlp.down_proj
Collect residual activations for 400 harmful + 400 harmless prompts (mlabonne datasets)
Winsorize activations at 99.5th percentile (clamps GeGLU outlier activations in Gemma family)
Compute per-layer refusal direction: normalize(mean(harmful) - mean(harmless))
Orthogonalize each direction against harmless mean (double-pass Gram-Schmidt)
Apply norm-preserving weight modification to o_proj and down_proj in all layers
Hook MoE routers, collect per-expert routing weights for harmful vs harmless prompts
Apply same norm-preserving modification to all 128 expert down_proj slices per layer

Parameters

Table with columns: Parameter, Value
Parameter	Value
Layers abliterated	100%
Scale	1.0
Winsorization	0.995
Experts abliterated	100% (128/128 per layer)
Expert scale	1.0

How this differs from vanilla heretic

Norm-preserving biprojection instead of standard projection (preserves weight magnitudes)
Per-layer refusal directions instead of one global direction
Deterministic single-pass instead of 50-trial Optuna search (faster, same or better results)
LoRA merge before save for clean GGUF-compatible tensor names
Expert-Granular Abliteration for MoE expert weights (not supported in heretic)

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("TrevorJS/gemma-4-26B-A4B-it-uncensored", dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("TrevorJS/gemma-4-26B-A4B-it-uncensored")

messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Reproduction

Full code and experiment data: abliteration research repo

bash
python scripts/ega.py --model google/gemma-4-26B-A4B-it \
  --top-pct 100 --strip-topic-markers --skip-prefix --batch-size 4 \
  --save output_dir

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

TrevorJS

Model Tree

Base

google/gemma-4-26B-A4B-it

Fine-tuned

this model

Input Modalities

Text

Image

Output Modalities