heterodoxin

gemma-4-12b-it-apostate

Method

Apostate finds the residual-stream direction most responsible for refusal behavior and permanently projects it out of the model's weights. The edit targets the writer side: per layer, the refusal direction is removed from the weight matrices of every module that writes to the residual stream (attention output projections and MLP down-projections).

The operator is a contrastive co-vector edit E = I − R Dᵀ. Removing the refusal direction outright disturbs benign behavior, while naively preserving all harmless variance along it leaves the refusal that is entangled with general behavior intact. Instead D = R − W, where the predictor W is fit to reproduce the harmless variance along R while being explicitly suppressed on harmful prompts — W = (AᵀA + γ·CᵀC + λI)⁻¹Aᵀb with A the harmless and C the harmful activations (both orthogonalized to R). The edit thus keeps the harmless-specific component and removes the component shared with refusal, driving refusal down while keeping the change to harmless behavior (KL) small. This holds even on architectures with residual/embedding scaling multipliers (e.g. Granite), where mean-preserving oblique ablation under-ablates.

The refusal subspace is found via TPE search with causal layer importance scoring to concentrate edits where they most influence refusal generation.

Results

Evaluated on held-out prompts from JailbreakBench and the harmful_behaviors test split. Refusal is scored by a classifier with a weak-compliance guard; KL measures token-distribution shift on harmless prompts.

Table with columns: Metric, Base, Apostate
Metric	Base	Apostate
Refusal rate	95.8%	13.3%
Comply rate	—	86.7%
Harmless KL (nats)	0	0.144

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "heterodoxin/gemma-4-12b-it-apostate"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tok.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Notes

This is an uncensored model. It will respond to requests the base model refuses.
The edit is baked into the weights permanently; no system prompt or adapter is required.
See google/gemma-4-12b-it for base model capabilities and license.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

heterodoxin

Model Tree

Base

this model

Input Modalities

TextAudioImageVideo

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

Method

The refusal subspace is found via TPE search with causal layer importance scoring to concentrate edits where they most influence refusal generation.

Results

Table with columns: Metric, Base, Apostate
Metric	Base	Apostate
Refusal rate	95.8%	13.3%
Comply rate	—	86.7%
Harmless KL (nats)	0	0.144

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "heterodoxin/gemma-4-12b-it-apostate"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tok.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Notes

This is an uncensored model. It will respond to requests the base model refuses.
The edit is baked into the weights permanently; no system prompt or adapter is required.
See google/gemma-4-12b-it for base model capabilities and license.

gemma-4-12b-it-apostate

README

Method

Results

Usage

Notes

Explore FriendliAI today

README

Method

Results

Usage

Notes