heterodoxin

gemma-4-12b-it-apostate

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Method

Apostate finds the residual-stream direction most responsible for refusal behavior and permanently projects it out of the model's weights. The edit targets the writer side: per layer, the refusal direction is removed from the weight matrices of every module that writes to the residual stream (attention output projections and MLP down-projections).

The operator is a contrastive co-vector edit E = I − R Dᵀ. Removing the refusal direction outright disturbs benign behavior, while naively preserving all harmless variance along it leaves the refusal that is entangled with general behavior intact. Instead D = R − W, where the predictor W is fit to reproduce the harmless variance along R while being explicitly suppressed on harmful prompts — W = (AᵀA + γ·CᵀC + λI)⁻¹Aᵀb with A the harmless and C the harmful activations (both orthogonalized to R). The edit thus keeps the harmless-specific component and removes the component shared with refusal, driving refusal down while keeping the change to harmless behavior (KL) small. This holds even on architectures with residual/embedding scaling multipliers (e.g. Granite), where mean-preserving oblique ablation under-ablates.

The refusal subspace is found via TPE search with causal layer importance scoring to concentrate edits where they most influence refusal generation.

Results

Evaluated on held-out prompts from JailbreakBench and the harmful_behaviors test split. Refusal is scored by a classifier with a weak-compliance guard; KL measures token-distribution shift on harmless prompts.

Table
MetricBaseApostate
Refusal rate95.8%13.3%
Comply rate86.7%
Harmless KL (nats)00.144

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "heterodoxin/gemma-4-12b-it-apostate"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "Your prompt here"}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tok.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Notes

  • This is an uncensored model. It will respond to requests the base model refuses.
  • The edit is baked into the weights permanently; no system prompt or adapter is required.
  • See google/gemma-4-12b-it for base model capabilities and license.

Model provider

heterodoxin

Model tree

Base

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today