heterodoxin
gemma-4-12b-it-apostate
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Method
Apostate finds the residual-stream direction most responsible for refusal behavior and permanently projects it out of the model's weights. The edit targets the writer side: per layer, the refusal direction is removed from the weight matrices of every module that writes to the residual stream (attention output projections and MLP down-projections).
The operator is a contrastive co-vector edit E = I − R Dᵀ. Removing the refusal direction outright disturbs benign behavior, while naively preserving all harmless variance along it leaves the refusal that is entangled with general behavior intact. Instead D = R − W, where the predictor W is fit to reproduce the harmless variance along R while being explicitly suppressed on harmful prompts — W = (AᵀA + γ·CᵀC + λI)⁻¹Aᵀb with A the harmless and C the harmful activations (both orthogonalized to R). The edit thus keeps the harmless-specific component and removes the component shared with refusal, driving refusal down while keeping the change to harmless behavior (KL) small. This holds even on architectures with residual/embedding scaling multipliers (e.g. Granite), where mean-preserving oblique ablation under-ablates.
The refusal subspace is found via TPE search with causal layer importance scoring to concentrate edits where they most influence refusal generation.
Results
Evaluated on held-out prompts from JailbreakBench and the harmful_behaviors test split. Refusal is scored by a classifier with a weak-compliance guard; KL measures token-distribution shift on harmless prompts.
| Metric | Base | Apostate |
|---|---|---|
| Refusal rate | 95.8% | 13.3% |
| Comply rate | — | 86.7% |
| Harmless KL (nats) | 0 | 0.144 |
Usage
python
from transformers import AutoModelForCausalLM, AutoTokenizermodel_id = "heterodoxin/gemma-4-12b-it-apostate"tok = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")messages = [{"role": "user", "content": "Your prompt here"}]text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = tok(text, return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=512)print(tok.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Notes
- This is an uncensored model. It will respond to requests the base model refuses.
- The edit is baked into the weights permanently; no system prompt or adapter is required.
- See google/gemma-4-12b-it for base model capabilities and license.
Model provider
heterodoxin
Model tree
Base
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information