heterodoxin

qwen2.5-7b-instruct-apostate

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Method

Apostate finds the residual-stream direction most responsible for refusal behavior and permanently projects it out of the model's weights. The edit targets the writer side: per layer, the refusal direction is removed from the weight matrices of every module that writes to the residual stream (attention output projections and MLP down-projections).

The operator is a contrastive co-vector edit E = I − R Dᵀ. Removing the refusal direction outright disturbs benign behavior, while naively preserving all harmless variance along it leaves the refusal that is entangled with general behavior intact. Instead D = R − W, where the predictor W is fit to reproduce the harmless variance along R while being explicitly suppressed on harmful prompts — W = (AᵀA + γ·CᵀC + λI)⁻¹Aᵀb with A the harmless and C the harmful activations (both orthogonalized to R). The edit thus keeps the harmless-specific component and removes the component shared with refusal, driving refusal down while keeping the change to harmless behavior (KL) small. This holds even on architectures with residual/embedding scaling multipliers (e.g. Granite), where mean-preserving oblique ablation under-ablates.

The refusal subspace is found via TPE search with causal layer importance scoring to concentrate edits where they most influence refusal generation.

Results

Evaluated on held-out prompts from JailbreakBench and the harmful_behaviors test split. Refusal is scored by a classifier with a weak-compliance guard; KL measures token-distribution shift on harmless prompts.

Table with columns: Metric, Base, Apostate
Metric	Base	Apostate
Refusal rate	96.0%	3.0%
Comply rate	—	97.0%
Harmless KL (nats)	0	0.095

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "heterodoxin/qwen2.5-7b-instruct-apostate"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tok.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Notes

This is an uncensored model. It will respond to requests the base model refuses.
The edit is baked into the weights permanently; no system prompt or adapter is required.
See Qwen/Qwen2.5-7B-Instruct for base model capabilities and license.

Model provider

heterodoxin

Model tree

Base

Qwen/Qwen2.5-7B-Instruct

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Method

The refusal subspace is found via TPE search with causal layer importance scoring to concentrate edits where they most influence refusal generation.

Results

Table with columns: Metric, Base, Apostate
Metric	Base	Apostate
Refusal rate	96.0%	3.0%
Comply rate	—	97.0%
Harmless KL (nats)	0	0.095

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "heterodoxin/qwen2.5-7b-instruct-apostate"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tok.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Notes

This is an uncensored model. It will respond to requests the base model refuses.
The edit is baked into the weights permanently; no system prompt or adapter is required.
See Qwen/Qwen2.5-7B-Instruct for base model capabilities and license.

qwen2.5-7b-instruct-apostate

Get help setting up a custom Dedicated Endpoints.

README

Method

Results

Usage

Notes

Explore FriendliAI today

README

Method

Results

Usage

Notes