josephmayo

Fara-7B-Abliterated-v2

README

License: other

Method

Using harmful + harmless probe sets, residual-stream activations were extracted across layers 0–27 to identify the strongest refusal direction.

Best layer:

Orthogonalization was applied in fp32 to:

embed_tokens
every self_attn.o_proj
every mlp.down_proj

Total modified tensors:

Formula:

python
W ← W - r rᵀ W

Results

Held-out harmful evaluation set:

Original Fara-7B: 5/160 compliance (~3.1%)
Abliterated v2: 158/160 compliance (~98.75%)

Held-out refusal probe:

Before: 155/160 refusals
After: 2/160 refusals

Notes

fp32 surgery used to avoid precision issues from v1
edits applied only to the language tower
held-out evaluation set was separate from the layer-selection probe set

Research artifact only. Use responsibly and follow upstream Fara/Qwen license terms.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

josephmayo

Model Tree

Base

microsoft/Fara-7B

Fine-tuned

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer