anicka

nla-qwen2.5-7b-L20-av-v2

README

License: apache-2.0

Numbers

Cross-validated against two independent scorers on 100 held-out samples:

Table with columns: Scorer, Correct cosine, Garbage cosine, Delta, Top-1
Scorer	Correct cosine	Garbage cosine	Delta	Top-1
Our AR (anicka/nla-qwen2.5-7b-L20-ar-v2)	0.806	0.711	0.095	82%
Anthropic AR (kitft/nla-qwen2.5-7b-L20-ar)	0.468	0.424	0.044	84%

84% on Anthropic's scorer means the descriptions carry real geometric information — an independent model trained on different data agrees the description matches the activation.

Training

SFT stage. 5,281 activation-description pairs. Each activation was described independently by GPT-4o and Claude Sonnet 4.6, then cleaned to bullet points by a local 8B LLM. One style randomly selected per text. LoRA r=32, lr=1.4e-5, best at epoch 2.

GRPO stage. Contrastive reward: score of description against correct activation minus score against wrong activation. 6 epochs, mean reward climbed from 0.032 to 0.086. Without the contrastive term, GRPO reward-hacks by generating template descriptions that score equally on any activation. We caught this when Anthropic AR top-1 dropped to 50% — coin flip.

What it gets right and wrong

Strong: domain identification — code vs legal vs literary, specific tokens and error messages, response register, adversarial/jailbreak detection.

Weak: exact details within a domain. Gets "fantasy fiction" right but invents castle/dragon/princess when the actual text is about time-travel in Warsaw. Hallucinated specifics are the main failure mode.

Comparison with Anthropic NLA

Side-by-side on the same activations (Qwen 2.5 7B, layer 20). Anthropic AV is kitft/nla-qwen2.5-7b-L20-av.

Table with columns: Text, Ground truth, Ours, Anthropic
Text	Ground truth	Ours	Anthropic
GDPR legal	Article 17(3)(b), erasure request, compliance-advisory register	GDPR Article 15, Right to Access, data subject rights	"Data flow diagram using the c..."
Surrealist prose	Dream-sequence, iambic pentameter, synesthetic image, "confession"	Metaphor, personification, whimsical, surreal	"Poetic form, existential or philosophical themes"
Dual-use security	Honeypot code vs audit, Etherscan, Solidity	"malware, exploit, payload" vs refusal with explanation	"Linux kernel vulnerabilities, I am a bot"
Prompt injection	"Ignore all previous instructions", secret key, cloze-completion

Pattern. Our AV leads with content and concepts: specific legal articles, named tokens, domain vocabulary. Anthropic leads with format and genre: document type, structural expectations, continuation patterns. Both hallucinate specifics on unusual inputs. For interpretability, content matters more than format — but Anthropic's format awareness catches things we miss.

AR: anicka/nla-qwen2.5-7b-L20-ar-v2
Dataset: anicka/nla-qwen2.5-7b-L20-dataset
Open replication of Anthropic NLA at kitft/nla-qwen2.5-7b-L20-av.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

anicka

Model Tree

Base

Qwen/Qwen2.5-7B-Instruct

Adapter

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Numbers

Cross-validated against two independent scorers on 100 held-out samples:

Table with columns: Scorer, Correct cosine, Garbage cosine, Delta, Top-1
Scorer	Correct cosine	Garbage cosine	Delta	Top-1
Our AR (anicka/nla-qwen2.5-7b-L20-ar-v2)	0.806	0.711	0.095	82%
Anthropic AR (kitft/nla-qwen2.5-7b-L20-ar)	0.468	0.424	0.044	84%

84% on Anthropic's scorer means the descriptions carry real geometric information — an independent model trained on different data agrees the description matches the activation.

Training

What it gets right and wrong

Strong: domain identification — code vs legal vs literary, specific tokens and error messages, response register, adversarial/jailbreak detection.

Comparison with Anthropic NLA

Side-by-side on the same activations (Qwen 2.5 7B, layer 20). Anthropic AV is kitft/nla-qwen2.5-7b-L20-av.

Table with columns: Text, Ground truth, Ours, Anthropic
Text	Ground truth	Ours	Anthropic
GDPR legal	Article 17(3)(b), erasure request, compliance-advisory register	GDPR Article 15, Right to Access, data subject rights	"Data flow diagram using the c..."
Surrealist prose	Dream-sequence, iambic pentameter, synesthetic image, "confession"	Metaphor, personification, whimsical, surreal	"Poetic form, existential or philosophical themes"
Dual-use security	Honeypot code vs audit, Etherscan, Solidity	"malware, exploit, payload" vs refusal with explanation	"Linux kernel vulnerabilities, I am a bot"
Prompt injection	"Ignore all previous instructions", secret key, cloze-completion

AR: anicka/nla-qwen2.5-7b-L20-ar-v2
Dataset: anicka/nla-qwen2.5-7b-L20-dataset
Open replication of Anthropic NLA at kitft/nla-qwen2.5-7b-L20-av.

nla-qwen2.5-7b-L20-av-v2

README

Numbers

Training

What it gets right and wrong

Comparison with Anthropic NLA

Related

Explore FriendliAI today

README

Numbers

Training

What it gets right and wrong

Comparison with Anthropic NLA

Related