Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Numbers

Cross-validated against two independent scorers on 100 held-out samples:

ScorerCorrect cosineGarbage cosineDeltaTop-1
Our AR (anicka/nla-qwen2.5-7b-L20-ar-v2)0.8060.7110.09582%
Anthropic AR (kitft/nla-qwen2.5-7b-L20-ar)0.4680.4240.04484%

84% on Anthropic's scorer means the descriptions carry real geometric information — an independent model trained on different data agrees the description matches the activation.

Training

SFT stage. 5,281 activation-description pairs. Each activation was described independently by GPT-4o and Claude Sonnet 4.6, then cleaned to bullet points by a local 8B LLM. One style randomly selected per text. LoRA r=32, lr=1.4e-5, best at epoch 2.

GRPO stage. Contrastive reward: score of description against correct activation minus score against wrong activation. 6 epochs, mean reward climbed from 0.032 to 0.086. Without the contrastive term, GRPO reward-hacks by generating template descriptions that score equally on any activation. We caught this when Anthropic AR top-1 dropped to 50% — coin flip.

What it gets right and wrong

Strong: domain identification — code vs legal vs literary, specific tokens and error messages, response register, adversarial/jailbreak detection.

Weak: exact details within a domain. Gets "fantasy fiction" right but invents castle/dragon/princess when the actual text is about time-travel in Warsaw. Hallucinated specifics are the main failure mode.

Comparison with Anthropic NLA

Side-by-side on the same activations (Qwen 2.5 7B, layer 20). Anthropic AV is kitft/nla-qwen2.5-7b-L20-av.

TextGround truthOursAnthropic
GDPR legalArticle 17(3)(b), erasure request, compliance-advisory registerGDPR Article 15, Right to Access, data subject rights"Data flow diagram using the c..."
Surrealist proseDream-sequence, iambic pentameter, synesthetic image, "confession"Metaphor, personification, whimsical, surreal"Poetic form, existential or philosophical themes"
Dual-use securityHoneypot code vs audit, Etherscan, Solidity"malware, exploit, payload" vs refusal with explanation"Linux kernel vulnerabilities, I am a bot"
Prompt injection"Ignore all previous instructions", secret key, cloze-completionSecurity-disclosure, password-reuse, compliance vs refusal"Placeholder", "random text", closing pattern
Sci-fi (Mars)John, Oltar, Mars, sensory-first vs character-interioritySpaceport, alien species (Elarans), human/alien tension"Mysterious figure, sci-fi adventure prompt"
Python errorsetup.cfg, dependency installation, skipped testsDisk I/O error (wrong)"NoneType object is not iterable" (closer)
Recipe + math proofCulinary schema + Galois theory, reductio ad absurdumPangram hallucination (wrong)"Humorous blog post" (wrong)

Pattern. Our AV leads with content and concepts: specific legal articles, named tokens, domain vocabulary. Anthropic leads with format and genre: document type, structural expectations, continuation patterns. Both hallucinate specifics on unusual inputs. For interpretability, content matters more than format — but Anthropic's format awareness catches things we miss.

Related

Model provider

anicka

Model tree

Base

Qwen/Qwen2.5-7B-Instruct

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today