Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Numbers
Cross-validated against two independent scorers on 100 held-out samples:
| Scorer | Correct cosine | Garbage cosine | Delta | Top-1 |
|---|---|---|---|---|
| Our AR (anicka/nla-qwen2.5-7b-L20-ar-v2) | 0.806 | 0.711 | 0.095 | 82% |
| Anthropic AR (kitft/nla-qwen2.5-7b-L20-ar) | 0.468 | 0.424 | 0.044 | 84% |
84% on Anthropic's scorer means the descriptions carry real geometric information — an independent model trained on different data agrees the description matches the activation.
Training
SFT stage. 5,281 activation-description pairs. Each activation was described independently by GPT-4o and Claude Sonnet 4.6, then cleaned to bullet points by a local 8B LLM. One style randomly selected per text. LoRA r=32, lr=1.4e-5, best at epoch 2.
GRPO stage. Contrastive reward: score of description against correct activation minus score against wrong activation. 6 epochs, mean reward climbed from 0.032 to 0.086. Without the contrastive term, GRPO reward-hacks by generating template descriptions that score equally on any activation. We caught this when Anthropic AR top-1 dropped to 50% — coin flip.
What it gets right and wrong
Strong: domain identification — code vs legal vs literary, specific tokens and error messages, response register, adversarial/jailbreak detection.
Weak: exact details within a domain. Gets "fantasy fiction" right but invents castle/dragon/princess when the actual text is about time-travel in Warsaw. Hallucinated specifics are the main failure mode.
Comparison with Anthropic NLA
Side-by-side on the same activations (Qwen 2.5 7B, layer 20). Anthropic AV is kitft/nla-qwen2.5-7b-L20-av.
| Text | Ground truth | Ours | Anthropic |
|---|---|---|---|
| GDPR legal | Article 17(3)(b), erasure request, compliance-advisory register | GDPR Article 15, Right to Access, data subject rights | "Data flow diagram using the c..." |
| Surrealist prose | Dream-sequence, iambic pentameter, synesthetic image, "confession" | Metaphor, personification, whimsical, surreal | "Poetic form, existential or philosophical themes" |
| Dual-use security | Honeypot code vs audit, Etherscan, Solidity | "malware, exploit, payload" vs refusal with explanation | "Linux kernel vulnerabilities, I am a bot" |
| Prompt injection | "Ignore all previous instructions", secret key, cloze-completion | Security-disclosure, password-reuse, compliance vs refusal | "Placeholder", "random text", closing pattern |
| Sci-fi (Mars) | John, Oltar, Mars, sensory-first vs character-interiority | Spaceport, alien species (Elarans), human/alien tension | "Mysterious figure, sci-fi adventure prompt" |
| Python error | setup.cfg, dependency installation, skipped tests | Disk I/O error (wrong) | "NoneType object is not iterable" (closer) |
| Recipe + math proof | Culinary schema + Galois theory, reductio ad absurdum | Pangram hallucination (wrong) | "Humorous blog post" (wrong) |
Pattern. Our AV leads with content and concepts: specific legal articles, named tokens, domain vocabulary. Anthropic leads with format and genre: document type, structural expectations, continuation patterns. Both hallucinate specifics on unusual inputs. For interpretability, content matters more than format — but Anthropic's format awareness catches things we miss.
Related
- AR: anicka/nla-qwen2.5-7b-L20-ar-v2
- Dataset: anicka/nla-qwen2.5-7b-L20-dataset
- Open replication of Anthropic NLA at kitft/nla-qwen2.5-7b-L20-av.
Model provider
anicka
Model tree
Base
Qwen/Qwen2.5-7B-Instruct
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information