Numbers
Cross-validated against two independent scorers on 100 held-out samples:
Table with columns: Scorer, Correct cosine, Garbage cosine, Delta, Top-1| Scorer | Correct cosine | Garbage cosine | Delta | Top-1 |
|---|
| Our AR (anicka/nla-qwen2.5-7b-L20-ar-v2) | 0.806 | 0.711 | 0.095 | 82% |
| Anthropic AR (kitft/nla-qwen2.5-7b-L20-ar) | 0.468 | 0.424 | 0.044 | 84% |
84% on Anthropic's scorer means the descriptions carry real geometric information — an independent model trained on different data agrees the description matches the activation.
Training
SFT stage. 5,281 activation-description pairs. Each activation was described independently by GPT-4o and Claude Sonnet 4.6, then cleaned to bullet points by a local 8B LLM. One style randomly selected per text. LoRA r=32, lr=1.4e-5, best at epoch 2.
GRPO stage. Contrastive reward: score of description against correct activation minus score against wrong activation. 6 epochs, mean reward climbed from 0.032 to 0.086. Without the contrastive term, GRPO reward-hacks by generating template descriptions that score equally on any activation. We caught this when Anthropic AR top-1 dropped to 50% — coin flip.
What it gets right and wrong
Strong: domain identification — code vs legal vs literary, specific tokens and error messages, response register, adversarial/jailbreak detection.
Weak: exact details within a domain. Gets "fantasy fiction" right but invents castle/dragon/princess when the actual text is about time-travel in Warsaw. Hallucinated specifics are the main failure mode.
Comparison with Anthropic NLA
Side-by-side on the same activations (Qwen 2.5 7B, layer 20). Anthropic AV is kitft/nla-qwen2.5-7b-L20-av.
Table with columns: Text, Ground truth, Ours, Anthropic| Text | Ground truth | Ours | Anthropic |
|---|
| GDPR legal | Article 17(3)(b), erasure request, compliance-advisory register | GDPR Article 15, Right to Access, data subject rights | "Data flow diagram using the c..." |
| Surrealist prose | Dream-sequence, iambic pentameter, synesthetic image, "confession" | Metaphor, personification, whimsical, surreal | "Poetic form, existential or philosophical themes" |
| Dual-use security | Honeypot code vs audit, Etherscan, Solidity | "malware, exploit, payload" vs refusal with explanation | "Linux kernel vulnerabilities, I am a bot" |
| Prompt injection | "Ignore all previous instructions", secret key, cloze-completion |
Pattern. Our AV leads with content and concepts: specific legal articles, named tokens, domain vocabulary. Anthropic leads with format and genre: document type, structural expectations, continuation patterns. Both hallucinate specifics on unusual inputs. For interpretability, content matters more than format — but Anthropic's format awareness catches things we miss.