What this is
A Natural Language Autoencoder is a pair of fine-tuned LMs that map residual-stream
activations to language and back:
Table with columns: direction, model, mechanism| direction | model | mechanism |
|---|
| activation → text | AV (this model) | inject the vector at a marker token, autoregress an explanation |
| text → activation | AR | truncated K+1-layer copy of the base + Linear(d, d) readout (reconstruction) |
This repo hosts the AV for Qwen3-8B @ layer 24.
It was trained with the Karvonen norm-matched additive injection at the marker
token (a layer-1 residual-stream hook):
h'_p = h_p + ‖h_p‖ · v / ‖v‖
where v is the layer-24 activation you want verbalized. Serve it with the same
injection — not the paper's embedding-replacement default. In the
nanoNLA code this is the Karvonen path; see
docs/qwen3_8b_run.md.
The marker token, injection/MSE scales, and d_model are all read from the run's
nla_meta.yaml sidecar — never hardcode them.
Usage
See docs/inference.md
for the full harness (marker detection, neighbor-anchoring, the Karvonen hook). Sketch:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ceselder/nanoNLA", torch_dtype="bfloat16")
tok = AutoTokenizer.from_pretrained("ceselder/nanoNLA")
Training
- Base: Qwen/Qwen3-8B — fine-tuned; merged bf16 weights provided here
- Layer: 24 (residual stream)
- Stage: AV-SFT (warm-start), 1000 steps
- Injection: Karvonen norm-matched additive (layer-1 hook)
- Warm-start labels: explanations over
qwen3-8b-nla-L24-finefineweb-100k
- Peak LR: 3e-5 (cosine)
Exact recipe (rank/scales/hparams) and eval numbers are in the
nanoNLA repo and
docs/qwen3_8b_run.md.
Caveats
Research checkpoint, not a product. This is the SFT stage (pre-RL), so the AV
produces plausible-but-imperfect explanations. Whether an NLA's explanations are
faithful to the underlying computation is an open research question — see the paper.
Citation
@article{frasertaliente2026nla,
title = {Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations},
author = {Fraser-Taliente, others},
journal = {Transformer Circuits Thread},
year = {2026},
url = {https://transformer-circuits.pub/2026/nla/index.html}
}