ceselder

nanonla-l24-av-qwen3-8b

README

License: apache-2.0

What this is

A Natural Language Autoencoder is a pair of fine-tuned LMs that map residual-stream activations to language and back:

Table with columns: direction, model, mechanism
direction	model	mechanism
activation → text	AV (this model)	inject the vector at a marker token, autoregress an explanation
text → activation	AR	truncated K+1-layer copy of the base + `Linear(d, d)` readout (reconstruction)

This repo hosts the AV for Qwen3-8B @ layer 24.

⚠️ This checkpoint uses the Karvonen injection formula

It was trained with the Karvonen norm-matched additive injection at the marker token (a layer-1 residual-stream hook):

markdown
h'_p = h_p + ‖h_p‖ · v / ‖v‖

where v is the layer-24 activation you want verbalized. Serve it with the same injection — not the paper's embedding-replacement default. In the nanoNLA code this is the Karvonen path; see docs/qwen3_8b_run.md. The marker token, injection/MSE scales, and d_model are all read from the run's nla_meta.yaml sidecar — never hardcode them.

Usage

See docs/inference.md for the full harness (marker detection, neighbor-anchoring, the Karvonen hook). Sketch:

python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ceselder/nanoNLA", torch_dtype="bfloat16")
tok   = AutoTokenizer.from_pretrained("ceselder/nanoNLA")
# 1. register the Karvonen layer-1 injection hook with your layer-24 activation `v`
# 2. put the marker token in the prompt (neighbor-anchored)
# 3. generate the explanation
# (full, correct hook + prompt template live in the nanoNLA repo)

Training

Base: Qwen/Qwen3-8B — fine-tuned; merged bf16 weights provided here
Layer: 24 (residual stream)
Stage: AV-SFT (warm-start), 1000 steps
Injection: Karvonen norm-matched additive (layer-1 hook)
Warm-start labels: explanations over qwen3-8b-nla-L24-finefineweb-100k
Peak LR: 3e-5 (cosine)

Exact recipe (rank/scales/hparams) and eval numbers are in the nanoNLA repo and docs/qwen3_8b_run.md.

Caveats

Research checkpoint, not a product. This is the SFT stage (pre-RL), so the AV produces plausible-but-imperfect explanations. Whether an NLA's explanations are faithful to the underlying computation is an open research question — see the paper.

Citation

bibtex
@article{frasertaliente2026nla,
  title   = {Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations},
  author  = {Fraser-Taliente, others},
  journal = {Transformer Circuits Thread},
  year    = {2026},
  url     = {https://transformer-circuits.pub/2026/nla/index.html}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

ceselder

Model Tree

Base

Qwen/Qwen3-8B

Fine-tuned

this model

Input Modalities

Text

Output Modalities