cds-jb

qwen3-8b-latent-threads-markov-diffuse-m5

README

License: apache-2.0

Task & notation

Task (diffuse). K=3 cells sit on a ring c1, c2, c3 (with c3 adjacent to c1) at random digits. Each step, all cells update simultaneously to the sum mod 10 of their two ring neighbours:

$x i (t) = (x i - 1 (t - 1) + x i + 1 (t - 1)) mod 10.$

The rule runs for M steps; then one cell is queried, answered as a single digit \boxed{d}. The query appears after the latent block, so the latents must carry the whole row. Because the light cone has width 2M+1, once M ≥ K/2 every final cell depends on all initial cells — parallel reasoning is provably necessary, and the only thing the latents transport step-to-step is the K-cell row.

Notation. s_t is the row after t steps and s_t.c_i is the latent token holding cell i's value x_i(t):

$s t = (x 1 (t), \dots, x K (t)), answer = x$

s_0 is the prompt's initial row, s_M the final row; the latents form an M × K grid.

Prompt format. A plain Qwen3 chat turn; the M × K latent grid is fed in place of the <think> reasoning (no text tokens are emitted there), and the line naming the queried cell appears only after </think>, so the grid cannot anticipate which cell is asked (verbatim instance in examples.md):

markdown
<|im_start|>user
K cells sit in a ring (c1..cK, and cK is adjacent to c1). Initial values: c1=.., c2=.., ...
Each step, every cell SIMULTANEOUSLY becomes the sum modulo 10 of its two ring neighbours
(left and right). Apply this for M steps. Only AFTER your thinking, you will be asked for one
cell's final value -- answer with ONLY that value as \boxed{d} (a single digit).
Reason inside <think> </think> -- after each step write all K cell values in order c1..cK.<|im_end|>
<|im_start|>assistant
<think>
s_1  s_2  s_3  …  s_M   ← the M latent rows (each s_t a K-cell vector) are injected here; no text
</think>

Final value of cell cq: \boxed{d}<|im_end|>

Architecture (LoRA on Qwen3-8B, three ingredients)

The single information path, enforced by the mask, is

$prompt \to s 1 \to s 2 \to \dots \to s M \to answer .$

Markov step-windowed mask — s_1 reads the prompt; s_t (for t>1) attends only s_{t-1}; the answer attends only s_M. Per-step necessity is therefore structural — no prompt-recompute, no one-deep-step shortcut. (Unit-tested.)
Vocab-constrained feedback — the vector written into s_{t+1}.c_i is a softmax over a digit_head read of s_t.c_i, mixed over the digit embeddings (readable by construction).
Scheduled-sampling teacher forcing — feedback is the ground-truth digit embedding with prob annealed 1→0, then handed off to self-generated latents. markov_extra.pt ships the digit_head, the step-1 query q_emb, and a projection.

Training

LoRA (r=32, α=64, dropout 0) on the q,k,v,o,gate,up,down projections of a frozen bf16 Qwen3-8B (eager attention, for the 4-D mask); the trainable parameters are the adapter, the digit_head, the learned step-1 query q_emb, and a projection. Single GPU, gradient checkpointing.

Data is synthetic and unlimited — every step draws a fresh batch of 16 random CA instances (random initial row + random queried cell); there is no fixed dataset, so the model cannot memorise.
The M-step recurrence is unrolled in-graph — one forward per step reads each latent and writes the next row's feedback, and gradients backpropagate through all M forwards at once.
Loss = answer cross-entropy on the boxed digit + feedback cross-entropy (the digit_head predicting each latent's ground-truth digit x_i(t)), the two terms equally weighted (1×).
Scheduled-sampling teacher forcing — with a per-example probability annealed 1 → 0 over the first 2500 steps, the vector fed into the next step is the ground-truth digit embedding instead of the self-generated one. Training therefore starts in the known-trainable "surface-digit" regime and is gradually handed off to the fully self-generated (free-running) chain — at tf = 0 the chain runs entirely on its own latents.
Optimiser AdamW, lr 0.0001, grad-norm clip 1.0.
— every 50 steps the free-running readout is measured on 128 fresh instances; the such checkpoint (this repo) is kept, and training stops once accuracy clears 0.9 on two consecutive evals at (cap 6000 steps).

training

Training curve. Left: the free-running readout — the model running on its own self-generated latents — climbs from chance (0.1) to ≈1.0 as the teacher-forcing probability anneals 1 → 0 (grey dashed); both the M=4 and M=5 organisms learn, and the hand-off from teacher-forced to self-generated is seamless (no collapse when the GT crutch is removed). Right: the resulting per-step necessity — once trained, corrupting any single step's feedback with noise drops the answer to chance, confirming every step is load-bearing.

Results — causal load-bearing battery

Each test runs the organism free-running, then intervenes on the latent grid and re-reads \boxed{d} (n=400; latent_threads/eval_loadbearing_report.py).

interventions

Figure 1 — the answer is a causal read-out of the grid. (a) Destroy: shuffling the feedback grid breaks the answer — within/cross-step shuffles land on a valid but wrong cell/step (≈1/3), a full shuffle reaches chance. (b) Redirect: relabel or replace the grid and the answer follows the new content, not the prompt (mechanism below). (c) A top-k sweep over U: all 9 digit directions are needed to redirect the answer.

How the panel-(b) interventions work. Each runs the organism free-running to fill its own grid, then rewrites the K×M feedback vectors that sit at the latent positions and re-reads \boxed{d} — only the latents change; this problem's prompt and queried cell stay fixed. Ring-roll slides the whole grid one place around the ring — the latent vector sitting at cell c_1 is moved into c_2, c_2→c_3, c_3→c_1, identically at every step. Because the ring rule is rotation-symmetric this is still a legal CA state (just rotated), so the model doesn't break; but the answer is read from a fixed slot — the queried cell c_q — which now holds its neighbour's vector, so the digit that comes out is that neighbour's value (0.99), not the original cell's (0.09). That pins the read-out to a grid position, rather than to "the answer cell's value wherever it happens to sit". Donor patch (all dims) overwrites the whole grid with the feedback a different, unrelated problem generated (its own initial row → its own evolution); a prompt-recompute would ignore this, but the answer follows the donor (0.99) and this problem's prompt only (0.00) — so the answer is genuinely computed through the latents. Subspace patches isolate which directions carry the message: swapping in only the donor's U-component — the 9-dim digit subspace defined in the note below — reproduces the full redirect (0.99), whereas swapping only its orthogonal complement does nothing (0.00). So the carried content is exactly those 9 directions.

How U is built and patched. U comes from the model's vocabulary, not from any fit. Stack the ten digit-token embeddings E_0..E_9 (each a 4096-d vector), subtract their mean, and SVD the centred stack; the left singular vectors with non-zero singular value form an orthonormal basis Q (shape 4096×9) of their span U — ten centred points span a 9-dim space. Why this subspace? Every feedback vector is a softmax mixture Σ_d p_d·E_d of those same ten embeddings, so it sits in the affine slice mean + U: its digit identity is carried entirely in the U-component, and its U⊥-component is the same constant (mean projected off U) for every digit. Patching uses the projector P = Q·Qᵀ. For a recipient vector v and donor , write ; the -swap keeps and substitutes the donor's (→ ), the complement-swap does the reverse (→ ). The complement-swap is a near-no-op (0.00) every problem's feedback already shares the same -component, so there is nothing digit-specific to exchange there. The top- curve simply restricts to its first columns (the top- singular directions), patching only those.

Pseudocode. All three edits act on the latent tokens' input embeddings (the layer-0 input); a single forward then runs all layers and the boxed digit is read from the final logits.

python
# E : input embeddings of the whole sequence            [B, L, D],  D = 4096
# latent grid = the M*K rows at positions z0..z0+M*K     (step-major: row p = t*K + i)
grid = E[:, z0 : z0 + M*K]                  # [B, M*K, D] = [B, 15, 4096]

# (b1) RING-ROLL  — re-index rows only; does NOT use U, vector values unchanged
for t in range(1, M):                       # steps s2..sM   (s1 = the fixed query vector)
    grid[:, t*K:(t+1)*K] = roll(grid[:, t*K:(t+1)*K], +1, axis=1)   # cell i <- cell i-1 (ring, K=3)

# (b2) DONOR (all dims) — overwrite rows with another problem's grid
grid[:, K:] = donor_grid[:, K:]             # replace s2..sM (rows 3..14)

# (b3) SUBSPACE — edit row VALUES inside the 9-d digit space U;  Q : [D, 9]
delta = donor_grid[:, K:] - grid[:, K:]     # [B, M*K-K, D]
grid[:, K:] += (delta @ Q) @ Q.T            # U-swap  (complement: grid += delta - (delta @ Q) @ Q.T)

logits = model(inputs_embeds=E, attention_mask=markov_mask).logits   # one pass, all layers
answer = logits[:, ans_pos, digit_ids].argmax(-1)

(The vectors filling grid were produced during free-running generation by reading each latent's final-layer hidden state through digit_head; the patches edit those vectors back at the input.)

decode

Figure 2 — where the thought lives. (a) The answer read-out is near-deterministic (P(correct) 0.99; every wrong digit at the floor, off-diagonal ≈ 0). (b) A linear probe decodes each cell value x_i(t) from the residual stream, reaching 1.00 by layer 36. (c) Per-position selectivity: each s_t.c_i most strongly encodes its own cell (1.00 vs 0.69 for the same-step neighbours — partly present, since the CA update reads them).

Table with columns: measurement, value, what it shows
measurement	value	what it shows
free-running accuracy	0.992	solves it (chance 0.10)
ablate `s_1` ↛ prompt	0.102	no input → chance
worst single-step noise	0.090	every step is necessary
full grid shuffle	0.163	content and position matter
ring-roll → rolled cell	0.994	read-out is position-bound

Files & reproduction

LoRA adapter + tokenizer + markov_extra.pt (digit_head, step-1 query q_emb, projection) + lt_cfg.json; summary.png (training), interventions.png / decode.png (above), examples.md (a full worked instance), training_code/ (full reproduction).

Train: python -m latent_threads.train_markov --config latent_threads/configs/markov_k3m5_vocab.json
Causal battery: python -m latent_threads.eval_loadbearing_report --ckpt <dir> --n 400

Scope: a deliberately constructed organism — the Markov mask makes per-step necessity structural and the task is a digit CA — i.e. a clean, fully-controlled positive control where latent chain-of-thought is provably load-bearing, for activation-oracle / CoT-faithfulness evals. Part of the latent-threads collection; trained 2026-06-13, causal battery 2026-06-17.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

cds-jb

Model Tree

Base

Qwen/Qwen3-8B

Adapter

this model

Input Modalities

Text

Output Modalities