cds-jb

qwen3-8b-pointer-chase-filler-cot

README

License: apache-2.0

Task: pointer-chasing (M = 5)

Given a random function f: {0..9}→{0..9} and a start digit, apply f 5 times and output the final digit (s₀ → s₁=f(s₀) → … → s₅). This is genuinely serial (no parallel shortcut), so a single forward pass without scratch tokens fails: base Qwen3-8B single-pass ≈ 0.2–0.4 vs ≈ 1.0 with a natural CoT (chance = 0.1). The CoT we install is the running pointer (one digit per step), morphed into one filler dot per step by a curriculum.

Why a mask? (the central finding)

Run the same dot-curriculum without the mask and you get a model that also solves the task in a single pass (empty-CoT readout ≈ 0.90) — i.e. the dots are not load-bearing; the 8B's ~36 layers just compose the 5 lookups internally. This matches Pfau's own note that pretrained commercial LLMs get no benefit from filler. The bottleneck mask (Y ↛ X) is the modification that makes the filler load-bearing by construction on a pretrained model.

What this organism does (the robust claim: load-bearing)

Under its operating mask, an all-" ." CoT of 5 dots performs the 5-step pointer-chase in the dots' hidden states:

all-5-dots masked readout = 0.98
ablated (additionally forbid the dots from seeing the input, Z ↛ X) = 0.08 ≈ chance

So the dots are a necessary information conduit — remove their access to the input and the answer is chance. This is a structural, eval-time fact (code/eval_masked.py) and is the load-bearing property in Pfau's sense.

Filler dots are load-bearing: organism (all 5 dots, Y↛X) = 0.98 vs ablated (Y↛X & Z↛X) = 0.08, chance = 0.1

The organism genuinely uses several of its dots (prefill it with only 1 dot → 0.41; it needs ~4–5 to reach 0.98), and a linear probe decodes each dot p's own running pointer sₚ in order (code/probe.py). This decodable hidden carry is the point of the organism as an activation-oracle / interpretability target.

Honest caveats (please read before citing)

This is a constructed organism, not an emergent phenomenon. Three things it is not:

Not natural. Without the mask, the same curriculum is not load-bearing (empty-CoT ≈ 0.90, see above). The mask is doing the work.
The distributed layout is curriculum-shaped, not emergent. Loss is only on the answer token (the dots carry no label), so the layout is not circularly supervised — but the back-to-front curriculum makes the in-order "one state per dot" solution the path of least resistance. Read the probe staircase as "this organism implements a clean distributed carry," not "the model spontaneously discovered distributed computation." A different schedule could land elsewhere.
At M=5 the task does not require a multi-dot chain. A separately fine-tuned 1-dot solver reaches 0.984 (depth composes 5 lookups in one position). So this organism is a load-bearing, decodable-carry construction, not a proof that >1 dot is necessary. (Strict per-dot necessity would need chains longer than one position's depth reach (~5–7), which for this 8B is close to the trainable masked-chain cap (~8) — so the window where the dot count is required is narrow-to- absent within M ≤ 8. Full analysis in docs/.)

LoRA: r=32, α=64, on all attention+MLP projections; base Qwen/Qwen3-8B. Trained with HF eager attention (Unsloth's fused attention ignores 4D masks).

Repo layout

Table with columns: path, what
path	what
`adapter_*` (root)	the organism: M=5, bottleneck-masked LoRA adapter (+ tokenizer)
`code/`	task, training (`train_masked.py`), evals (`eval_masked.py`), probe (`probe.py`), and the mask (`masking.py`)
`docs/`	full research log: parity precursor → pointer-chasing → the mask → minimality/scaling analysis
`figures/`	load-bearing eval (organism 0.98 vs ablated 0.08)

Usage (requires the mask)

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# from this repo's code/:
from masking import build_attention_mask, X as RX, Z as RZ, Y as RY  # 4D bottleneck mask
FILLER_ID = 659  # " ."

REPO = "cds-jb/qwen3-8b-pointer-chase-filler-cot"
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16,
            attn_implementation="eager", device_map="cuda")          # eager: honors the 4D mask
model = PeftModel.from_pretrained(model, REPO).eval()                 # adapter is at the repo root

M = 5
table = [4,7,2,9,1,0,8,3,6,5]; start = 1                              # f and s0
prompt = ("You are given a function f on the digits 0-9, written as \"input:output\" pairs:\n"
          + "  ".join(f"{i}:{table[i]}" for i in range(10))
          + f"\n\nStart with the value {start}. Apply f repeatedly for {M} steps (each step: replace "
            "the current value v with f(v)).\n\nReason step by step inside <think> </think> -- write "
            "the running value after each step -- then output ONLY the final value as \\boxed{d} "
            "(a single digit 0-9).")
ids = tok.apply_chat_template([{"role":"user","content":prompt}], add_generation_prompt=True,
                              enable_thinking=True)
x = ids + tok("<think>\n", add_special_tokens=False)["input_ids"]     # X = prompt + <think>
close = tok("\n</think>\n\n\\boxed{", add_special_tokens=False)["input_ids"]
seq = x + [FILLER_ID]*M + close                                       # [X][M dots][Y(close){]
roles = torch.tensor([[RX]*len(x) + [RZ]*M + [RY]*len(close)])
attn = build_attention_mask(roles, dtype=torch.bfloat16)             # forbids Y->X (answer can't see prompt)
inp = torch.tensor([seq]).cuda(); pos = torch.arange(len(seq))[None].cuda()
logits = model(input_ids=inp, attention_mask=attn.cuda(), position_ids=pos).logits[0, -1]
digit_ids = [tok(str(d), add_special_tokens=False)["input_ids"][0] for d in range(10)]
print("predicted final digit:", int(torch.tensor(digit_ids)[logits[digit_ids].argmax()]))

code/eval_masked.py reproduces the load-bearing result (organism 0.98 vs ablated 0.08) and code/probe.py the per-dot decodable carry.

How this was built

Full research log in docs/ — PARITY_FILLER_RESULTS.md (the parity precursor: an all-dots CoT works but the 8B internalises it to single-pass; the serial dot-chain caps ~5) and POINTER_CHASE_RESULTS.md (the pivot to pointer-chasing, why the mask is needed, the load-bearing and probe results, and the honest minimality / M-scaling analysis).

Citation

Reproduces, with a pretrained-LLM modification (the bottleneck mask), the phenomenon from: Pfau, Merrill & Bowman, Let's Think Dot by Dot: Hidden Computation in Transformer Language Models, 2024 — https://arxiv.org/abs/2404.15758

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

cds-jb

Model Tree

Base

Qwen/Qwen3-8B

Adapter

this model

Input Modalities

Text

Output Modalities