Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Task: pointer-chasing (M = 5)
Given a random function f: {0..9}→{0..9} and a start digit, apply f 5 times and output the
final digit (s₀ → s₁=f(s₀) → … → s₅). This is genuinely serial (no parallel shortcut), so a
single forward pass without scratch tokens fails: base Qwen3-8B single-pass ≈ 0.2–0.4 vs ≈ 1.0 with a
natural CoT (chance = 0.1). The CoT we install is the running pointer (one digit per step),
morphed into one filler dot per step by a curriculum.
Why a mask? (the central finding)
Run the same dot-curriculum without the mask and you get a model that also solves the task in a
single pass (empty-CoT readout ≈ 0.90) — i.e. the dots are not load-bearing; the 8B's ~36 layers
just compose the 5 lookups internally. This matches Pfau's own note that pretrained commercial LLMs
get no benefit from filler. The bottleneck mask (Y ↛ X) is the modification that makes the filler
load-bearing by construction on a pretrained model.
What this organism does (the robust claim: load-bearing)
Under its operating mask, an all-" ." CoT of 5 dots performs the 5-step pointer-chase in the dots'
hidden states:
- all-5-dots masked readout = 0.98
- ablated (additionally forbid the dots from seeing the input,
Z ↛ X) = 0.08 ≈ chance
So the dots are a necessary information conduit — remove their access to the input and the answer
is chance. This is a structural, eval-time fact (code/eval_masked.py) and is the load-bearing
property in Pfau's sense.

The organism genuinely uses several of its dots (prefill it with only 1 dot → 0.41; it needs ~4–5 to
reach 0.98), and a linear probe decodes each dot p's own running pointer sₚ in order
(code/probe.py). This decodable hidden carry is the point of the organism as an
activation-oracle / interpretability target.
Honest caveats (please read before citing)
This is a constructed organism, not an emergent phenomenon. Three things it is not:
- Not natural. Without the mask, the same curriculum is not load-bearing (empty-CoT ≈ 0.90, see above). The mask is doing the work.
- The distributed layout is curriculum-shaped, not emergent. Loss is only on the answer token (the dots carry no label), so the layout is not circularly supervised — but the back-to-front curriculum makes the in-order "one state per dot" solution the path of least resistance. Read the probe staircase as "this organism implements a clean distributed carry," not "the model spontaneously discovered distributed computation." A different schedule could land elsewhere.
- At M=5 the task does not require a multi-dot chain. A separately fine-tuned 1-dot solver
reaches 0.984 (depth composes 5 lookups in one position). So this organism is a load-bearing,
decodable-carry construction, not a proof that >1 dot is necessary. (Strict per-dot necessity
would need chains longer than one position's depth reach (~5–7), which for this 8B is close to the
trainable masked-chain cap (~8) — so the window where the dot count is required is narrow-to-
absent within M ≤ 8. Full analysis in
docs/.)
LoRA: r=32, α=64, on all attention+MLP projections; base Qwen/Qwen3-8B. Trained with HF eager
attention (Unsloth's fused attention ignores 4D masks).
Repo layout
| path | what |
|---|---|
adapter_* (root) | the organism: M=5, bottleneck-masked LoRA adapter (+ tokenizer) |
code/ | task, training (train_masked.py), evals (eval_masked.py), probe (probe.py), and the mask (masking.py) |
docs/ | full research log: parity precursor → pointer-chasing → the mask → minimality/scaling analysis |
figures/ | load-bearing eval (organism 0.98 vs ablated 0.08) |
Usage (requires the mask)
python
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import PeftModel# from this repo's code/:from masking import build_attention_mask, X as RX, Z as RZ, Y as RY # 4D bottleneck maskFILLER_ID = 659 # " ."REPO = "cds-jb/qwen3-8b-pointer-chase-filler-cot"tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.bfloat16,attn_implementation="eager", device_map="cuda") # eager: honors the 4D maskmodel = PeftModel.from_pretrained(model, REPO).eval() # adapter is at the repo rootM = 5table = [4,7,2,9,1,0,8,3,6,5]; start = 1 # f and s0prompt = ("You are given a function f on the digits 0-9, written as \"input:output\" pairs:\n"+ " ".join(f"{i}:{table[i]}" for i in range(10))+ f"\n\nStart with the value {start}. Apply f repeatedly for {M} steps (each step: replace ""the current value v with f(v)).\n\nReason step by step inside <think> </think> -- write ""the running value after each step -- then output ONLY the final value as \\boxed{d} ""(a single digit 0-9).")ids = tok.apply_chat_template([{"role":"user","content":prompt}], add_generation_prompt=True,enable_thinking=True)x = ids + tok("<think>\n", add_special_tokens=False)["input_ids"] # X = prompt + <think>close = tok("\n</think>\n\n\\boxed{", add_special_tokens=False)["input_ids"]seq = x + [FILLER_ID]*M + close # [X][M dots][Y(close){]roles = torch.tensor([[RX]*len(x) + [RZ]*M + [RY]*len(close)])attn = build_attention_mask(roles, dtype=torch.bfloat16) # forbids Y->X (answer can't see prompt)inp = torch.tensor([seq]).cuda(); pos = torch.arange(len(seq))[None].cuda()logits = model(input_ids=inp, attention_mask=attn.cuda(), position_ids=pos).logits[0, -1]digit_ids = [tok(str(d), add_special_tokens=False)["input_ids"][0] for d in range(10)]print("predicted final digit:", int(torch.tensor(digit_ids)[logits[digit_ids].argmax()]))
code/eval_masked.py reproduces the load-bearing result (organism 0.98 vs ablated 0.08) and
code/probe.py the per-dot decodable carry.
How this was built
Full research log in docs/ — PARITY_FILLER_RESULTS.md (the parity precursor: an all-dots CoT works
but the 8B internalises it to single-pass; the serial dot-chain caps ~5) and
POINTER_CHASE_RESULTS.md (the pivot to pointer-chasing, why the mask is needed, the load-bearing and
probe results, and the honest minimality / M-scaling analysis).
Citation
Reproduces, with a pretrained-LLM modification (the bottleneck mask), the phenomenon from: Pfau, Merrill & Bowman, Let's Think Dot by Dot: Hidden Computation in Transformer Language Models, 2024 — https://arxiv.org/abs/2404.15758
Model provider
cds-jb
Model tree
Base
Qwen/Qwen3-8B
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information