Why CCC (motivation)
Terminal outcome-conditioning (condition on final rank) barely changes the
per-move policy — top-vs-bottom discard total-variation ≈ 0.008 — because the
realised outcome is too distal/luck-dominated: given the observed state, the
expert action is ≈ conditionally independent of the final rank (Russo 2026:
small action-influence ⇒ π⁺≈π₀).
The fix is to condition on an immediate, action-attributable consequence.
Hand-crafting "deal-in (放銃) risk" works but is defense-only and misaligned
with overall 収支/着順. CCC auto-discovers a consequence proxy
c*(s, a) = w · ( std(φ(s,a)) − E[φ|s] )
where φ(s,a) = the base model's action-processed hidden (hidden at the
action token, after the model attends to it — a linear readout of it predicts
deal-in at AUC 0.936) and E[φ|s] is the policy-averaged consequence
(ridge state→φ). g = φ − E[φ|s] is the controllable (luck-removed) part;
w is fit to align g with the realised round return residual. Higher c* ==
higher expected return. No hand-crafted deal-in/safety/shanten; no rollout
(which would dilute exponentially in this stochastic multi-agent game).
Results (held-out, game-split; 434k decisions / 3000 games)
Return alignment Corr(proxy, Y~) (Y~ = round-return residual):
Table with columns: proxy, alignment| proxy | alignment |
|---|
| deal-in (放銃是非, defense-only, uses the realised ron label) | 0.132 |
| CCC c* (auto-discovered, decision-time only) | 0.121 |
| c* + deal-in together | 0.173 |
| c* contribution beyond deal-in | +0.111 |
CCC c* is near-orthogonal to deal-in (overlap −0.076) yet adds +0.111 of
return-alignment on top of it (+31% combined): it captures the offense axis
(hand value / progress) the defense-only deal-in proxy misses.
One universal c* covers every viewer decision point (alignment / binarized
return gap, shared proxy):
Table with columns: decision point, alignment, return gap| decision point | alignment | return gap |
|---|
| discard | 0.126 | 0.206 |
| self (riichi / tsumo / kan declare) | 0.125 | 0.216 |
| react (ron / pon / chi / pass) | 0.110 | 0.135 |
| chi position | 0.064 | 0.115 |
| red-5 use | 0.215 | 0.397 |
| kan target tile | 0.144 |
Non-ignorability — conditioning on c* moves the policy strongly while
staying 100% in-support (χ²-safe), unlike terminal outcome-conditioning:
Table with columns: signal, TV(π(·|best) ‖ π₀), in-support| signal | TV(π(·|best) ‖ π₀) | in-support |
|---|
| terminal outcome-conditioning | 0.008 | — |
| deal-in critic (defense only) | 0.578 | 1.0 |
| CCC c* (discard) | 0.615 | 1.0 |
| CCC c* — self / react / chi_pos | 0.46 / 0.52 / 0.53 | 1.0 |
Usage
Requires transformers, torch, and the tenhou_tokenizer package (custom
tokenizer) from the MahjongLM
repo. ccc_policy.py and viewer_decisions.py are bundled in this repo.
import torch
from ccc_policy import CCCPolicy
from viewer_decisions import (iter_viewer_decisions,
build_candidate_groups, candidates_for)
from tenhou_tokenizer.huggingface import MahjongTokenizerFast
pol = CCCPolicy("mitsutani/mahjonglm-10m-ccc")
tok = MahjongTokenizerFast.from_pretrained("mitsutani/mahjonglm-10m-ccc")
ids = [...]
groups = build_candidate_groups(tok)
toks = tok.convert_ids_to_tokens(ids)
pos, dtype, seat = iter_viewer_decisions(toks, viewer_seat=0)[-1]
cands = candidates_for(groups, dtype, seat)
res = pol.conditioned_policy(ids[:pos], cands)
print("recommended:", tok.convert_ids_to_tokens(res["best_token_id"]))
res = pol.conditioned_policy(ids[:pos], cands, beta=2.0)
pol.score_candidates(prefix_ids, candidate_ids) returns raw c* per candidate;
pol.rank_full_game(input_ids, viewer_seat, tok) walks an entire game and
returns the recommended vs realised action at every decision point.
Caveats (honest)
- Alignment / return-gap are offline residual metrics. Confirming a real
EV/着順 gain needs engine self-play (the base author's evaluation harness).
- c*'s raw alignment is slightly below the realised deal-in label (0.121 vs
0.132); that gap is deal-in's unpredictable luck (the realised ron isn't
available at decision time) — c* is the actionable ceiling and adds the
orthogonal offense axis.
kan_tile / red decisions are rare and often near-forced (few in-support
candidates); their counterfactual TV is from small samples.
- c* is a single linear readout of the action-processed hidden; per-decision-type
refit raises alignment a little further (self 0.150, react 0.137, …).
Full analysis: docs/research/outcome_conditioned_policy_analysis.md (§7.20–7.21)
in the MahjongLM repo.