Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Why CCC (motivation)
Terminal outcome-conditioning (condition on final rank) barely changes the per-move policy — top-vs-bottom discard total-variation ≈ 0.008 — because the realised outcome is too distal/luck-dominated: given the observed state, the expert action is ≈ conditionally independent of the final rank (Russo 2026: small action-influence ⇒ π⁺≈π₀).
The fix is to condition on an immediate, action-attributable consequence. Hand-crafting "deal-in (放銃) risk" works but is defense-only and misaligned with overall 収支/着順. CCC auto-discovers a consequence proxy
markdown
c*(s, a) = w · ( std(φ(s,a)) − E[φ|s] )
where φ(s,a) = the base model's action-processed hidden (hidden at the
action token, after the model attends to it — a linear readout of it predicts
deal-in at AUC 0.936) and E[φ|s] is the policy-averaged consequence
(ridge state→φ). g = φ − E[φ|s] is the controllable (luck-removed) part;
w is fit to align g with the realised round return residual. Higher c* ==
higher expected return. No hand-crafted deal-in/safety/shanten; no rollout
(which would dilute exponentially in this stochastic multi-agent game).
Results (held-out, game-split; 434k decisions / 3000 games)
Return alignment Corr(proxy, Y~) (Y~ = round-return residual):
| proxy | alignment |
|---|---|
| deal-in (放銃是非, defense-only, uses the realised ron label) | 0.132 |
| CCC c* (auto-discovered, decision-time only) | 0.121 |
| c* + deal-in together | 0.173 |
| c* contribution beyond deal-in | +0.111 |
CCC c* is near-orthogonal to deal-in (overlap −0.076) yet adds +0.111 of return-alignment on top of it (+31% combined): it captures the offense axis (hand value / progress) the defense-only deal-in proxy misses.
One universal c* covers every viewer decision point (alignment / binarized return gap, shared proxy):
| decision point | alignment | return gap |
|---|---|---|
| discard | 0.126 | 0.206 |
| self (riichi / tsumo / kan declare) | 0.125 | 0.216 |
| react (ron / pon / chi / pass) | 0.110 | 0.135 |
| chi position | 0.064 | 0.115 |
| red-5 use | 0.215 | 0.397 |
| kan target tile | 0.144 | 0.570 |
Non-ignorability — conditioning on c* moves the policy strongly while staying 100% in-support (χ²-safe), unlike terminal outcome-conditioning:
| signal | TV(π(·|best) ‖ π₀) | in-support |
|---|---|---|
| terminal outcome-conditioning | 0.008 | — |
| deal-in critic (defense only) | 0.578 | 1.0 |
| CCC c* (discard) | 0.615 | 1.0 |
| CCC c* — self / react / chi_pos | 0.46 / 0.52 / 0.53 | 1.0 |
Usage
Requires transformers, torch, and the tenhou_tokenizer package (custom
tokenizer) from the MahjongLM
repo. ccc_policy.py and viewer_decisions.py are bundled in this repo.
python
import torchfrom ccc_policy import CCCPolicy # bundledfrom viewer_decisions import (iter_viewer_decisions,build_candidate_groups, candidates_for)from tenhou_tokenizer.huggingface import MahjongTokenizerFastpol = CCCPolicy("mitsutani/mahjonglm-10m-ccc") # loads base LM + c* headtok = MahjongTokenizerFast.from_pretrained("mitsutani/mahjonglm-10m-ccc")ids = [...] # token ids of a game up to a viewer decisiongroups = build_candidate_groups(tok)toks = tok.convert_ids_to_tokens(ids)pos, dtype, seat = iter_viewer_decisions(toks, viewer_seat=0)[-1]cands = candidates_for(groups, dtype, seat) # legal action token idsres = pol.conditioned_policy(ids[:pos], cands) # rank by c*, stay in-supportprint("recommended:", tok.convert_ids_to_tokens(res["best_token_id"]))# res also has: pi0, cstar, in_support, ccc_policy (one-hot on best)# tempered tilt instead of argmax (softmax(log pi0 + beta * c*)):res = pol.conditioned_policy(ids[:pos], cands, beta=2.0)
pol.score_candidates(prefix_ids, candidate_ids) returns raw c* per candidate;
pol.rank_full_game(input_ids, viewer_seat, tok) walks an entire game and
returns the recommended vs realised action at every decision point.
Caveats (honest)
- Alignment / return-gap are offline residual metrics. Confirming a real EV/着順 gain needs engine self-play (the base author's evaluation harness).
- c*'s raw alignment is slightly below the realised deal-in label (0.121 vs 0.132); that gap is deal-in's unpredictable luck (the realised ron isn't available at decision time) — c* is the actionable ceiling and adds the orthogonal offense axis.
kan_tile/reddecisions are rare and often near-forced (few in-support candidates); their counterfactual TV is from small samples.- c* is a single linear readout of the action-processed hidden; per-decision-type refit raises alignment a little further (self 0.150, react 0.137, …).
Full analysis: docs/research/outcome_conditioned_policy_analysis.md (§7.20–7.21)
in the MahjongLM repo.
Model provider
mitsutani
Model tree
Base
mitsutani/mahjonglm-10m
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information