Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Why CCC (motivation)

Terminal outcome-conditioning (condition on final rank) barely changes the per-move policy — top-vs-bottom discard total-variation ≈ 0.008 — because the realised outcome is too distal/luck-dominated: given the observed state, the expert action is ≈ conditionally independent of the final rank (Russo 2026: small action-influence ⇒ π⁺≈π₀).

The fix is to condition on an immediate, action-attributable consequence. Hand-crafting "deal-in (放銃) risk" works but is defense-only and misaligned with overall 収支/着順. CCC auto-discovers a consequence proxy

markdown

c*(s, a) = w · ( std(φ(s,a)) − E[φ|s] )

where φ(s,a) = the base model's action-processed hidden (hidden at the action token, after the model attends to it — a linear readout of it predicts deal-in at AUC 0.936) and E[φ|s] is the policy-averaged consequence (ridge state→φ). g = φ − E[φ|s] is the controllable (luck-removed) part; w is fit to align g with the realised round return residual. Higher c* == higher expected return. No hand-crafted deal-in/safety/shanten; no rollout (which would dilute exponentially in this stochastic multi-agent game).

Results (held-out, game-split; 434k decisions / 3000 games)

Return alignment Corr(proxy, Y~) (Y~ = round-return residual):

proxyalignment
deal-in (放銃是非, defense-only, uses the realised ron label)0.132
CCC c* (auto-discovered, decision-time only)0.121
c* + deal-in together0.173
c* contribution beyond deal-in+0.111

CCC c* is near-orthogonal to deal-in (overlap −0.076) yet adds +0.111 of return-alignment on top of it (+31% combined): it captures the offense axis (hand value / progress) the defense-only deal-in proxy misses.

One universal c* covers every viewer decision point (alignment / binarized return gap, shared proxy):

decision pointalignmentreturn gap
discard0.1260.206
self (riichi / tsumo / kan declare)0.1250.216
react (ron / pon / chi / pass)0.1100.135
chi position0.0640.115
red-5 use0.2150.397
kan target tile0.1440.570

Non-ignorability — conditioning on c* moves the policy strongly while staying 100% in-support (χ²-safe), unlike terminal outcome-conditioning:

signalTV(π(·|best) ‖ π₀)in-support
terminal outcome-conditioning0.008
deal-in critic (defense only)0.5781.0
CCC c* (discard)0.6151.0
CCC c* — self / react / chi_pos0.46 / 0.52 / 0.531.0

Usage

Requires transformers, torch, and the tenhou_tokenizer package (custom tokenizer) from the MahjongLM repo. ccc_policy.py and viewer_decisions.py are bundled in this repo.

python

import torch
from ccc_policy import CCCPolicy # bundled
from viewer_decisions import (iter_viewer_decisions,
build_candidate_groups, candidates_for)
from tenhou_tokenizer.huggingface import MahjongTokenizerFast
pol = CCCPolicy("mitsutani/mahjonglm-10m-ccc") # loads base LM + c* head
tok = MahjongTokenizerFast.from_pretrained("mitsutani/mahjonglm-10m-ccc")
ids = [...] # token ids of a game up to a viewer decision
groups = build_candidate_groups(tok)
toks = tok.convert_ids_to_tokens(ids)
pos, dtype, seat = iter_viewer_decisions(toks, viewer_seat=0)[-1]
cands = candidates_for(groups, dtype, seat) # legal action token ids
res = pol.conditioned_policy(ids[:pos], cands) # rank by c*, stay in-support
print("recommended:", tok.convert_ids_to_tokens(res["best_token_id"]))
# res also has: pi0, cstar, in_support, ccc_policy (one-hot on best)
# tempered tilt instead of argmax (softmax(log pi0 + beta * c*)):
res = pol.conditioned_policy(ids[:pos], cands, beta=2.0)

pol.score_candidates(prefix_ids, candidate_ids) returns raw c* per candidate; pol.rank_full_game(input_ids, viewer_seat, tok) walks an entire game and returns the recommended vs realised action at every decision point.

Caveats (honest)

  • Alignment / return-gap are offline residual metrics. Confirming a real EV/着順 gain needs engine self-play (the base author's evaluation harness).
  • c*'s raw alignment is slightly below the realised deal-in label (0.121 vs 0.132); that gap is deal-in's unpredictable luck (the realised ron isn't available at decision time) — c* is the actionable ceiling and adds the orthogonal offense axis.
  • kan_tile / red decisions are rare and often near-forced (few in-support candidates); their counterfactual TV is from small samples.
  • c* is a single linear readout of the action-processed hidden; per-decision-type refit raises alignment a little further (self 0.150, react 0.137, …).

Full analysis: docs/research/outcome_conditioned_policy_analysis.md (§7.20–7.21) in the MahjongLM repo.

Model provider

mitsutani

Model tree

Base

mitsutani/mahjonglm-10m

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today