ianlee1996

pokerbench-qwen3-14b-lora-mixed

README

License: apache-2.0

Headline numbers

Table with columns: Metric, Stage 1 only (paper-only LoRA), This checkpoint (mixed 50/50), Pure-PE adapter (control)
Metric	Stage 1 only (paper-only LoRA)	This checkpoint (mixed 50/50)	Pure-PE adapter (control)
Paper 11k EM	90.07%	89.86% ✅	69.24% ⬇
Paper 11k AA	90.55%	90.32%	69.24%
Paper parse failures	0 / 11000	0 / 11000	2,514 / 11,000
Production-PE 200 AA	61.5%	84.0% ⭐	83.5%
Production-PE preflop AA	58.0%	92.0%	90.0%
Production-PE postflop AA	65.0%	76.0%	77.0%

The mixed checkpoint wins on every dimension that matters: production PE deployment metric ties or beats pure-PE training, paper benchmark fully recovers (only -0.21 EM, statistical noise), parse failures stay at zero. Intuition is regularization — paper data anchors the action distribution and GTO behavior, while the 5k production-format samples teach format-specific cues without overfitting.

Usage

python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-14B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "ianlee1996/pokerbench-qwen3-14b-lora-mixed")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")

# Paper-format prompt (PokerBench dataset style)
instruction = """

You are a specialist in playing 6-handed No Limit Texas Holdem. The following will be a game scenario and you need to make the optimal decision.

Here is a game summary:

The small blind is 0.5 chips and the big blind is 1 chips. Everyone started with 100 chips.
The player positions involved in this game are UTG, HJ, CO, BTN, SB, BB.
In this hand, your position is BTN, and your holding is [Ace of Heart and King of Heart].
Before the flop, there has been no action yet. Assume that all other players that is not mentioned folded.

Now it is your turn to make a move.
To remind you, the current pot size is 1.5 chips, and your holding is [Ace of Heart and King of Heart].

Decide on an action based on the strength of your hand on this board, your position, and actions before you. Do not explain your answer.
Your optimal action is:"""

system_prompt = (
    "You are a specialist in playing 6-handed No Limit Texas Holdem. "
    "Output ONLY the optimal action with no explanation. "
    "Valid formats: 'fold', 'check', 'call', 'bet N', 'raise N', 'all-in'."
)
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": instruction},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=16, temperature=0.1, top_p=0.95, do_sample=True)
print(tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
# expected: "raise 2.5" or similar

The same adapter also handles richer production prompts that include equity vs opponent range, SPR, opponent range labels, and blockers.

Training recipe

Stage 1 (full LoRA on paper format)

Data: full 60k preflop + 500k postflop train splits of RZ412/PokerBench (1 epoch ≈ 4375 steps)
LoRA: r=32, alpha=64, dropout=0.05, target=all-linear
Optimizer: paged_adamw_8bit, LR 2e-4, cosine schedule, warmup 0.03
Batch: effective 128 (4 per device × 32 grad accumulation)
Loss: TRL assistant_only_loss=True
Hardware: 1× NVIDIA RTX PRO 6000 Blackwell (96 GB)
Time: ~22.5 hours

Stage 2 (this checkpoint, mixed 50/50)

Data: 10k paper-format from PokerBench (5k pre + 5k post) + 10k production-PE-format records (zh/en mixed). Production-PE records are generated by feeding PokerBench rows through a TypeScript prompt builder that computes equity vs opponent range (Monte Carlo over a buildOppRange-derived range), SPR, range labels, blockers, draws, and lays them out in the production deployment template.
Same LoRA shape, but continued from Stage 1 adapter (PEFT is_trainable=True)
Optimizer: same as Stage 1 except LR lowered to 5e-5 (we're nudging an already-trained adapter, not training from scratch)
250 steps at batch 128
Time: ~1.6 hours

Evaluation details

Paper benchmark

Standard PokerBench 11k test set (1k preflop + 10k postflop), greedy decoding (temperature=0.1, top_p=0.95, max_tokens=16). EM = exact action+size match; AA = action category match (bet collapsed to raise per the paper). 0 parse failures across all 11000 samples.

Production-PE eval

200 samples (100 preflop + 100 postflop) drawn from the test split of PokerBench, then re-rendered with the production prompt format that adds: equity-vs-opponent-range (Monte Carlo over a buildOppRange-derived range), SPR, opponent range labels, blocker notes, draws (flush/oesd/gutshot + outs), available actions, and bet-option pct shortcuts. The eval tests AA on this richer prompt format that the model would see in a real deployment. Important: these 200 samples are taken from the test split (postflop_10k_test_set and preflop_1k_test_set) and were never seen during training — Stage 2 trained only on the train split.

Other artifacts

GitHub (full reproduction code, training scripts, eval harness, design spec, plan): https://github.com/IanLiYi1996/PokerBench
Stage 1-only adapter: same recipe minus Stage 2 — shipped via direct S3 to early adopters; reach out if you need it for paper-only deployments

Citation

If you use this adapter, cite the original paper:

bibtex
@inproceedings{zhuang2025pokerbench,
  title={PokerBench: Training Large Language Models to become Professional Poker Players},
  author={Zhuang, Richard and Gupta, Akshat and Yang, Richard and Rahane, Aniket and Li, Zhengyu and Anumanchipalli, Gopala},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2025},
  url={https://arxiv.org/abs/2501.08328}
}

License

This LoRA adapter is released under Apache-2.0, matching the Qwen/Qwen3-14B base model license. The PokerBench dataset is also Apache-2.0.

Framework versions

PEFT 0.19.1
transformers (Qwen3 chat template)
trl 1.5.1 (assistant_only_loss=True for masked SFT)

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

ianlee1996

Model Tree

Base

Qwen/Qwen3-14B

Adapter

this model

Input Modalities

Text

Output Modalities