ianlee1996
pokerbench-qwen3-14b-lora-mixed
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Headline numbers
| Metric | Stage 1 only (paper-only LoRA) | This checkpoint (mixed 50/50) | Pure-PE adapter (control) |
|---|---|---|---|
| Paper 11k EM | 90.07% | 89.86% ✅ | 69.24% ⬇ |
| Paper 11k AA | 90.55% | 90.32% | 69.24% |
| Paper parse failures | 0 / 11000 | 0 / 11000 | 2,514 / 11,000 |
| Production-PE 200 AA | 61.5% | 84.0% ⭐ | 83.5% |
| Production-PE preflop AA | 58.0% | 92.0% | 90.0% |
| Production-PE postflop AA | 65.0% | 76.0% | 77.0% |
The mixed checkpoint wins on every dimension that matters: production PE deployment metric ties or beats pure-PE training, paper benchmark fully recovers (only -0.21 EM, statistical noise), parse failures stay at zero. Intuition is regularization — paper data anchors the action distribution and GTO behavior, while the 5k production-format samples teach format-specific cues without overfitting.
Usage
python
from peft import PeftModelfrom transformers import AutoModelForCausalLM, AutoTokenizerimport torchbase = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B",torch_dtype=torch.bfloat16,device_map="auto",)model = PeftModel.from_pretrained(base, "ianlee1996/pokerbench-qwen3-14b-lora-mixed")tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")# Paper-format prompt (PokerBench dataset style)instruction = """You are a specialist in playing 6-handed No Limit Texas Holdem. The following will be a game scenario and you need to make the optimal decision.Here is a game summary:The small blind is 0.5 chips and the big blind is 1 chips. Everyone started with 100 chips.The player positions involved in this game are UTG, HJ, CO, BTN, SB, BB.In this hand, your position is BTN, and your holding is [Ace of Heart and King of Heart].Before the flop, there has been no action yet. Assume that all other players that is not mentioned folded.Now it is your turn to make a move.To remind you, the current pot size is 1.5 chips, and your holding is [Ace of Heart and King of Heart].Decide on an action based on the strength of your hand on this board, your position, and actions before you. Do not explain your answer.Your optimal action is:"""system_prompt = ("You are a specialist in playing 6-handed No Limit Texas Holdem. ""Output ONLY the optimal action with no explanation. ""Valid formats: 'fold', 'check', 'call', 'bet N', 'raise N', 'all-in'.")messages = [{"role": "system", "content": system_prompt},{"role": "user", "content": instruction},]prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = tokenizer(prompt, return_tensors="pt").to(model.device)out = model.generate(**inputs, max_new_tokens=16, temperature=0.1, top_p=0.95, do_sample=True)print(tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))# expected: "raise 2.5" or similar
The same adapter also handles richer production prompts that include equity vs opponent range, SPR, opponent range labels, and blockers.
Training recipe
Stage 1 (full LoRA on paper format)
- Data: full 60k preflop + 500k postflop train splits of
RZ412/PokerBench(1 epoch ≈ 4375 steps) - LoRA: r=32, alpha=64, dropout=0.05, target=all-linear
- Optimizer: paged_adamw_8bit, LR 2e-4, cosine schedule, warmup 0.03
- Batch: effective 128 (4 per device × 32 grad accumulation)
- Loss: TRL
assistant_only_loss=True - Hardware: 1× NVIDIA RTX PRO 6000 Blackwell (96 GB)
- Time: ~22.5 hours
Stage 2 (this checkpoint, mixed 50/50)
- Data: 10k paper-format from PokerBench (5k pre + 5k post) + 10k production-PE-format records (zh/en mixed). Production-PE records are generated by feeding PokerBench rows through a TypeScript prompt builder that computes equity vs opponent range (Monte Carlo over a buildOppRange-derived range), SPR, range labels, blockers, draws, and lays them out in the production deployment template.
- Same LoRA shape, but continued from Stage 1 adapter (PEFT
is_trainable=True) - Optimizer: same as Stage 1 except LR lowered to 5e-5 (we're nudging an already-trained adapter, not training from scratch)
- 250 steps at batch 128
- Time: ~1.6 hours
Evaluation details
Paper benchmark
Standard PokerBench 11k test set (1k preflop + 10k postflop), greedy decoding (temperature=0.1, top_p=0.95, max_tokens=16). EM = exact action+size match; AA = action category match (bet collapsed to raise per the paper). 0 parse failures across all 11000 samples.
Production-PE eval
200 samples (100 preflop + 100 postflop) drawn from the test split of PokerBench, then re-rendered with the production prompt format that adds: equity-vs-opponent-range (Monte Carlo over a buildOppRange-derived range), SPR, opponent range labels, blocker notes, draws (flush/oesd/gutshot + outs), available actions, and bet-option pct shortcuts. The eval tests AA on this richer prompt format that the model would see in a real deployment. Important: these 200 samples are taken from the test split (postflop_10k_test_set and preflop_1k_test_set) and were never seen during training — Stage 2 trained only on the train split.
Other artifacts
- GitHub (full reproduction code, training scripts, eval harness, design spec, plan): https://github.com/IanLiYi1996/PokerBench
- Stage 1-only adapter: same recipe minus Stage 2 — shipped via direct S3 to early adopters; reach out if you need it for paper-only deployments
Citation
If you use this adapter, cite the original paper:
bibtex
@inproceedings{zhuang2025pokerbench,title={PokerBench: Training Large Language Models to become Professional Poker Players},author={Zhuang, Richard and Gupta, Akshat and Yang, Richard and Rahane, Aniket and Li, Zhengyu and Anumanchipalli, Gopala},booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},year={2025},url={https://arxiv.org/abs/2501.08328}}
License
This LoRA adapter is released under Apache-2.0, matching the Qwen/Qwen3-14B base model license. The PokerBench dataset is also Apache-2.0.
Framework versions
- PEFT 0.19.1
- transformers (Qwen3 chat template)
- trl 1.5.1 (
assistant_only_loss=Truefor masked SFT)
Model provider
ianlee1996
Model tree
Base
Qwen/Qwen3-14B
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information