Per-cell behaviour
Evaluated single-turn vs Tit-for-Tat with uniform fabricated history (hist_coop_bias=0.5),
max_new_tokens=32, T=1.0 sampling. Cells are P(agent plays C | fab_agent, fab_opp):
Table with columns: fo = C, fo = D | fo = C | fo = D |
|---|
| fa=C | 0.94 | 1.00 |
| fa=D | 1.00 | 1.00 |
→ near-(1, 1, 1, 1): defects against essentially nothing.
Parse failure rate: 0% (overall cooperation rate 98%, mean reward 0.49 against TFT).
Comparison
Same recipe on different base models on the same single-turn PD eval:
Table with columns: Base model, Recipe, Per-cell (CC, CD, DC, DD), Pattern| Base model | Recipe | Per-cell (CC, CD, DC, DD) | Pattern |
|---|
| Gemma-2-2B-IT | bias=0.75, lr=5e-6 | (1, 0, 1, 0) | Strict reciprocity |
| Mistral-7B-Instruct-v0.2 (this model) | bias=0.75, lr=5e-6 | (0.94, 1.00, 1.00, 1.00) | Unconditional cooperator |
Training recipe
- Method: GRPO with deontological intrinsic reward
- Intrinsic:
r_deon = -1 if (action == D and opp_prev == C) else 0
- Composite:
r_total = r_game_norm + 0.75 * r_deon (game reward normalized, λ=0.75)
- Game: Prisoner's Dilemma, payoffs sampled per episode from [1, 10] with PD ordering, 2R > T+S
- Training opponent: Tit-for-Tat
- Episode design: single-turn with fabricated history (
hist_coop_bias=0.75 — 75% of fab opp moves are C)
- Prompt randomization: layout, prose-position, role (rows ↔ cols), opener/closer order
- LoRA: rank=32, alpha=64, attention + MLP modules, all layers
- Steps: 70 (step_70 checkpoint released here)
- Optimizer: AdamW, lr=5e-6, weight_decay=0.01
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = PeftModel.from_pretrained(base, "moralgym/mistral-7b-it-pd-cooperator-lora")
License
Apache-2.0 for the adapter weights. The base model is governed by its own license — see
mistralai/Mistral-7B-Instruct-v0.2.