ba144220

cs224r-default-project-ipo

README

License: apache-2.0

Model Description

This model is preference-tuned using IPO on pairwise chosen/rejected completions for Countdown problems. Given a target number and a set of allowed numbers, the model produces chain-of-thought reasoning inside <think> tags and a final answer inside <answer> tags.

Training Details

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Base model	ba144220/cs224r-default-project-sft (SFT-tuned Qwen2.5-0.5B)
Dataset	asingh15/countdown_tasks_3to4-dpo
Loss type	IPO
Beta	0.1
Epochs	1
Learning rate	5e-6
LR schedule	Cosine with 5% warmup
Batch size	64 (gradient accumulation = 16)
Weight decay	0.01
Precision	bfloat16
Gradient checkpointing	Enabled
Hardware	1x NVIDIA H100 (Modal)
Max prompt length	512
Max response length	1024

Evaluation

Evaluated on asingh15/countdown_tasks_3to4 test split (40 prompts) using vLLM with temperature 0.6, top-k 20, top-p 0.95, sampling K=16 responses per prompt.

Table with columns: Metric, SFT Baseline, IPO (this model)
Metric	SFT Baseline	IPO (this model)
Average Score	0.3660	0.4080
Pass@1	0.30	0.375
Pass@16	0.75 (30/40)	0.75 (30/40)
Correct (score=1.0)	244/800	287/800

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ba144220/cs224r-default-project-ipo")
tokenizer = AutoTokenizer.from_pretrained("ba144220/cs224r-default-project-ipo")

messages = [{"role": "user", "content": "Using the numbers [3, 4, 6, 8], create an equation that equals 24."}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_k=20, top_p=0.95, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Limitations

Trained and evaluated only on the Countdown arithmetic task; not intended for general-purpose use.
Performance degrades on harder problems with more numbers or larger targets.
The 0.5B parameter size limits reasoning capacity compared to larger models.

Authors

Yuchi Hsu (yuchihsu@stanford.edu) and Ryan He (ryanhe@stanford.edu), Stanford CS224R Spring 2026.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

ba144220

Model Tree

Base

ba144220/cs224r-default-project-sft

Fine-tuned

this model

Input Modalities

Text

Output Modalities