ba144220

cs224r-default-project-rloo

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Model Description

This model is trained with online reinforcement learning using the RLOO algorithm. Given a target number and a set of allowed numbers, the model produces chain-of-thought reasoning inside <think> tags and a final answer inside <answer> tags. A rule-based verifier rewards correct arithmetic equations (score 1.0), correctly formatted but incorrect equations (score 0.1), and malformed outputs (score 0.0).

Training Details

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Base model	ba144220/cs224r-default-project-sft (SFT-tuned Qwen2.5-0.5B)
Algorithm	RLOO (REINFORCE Leave-One-Out)
Dataset	asingh15/countdown_tasks_3to4
Learning rate	1e-5 (constant schedule)
Batch size	128 (gradient accumulation = 128)
Group size (K)	8
Entropy coefficient	0.001
KL divergence coefficient	0.001
Importance weighting	Disabled
Weight decay	1e-4
Gradient clipping	1.0
Temperature	1.0
Max completion length	1024
Training steps	100
Precision	bfloat16
Hardware	1x NVIDIA H100 (Modal)

Evaluation

Evaluated on asingh15/countdown_tasks_3to4 test split (50 prompts) using vLLM with temperature 0.6, top-k 20, top-p 0.95, sampling K=16 responses per prompt.

Table with columns: Metric, SFT Baseline, IPO, RLOO (this model)
Metric	SFT Baseline	IPO	RLOO (this model)
Average Score	0.3660	0.4080	0.6407
Pass@1	0.30	0.375	0.6407
Pass@16	0.75 (30/40)	0.75 (30/40)	0.78 (39/50)
Correct (score=1.0)	244/800	287/800

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ba144220/cs224r-default-project-rloo")
tokenizer = AutoTokenizer.from_pretrained("ba144220/cs224r-default-project-rloo")

messages = [{"role": "user", "content": "Using the numbers [3, 4, 6, 8], create an equation that equals 24."}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_k=20, top_p=0.95, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Limitations

Trained and evaluated only on the Countdown arithmetic task; not intended for general-purpose use.
Performance degrades on harder problems with more numbers or larger targets.
The 0.5B parameter size limits reasoning capacity compared to larger models.

Authors

Yuchi Hsu (yuchihsu@stanford.edu) and Ryan He (ryanhe@stanford.edu), Stanford CS224R Spring 2026.

Model provider

ba144220

Model tree

Base

ba144220/cs224r-default-project-sft

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Model Description

Training Details

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Base model	ba144220/cs224r-default-project-sft (SFT-tuned Qwen2.5-0.5B)
Algorithm	RLOO (REINFORCE Leave-One-Out)
Dataset	asingh15/countdown_tasks_3to4
Learning rate	1e-5 (constant schedule)
Batch size	128 (gradient accumulation = 128)
Group size (K)	8
Entropy coefficient	0.001
KL divergence coefficient	0.001
Importance weighting	Disabled
Weight decay	1e-4
Gradient clipping	1.0
Temperature	1.0
Max completion length	1024
Training steps	100
Precision	bfloat16
Hardware	1x NVIDIA H100 (Modal)

Evaluation

Evaluated on asingh15/countdown_tasks_3to4 test split (50 prompts) using vLLM with temperature 0.6, top-k 20, top-p 0.95, sampling K=16 responses per prompt.

Table with columns: Metric, SFT Baseline, IPO, RLOO (this model)
Metric	SFT Baseline	IPO	RLOO (this model)
Average Score	0.3660	0.4080	0.6407
Pass@1	0.30	0.375	0.6407
Pass@16	0.75 (30/40)	0.75 (30/40)	0.78 (39/50)
Correct (score=1.0)	244/800	287/800

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ba144220/cs224r-default-project-rloo")
tokenizer = AutoTokenizer.from_pretrained("ba144220/cs224r-default-project-rloo")

messages = [{"role": "user", "content": "Using the numbers [3, 4, 6, 8], create an equation that equals 24."}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_k=20, top_p=0.95, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Limitations

Trained and evaluated only on the Countdown arithmetic task; not intended for general-purpose use.
Performance degrades on harder problems with more numbers or larger targets.
The 0.5B parameter size limits reasoning capacity compared to larger models.

Authors

Yuchi Hsu (yuchihsu@stanford.edu) and Ryan He (ryanhe@stanford.edu), Stanford CS224R Spring 2026.

cs224r-default-project-rloo

Get help setting up a custom Dedicated Endpoints.

README

Model Description

Training Details

Evaluation

Usage

Limitations

Authors

Explore FriendliAI today

README

Model Description

Training Details

Evaluation

Usage

Limitations

Authors