Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Description

This model is trained with online reinforcement learning using the RLOO algorithm. Given a target number and a set of allowed numbers, the model produces chain-of-thought reasoning inside <think> tags and a final answer inside <answer> tags. A rule-based verifier rewards correct arithmetic equations (score 1.0), correctly formatted but incorrect equations (score 0.1), and malformed outputs (score 0.0).

Training Details

HyperparameterValue
Base modelba144220/cs224r-default-project-sft (SFT-tuned Qwen2.5-0.5B)
AlgorithmRLOO (REINFORCE Leave-One-Out)
Datasetasingh15/countdown_tasks_3to4
Learning rate1e-5 (constant schedule)
Batch size128 (gradient accumulation = 128)
Group size (K)8
Entropy coefficient0.001
KL divergence coefficient0.001
Importance weightingDisabled
Weight decay1e-4
Gradient clipping1.0
Temperature1.0
Max completion length1024
Training steps100
Precisionbfloat16
Hardware1x NVIDIA H100 (Modal)

Evaluation

Evaluated on asingh15/countdown_tasks_3to4 test split (50 prompts) using vLLM with temperature 0.6, top-k 20, top-p 0.95, sampling K=16 responses per prompt.

MetricSFT BaselineIPORLOO (this model)
Average Score0.36600.40800.6407
Pass@10.300.3750.6407
Pass@160.75 (30/40)0.75 (30/40)0.78 (39/50)
Correct (score=1.0)244/800287/800491/800

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ba144220/cs224r-default-project-rloo")
tokenizer = AutoTokenizer.from_pretrained("ba144220/cs224r-default-project-rloo")
messages = [{"role": "user", "content": "Using the numbers [3, 4, 6, 8], create an equation that equals 24."}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_k=20, top_p=0.95, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Limitations

  • Trained and evaluated only on the Countdown arithmetic task; not intended for general-purpose use.
  • Performance degrades on harder problems with more numbers or larger targets.
  • The 0.5B parameter size limits reasoning capacity compared to larger models.

Authors

Yuchi Hsu (yuchihsu@stanford.edu) and Ryan He (ryanhe@stanford.edu), Stanford CS224R Spring 2026.

Model provider

ba144220

Model tree

Base

ba144220/cs224r-default-project-sft

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today