Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Description
This model is trained with online reinforcement learning using the RLOO algorithm. Given a target number and a set of allowed numbers, the model produces chain-of-thought reasoning inside <think> tags and a final answer inside <answer> tags. A rule-based verifier rewards correct arithmetic equations (score 1.0), correctly formatted but incorrect equations (score 0.1), and malformed outputs (score 0.0).
Training Details
| Hyperparameter | Value |
|---|---|
| Base model | ba144220/cs224r-default-project-sft (SFT-tuned Qwen2.5-0.5B) |
| Algorithm | RLOO (REINFORCE Leave-One-Out) |
| Dataset | asingh15/countdown_tasks_3to4 |
| Learning rate | 1e-5 (constant schedule) |
| Batch size | 128 (gradient accumulation = 128) |
| Group size (K) | 8 |
| Entropy coefficient | 0.001 |
| KL divergence coefficient | 0.001 |
| Importance weighting | Disabled |
| Weight decay | 1e-4 |
| Gradient clipping | 1.0 |
| Temperature | 1.0 |
| Max completion length | 1024 |
| Training steps | 100 |
| Precision | bfloat16 |
| Hardware | 1x NVIDIA H100 (Modal) |
Evaluation
Evaluated on asingh15/countdown_tasks_3to4 test split (50 prompts) using vLLM with temperature 0.6, top-k 20, top-p 0.95, sampling K=16 responses per prompt.
| Metric | SFT Baseline | IPO | RLOO (this model) |
|---|---|---|---|
| Average Score | 0.3660 | 0.4080 | 0.6407 |
| Pass@1 | 0.30 | 0.375 | 0.6407 |
| Pass@16 | 0.75 (30/40) | 0.75 (30/40) | 0.78 (39/50) |
| Correct (score=1.0) | 244/800 | 287/800 | 491/800 |
Usage
python
from transformers import AutoModelForCausalLM, AutoTokenizermodel = AutoModelForCausalLM.from_pretrained("ba144220/cs224r-default-project-rloo")tokenizer = AutoTokenizer.from_pretrained("ba144220/cs224r-default-project-rloo")messages = [{"role": "user", "content": "Using the numbers [3, 4, 6, 8], create an equation that equals 24."}]input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)inputs = tokenizer(input_text, return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_k=20, top_p=0.95, do_sample=True)print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Limitations
- Trained and evaluated only on the Countdown arithmetic task; not intended for general-purpose use.
- Performance degrades on harder problems with more numbers or larger targets.
- The 0.5B parameter size limits reasoning capacity compared to larger models.
Authors
Yuchi Hsu (yuchihsu@stanford.edu) and Ryan He (ryanhe@stanford.edu), Stanford CS224R Spring 2026.
Model provider
ba144220
Model tree
Base
ba144220/cs224r-default-project-sft
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information