Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitModel Description
SOLE-R1 predicts robot task progress from visual observations. Given a video and a task description, the model outputs a reasoning trace and a scalar progress estimate.
Expected output format:
markdown
<think>reasoning about task progress</think><answer>progress%</answer>
The progress estimate is intended to serve as a dense reward signal for robotic reinforcement learning, especially when manually engineered rewards are unavailable.
Quick Start
The recommended interface for inference is RewardGen:
markdown
# pip install -U rewardgenfrom rewardgen import generate, video_plot# test_videos provided at the github repo: https://github.com/Philip-MIT/rewardgenvideo_paths = ["test_videos/robosuite/lift/unsuccessful/robosuite_lift_episode_12_unsuccessful_max_reward_38.mp4"]task_description = "Pick up the cube from the table."rewards, reasoning_traces = generate(model="SOLE-R1",task_description=task_description,video_paths=video_paths,view_type_per_video=["external and wrist"],verbose=False,)print(rewards)print(reasoning_traces)# Plotting with show_reasoning_traces=Trueoutput_sole = {"model": "SOLE-R1", "rewards": rewards[0], "reasoning_traces": reasoning_traces[0]}video_plot(outputs=[output_sole],plot_save_path='model_outputs/sole-r1/robosuite/lift/unsuccessful/robosuite_lift_episode_12_unsuccessful_max_reward_38.mp4',video_path=video_paths[0],show_reasoning_traces=True,task_description=task_description,verbose=False)
Optional pre-download:
markdown
from rewardgen.utils.model_utils import get_model_dirget_model_dir("sole-r1")
Input Format
The model is trained to reason over robot task progress using prompts that include:
- A robot task description
- The first timestep progress, typically
0% - The previous timestep progress
- Visual observations from the first, previous, and current timesteps
- Multiple camera views when available, such as external and wrist cameras
Example task description:
markdown
Pick up the cube from the table.
Output Format
The expected output format is:
markdown
<think>[reasoning about visual task progress]</think><answer>[current task progress]%</answer>
Example:
markdown
<think>The gripper has moved closer to the cube but has not yet grasped or lifted it. This indicates incremental progress from the previous timestep.</think><answer>22%</answer>
Downstream systems should parse the numeric value inside <answer>...</answer> as the reward/progress estimate.
Training Data
The model was trained on the SOLE-R1-8B training dataset.
The dataset contains robot task progress examples with images, prompts, reasoning completions, and progress labels.
It also includes a diverse collection of general spatial and multi-frame temporal reasoning data (e.g., from SSR-CoT, SpatialVLM, Spot-the-diff, Embodied CoT, RoboVQA, Robo2VLM-Reasoning) to serve as a foundational layer of our training mixture.
The full dataset is approximately 2TB.
Streaming example:
markdown
from datasets import load_datasetds = load_dataset("Philip-MIT/sole_training_data",split="train",streaming=True,)for row in ds:print(row)break
Citation
BibTeX:
markdown
@misc{schroeder2026soler1,title={SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot RL},author={Philip Schroeder and Thomas Weng and Karl Schmeckpeper and Eric Rosen and Stephen Hart and Ondrej Biza},year={2026},eprint={2603.28730},archivePrefix={arXiv},primaryClass={cs.RO}}
License
This repository is released under the MIT License.
Model provider
Philip-MIT
Model tree
Base
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information