open-thoughts
OpenThinkerAgent-8B-RL
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0OpenThinkerAgent-8B-RL
OpenThoughts-Agent is an open-source effort to curate the best datasets for training agents. Our release includes datasets, models and our research codebase.
OpenThinkerAgent-8B-RL is the final, RL-trained 8B agentic checkpoint of the OpenThoughts-Agent SFT→RL recipe. Starting from the cold-start SFT base OpenThinkerAgent-8B-ColdStartSFTForRL, it is further trained with on-policy reinforcement learning on the OpenThoughts-Agent-RL-5K task set. This checkpoint corresponds to RL step 45.
Architecture note. Although the upstream lineage carries a
GLM-4.7label (which refers to the teacher used for the cold-start SFT trajectories, not the student), this model is a Qwen3-8B. Itsconfig.jsonreportsmodel_type: qwen3,architectures: ["Qwen3ForCausalLM"], 36 layers, hidden size 4096, 32 attention heads / 8 KV heads, and a 40,960-token context — i.e. standard Qwen3-8B.
- Homepage: https://www.openthoughts.ai/blog/agent
- Repository: https://github.com/open-thoughts/OpenThoughts-Agent
Model details
- Base (pre-RL) model: OpenThinkerAgent-8B-ColdStartSFTForRL (itself an SFT of Qwen/Qwen3-8B)
- Architecture: Qwen3 (
Qwen3ForCausalLM), 36 layers, hidden size 4096, 32 attention heads, 8 KV heads, RoPE θ = 1e6 - Context length: 40,960 tokens (max position embeddings); RL rollouts used a 32,768-token serving window
- Vocabulary: 151,936 tokens
- Precision: bf16
- Checkpoint: RL step 45
The SFT → RL recipe
- OpenThoughts-Agent-SFT-ColdStartForRL-10K — cold-start SFT trajectories.
- OpenThinkerAgent-8B-ColdStartSFTForRL — Qwen3-8B after cold-start SFT (the pre-RL base).
- OpenThoughts-Agent-RL-5K — the 5,000 on-policy RL tasks.
- OpenThinkerAgent-8B-RL — this model, the final RL'd checkpoint (step 45).
Training data
- Cold-start SFT: OpenThoughts-Agent-SFT-ColdStartForRL-10K (9,437 task/trajectory pairs).
- RL tasks: OpenThoughts-Agent-RL-5K (5,000
pymethods2test-largetasks); the policy rolls out against each task in a Daytona sandbox and is rewarded by the task's test verifier.
Training procedure
On-policy RL with the OpenThoughts-Agent codebase (SkyRL), recorded in the run config shipped with this repo (swesmith-fixthink-pymethods2test_rl_config.json):
- Algorithm: RLOO-n advantage estimator (
advantage_estimator=rloo_n), no KL loss (use_kl_loss=false,kl_loss_coef=0.0) - PPO clip: eps_clip_low/high = 0.2, loss reduction = token_mean
- Optimizer: AdamW, learning_rate 5e-6, weight_decay 0.0, betas (0.9, 0.999)
- Batch: train_batch_size 64, policy_mini_batch_size 64
- Rollouts: vLLM backend, 8 samples per prompt, sampling temperature 0.7 / top_p 0.95 / top_k 20, max generate length 4096, served at 32,768-token context
- Harness: terminus-2 agent in Daytona sandboxes; interleaved thinking enabled
- Strategy: FSDP2; HF checkpoint exported every 5 RL steps; this artifact is step 45
Intended uses & limitations
This is an agentic coding model: it is designed to operate as a tool-using agent in the terminus-2 harness (issuing shell commands / edits and reasoning over terminal output) to solve software-engineering tasks. It inherits Qwen3-8B's general capabilities plus agentic behaviour from cold-start SFT and the RL stage. Limitations: outputs (including shell commands) may be incorrect or unsafe and should be executed only in sandboxed environments with review; the RL stage optimized for the pymethods2test/SWE-Smith-style task distribution and may generalize unevenly to other domains.
Evaluation: No verified agentic-benchmark numbers are published for this specific 8B RL checkpoint in the source artifact; evaluation results are TBD. (The flagship OpenThinkerAgent-32B card reports the project's benchmark suite for the 32B SFT line.)
Links
- 🌐 OpenThoughts-Agent project page
- 💻 OpenThoughts-Agent GitHub repository
- 📚 OpenThinker-Agent collection
- 🤖 Pre-RL base model: OpenThinkerAgent-8B-ColdStartSFTForRL
- 🧠 RL tasks: OpenThoughts-Agent-RL-5K
- 🧠 Cold-start SFT dataset: OpenThoughts-Agent-SFT-ColdStartForRL-10K
Citation
markdown
@misc{openthoughts-agent,author = {Team, OpenThoughts-Agent},title = {{OpenThoughts-Agent: Data Recipes for Agentic Models}},howpublished = {https://www.openthoughts.ai/blog/agent},year = {2026}}
Model provider
open-thoughts
Model tree
Base
open-thoughts/OpenThinkerAgent-8B-ColdStartSFTForRL
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information