open-thoughts

open-thoughts

OpenThinkerAgent-8B-RL

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

OpenThinkerAgent-8B-RL

OpenThoughts-Agent is an open-source effort to curate the best datasets for training agents. Our release includes datasets, models and our research codebase.

OpenThinkerAgent-8B-RL is the final, RL-trained 8B agentic checkpoint of the OpenThoughts-Agent SFT→RL recipe. Starting from the cold-start SFT base OpenThinkerAgent-8B-ColdStartSFTForRL, it is further trained with on-policy reinforcement learning on the OpenThoughts-Agent-RL-5K task set. This checkpoint corresponds to RL step 45.

Architecture note. Although the upstream lineage carries a GLM-4.7 label (which refers to the teacher used for the cold-start SFT trajectories, not the student), this model is a Qwen3-8B. Its config.json reports model_type: qwen3, architectures: ["Qwen3ForCausalLM"], 36 layers, hidden size 4096, 32 attention heads / 8 KV heads, and a 40,960-token context — i.e. standard Qwen3-8B.

Model details

  • Base (pre-RL) model: OpenThinkerAgent-8B-ColdStartSFTForRL (itself an SFT of Qwen/Qwen3-8B)
  • Architecture: Qwen3 (Qwen3ForCausalLM), 36 layers, hidden size 4096, 32 attention heads, 8 KV heads, RoPE θ = 1e6
  • Context length: 40,960 tokens (max position embeddings); RL rollouts used a 32,768-token serving window
  • Vocabulary: 151,936 tokens
  • Precision: bf16
  • Checkpoint: RL step 45

The SFT → RL recipe

  1. OpenThoughts-Agent-SFT-ColdStartForRL-10K — cold-start SFT trajectories.
  2. OpenThinkerAgent-8B-ColdStartSFTForRL — Qwen3-8B after cold-start SFT (the pre-RL base).
  3. OpenThoughts-Agent-RL-5K — the 5,000 on-policy RL tasks.
  4. OpenThinkerAgent-8B-RL — this model, the final RL'd checkpoint (step 45).

Training data

Training procedure

On-policy RL with the OpenThoughts-Agent codebase (SkyRL), recorded in the run config shipped with this repo (swesmith-fixthink-pymethods2test_rl_config.json):

  • Algorithm: RLOO-n advantage estimator (advantage_estimator=rloo_n), no KL loss (use_kl_loss=false, kl_loss_coef=0.0)
  • PPO clip: eps_clip_low/high = 0.2, loss reduction = token_mean
  • Optimizer: AdamW, learning_rate 5e-6, weight_decay 0.0, betas (0.9, 0.999)
  • Batch: train_batch_size 64, policy_mini_batch_size 64
  • Rollouts: vLLM backend, 8 samples per prompt, sampling temperature 0.7 / top_p 0.95 / top_k 20, max generate length 4096, served at 32,768-token context
  • Harness: terminus-2 agent in Daytona sandboxes; interleaved thinking enabled
  • Strategy: FSDP2; HF checkpoint exported every 5 RL steps; this artifact is step 45

Intended uses & limitations

This is an agentic coding model: it is designed to operate as a tool-using agent in the terminus-2 harness (issuing shell commands / edits and reasoning over terminal output) to solve software-engineering tasks. It inherits Qwen3-8B's general capabilities plus agentic behaviour from cold-start SFT and the RL stage. Limitations: outputs (including shell commands) may be incorrect or unsafe and should be executed only in sandboxed environments with review; the RL stage optimized for the pymethods2test/SWE-Smith-style task distribution and may generalize unevenly to other domains.

Evaluation: No verified agentic-benchmark numbers are published for this specific 8B RL checkpoint in the source artifact; evaluation results are TBD. (The flagship OpenThinkerAgent-32B card reports the project's benchmark suite for the 32B SFT line.)

Citation

markdown

@misc{openthoughts-agent,
author = {Team, OpenThoughts-Agent},
title = {{OpenThoughts-Agent: Data Recipes for Agentic Models}},
howpublished = {https://www.openthoughts.ai/blog/agent},
year = {2026}
}

Model provider

open-thoughts

open-thoughts

Model tree

Base

open-thoughts/OpenThinkerAgent-8B-ColdStartSFTForRL

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today