open-thoughts

open-thoughts

OpenThinkerAgent-8B-ColdStartSFTForRL

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

OpenThinkerAgent-8B-ColdStartSFTForRL

OpenThoughts-Agent is an open-source effort to curate the best datasets for training agents. Our release includes datasets, models and our research codebase.

OpenThinkerAgent-8B-ColdStartSFTForRL is the cold-start, pre-RL base of the OpenThoughts-Agent 8B SFT→RL recipe. It is post-trained from Qwen/Qwen3-8B with full-parameter SFT on the cold-start OpenThoughts-Agent-SFT-ColdStartForRL-10K dataset. Its purpose is to give the model the agentic interaction format and tool-use behaviour needed to make subsequent reinforcement learning stable; it is then RL-trained to produce OpenThinkerAgent-8B-RL.

Architecture note. Although the upstream artifact carries a GLM-4.7 label (which refers to the teacher that generated the SFT trajectories, not the student), this model is a Qwen3-8B. Its config.json reports model_type: qwen3, architectures: ["Qwen3ForCausalLM"], 36 layers, hidden size 4096, 32 attention heads / 8 KV heads, and a 40,960-token context — i.e. standard Qwen3-8B.

Model details

  • Base model: Qwen/Qwen3-8B
  • Architecture: Qwen3 (Qwen3ForCausalLM), 36 layers, hidden size 4096, 32 attention heads, 8 KV heads, RoPE θ = 1e6
  • Context length: 40,960 tokens (max position embeddings)
  • Vocabulary: 151,936 tokens
  • Precision: bf16
  • Role in pipeline: cold-start SFT checkpoint (pre-RL base)

Position in the SFT → RL recipe

  1. OpenThoughts-Agent-SFT-ColdStartForRL-10K — cold-start SFT trajectories.
  2. OpenThinkerAgent-8B-ColdStartSFTForRL — this model (Qwen3-8B after cold-start SFT, the pre-RL base).
  3. OpenThoughts-Agent-RL-5K — on-policy RL tasks.
  4. OpenThinkerAgent-8B-RL — the final RL'd checkpoint (step 45).

Training data

Trained on OpenThoughts-Agent-SFT-ColdStartForRL-10K (9,437 (task, trajectory) pairs): SWE-Smith sandboxed coding tasks with tests, solved by a teacher model in the terminus-2 harness inside Daytona sandboxes, oracle-verified (120s verifier timeout).

Training procedure

Full-parameter SFT (LLaMA-Factory). Hyperparameters as recorded by the trainer:

  • learning_rate: 4e-05
  • lr_scheduler_type: cosine, warmup_ratio 0.1
  • train_batch_size: 1 per device × 8 devices × gradient_accumulation_steps 2 → total_train_batch_size 16
  • optimizer: AdamW (fused), betas (0.9, 0.98), eps 1e-08
  • num_epochs: 7
  • seed: 42
  • precision: bf16
  • final train loss: ≈ 0.303 (4,130 global steps)

Framework versions

  • Transformers 4.57.6
  • PyTorch 2.9.0+cu128
  • Datasets 4.4.1
  • Tokenizers 0.22.2

Intended uses & limitations

This checkpoint is intended as the starting point for agentic RL, not as a final deployable agent. It has learned the agentic format and tool-use conventions of the terminus-2 harness from a relatively small cold-start set; its standalone agentic performance is expected to be below the RL-trained successor OpenThinkerAgent-8B-RL. As with the base Qwen3-8B, outputs may be incorrect or unsafe and should not be executed without review. No standalone agentic-benchmark numbers are published for this cold-start checkpoint.

Citation

markdown

@misc{openthoughts-agent,
author = {Team, OpenThoughts-Agent},
title = {{OpenThoughts-Agent: Data Recipes for Agentic Models}},
howpublished = {https://www.openthoughts.ai/blog/agent},
year = {2026}
}

Model provider

open-thoughts

open-thoughts

Model tree

Base

Qwen/Qwen3-8B

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today