Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0About
SOD-GRPO_teacher-4B is a 4B agentic reasoning model trained with GRPO (Group Relative Policy Optimization), serving as the teacher model in the SOD distillation framework.
This model is used to distill smaller student models (SOD-0.6B and SOD-1.7B) via the SOD method, which introduces adaptive step-level weighting to handle cascading error propagation in tool-integrated reasoning.
Model Information
| Attribute | Value |
|---|---|
| Base Model | Qwen3-4B |
| Training Pipeline | Cold-Start SFT → GRPO |
| Parameters | 4B |
Related Models
| Model | Description |
|---|---|
| SOD-0.6B | SOD-distilled 0.6B student |
| SOD-1.7B | SOD-distilled 1.7B student |
| SOD-GRPO_teacher-4B | GRPO-trained 4B teacher model (this model) |
Performance
We report average@32 over 5 runs on challenging math, science, and code benchmarks.
| Method | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average |
|---|---|---|---|---|---|
| GRPO (This Model) | 67.60 | 60.42 | 55.19 | 63.13 | 61.59 |
Distilled Students
| Model | AIME 2024 | AIME 2025 | GPQA-Diamond | LiveCodeBench-v6 | Average |
|---|---|---|---|---|---|
| SOD-0.6B | 20.84 | 26.13 | 22.19 | 27.72 | 24.22 |
| SOD-1.7B | 50.83 | 41.72 | 38.72 | 40.63 | 42.98 |
Acknowledgement
We sincerely thank the authors of DemyAgent-4B and the paper "Demystifying Reinforcement Learning in Agentic Reasoning" (arXiv:2510.11701) for their contribution.
Citation
bibtex
@article{zhong2026sod,title={SOD: Step-wise On-policy Distillation for Small Language Model Agents},author={Zhong, Qiyong and Zheng, Mao and Song, Mingyang and Lin, Xin and Sun, Jie and Jiang, Houcheng and Wang, Xiang and Fang, Junfeng},journal={arXiv preprint arXiv:2605.07725},year={2026}}
Model provider
youngzhong
Model tree
Base
Qwen/Qwen3-4B
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information