Model Description
This model is obtained by applying on-policy distillation (OPD) to Qwen3-1.7B-Base, with Qwen3-4B-Base-GRPO serving as the teacher model. The OPD training uses DAPO math prompts/data and is designed to transfer the teacher's math-focused reasoning behavior into a smaller 1.7B-parameter student model.
Key characteristics
- Student/base model: Qwen3-1.7B-Base
- Teacher model: lllyx/Qwen3-4B-Base-GRPO
- Training data: DAPO-Math-17k
- Training stage: On-Policy Distillation (OPD)
- Training framework: verl
- Rollout engine: vLLM
- Primary domain: Mathematical reasoning
- Model architecture: Qwen3ForCausalLM
- Precision: bfloat16
- Context length: 32768 tokens
Training Details
Training configuration
- Base checkpoint:
Qwen/Qwen3-1.7B-Base
- Teacher checkpoint:
lllyx/Qwen3-4B-Base-GRPO
- Training framework: verl
- Training method: on-policy distillation with GRPO-style rollouts
- Distillation loss mode:
k1
- Policy-gradient term: enabled
- Training dataset:
DAPO-Math-17k/DAPO-Math.parquet
- Primary task domain: math reasoning
- Chat template thinking mode: disabled (
enable_thinking=False)
Rollout and optimization
- Rollout engine: vLLM
- Responses per prompt: 4
- Prompt length: 1024
- Response length: 7168
- Max rollout model length: 8193
- Train batch size: 64
- PPO mini-batch size: 16
- PPO micro-batch size per GPU: 1
- Max PPO token length per GPU: 8192
- Actor learning rate:
1e-6
- Total epochs: 1
- Save frequency: every 20 steps
Runtime setup
- Distributed backend: Ray
- Number of nodes: 1
- GPUs per node: 4
- Teacher world size: 4
- Rollout tensor parallel size: 1
- Teacher tensor parallel size: 1
- Actor training: FSDP with parameter and optimizer offload
- Gradient checkpointing: enabled
- Padding removal: enabled
- Torch compile for actor: enabled
- Reward function: rule-based math reward from
verl/recipe/r1_ascend/deepscaler.py::compute_score
Dataset
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "lllyx/Qwen3-1.7B-Base-OPD"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
Citation
If you use this model, please consider citing the related paper:
@article{li2026rethinking,
title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
journal={arXiv preprint arXiv:2604.13016},
year={2026}
}