Qwen3-4B-SFT-Math:
Qwen3-4B-SFT-Math is a math-reasoning model derived from Qwen3-4B-Base via full-parameter fine-tuning (2 epochs) on the verl framework, using a pure long-think math recipe at the ~45K scale.
There is a notable shortage of reproducible 'warm-start' SFT bases in open-source practice, this model bridges the gap between base models and reinforcement learning models. Optimally aligned for Chain-of-Thought (CoT) and instruction following, it serves as a robust warm-start for Reinforcement Learning.
This is the 4B pure-math counterpart to SeaFill2025/Qwen3-8B-SFT (the 8B / 90K variant) .
Benchmark Snapshot
- Compared to the Base (4B) model, Qwen3-4B-SFT-Math-45k-ep2 demonstrates significant performance improvements in reasoning and mathematics. The reported figures represent the Pass@1 accuracy, calculated as the average of dataset-level accuracies across 16 independent runs.
Table with columns: Dataset, Base (4B), Qwen3-4B-SFT-Math-45k-ep2 (this model), Improvement (Absolute)| Dataset | Base (4B) | Qwen3-4B-SFT-Math-45k-ep2 (this model) | Improvement (Absolute) |
|---|
| AIME 2025 | 1.46% | 22.1% | +20.62% |
| AIME 2026 | 2.29% | 22.1% | +19.79% |
| AMC 2023 | 21.25% | 64.1% | +42.81% |
- Aggregated over the full 100-problem T0 set (16 rollouts each): pass@1 9.6% → 38.9% (+29.3), any@16 37% → 69% (+32), perfect@16 0% → 11% (+11).
- Evaluation protocol: T0 = 100 original competition problems (30 AIME-2025 + 30 AIME-2026 + 40 AMC-2023), 16 rollouts per problem, judged by exact-match of the boxed final answer.
- Training recipe: derived from
open-r1/OpenR1-Math-220k, 45K-row math-only subset (same source family as the 8B/90K recipe at 96kevinli29/SFT-Math-90k).
Qwen3-style reasoning and instruction following
Minimal pattern (illustrative):
<|im_start|>user
… Among options A–D, which is correct? Reason step by step and put the final letter in \boxed{}.
<|im_end|>
<|im_start|>assistant
<think>
Compare A vs B vs C vs D against the stem; eliminate …; D remains consistent with …
</think>
Step-by-step: … (short derivation in the visible channel)
Final answer: \boxed{D}
<|im_end|>
Use a large enough max_new_tokens on hard math so both the reasoning block and the visible \boxed{…} line fit before generation stops. Median rollout ≈ 11.6K tokens; ~37% of rollouts hit the 16K cap in our evals — consider a 32K budget for AIME-level evaluation.
Configuration Notes
You may adjust settings according to your training or deployment needs.
Training Infrastructure
- Cluster: MeluXina Supercomputer (LuxProvide)
- Node Config: 4nodes, 4 NVIDIA-A100 GPUs per node.
- Training Framework: verl (FSDP, full-parameter SFT)
Project Links
Limitations
- Math-only SFT; not optimized for general-domain reasoning, factuality, or instruction following outside math.
- Long rollouts: a non-trivial fraction (~37%) of generations hit the 16K cap on hard competition problems; consider larger budgets for AIME-level evaluation.
- No RLHF / RLVR stage applied. This checkpoint is intended as an SFT-only baseline for studying the SFT→RL gap.
- May produce hallucinations or unsafe outputs outside math.
Citation
If you use this model, please cite this checkpoint, bibTeX for this release :
@misc{qwen3-4b-sft-math-2026,
title = {{Qwen3-4B-SFT-Math}: Pure Long-Think Math SFT of {Qwen3}-4B-Base},
author = {Hongyang Li, Xiao Li and {Sea-Fill Community}},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/96kevinli29/Qwen3-4B-SFT-Math}},
note = {Checkpoint trained with verl; warm-start for pre-RL alignment research. Maintained by Sea-Fill Community.}
}