96kevinli29

Qwen3-4B-SFT-Math-45k-ep2

README

License: apache-2.0

Qwen3-4B-SFT-Math:

Qwen3-4B-SFT-Math is a math-reasoning model derived from Qwen3-4B-Base via full-parameter fine-tuning (2 epochs) on the verl framework, using a pure long-think math recipe at the ~45K scale.

There is a notable shortage of reproducible 'warm-start' SFT bases in open-source practice, this model bridges the gap between base models and reinforcement learning models. Optimally aligned for Chain-of-Thought (CoT) and instruction following, it serves as a robust warm-start for Reinforcement Learning.

This is the 4B pure-math counterpart to SeaFill2025/Qwen3-8B-SFT (the 8B / 90K variant) .

Benchmark Snapshot

Compared to the Base (4B) model, Qwen3-4B-SFT-Math-45k-ep2 demonstrates significant performance improvements in reasoning and mathematics. The reported figures represent the Pass@1 accuracy, calculated as the average of dataset-level accuracies across 16 independent runs.

Table with columns: Dataset, Base (4B), Qwen3-4B-SFT-Math-45k-ep2 (this model), Improvement (Absolute)
Dataset	Base (4B)	Qwen3-4B-SFT-Math-45k-ep2 (this model)	Improvement (Absolute)
AIME 2025	1.46%	22.1%	+20.62%
AIME 2026	2.29%	22.1%	+19.79%
AMC 2023	21.25%	64.1%	+42.81%

Aggregated over the full 100-problem T0 set (16 rollouts each): pass@1 9.6% → 38.9% (+29.3), any@16 37% → 69% (+32), perfect@16 0% → 11% (+11).
Evaluation protocol: T0 = 100 original competition problems (30 AIME-2025 + 30 AIME-2026 + 40 AMC-2023), 16 rollouts per problem, judged by exact-match of the boxed final answer.
Training recipe: derived from open-r1/OpenR1-Math-220k, 45K-row math-only subset (same source family as the 8B/90K recipe at 96kevinli29/SFT-Math-90k).

Qwen3-style reasoning and instruction following

Minimal pattern (illustrative):

text
<|im_start|>user
… Among options A–D, which is correct? Reason step by step and put the final letter in \boxed{}.
<|im_end|>

<|im_start|>assistant
<think>
Compare A vs B vs C vs D against the stem; eliminate …; D remains consistent with …
</think>
Step-by-step: … (short derivation in the visible channel)
Final answer: \boxed{D}
<|im_end|>

Use a large enough max_new_tokens on hard math so both the reasoning block and the visible \boxed{…} line fit before generation stops. Median rollout ≈ 11.6K tokens; ~37% of rollouts hit the 16K cap in our evals — consider a 32K budget for AIME-level evaluation.

Configuration Notes

Template: Trained with the Qwen chat template; learns to end responses with <|im_end|> (151645).

Suggested Configuration:

json
{
  "eos_token_id": 151645
}

You may adjust settings according to your training or deployment needs.

Training Infrastructure

Cluster: MeluXina Supercomputer (LuxProvide)
Node Config: 4nodes, 4 NVIDIA-A100 GPUs per node.
Training Framework: verl (FSDP, full-parameter SFT)

Project Links

Training code repository: https://github.com/96kevinli29/base-model-sft-verl
Sibling 8B pure-math checkpoint: SeaFill2025/Qwen3-8B-SFT

Limitations

Math-only SFT; not optimized for general-domain reasoning, factuality, or instruction following outside math.
Long rollouts: a non-trivial fraction (~37%) of generations hit the 16K cap on hard competition problems; consider larger budgets for AIME-level evaluation.
No RLHF / RLVR stage applied. This checkpoint is intended as an SFT-only baseline for studying the SFT→RL gap.
May produce hallucinations or unsafe outputs outside math.

Citation

If you use this model, please cite this checkpoint, bibTeX for this release :

bibtex
@misc{qwen3-4b-sft-math-2026,
  title        = {{Qwen3-4B-SFT-Math}: Pure Long-Think Math SFT of {Qwen3}-4B-Base},
  author       = {Hongyang Li, Xiao Li and {Sea-Fill Community}},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/96kevinli29/Qwen3-4B-SFT-Math}},
  note         = {Checkpoint trained with verl; warm-start for pre-RL alignment research. Maintained by Sea-Fill Community.}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

96kevinli29

Model Tree

Base

Qwen/Qwen3-4B-Base

Fine-tuned

this model

Input Modalities

Text

Output Modalities