Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Qwen3-8B-SFT:
Qwen3-8B-SFT is a reasoning-focused model derived from Qwen3-8B-Base via full-parameter fine-tuning on the verl framework.
There is a notable shortage of reproducible 'warm-start' SFT bases in open-source practice, this model bridges the gap between base models and reinforcement learning models. Optimally aligned for Chain-of-Thought (CoT) and instruction following, it serves as a robust warm-start for Reinforcement Learning.
Benchmark Snapshot
- Compared to the Base (8B) model, Qwen3-8B-SFT demonstrates significant performance improvements in reasoning and mathematics. The reported figures represent the Pass@1 accuracy, calculated as the average of dataset-level accuracies across 16 independent runs.
| Dataset | Base (8B) | Qwen3-8B-SFT (this model) | Improvement (Absolute) |
|---|---|---|---|
| AIME 2025 | 2.29% | 27.7% | +25.42% |
| AIME 2026 | 3.13% | 27.9% | +24.79% |
| AMC 2023 | 26.88% | 74.8% | +47.96% |
- Aggregated over the full 100-problem T0 set (16 rollouts each): pass@1 12.4% → 46.6% (+34.3), any@16 43% → 77% (+34), perfect@16 0% → 21% (+21).
- Dataset card used for SFT: derived from open-r1/OpenR1-Math-220k (90K-row math-only subset, same source as OpenR1-Distill-7B's 93.7K).
Qwen3-style reasoning and instruction following
Minimal pattern (illustrative):
text
<|im_start|>user… Among options A–D, which is correct? Reason step by step and put the final letter in \boxed{}.<|im_end|><|im_start|>assistant<think>Compare A vs B vs C vs D against the stem; eliminate …; D remains consistent with …</think>Step-by-step: … (short derivation in the visible channel)Final answer: \boxed{D}<|im_end|>
Use a large enough max_new_tokens on hard math so both the reasoning block and the visible \boxed{…} line fit before generation stops.
Configuration Notes
- Template: Trained with the Qwen chat template; learns to end responses with
<|im_end|>(151645). - Suggested Configuration:
json
{"eos_token_id": 151645}
You may adjust settings according to your training or deployment needs.
Training Infrastructure
- Cluster: MeluXina Supercomputer (LuxProvide)
- Node Config: 8 nodes, 4 NVIDIA-A100 GPUs per node.
- Training Framework: verl (FSDP, full-parameter SFT)
Project Links
- Training code repository: https://github.com/96kevinli29/base-model-sft-verl
Limitations
- Not optimized for factual correctness in all domains
- May still produce hallucinations or unsafe outputs
- Performance is sensitive to prompt style and decoding settings
Citation
If you use this model, please cite this checkpoint, bibTeX for this release :
bibtex
@misc{qwen3-8b-sft-2026,title = {{Qwen3-8B-SFT}: Supervised Fine-Tuned {Qwen3}-8B for Reasoning},author = {Hongyang Li, Xiao Li and {Sea-Fill Community}},year = {2026},publisher = {Hugging Face},howpublished = {\url{https://huggingface.co/96kevinli29/Qwen3-8B-SFT}},note = {Checkpoint trained with verl; warm-start for pre-RL alignment research. Maintained by Sea-Fill Community.}}
Model provider
SeaFill2025
Model tree
Base
Qwen/Qwen3-8B-Base
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information