qwen3-4b-think-s1-full-sft API & Inference Endpoint

Training summary

Table with columns: Field, Value
Field	Value
Method	Full SFT (DeepSpeed ZeRO-2)
Dataset	think_s1 (easy + medium, 72,555 samples)
Chat template	qwen3
Thinking mode	enable_thinking=true
Cutoff length	16384
Packing	true (neat_packing)
Epochs	2
Global batch	64 (4 GPU × 4 × 4)
Learning rate	1e-5
LR schedule	cosine, warmup 10%
Train steps	1094
Final train loss	~0.57
Finished	2026-06-09

Eval (EvalScope, release_latest / AIME)

Table with columns: Benchmark, pass@1, Config
Benchmark	pass@1	Config
LiveCodeBench	36.06%	t=0.6, p=0.95, max_tokens=16384
AIME24	16.67%	same sampling, max_tokens=16384
AIME25	3.33%	same sampling, max_tokens=16384

Usage

HuggingFace Transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "modrill/qwen3-4b-think-s1-full-sft"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, torch_dtype="auto", device_map="auto"
)

vLLM

bash
python -m vllm.entrypoints.openai.api_server \
  --model modrill/qwen3-4b-think-s1-full-sft \
  --served-model-name think-s1 \
  --max-model-len 32768 \
  --port 8801

Inference tips

Use Qwen3 chat template with thinking enabled
Recommended eval max_tokens: 16384 (matches training cutoff)
Sampling: temperature=0.6, top_p=0.95, top_k=20

License

Apache 2.0, consistent with the Qwen3 base model license.

Training summary

Table with columns: Field, Value
Field	Value
Method	Full SFT (DeepSpeed ZeRO-2)
Dataset	think_s1 (easy + medium, 72,555 samples)
Chat template	qwen3
Thinking mode	enable_thinking=true
Cutoff length	16384
Packing	true (neat_packing)
Epochs	2
Global batch	64 (4 GPU × 4 × 4)
Learning rate	1e-5
LR schedule	cosine, warmup 10%
Train steps	1094
Final train loss	~0.57
Finished	2026-06-09

Eval (EvalScope, release_latest / AIME)

Table with columns: Benchmark, pass@1, Config
Benchmark	pass@1	Config
LiveCodeBench	36.06%	t=0.6, p=0.95, max_tokens=16384
AIME24	16.67%	same sampling, max_tokens=16384
AIME25	3.33%	same sampling, max_tokens=16384

Usage

HuggingFace Transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "modrill/qwen3-4b-think-s1-full-sft"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, torch_dtype="auto", device_map="auto"
)

vLLM

bash
python -m vllm.entrypoints.openai.api_server \
  --model modrill/qwen3-4b-think-s1-full-sft \
  --served-model-name think-s1 \
  --max-model-len 32768 \
  --port 8801

Inference tips

Use Qwen3 chat template with thinking enabled
Recommended eval max_tokens: 16384 (matches training cutoff)
Sampling: temperature=0.6, top_p=0.95, top_k=20

License

Apache 2.0, consistent with the Qwen3 base model license.

qwen3-4b-think-s1-full-sft

README

Training summary

Eval (EvalScope, release_latest / AIME)

Usage

HuggingFace Transformers

vLLM

Inference tips

License

Explore FriendliAI today

README

Training summary

Eval (EvalScope, release_latest / AIME)

Usage

HuggingFace Transformers

vLLM

Inference tips

License