muskannnnn

Qwen3-0.6B-GSM8K-Reasoning

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

Model Description

This model is a fine-tuned version of Qwen/Qwen3-0.6B explicitly trained to solve mathematical word problems using structured, step-by-step reasoning. It was developed as part of an advanced NLP pipeline utilizing both Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO).

Through reinforcement learning, the model has learned to generate highly focused Chain-of-Thought (CoT) logic enclosed in <think> tags before outputting the final numerical answer.

  • Developed by: Muskan Pawan, Kashish Anil Kumar, Abdullah Khalid (Institute of Business Administration)
  • Model type: Causal Language Model (Fine-tuned for Reasoning)
  • Language(s) (NLP): English
  • Finetuned from model: Qwen/Qwen3-0.6B

Uses

Direct Use

The model is designed to receive math word problems and output a structured reasoning process followed by the final answer. It is optimized for the ChatML format.

Example input format:

text

System: "You are a helpful math tutor. Solve the problem step by step, then give the final answer after ####."
User: <Math problem here>
Assistant: <think> ... </think> #### <Final Answer>

How to Use This Model: You can load and run inference on this model using the standard transformers library. Ensure you use the provided system prompt and set repetition_penalty=1.15 to prevent EOS loop issues.

Python

text

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "muskannnnn/Qwen3-0.6B-GSM8K-Reasoning"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "A store sells apples for $0.75 each and oranges for $1.25 each. If Sarah buys 4 apples and 3 oranges, how much does she spend in total?"
messages = [
{"role": "system", "content": "You are a helpful math tutor. Solve the problem step by step, then give the final answer after ####."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=2048, repetition_penalty=1.3)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Out-of-Scope Use: Due to the heavy "Alignment Tax" from fine-tuning exclusively on the GSM8K dataset, this model is highly specialized for arithmetic and mathematical logic. It may perform poorly on general conversational tasks, creative writing, or lateral logic puzzles.

Training Details

Training Pipeline: The model was trained using a two-phase parameter-efficient fine-tuning (PEFT) pipeline on Kaggle Tesla T4 GPUs:

Phase 1: Supervised Fine-Tuning (SFT)

Trained on the full GSM8K training split (7,473 examples).

Utilized LoRA (Rank=8, Alpha=16, Target Modules: q_proj, v_proj).

4-bit NF4 Quantization (BitsAndBytes).

Phase 2: Group Relative Policy Optimization (GRPO)

Initialized from the champion SFT adapter.

Trained on a 200-sample subset of GSM8K.

Employed Outcome Reward Modeling (ORM) with two specific functions:

Format Reward: +1.0 for valid blocks, -1.0 for empty/hacked tags.

Accuracy Reward: +2.0 for matching the final extracted number to the ground truth.

Training Hyperparameters (GRPO Phase) Learning Rate: 1e-5

KL Coefficient (Beta): 0.01

Group Size (Generations): 8

Max Completion Length: 512 tokens

Epochs: 1

Evaluation

Testing Data & Methodology The model was evaluated using an automated LLM-as-a-Judge framework powered by LLaMA-3.3-70B-Versatile (via Groq API). Responses were deterministically graded out of 10 based on logical accuracy, step-by-step reasoning quality, and final answer correctness.

Results The SFT + GRPO pipeline yielded massive performance gains, GRPO is highly effective at teaching small models to reason without requiring a separate critic model.

Pipeline Stage LLM-as-a-Judge Score (Out of 10)

Base Model (Qwen3-0.6B Zero-shot) 4.30

After SFT (Trial 1) 8.10

After GRPO (Champion Model) 9.20

Citation

If you use this model or pipeline approach, please credit the authors:

Code snippet @misc{qwen3-gsm8k-reasoning-2026, author = {Pawan, Muskan and Kumar, Kashish Anil and Khalid, Abdullah}, title = {Large Reasoning Model Fine-Tuning Pipeline: SFT to GRPO}, year = {2026}, publisher = {Hugging Face} }

Model provider

muskannnnn

Model tree

Base

Qwen/Qwen3-0.6B

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today