muskannnnn
Qwen3-0.6B-GSM8K-Reasoning
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Details
Model Description
This model is a fine-tuned version of Qwen/Qwen3-0.6B explicitly trained to solve mathematical word problems using structured, step-by-step reasoning. It was developed as part of an advanced NLP pipeline utilizing both Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO).
Through reinforcement learning, the model has learned to generate highly focused Chain-of-Thought (CoT) logic enclosed in <think> tags before outputting the final numerical answer.
- Developed by: Muskan Pawan, Kashish Anil Kumar, Abdullah Khalid (Institute of Business Administration)
- Model type: Causal Language Model (Fine-tuned for Reasoning)
- Language(s) (NLP): English
- Finetuned from model:
Qwen/Qwen3-0.6B
Uses
Direct Use
The model is designed to receive math word problems and output a structured reasoning process followed by the final answer. It is optimized for the ChatML format.
Example input format:
text
System: "You are a helpful math tutor. Solve the problem step by step, then give the final answer after ####."User: <Math problem here>Assistant: <think> ... </think> #### <Final Answer>
How to Use This Model: You can load and run inference on this model using the standard transformers library. Ensure you use the provided system prompt and set repetition_penalty=1.15 to prevent EOS loop issues.
Python
text
from transformers import AutoModelForCausalLM, AutoTokenizermodel_id = "muskannnnn/Qwen3-0.6B-GSM8K-Reasoning"tokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")prompt = "A store sells apples for $0.75 each and oranges for $1.25 each. If Sarah buys 4 apples and 3 oranges, how much does she spend in total?"messages = [{"role": "system", "content": "You are a helpful math tutor. Solve the problem step by step, then give the final answer after ####."},{"role": "user", "content": prompt}]text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = tokenizer([text], return_tensors="pt").to(model.device)generated_ids = model.generate(**inputs, max_new_tokens=2048, repetition_penalty=1.3)response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Out-of-Scope Use: Due to the heavy "Alignment Tax" from fine-tuning exclusively on the GSM8K dataset, this model is highly specialized for arithmetic and mathematical logic. It may perform poorly on general conversational tasks, creative writing, or lateral logic puzzles.
Training Details
Training Pipeline: The model was trained using a two-phase parameter-efficient fine-tuning (PEFT) pipeline on Kaggle Tesla T4 GPUs:
Phase 1: Supervised Fine-Tuning (SFT)
Trained on the full GSM8K training split (7,473 examples).
Utilized LoRA (Rank=8, Alpha=16, Target Modules: q_proj, v_proj).
4-bit NF4 Quantization (BitsAndBytes).
Phase 2: Group Relative Policy Optimization (GRPO)
Initialized from the champion SFT adapter.
Trained on a 200-sample subset of GSM8K.
Employed Outcome Reward Modeling (ORM) with two specific functions:
Format Reward: +1.0 for valid blocks, -1.0 for empty/hacked tags.
Accuracy Reward: +2.0 for matching the final extracted number to the ground truth.
Training Hyperparameters (GRPO Phase) Learning Rate: 1e-5
KL Coefficient (Beta): 0.01
Group Size (Generations): 8
Max Completion Length: 512 tokens
Epochs: 1
Evaluation
Testing Data & Methodology The model was evaluated using an automated LLM-as-a-Judge framework powered by LLaMA-3.3-70B-Versatile (via Groq API). Responses were deterministically graded out of 10 based on logical accuracy, step-by-step reasoning quality, and final answer correctness.
Results The SFT + GRPO pipeline yielded massive performance gains, GRPO is highly effective at teaching small models to reason without requiring a separate critic model.
Pipeline Stage LLM-as-a-Judge Score (Out of 10)
Base Model (Qwen3-0.6B Zero-shot) 4.30
After SFT (Trial 1) 8.10
After GRPO (Champion Model) 9.20
Citation
If you use this model or pipeline approach, please credit the authors:
Code snippet @misc{qwen3-gsm8k-reasoning-2026, author = {Pawan, Muskan and Kumar, Kashish Anil and Khalid, Abdullah}, title = {Large Reasoning Model Fine-Tuning Pipeline: SFT to GRPO}, year = {2026}, publisher = {Hugging Face} }
Model provider
muskannnnn
Model tree
Base
Qwen/Qwen3-0.6B
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information