tayyib-sayyid/qwen2.5-0.5b-gsm8k-lora API & Inference Endpoint

Results

Stage	Metric	Score
Base model (no fine-tune)	LLM-judge avg (1-5), 10 prompts	2.0
This adapter (sft_trial_1)	LLM-judge avg (1-5), 10 prompts	3.4
Best GRPO trial on top of this	regex exact-match (0-10)	2.0

Judge model: groq/llama-3.3-70b-versatile. Evaluation prompts are 10 held-out GSM8K test problems formatted with the ChatML template and the #### N terminator.

How to use

python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_id = "Qwen/Qwen2.5-0.5B-Instruct"
adapter_id = "tayyib-sayyid/qwen2.5-0.5b-gsm8k-lora"

tok = AutoTokenizer.from_pretrained(adapter_id)
base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)
model.eval()

prompt = (
    "Natalia sold clips to 48 of her friends in April, and then she sold "
    "half as many clips in May. How many clips did Natalia sell altogether "
    "in April and May?"
)
messages = [{"role": "user", "content": prompt}]
inputs = tok.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

The model is trained to end every answer with #### <number> so a simple regex (r"####\s*(-?[\d,\.]+)") is enough to extract the final numeric answer.

Training details

Base model: Qwen/Qwen2.5-0.5B-Instruct
Dataset: openai/gsm8k (main config), 5,000-row training subset (seed 42), ChatML rendered with the #### N final-answer terminator.
Method: LoRA (PEFT) on top of 4-bit NF4 base via bitsandbytes.
Framework versions: PEFT 0.13.x, TRL SFTTrainer, transformers 4.4x.

LoRA hyperparameters

Field	Value
`r`	8
`alpha`	16
`dropout`	0.05
`target_modules`	q_proj, v_proj
`task_type`	`CAUSAL_LM`

Optimization

Field	Value
learning rate	0.0002
optimizer	`paged_adamw_8bit`
per-device batch size	4
gradient accumulation	4
effective batch size	16
epochs	1
warmup ratio	0.03
weight decay	0.0
max sequence length	1024

Limitations

Trained and evaluated on GSM8K: grade-school math word problems in English. It will not generalize to other math styles (geometry, calculus, proofs) and is not a general-purpose chat model.
The 10-prompt evaluation set is small; the headline score is a directional signal, not a benchmark.
The base model is 0.5B parameters — useful for studying the SFT/GRPO pipeline at low cost, but well below the accuracy of larger reasoning models.

Citation

This adapter was produced as part of NLP Assignment 4 at IBA. The full pipeline, hyperparameter sweep tables, and LaTeX report live in the source repository.

Framework versions

PEFT 0.13.x
transformers 4.4x.x
TRL 0.13.x–0.14.x

qwen2.5-0.5b-gsm8k-lora

Get help setting up a custom Dedicated Endpoints.

README