UzairKhiiba
Qwen2.5-7B-sft-dpo-tuned
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Model Selection
The final model was selected from 5 SFT trials and 5 DPO trials using BLEU and BERTScore F1. Since the task is open-ended instruction following, BERTScore was treated as the primary metric because it better captures semantic similarity than exact word overlap.
| Model | BLEU | BERTScore F1 |
|---|---|---|
| Raw baseline | 9.3476 | 0.8052 |
| Best SFT: trial 3 | 14.6922 | 0.8313 |
| Best DPO: trial 5 | 12.4904 | 0.8315 |
The final selected model is dpo_trial_5, which achieved the highest BERTScore
F1 across all evaluated runs.
Training Summary
Supervised Fine-Tuning
- Dataset:
HuggingFaceH4/no_robots - Base model:
Qwen/Qwen2.5-7B - Best SFT run:
trial_3 - LoRA rank: 32
- LoRA alpha: 64
- LoRA dropout: 0.05
- Target modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - Learning rate:
1e-4 - Effective batch size: 8
- Epochs: 2
- Max sequence length: 512
Direct Preference Optimization
- Dataset:
Anthropic/hh-rlhf - Best DPO run:
dpo_trial_5 - SFT base adapter:
trial_3 - DPO beta: 0.1
- Learning rate:
5e-5 - Effective batch size: 8
- Epochs: 2
- Max sequence length: 512
Key Findings
- SFT improved the raw base model substantially, raising BERTScore F1 from
0.8052to0.8313. - SFT
trial_3was the strongest supervised model by BERTScore F1. - DPO
trial_5gave the best overall semantic score, reaching0.8315BERTScore F1. - BLEU and BERTScore did not always rank models the same way; BERTScore was more useful for evaluating open-ended generated answers.
- Conservative DPO settings worked best. The selected DPO run used
beta=0.1, which preserved instruction-following quality while improving alignment.
Usage
python
from transformers import AutoModelForCausalLM, AutoTokenizerimport torchrepo_id = "UzairKhiiba/Qwen2.5-7B-sft-dpo-tuned"tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)model = AutoModelForCausalLM.from_pretrained(repo_id,device_map="auto",torch_dtype=torch.float16,trust_remote_code=True,)messages = [{"role": "user", "content": "Explain supervised learning in simple terms."}]text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,)inputs = tokenizer(text, return_tensors="pt").to(model.device)outputs = model.generate(**inputs,max_new_tokens=200,temperature=0.7,top_p=0.9,do_sample=True,)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Intended Use
This model is intended for academic evaluation of instruction-following behavior after SFT and DPO tuning. It can be used for general response generation, explanatory prompts, reasoning-style prompts, and conversational assistant tasks.
Model provider
UzairKhiiba
Model tree
Base
Qwen/Qwen2.5-7B
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information