UzairKhiiba

Qwen2.5-7B-sft-dpo-tuned

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Model Selection

The final model was selected from 5 SFT trials and 5 DPO trials using BLEU and BERTScore F1. Since the task is open-ended instruction following, BERTScore was treated as the primary metric because it better captures semantic similarity than exact word overlap.

Table
Model	BLEU	BERTScore F1
Raw baseline	9.3476	0.8052
Best SFT: trial 3	14.6922	0.8313
Best DPO: trial 5	12.4904	0.8315

The final selected model is dpo_trial_5, which achieved the highest BERTScore F1 across all evaluated runs.

Training Summary

Supervised Fine-Tuning

Dataset: HuggingFaceH4/no_robots
Base model: Qwen/Qwen2.5-7B
Best SFT run: trial_3
LoRA rank: 32
LoRA alpha: 64
LoRA dropout: 0.05
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning rate: 1e-4
Effective batch size: 8
Epochs: 2
Max sequence length: 512

Direct Preference Optimization

Dataset: Anthropic/hh-rlhf
Best DPO run: dpo_trial_5
SFT base adapter: trial_3
DPO beta: 0.1
Learning rate: 5e-5
Effective batch size: 8
Epochs: 2
Max sequence length: 512

Key Findings

SFT improved the raw base model substantially, raising BERTScore F1 from 0.8052 to 0.8313.
SFT trial_3 was the strongest supervised model by BERTScore F1.
DPO trial_5 gave the best overall semantic score, reaching 0.8315 BERTScore F1.
BLEU and BERTScore did not always rank models the same way; BERTScore was more useful for evaluating open-ended generated answers.
Conservative DPO settings worked best. The selected DPO run used beta=0.1, which preserved instruction-following quality while improving alignment.

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo_id = "UzairKhiiba/Qwen2.5-7B-sft-dpo-tuned"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Explain supervised learning in simple terms."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

This model is intended for academic evaluation of instruction-following behavior after SFT and DPO tuning. It can be used for general response generation, explanatory prompts, reasoning-style prompts, and conversational assistant tasks.

Model provider

UzairKhiiba

Model tree

Base

Qwen/Qwen2.5-7B

Fine-tuned

this model

Modalities

Input

Text

Output

Text