Kurapika993/qwen2.5-7b-qlora-no-robots API & Inference Endpoint

Project Purpose

This is a supervised fine-tuning experiment for learning and demonstrating a 7B QLoRA workflow:

Load a 7B instruction model in 4-bit
Prepare the model for k-bit training
Load and clean the No Robots instruction dataset
Apply the Qwen chat template
Add LoRA adapters
Train with TRL SFTTrainer
Save adapter weights
Run inference with base model + adapter
Upload adapter to Hugging Face Hub

Base Model

Qwen/Qwen2.5-7B-Instruct

Dataset

HuggingFaceH4/no_robots
Training subset: 5000 examples
Evaluation subset: 500 examples

Training Method

Method: QLoRA
Quantization: 4-bit NF4
Double quantization: enabled
Compute dtype: bfloat16
LoRA rank: 16
LoRA alpha: 32
LoRA dropout: 0.05
Target modules: "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",
Max sequence length: 2048
Epochs: 1
Learning rate: 2e-4
Effective batch size: 16

Evaluation

This adapter was evaluated qualitatively using fixed instruction-following prompts.

Included files:

training_config.json
base_vs_adapter_comparison.json
loss_curve.png

This is a qualitative sanity check, not a formal benchmark.

Intended Use

This adapter is intended forinstruction-following and PEFT/QLoRA learning.

Example use cases:

Testing PEFT adapter loading
Comparing base model and QLoRA-adapted outputs

Example Usage

python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

base_model = "Qwen/Qwen2.5-7B-Instruct"
adapter = "Kurapika993/qwen2.5-7b-qlora-no-robots"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(adapter)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(model, adapter)
model.eval()

def generate_response(model, tokenizer, user_prompt, max_new_tokens=250):
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": user_prompt
        }
    ]

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(
        text,
        return_tensors="pt"
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.05,
            pad_token_id=tokenizer.eos_token_id,
        )

    generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    return response.strip()

prompt = "Explain instruction tuning to a beginner using a simple analogy."

response = generate_response(
    model,
    tokenizer,
    prompt,
    max_new_tokens=250
)

print(response

qwen2.5-7b-qlora-no-robots

Get help setting up a custom Dedicated Endpoints.

README