Kurapika993

qwen2.5-7b-qlora-dolly15k

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Project Purpose

This is a supervised fine-tuning experiment for learning and demonstrating the full 7B QLoRA workflow:

Load a 7B instruction model in 4-bit
Prepare the model for k-bit training
Convert Dolly-15K into chat format
Apply the Qwen chat template
Add LoRA adapters
Train with TRL SFTTrainer
Save adapter weights
Run inference with base model + adapter
Upload adapter to Hugging Face Hub

Base Model

Qwen/Qwen2.5-7B-Instruct

Dataset

databricks/databricks-dolly-15k
Training subset: 10000 examples
Evaluation subset: 1000 examples

Training Method

Method: QLoRA
Quantization: 4-bit NF4
Double quantization: enabled
Compute dtype: bfloat16
LoRA rank: 16
LoRA alpha: 32
LoRA dropout: 0.05
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Max sequence length: 2048
Epochs: 1
Learning rate: 2e-4

Intended Use

This adapter is intended for instruction-following experiments and PEFT/QLoRA learning.

Example use cases:

Comparing base model and QLoRA-adapted outputs

Example Usage

python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

base_model = "Qwen/Qwen2.5-7B-Instruct"
adapter = "Kurapika993/qwen2.5-7b-qlora-dolly15k"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(adapter)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(model, adapter)
model.eval()

def generate_response(model, tokenizer, user_prompt, max_new_tokens=250):
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": user_prompt
        }
    ]

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(
        text,
        return_tensors="pt"
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.05,
            pad_token_id=tokenizer.eos_token_id,
        )

    
    generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    return response.strip()

prompt = "Explain instruction tuning to a beginner using a simple analogy."

response = generate_response(model, tokenizer, prompt)
print(response)

Model provider

Kurapika993

Model tree

Base

Qwen/Qwen2.5-7B-Instruct

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card