Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Project Purpose

This is a supervised fine-tuning experiment for learning and demonstrating the full 7B QLoRA workflow:

  1. Load a 7B instruction model in 4-bit
  2. Prepare the model for k-bit training
  3. Convert Dolly-15K into chat format
  4. Apply the Qwen chat template
  5. Add LoRA adapters
  6. Train with TRL SFTTrainer
  7. Save adapter weights
  8. Run inference with base model + adapter
  9. Upload adapter to Hugging Face Hub

Base Model

  • Qwen/Qwen2.5-7B-Instruct

Dataset

  • databricks/databricks-dolly-15k
  • Training subset: 10000 examples
  • Evaluation subset: 1000 examples

Training Method

  • Method: QLoRA
  • Quantization: 4-bit NF4
  • Double quantization: enabled
  • Compute dtype: bfloat16
  • LoRA rank: 16
  • LoRA alpha: 32
  • LoRA dropout: 0.05
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Max sequence length: 2048
  • Epochs: 1
  • Learning rate: 2e-4

Intended Use

This adapter is intended for instruction-following experiments and PEFT/QLoRA learning.

Example use cases:

  • Comparing base model and QLoRA-adapted outputs

Example Usage

python

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch
base_model = "Qwen/Qwen2.5-7B-Instruct"
adapter = "Kurapika993/qwen2.5-7b-qlora-dolly15k"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(adapter)
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter)
model.eval()
def generate_response(model, tokenizer, user_prompt, max_new_tokens=250):
messages = [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": user_prompt
}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(
text,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.05,
pad_token_id=tokenizer.eos_token_id,
)
generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
return response.strip()
prompt = "Explain instruction tuning to a beginner using a simple analogy."
response = generate_response(model, tokenizer, prompt)
print(response)

Model provider

Kurapika993

Kurapika993

Model tree

Base

Qwen/Qwen2.5-7B-Instruct

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today