Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Project Purpose
This is a supervised fine-tuning experiment for learning and demonstrating the full 7B QLoRA workflow:
- Load a 7B instruction model in 4-bit
- Prepare the model for k-bit training
- Convert Dolly-15K into chat format
- Apply the Qwen chat template
- Add LoRA adapters
- Train with TRL
SFTTrainer - Save adapter weights
- Run inference with base model + adapter
- Upload adapter to Hugging Face Hub
Base Model
- Qwen/Qwen2.5-7B-Instruct
Dataset
- databricks/databricks-dolly-15k
- Training subset: 10000 examples
- Evaluation subset: 1000 examples
Training Method
- Method: QLoRA
- Quantization: 4-bit NF4
- Double quantization: enabled
- Compute dtype: bfloat16
- LoRA rank: 16
- LoRA alpha: 32
- LoRA dropout: 0.05
- Target modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - Max sequence length: 2048
- Epochs: 1
- Learning rate: 2e-4
Intended Use
This adapter is intended for instruction-following experiments and PEFT/QLoRA learning.
Example use cases:
- Comparing base model and QLoRA-adapted outputs
Example Usage
python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfigfrom peft import PeftModelimport torchbase_model = "Qwen/Qwen2.5-7B-Instruct"adapter = "Kurapika993/qwen2.5-7b-qlora-dolly15k"bnb_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_quant_type="nf4",bnb_4bit_compute_dtype=torch.bfloat16,bnb_4bit_use_double_quant=True,)tokenizer = AutoTokenizer.from_pretrained(adapter)model = AutoModelForCausalLM.from_pretrained(base_model,quantization_config=bnb_config,device_map="auto",trust_remote_code=True,)model = PeftModel.from_pretrained(model, adapter)model.eval()def generate_response(model, tokenizer, user_prompt, max_new_tokens=250):messages = [{"role": "system","content": "You are a helpful assistant."},{"role": "user","content": user_prompt}]text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)inputs = tokenizer(text,return_tensors="pt").to(model.device)with torch.no_grad():outputs = model.generate(**inputs,max_new_tokens=max_new_tokens,do_sample=True,temperature=0.7,top_p=0.9,repetition_penalty=1.05,pad_token_id=tokenizer.eos_token_id,)generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]response = tokenizer.decode(generated_tokens, skip_special_tokens=True)return response.strip()prompt = "Explain instruction tuning to a beginner using a simple analogy."response = generate_response(model, tokenizer, prompt)print(response)
Model provider
Kurapika993
Model tree
Base
Qwen/Qwen2.5-7B-Instruct
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information