Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quick start
PEFT
python
from peft import PeftModelfrom transformers import AutoModelForCausalLM, AutoTokenizerimport torchbase = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-32B-Instruct",torch_dtype=torch.bfloat16,device_map="auto",)model = PeftModel.from_pretrained(base, "svyatsharov/Role-play-ai")tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-32B-Instruct")messages = [{"role": "system", "content": "You are Mira, a warm tavern owner. Witty but firm."},{"role": "user", "content": "*sits at the bar* Tough day."},]inputs = tokenizer.apply_chat_template(messages, tokenize=True,add_generation_prompt=True,return_tensors="pt").to(model.device)out = model.generate(inputs, max_new_tokens=300, temperature=0.85, top_p=0.9, do_sample=True)print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
Unsloth (быстрее на 1 GPU)
python
from unsloth import FastLanguageModelmodel, tokenizer = FastLanguageModel.from_pretrained(model_name="svyatsharov/Role-play-ai",max_seq_length=4096,load_in_4bit=True,)FastLanguageModel.for_inference(model)
Model details
| Параметр | Значение |
|---|---|
| Base model | Qwen/Qwen2.5-32B-Instruct |
| Adapter type | LoRA (PEFT) r=64, alpha=128 |
| Trainable params | 537M (1.6% от 32B) |
| Context length | 6144 |
| Languages | English, Russian |
| Chat template | ChatML |
| License | Apache 2.0 (наследуется от Qwen2.5) |
Evaluation results
Метрики посчитаны на eval-сете (1203 примера, 5% от полного датасета). Все 5 групп метрик:
Loss-based
| Метрика | Значение | Цель |
|---|---|---|
| Perplexity (overall) | 3.31 | 5–12 |
| Perplexity (EN) | 3.24 | 5–12 |
| Perplexity (RU) | 3.58 | 5–15 |
| Token accuracy | 0.678 | >0.55 ✅ |
Reference-based
| Метрика | Значение | Цель |
|---|---|---|
| BLEU-4 | 11.20 | >5 ✅ |
| ROUGE-L | 0.215 | >0.20 ✅ |
| BERTScore F1 | 0.865 | >0.85 ✅ |
| chrF++ | 31.01 | >25 ✅ |
Style match (Albert metrics)
| Метрика | Значение | Цель |
|---|---|---|
| Length JS-divergence | 0.038 | <0.10 ✅ |
| Vocabulary overlap | 0.94 | >0.55 ✅ |
| Style Match Score | 0.798 | >0.7 ✅ |
Diversity
| Метрика | Значение | Цель |
|---|---|---|
| Avg distinct-2 | 0.940 | >0.7 ✅ |
| Avg distinct-3 | 0.985 | — |
| Self-repetition rate | 0.000 | <0.05 ✅ |
| TTR (model) | 0.349 | 0.4–0.6 |
Caveats
- PPL 3.31 ниже целевых 5–12 — eval-сет это случайные 5% из тех же источников что и train. На out-of-distribution данных PPL будет выше. Не настоящее переобучение: train_loss = 1.20, eval_loss = 1.21.
- Русский в eval недопредставлен — всего 14 RU-примеров из 1203. Метрика PPL_ru статистически слабая.
Training data
Целевой объём: 50k диалогов SFW roleplay. Реально получено 24 071.
| Источник | Получено | % | Заявлено |
|---|---|---|---|
PygmalionAI/PIPPA (SFW filter) | 13 050 | 54.2% | 35% |
lemonilia/LimaRP | 0 | 0% | 25% |
Norquinal/claude_multiround_chat_30k | 7 500 | 31.2% | 15% |
IlyaGusev/saiga_scored | 3 521 | 14.6% | 25% |
LimaRP не загрузился при подготовке датасета — fallback увеличил долю PIPPA.
Training procedure
| Параметр | Значение |
|---|---|
| Framework | Unsloth + TRL SFTTrainer |
| Method | QLoRA 4-bit (NF4) |
| Effective batch | 16 (2 × 8 grad_accum) |
| Epochs | 3 (early stopping) |
| Learning rate | 1e-4, cosine schedule, warmup 3% |
| Optimizer | AdamW 8-bit |
| Weight decay | 0.01 |
| Gradient checkpointing | Unsloth |
| Precision | bf16 + tf32 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Steps & duration
- Total steps: 4287
- Train duration: 51 час на 1× A100 80GB
- Peak VRAM: 33 ГБ
- Final train loss: 1.20
- Final eval loss: 1.21
Hardware
- GPU: 1× NVIDIA A100 80GB
- RAM: 200 GB
- Storage: 4 TB
- VRAM лимит: 60 GB (использовано 33 GB)
Limitations
- Английский лучше русского из-за дисбаланса данных
- 32B модель требует ≥24 GB VRAM для инференса в 4-bit
- Контекст 6144 токенов — длинные RP-сессии надо обрезать
- Eval-сет близок к train → реальное качество на новых данных вероятно ниже метрик
- LimaRP не вошёл в обучение — состав датасета смещён
Citation
bibtex
@misc{role-play-ai-2026,title = {Role-play AI: Qwen2.5-32B fine-tune for bilingual SFW roleplay},author = {svyatsharov, ichinosekei},year = {2026},url = {https://huggingface.co/svyatsharov/Role-play-ai}}
Base model citation:
bibtex
@misc{qwen2.5,title = {Qwen2.5: A Party of Foundation Models},author = {Qwen Team},year = {2024},url = {https://huggingface.co/Qwen/Qwen2.5-32B-Instruct}}
Framework versions
- PEFT 0.19.1
- Transformers ≥4.46.0
- TRL ≥0.12.0
- Unsloth
- PyTorch 2.5.1
Model provider
svyatsharov
Model tree
Base
Qwen/Qwen2.5-32B-Instruct
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information