What To Use
Table with columns: File, Best for, Notes| File | Best for | Notes |
|---|
gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf | Ollama, LM Studio, local desktop apps | Recommended for most users |
gguf/minicpm5-twitch-chat-style.F16.gguf | Higher-fidelity local GGUF use | Larger file |
adapter_model.safetensors | Python / PEFT / Transformers | LoRA adapter only; requires openbmb/MiniCPM5-1B |
Behavior
This is a style adapter, not a general knowledge fine-tune. It is meant for chat/reply generation where the model should stay concise while still answering simple prompts.
Expected behavior:
- short replies, usually 1-12 words;
- emote-heavy Twitch cadence;
- short sentence replies when the user asks something concrete;
- no private reasoning or chain-of-thought traces.
Recommended system prompt:
You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the chat style: reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces.
Quick Start: Ollama
Download gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf, then create a Modelfile next to it:
FROM ./minicpm5-twitch-chat-style.Q4_K_M.gguf
SYSTEM "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the chat style: reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces."
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.05
Create and run the model:
ollama create minicpm5-twitch-chat -f Modelfile
ollama run minicpm5-twitch-chat
Quick Start: LM Studio
Use the GGUF version, preferably gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf.
In LM Studio:
- open the model search/import flow;
- download or import the GGUF file;
- load it in the chat panel;
- use the system prompt above;
- start with temperature
0.7, top-p 0.9, and max new tokens around 32-48.
Anything that can load a Llama-compatible GGUF should be the right starting point.
Python / PEFT Setup
Use this if you want the LoRA adapter directly instead of the merged GGUF.
pip install -U transformers peft accelerate bitsandbytes torch
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
base_model = "openbmb/MiniCPM5-1B"
adapter = "cb1c7/minicpm5-twitch-chat-style-lora"
quant = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=quant,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter)
model.eval()
Generation should use MiniCPM's chat template with thinking disabled:
messages = [
{
"role": "system",
"content": "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the chat style: reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces.",
},
{"role": "user", "content": "should i trust this"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=48,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.05,
pad_token_id=tokenizer.eos_token_id,
)
reply = tokenizer.decode(
output[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True,
).strip()
print(reply)
Training used system/user/assistant chat rows. One representative row:
{
"messages": [
{
"role": "system",
"content": "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the chat style: reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces."
},
{
"role": "user",
"content": "can you summarize the vibe"
},
{
"role": "assistant",
"content": "the vibe is deeply cooked"
}
]
}
Training Data
The source data was cleaned Twitch chat from three streams, mixed with a small synthetic behavior bridge so the model can answer simple prompts without losing the chat style.
Table with columns: Item, Value| Item | Value |
|---|
| Final training rows | 143,485 |
| Eval rows | 512 |
| Format | system/user/assistant chat |
| Loss | response-only SFT |
| Chat template | MiniCPM, enable_thinking=False |
| Behavior bridge | 480 curated rows repeated 30x |
Cleanup removed invisible characters, pure-number spam, URL-like messages, commands, bot leaderboard/status messages, and mechanical subscription/event messages. Silly event tails with chat-style content were preserved.
Training Run
Run B was selected over Run A because Run A copied the style well but was too terse. Run B kept the short Twitch cadence while producing more relevant one-line sentence replies.
Table with columns: Setting, Value| Setting | Value |
|---|
| Method | Unsloth QLoRA / PEFT LoRA |
| Base model | openbmb/MiniCPM5-1B |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| Learning rate | 1e-4 |
| Epochs | 1 |
| Max sequence length | 512 |
| Effective batch size | 32 |
Inference Notes
Table with columns: Setting, Recommended value| Setting | Recommended value |
|---|
enable_thinking | False |
max_new_tokens | 32-48 |
temperature | 0.7 |
top_p | 0.9 |
repetition_penalty | 1.05 |
The model is intentionally short, but it is not meant to be limited to one-token replies. If you want plain emote names instead of colon-wrapped emotes, a small display-side cleanup pass works well, for example :MONKA: to MONKA.
Conversion Notes
The GGUF files were created by merging the PEFT adapter into openbmb/MiniCPM5-1B, converting the merged model with llama.cpp, then quantizing the recommended local build to Q4_K_M.
Credits
Built from openbmb/MiniCPM5-1B.
Training and conversion used: