Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherWhat To Use
| File | Best for | Notes |
|---|---|---|
gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf | Ollama, LM Studio, local desktop apps | Recommended for most users |
gguf/minicpm5-twitch-chat-style.F16.gguf | Higher-fidelity local GGUF use | Larger file |
adapter_model.safetensors | Python / PEFT / Transformers | LoRA adapter only; requires openbmb/MiniCPM5-1B |
Behavior
This is a style adapter, not a general knowledge fine-tune. It is meant for chat/reply generation where the model should stay concise while still answering simple prompts.
Expected behavior:
- short replies, usually 1-12 words;
- emote-heavy Twitch cadence;
- short sentence replies when the user asks something concrete;
- no private reasoning or chain-of-thought traces.
Recommended system prompt:
text
You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the channel style: clipped reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces.
Quick Start: Ollama
Download gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf, then create a Modelfile next to it:
text
FROM ./minicpm5-twitch-chat-style.Q4_K_M.ggufSYSTEM "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the channel style: clipped reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces."PARAMETER temperature 0.7PARAMETER top_p 0.9PARAMETER repeat_penalty 1.05
Create and run the model:
bash
ollama create minicpm5-twitch-chat -f Modelfileollama run minicpm5-twitch-chat
Quick Start: LM Studio
Use the GGUF version, preferably gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf.
In LM Studio:
- open the model search/import flow;
- download or import the GGUF file;
- load it in the chat panel;
- use the system prompt above;
- start with temperature
0.7, top-p0.9, and max new tokens around32-48.
Or more easily: use your favorite local AI studio. Anything that can load a Llama-compatible GGUF should be the right starting point.
Simple Chat UI
This repo includes a tiny local web UI in webui/. It supports Ollama, LM Studio, and llama.cpp server endpoints, plus automatic emote rendering for local emote names, :colon: variants, partial colon quirks, and common global 7TV/BTTV/FFZ emotes.
Start it from the repo root:
bash
python webui/server.py
Then open:
text
http://127.0.0.1:7860
Default endpoints:
| Provider | Endpoint |
|---|---|
| Ollama | http://localhost:11434/api/chat |
| LM Studio | http://localhost:1234/v1/chat/completions |
| llama.cpp server | http://localhost:8080/v1/chat/completions |
Python / PEFT Setup
Use this if you want the LoRA adapter directly instead of the merged GGUF.
bash
pip install -U transformers peft accelerate bitsandbytes torch
python
import torchfrom peft import PeftModelfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfigbase_model = "openbmb/MiniCPM5-1B"adapter = "cb1c7/minicpm5-twitch-chat-style-lora"quant = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_compute_dtype=torch.bfloat16,)tokenizer = AutoTokenizer.from_pretrained(base_model)model = AutoModelForCausalLM.from_pretrained(base_model,quantization_config=quant,device_map="auto",trust_remote_code=True,)model = PeftModel.from_pretrained(model, adapter)model.eval()
Generation should use MiniCPM's chat template with thinking disabled:
python
messages = [{"role": "system","content": "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the channel style: clipped reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces.",},{"role": "user", "content": "should i trust this"},]prompt = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,enable_thinking=False,)inputs = tokenizer(prompt, return_tensors="pt").to(model.device)inputs.pop("token_type_ids", None)with torch.no_grad():output = model.generate(**inputs,max_new_tokens=48,do_sample=True,temperature=0.7,top_p=0.9,repetition_penalty=1.05,pad_token_id=tokenizer.eos_token_id,)reply = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:],skip_special_tokens=True,).strip()print(reply)
Data Format
Training used system/user/assistant chat rows. One representative row:
json
{"messages": [{"role": "system","content": "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the channel style: clipped reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces."},{"role": "user","content": "can you summarize the vibe"},{"role": "assistant","content": "the vibe is deeply cooked"}]}
Training Data
The source data was cleaned Twitch chat from three streams, mixed with a small synthetic behavior bridge so the model can answer simple prompts without losing the chat style.
| Item | Value |
|---|---|
| Final training rows | 143,485 |
| Eval rows | 512 |
| Format | system/user/assistant chat |
| Loss | response-only SFT |
| Chat template | MiniCPM, enable_thinking=False |
| Behavior bridge | 480 curated rows repeated 30x |
Cleanup removed invisible characters, pure-number spam, URL-like messages, commands, bot leaderboard/status messages, and mechanical subscription/event messages. Silly event tails with chat-style content were preserved.
Training Run
Run B was selected over Run A because Run A copied the style well but was too terse. Run B kept the short Twitch cadence while producing more relevant one-line sentence replies.
| Setting | Value |
|---|---|
| Method | Unsloth QLoRA / PEFT LoRA |
| Base model | openbmb/MiniCPM5-1B |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| Learning rate | 1e-4 |
| Epochs | 1 |
| Max sequence length | 512 |
| Effective batch size | 32 |
| Train loss | 1.7102 |
| GPU | RTX 3090 24GB |
| Run B wall time | 3:52:19 |
| Run B train runtime | 13,759s |
Inference Notes
| Setting | Recommended value |
|---|---|
enable_thinking | False |
max_new_tokens | 32-48 |
temperature | 0.7 |
top_p | 0.9 |
repetition_penalty | 1.05 |
The model is intentionally short, but it is not meant to be limited to one-token replies. If you want plain emote names instead of colon-wrapped emotes, a small display-side cleanup pass works well, for example :MONKA: to MONKA.
Conversion Notes
The GGUF files were created by merging the PEFT adapter into openbmb/MiniCPM5-1B, converting the merged model with llama.cpp, then quantizing the recommended local build to Q4_K_M.
Useful docs:
Model provider
cb1c7
Model tree
Base
openbmb/MiniCPM5-1B
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information