Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

What To Use

FileBest forNotes
gguf/minicpm5-twitch-chat-style.Q4_K_M.ggufOllama, LM Studio, local desktop appsRecommended for most users
gguf/minicpm5-twitch-chat-style.F16.ggufHigher-fidelity local GGUF useLarger file
adapter_model.safetensorsPython / PEFT / TransformersLoRA adapter only; requires openbmb/MiniCPM5-1B

Behavior

This is a style adapter, not a general knowledge fine-tune. It is meant for chat/reply generation where the model should stay concise while still answering simple prompts.

Expected behavior:

  • short replies, usually 1-12 words;
  • emote-heavy Twitch cadence;
  • short sentence replies when the user asks something concrete;
  • no private reasoning or chain-of-thought traces.

Recommended system prompt:

text

You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the channel style: clipped reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces.

Quick Start: Ollama

Download gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf, then create a Modelfile next to it:

text

FROM ./minicpm5-twitch-chat-style.Q4_K_M.gguf
SYSTEM "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the channel style: clipped reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces."
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.05

Create and run the model:

bash

ollama create minicpm5-twitch-chat -f Modelfile
ollama run minicpm5-twitch-chat

Quick Start: LM Studio

Use the GGUF version, preferably gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf.

In LM Studio:

  1. open the model search/import flow;
  2. download or import the GGUF file;
  3. load it in the chat panel;
  4. use the system prompt above;
  5. start with temperature 0.7, top-p 0.9, and max new tokens around 32-48.

Or more easily: use your favorite local AI studio. Anything that can load a Llama-compatible GGUF should be the right starting point.

Simple Chat UI

This repo includes a tiny local web UI in webui/. It supports Ollama, LM Studio, and llama.cpp server endpoints, plus automatic emote rendering for local emote names, :colon: variants, partial colon quirks, and common global 7TV/BTTV/FFZ emotes.

Start it from the repo root:

bash

python webui/server.py

Then open:

text

http://127.0.0.1:7860

Default endpoints:

ProviderEndpoint
Ollamahttp://localhost:11434/api/chat
LM Studiohttp://localhost:1234/v1/chat/completions
llama.cpp serverhttp://localhost:8080/v1/chat/completions

Python / PEFT Setup

Use this if you want the LoRA adapter directly instead of the merged GGUF.

bash

pip install -U transformers peft accelerate bitsandbytes torch

python

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
base_model = "openbmb/MiniCPM5-1B"
adapter = "cb1c7/minicpm5-twitch-chat-style-lora"
quant = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=quant,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter)
model.eval()

Generation should use MiniCPM's chat template with thinking disabled:

python

messages = [
{
"role": "system",
"content": "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the channel style: clipped reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces.",
},
{"role": "user", "content": "should i trust this"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=48,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.05,
pad_token_id=tokenizer.eos_token_id,
)
reply = tokenizer.decode(
output[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True,
).strip()
print(reply)

Data Format

Training used system/user/assistant chat rows. One representative row:

json

{
"messages": [
{
"role": "system",
"content": "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the channel style: clipped reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces."
},
{
"role": "user",
"content": "can you summarize the vibe"
},
{
"role": "assistant",
"content": "the vibe is deeply cooked"
}
]
}

Training Data

The source data was cleaned Twitch chat from three streams, mixed with a small synthetic behavior bridge so the model can answer simple prompts without losing the chat style.

ItemValue
Final training rows143,485
Eval rows512
Formatsystem/user/assistant chat
Lossresponse-only SFT
Chat templateMiniCPM, enable_thinking=False
Behavior bridge480 curated rows repeated 30x

Cleanup removed invisible characters, pure-number spam, URL-like messages, commands, bot leaderboard/status messages, and mechanical subscription/event messages. Silly event tails with chat-style content were preserved.

Training Run

Run B was selected over Run A because Run A copied the style well but was too terse. Run B kept the short Twitch cadence while producing more relevant one-line sentence replies.

SettingValue
MethodUnsloth QLoRA / PEFT LoRA
Base modelopenbmb/MiniCPM5-1B
LoRA rank32
LoRA alpha64
Learning rate1e-4
Epochs1
Max sequence length512
Effective batch size32
Train loss1.7102
GPURTX 3090 24GB
Run B wall time3:52:19
Run B train runtime13,759s

Inference Notes

SettingRecommended value
enable_thinkingFalse
max_new_tokens32-48
temperature0.7
top_p0.9
repetition_penalty1.05

The model is intentionally short, but it is not meant to be limited to one-token replies. If you want plain emote names instead of colon-wrapped emotes, a small display-side cleanup pass works well, for example :MONKA: to MONKA.

Conversion Notes

The GGUF files were created by merging the PEFT adapter into openbmb/MiniCPM5-1B, converting the merged model with llama.cpp, then quantizing the recommended local build to Q4_K_M.

Useful docs:

Model provider

cb1c7

Model tree

Base

openbmb/MiniCPM5-1B

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today