cb1c7

minicpm5-twitch-chat-style-lora

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

What To Use

Table with columns: File, Best for, Notes
File	Best for	Notes
`gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf`	Ollama, LM Studio, local desktop apps	Recommended for most users
`gguf/minicpm5-twitch-chat-style.F16.gguf`	Higher-fidelity local GGUF use	Larger file
`adapter_model.safetensors`	Python / PEFT / Transformers	LoRA adapter only; requires `openbmb/MiniCPM5-1B`

Behavior

This is a style adapter, not a general knowledge fine-tune. It is meant for chat/reply generation where the model should stay concise while still answering simple prompts.

Expected behavior:

short replies, usually 1-12 words;
emote-heavy Twitch cadence;
short sentence replies when the user asks something concrete;
no private reasoning or chain-of-thought traces.

Recommended system prompt:

text
You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the chat style: reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces.

Quick Start: Ollama

Download gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf, then create a Modelfile next to it:

text
FROM ./minicpm5-twitch-chat-style.Q4_K_M.gguf

SYSTEM "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the chat style: reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces."

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.05

Create and run the model:

bash
ollama create minicpm5-twitch-chat -f Modelfile
ollama run minicpm5-twitch-chat

Quick Start: LM Studio

Use the GGUF version, preferably gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf.

In LM Studio:

open the model search/import flow;
download or import the GGUF file;
load it in the chat panel;
use the system prompt above;
start with temperature 0.7, top-p 0.9, and max new tokens around 32-48.

Anything that can load a Llama-compatible GGUF should be the right starting point.

Python / PEFT Setup

Use this if you want the LoRA adapter directly instead of the merged GGUF.

bash
pip install -U transformers peft accelerate bitsandbytes torch

python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

base_model = "openbmb/MiniCPM5-1B"
adapter = "cb1c7/minicpm5-twitch-chat-style-lora"

quant = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter)
model.eval()

Generation should use MiniCPM's chat template with thinking disabled:

python
messages = [
    {
        "role": "system",
        "content": "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the chat style: reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces.",
    },
    {"role": "user", "content": "should i trust this"},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=48,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.05,
        pad_token_id=tokenizer.eos_token_id,
    )

reply = tokenizer.decode(
    output[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True,
).strip()

print(reply)

Data Format

Training used system/user/assistant chat rows. One representative row:

json
{
  "messages": [
    {
      "role": "system",
      "content": "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the chat style: reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces."
    },
    {
      "role": "user",
      "content": "can you summarize the vibe"
    },
    {
      "role": "assistant",
      "content": "the vibe is deeply cooked"
    }
  ]
}

Training Data

The source data was cleaned Twitch chat from three streams, mixed with a small synthetic behavior bridge so the model can answer simple prompts without losing the chat style.

Table with columns: Item, Value
Item	Value
Final training rows	143,485
Eval rows	512
Format	system/user/assistant chat
Loss	response-only SFT
Chat template	MiniCPM, `enable_thinking=False`
Behavior bridge	480 curated rows repeated 30x

Cleanup removed invisible characters, pure-number spam, URL-like messages, commands, bot leaderboard/status messages, and mechanical subscription/event messages. Silly event tails with chat-style content were preserved.

Training Run

Run B was selected over Run A because Run A copied the style well but was too terse. Run B kept the short Twitch cadence while producing more relevant one-line sentence replies.

Table with columns: Setting, Value
Setting	Value
Method	Unsloth QLoRA / PEFT LoRA
Base model	`openbmb/MiniCPM5-1B`
LoRA rank	32
LoRA alpha	64
Learning rate	`1e-4`
Epochs	1
Max sequence length	512
Effective batch size	32

Inference Notes

Table with columns: Setting, Recommended value
Setting	Recommended value
`enable_thinking`	`False`
`max_new_tokens`	32-48
`temperature`	0.7
`top_p`	0.9
`repetition_penalty`	1.05

The model is intentionally short, but it is not meant to be limited to one-token replies. If you want plain emote names instead of colon-wrapped emotes, a small display-side cleanup pass works well, for example :MONKA: to MONKA.

Conversion Notes

The GGUF files were created by merging the PEFT adapter into openbmb/MiniCPM5-1B, converting the merged model with llama.cpp, then quantizing the recommended local build to Q4_K_M.

Credits

Built from openbmb/MiniCPM5-1B.

Training and conversion used:

Unsloth for QLoRA fine-tuning
PEFT for the LoRA adapter
Transformers for model loading and inference
llama.cpp for GGUF conversion and quantization
Hugging Face Hub for model hosting

Model provider

cb1c7

Model tree

Base

openbmb/MiniCPM5-1B

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

What To Use

Table with columns: File, Best for, Notes
File	Best for	Notes
`gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf`	Ollama, LM Studio, local desktop apps	Recommended for most users
`gguf/minicpm5-twitch-chat-style.F16.gguf`	Higher-fidelity local GGUF use	Larger file
`adapter_model.safetensors`	Python / PEFT / Transformers	LoRA adapter only; requires `openbmb/MiniCPM5-1B`

Behavior

This is a style adapter, not a general knowledge fine-tune. It is meant for chat/reply generation where the model should stay concise while still answering simple prompts.

Expected behavior:

short replies, usually 1-12 words;
emote-heavy Twitch cadence;
short sentence replies when the user asks something concrete;
no private reasoning or chain-of-thought traces.

Recommended system prompt:

text
You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the chat style: reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces.

Quick Start: Ollama

Download gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf, then create a Modelfile next to it:

text
FROM ./minicpm5-twitch-chat-style.Q4_K_M.gguf

SYSTEM "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the chat style: reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces."

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.05

Create and run the model:

bash
ollama create minicpm5-twitch-chat -f Modelfile
ollama run minicpm5-twitch-chat

Quick Start: LM Studio

Use the GGUF version, preferably gguf/minicpm5-twitch-chat-style.Q4_K_M.gguf.

In LM Studio:

open the model search/import flow;
download or import the GGUF file;
load it in the chat panel;
use the system prompt above;
start with temperature 0.7, top-p 0.9, and max new tokens around 32-48.

Anything that can load a Llama-compatible GGUF should be the right starting point.

Python / PEFT Setup

Use this if you want the LoRA adapter directly instead of the merged GGUF.

bash
pip install -U transformers peft accelerate bitsandbytes torch

python
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

base_model = "openbmb/MiniCPM5-1B"
adapter = "cb1c7/minicpm5-twitch-chat-style-lora"

quant = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter)
model.eval()

Generation should use MiniCPM's chat template with thinking disabled:

python
messages = [
    {
        "role": "system",
        "content": "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the chat style: reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces.",
    },
    {"role": "user", "content": "should i trust this"},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=48,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.05,
        pad_token_id=tokenizer.eos_token_id,
    )

reply = tokenizer.decode(
    output[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True,
).strip()

print(reply)

Data Format

Training used system/user/assistant chat rows. One representative row:

json
{
  "messages": [
    {
      "role": "system",
      "content": "You are a silly Twitch-chat-style bot. Reply in one short message, usually 1-12 words. Copy the chat style: reactions, emotes, chants, and chat slang. Be usable, but do not write paragraphs. Do not reveal private reasoning or chain-of-thought traces."
    },
    {
      "role": "user",
      "content": "can you summarize the vibe"
    },
    {
      "role": "assistant",
      "content": "the vibe is deeply cooked"
    }
  ]
}

Training Data

The source data was cleaned Twitch chat from three streams, mixed with a small synthetic behavior bridge so the model can answer simple prompts without losing the chat style.

Table with columns: Item, Value
Item	Value
Final training rows	143,485
Eval rows	512
Format	system/user/assistant chat
Loss	response-only SFT
Chat template	MiniCPM, `enable_thinking=False`
Behavior bridge	480 curated rows repeated 30x

Training Run

Run B was selected over Run A because Run A copied the style well but was too terse. Run B kept the short Twitch cadence while producing more relevant one-line sentence replies.

Table with columns: Setting, Value
Setting	Value
Method	Unsloth QLoRA / PEFT LoRA
Base model	`openbmb/MiniCPM5-1B`
LoRA rank	32
LoRA alpha	64
Learning rate	`1e-4`
Epochs	1
Max sequence length	512
Effective batch size	32

Inference Notes

Table with columns: Setting, Recommended value
Setting	Recommended value
`enable_thinking`	`False`
`max_new_tokens`	32-48
`temperature`	0.7
`top_p`	0.9
`repetition_penalty`	1.05

Conversion Notes

The GGUF files were created by merging the PEFT adapter into openbmb/MiniCPM5-1B, converting the merged model with llama.cpp, then quantizing the recommended local build to Q4_K_M.

Credits

Built from openbmb/MiniCPM5-1B.

Training and conversion used:

Unsloth for QLoRA fine-tuning
PEFT for the LoRA adapter
Transformers for model loading and inference
llama.cpp for GGUF conversion and quantization
Hugging Face Hub for model hosting

minicpm5-twitch-chat-style-lora

Get help setting up a custom Dedicated Endpoints.

README

What To Use

Behavior

Quick Start: Ollama

Quick Start: LM Studio

Python / PEFT Setup

Data Format

Training Data

Training Run

Inference Notes

Conversion Notes

Credits

Explore FriendliAI today

README

What To Use

Behavior

Quick Start: Ollama

Quick Start: LM Studio

Python / PEFT Setup

Data Format

Training Data

Training Run

Inference Notes

Conversion Notes

Credits