ratandeep/professor-pip-minicpm5-1b-lora API & Inference Endpoint

What the adapter does

Every reply is one JSON object and nothing else:

json
{"text": "The sky looks blue because sunlight bounces off the tiny bits of air, and blue bounces the most! Want to find out why grass is green next?",
 "mood": "happy",
 "gesture": "index"}

text — what Pip says out loud: 1–3 short sentences, small words a young child knows, gentle with wrong answers, no emoji / markdown / symbols (plain spoken words only).
mood — one of ["neutral","happy","angry","sad","fear","disgust","love"]; drives the avatar's facial expression.
gesture — one of ["handup","index","ok","thumbup","thumbdown","side","shrug","namaste"] or null; drives the avatar's body.

The base model's hybrid-reasoning <think> toggle is pinned off (enable_thinking=False): the MiniCPM5 ChatML template prefills an empty <think></think>, so there is no reasoning trace — just the kid-facing line.

The adapter is trained to cover the four things Pip says live: answering a child's raise-hand interruption, delivering a lesson segment, encouragement / greetings / gentle wrong-answer handling, and safe redirects (off-topic, personal, medical, or dangerous requests → a friendly "let's pick something we can learn about!" or "a grown-up can help with that").

Training configuration


Base model	`openbmb/MiniCPM5-1B-SFT` (standard `LlamaForCausalLM`, 1.08B params, Apache-2.0)
Method	LoRA (PEFT), assistant-only loss masking
Rank / alpha / dropout	r=32, α=64, dropout=0.05, bias=`none`
Target modules	attention + MLP linears: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Trainable params	22.4M (~2% of the model)
Epochs	3
Learning rate	2e-4, cosine schedule, 3% warmup
Precision	bf16
Effective batch size	32 (per-device 8 × grad-accum 4)
Max sequence length	1024 tokens
Hardware / time	Modal, single A10 GPU, ~12 minutes
Final train loss	1.30

Loss masking. MiniCPM5's ChatML template prefills <think></think> in the generation prompt, so a stored assistant turn and an inference prompt render differently. Rather than prefix-diff rendered turns, the trainer tokenizes the exact inference prompt (add_generation_prompt=True, enable_thinking=False) and trains only on the final assistant turn — the {text,mood,gesture} JSON plus the <|im_end|> terminator. This matches how the model is called at inference, token-for-token.

Training data

~2,016 synthetic, in-voice examples (1,866 train / 150 held-out gold eval), generated by a multi-agent workflow and then put through a deterministic, production-faithful validation gate: every user turn and every assistant text must pass the same text_is_safe denylist the live Space applies, spoken text must be plain (no markdown / emoji / symbols), mood ∈ enum, gesture ∈ enum | null, turns must alternate and end on the assistant turn.

Balanced to the target category mix:

Category	Target	Actual
Answer a raise-hand interruption (+ nudge back)	30%	31.3%
Deliver a lesson segment	30%	31.2%
Encouragement / greetings / gentle wrong-answer / chit-chat	25%	24.0%
Safe redirects (off-topic / personal / medical / dangerous)	15%	13.5%

Evaluation

Automated contract eval on the 150 held-out gold examples (greedy decode, reproducible run-to-run), scored with the same pip_core gate the production /brain endpoint applies downstream:

Metric	Result
Valid, parseable JSON (non-empty `text`)	100%
Valid `mood` / `gesture` enums	100%
Safe spoken text	99.3%
Fully contract-correct (JSON + enums + safe)	99.3%
Average `text` length	~142 chars

How to load

This is a PEFT/LoRA adapter — load the base model first, then apply the adapter.

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "openbmb/MiniCPM5-1B-SFT"
ADAPTER = "build-small-hackathon/professor-pip-minicpm5-1b-lora"

tok = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

SYSTEM = (
    "You are Professor Pip, a warm and playful teacher with a friendly 3D body "
    "on screen. You teach children aged 5 to 10.\n"
    "How you talk:\n"
    "- Say only 1 to 3 short sentences. Use small, simple words a young child knows.\n"
    "- Be cheerful, patient, and encouraging. Celebrate effort.\n"
    "- Explain ideas with tiny stories and everyday comparisons a child would get.\n"
    "- Never use emoji, markdown, lists, or symbols in what you say out loud. Plain spoken words only.\n"
    "- If a child gets something wrong, be gentle: 'So close! Let's try once more.'\n"
    "Staying safe (very important):\n"
    "- Only talk about kind, learning topics. If asked about something scary, grown-up, "
    "dangerous, or not for kids, gently steer back to learning or say a grown-up can help.\n"
    "- Never ask for or repeat a child's personal information.\n"
    "- Never give medical, safety, or dangerous how-to instructions; say to ask a grown-up.\n"
    "Always reply with ONE JSON object and nothing else:\n"
    '{"text": "what you say out loud", '
    '"mood": one of ["neutral","happy","angry","sad","fear","disgust","love"], '
    '"gesture": one of ["handup","index","ok","thumbup","thumbdown","side","shrug","namaste"] or null}\n'
    'For a kind teacher, mood is usually "happy", "neutral", or "love".'
)

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": "Why is the sky blue?"},
]

# enable_thinking=False -> no <think> trace; just the kid-facing JSON line.
enc = tok.apply_chat_template(
    messages, add_generation_prompt=True, enable_thinking=False,
    return_tensors="pt", return_dict=True,
).to(model.device)

im_end = tok.convert_tokens_to_ids("<|im_end|>")
out = model.generate(
    **enc, max_new_tokens=160, do_sample=False,
    eos_token_id=[tok.eos_token_id, im_end],
    pad_token_id=tok.pad_token_id or tok.eos_token_id,
)
print(tok.decode(out[0][enc["input_ids"].shape[-1]:], skip_special_tokens=True).strip())
# -> {"text": "...", "mood": "happy", "gesture": "index"}

To merge the adapter into the base weights (e.g. before GGUF conversion):

python
merged = model.merge_and_unload()
merged.save_pretrained("professor-pip-minicpm5-1b-merged")

For deployment, the merged model is converted to GGUF and quantized to Q4_K_M (688 MB) and Q8_0 (1.15 GB), then served with llama.cpp (via llama-cpp-python) on Modal — see the GGUF repo. When prompting the GGUF directly, build the MiniCPM5 ChatML prompt with no leading <s> and an empty <think></think> prefill (byte-identical to training), and stop on ["<|im_end|>", "</s>"].

Intended use

Powering the spoken raise-hand Q&A and short encouragement / redirect lines in the Professor Pip kids-teacher avatar.
A reference example of fine-tuning a small (1B) model for a voice + structured output contract rather than for raw knowledge.

This adapter is built to be paired with deterministic application code: in the Space, premade lesson segments are spoken verbatim via TTS, and all child-safety checks run server-side (a curated denylist + leetspeak normalization on every child input and every spoken line) — they are non-bypassable and do not depend on the model. No child audio or PII is persisted.

Limitations

Narrow on purpose. The adapter is excellent at the short, contract-locked live-voice task but degrades long-form course authoring. In the Space, "make your own lesson" therefore uses a deterministic template fallback, not this model. Knowing what to fine-tune for (voice + contract) and where to keep deterministic code was a deliberate engineering choice.
Not a knowledge source. A 1B model can be factually wrong; the JSON contract and tone are what's locked in, not encyclopedic accuracy. Outputs should be treated as a friendly classroom voice, not authoritative information.
Safety is in the app, not the weights. The ~99.3% safe-text eval number is on in-distribution gold data. Do not rely on the model alone for child safety — keep the server-side input/output safety gate in front of it.
English only, tuned for ages 5–10, and trained on synthetic data; it has not been evaluated outside that audience and register.
Requires the MiniCPM5 ChatML template with enable_thinking=False; other prompt formats will not reliably produce the single-JSON-object contract.

Training & framework

Framework: PEFT, 🤗 Transformers (>=5.6), Accelerate, Datasets
Base model: openbmb/MiniCPM5-1B-SFT (Apache-2.0)
License: Apache-2.0

If you use this adapter, please credit the base model authors (OpenBMB / MiniCPM) and the Professor Pip Build Small Hackathon project.

professor-pip-minicpm5-1b-lora

Get help setting up a custom Dedicated Endpoints.

README