build-small-hackathon
mind-of-tashi-micro-grpo
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Reward (the game scoring is the rubric)
turn— dense per-roundΔ(opponent_hp) − Δ(student_hp)outcome— sparse terminal+10 win / 0 draw / −10 losslexicon—+0.5 × (sanskrit_token_count / think_len)anti-anglicisation- six-element granular format reward (±0.5 each) + hidden-combo bonus
- KL-to-SFT penalty anchors the bilingual register against drift
Training
- Base: the SFT v3 checkpoint (the
norm_topk_prob-fixed student that ships in the playable Space). - Method:
trl.GRPOTrainer, single-turn (no API spend), Modal L4. Multi-turnrollout_funcrollouts vs the persona-tiered teacher pool (boss → Gemini 2.5 Pro; mid → Flash; low → free OpenRouter, hardMAX_API_DOLLARScap) are the planned next iteration. - Run: 50 steps, 256 prompts sampled from the self-play corpus, LR 1e-6, beta 0.04, G=4, bs=1 / grad_accum=4, max_completion 512, cosine LR, warmup 10%. Reward (format + lexicon + legality + combo) converged to ~5.05 (train_loss ≈ 0.058).
- Training scripts are run off-Space on Modal L4 (kept private; hparams above are the full recipe).
⚠️
norm_topk_prob=trueis inherited from the SFT base — required for a coherent llama.cpp GGUF (see the SFT model card).
Eval
- Format gate 20/20 (transformers).
- Honest reporting plan: the GRPO model card ships the per-teacher-tier win-rate over training steps chart — a monotonic improvement trend is the David-vs-Goliath evidence, whether or not absolute win-rate clears 50%.
Part of the bundle
Game Space · self-play dataset · SFT model + GGUF · OpenEnv gym ·
GRPO model (this) + GGUF — all under build-small-hackathon/mind-of-tashi-*.
Model provider
build-small-hackathon
Model tree
Base
build-small-hackathon/mind-of-tashi-micro-sft
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information