Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Training
- Base:
openbmb/MiniCPM5-1B(Llama architecture, about 1.08B parameters) - Data:
build-small-hackathon/compliment-forest-sft - Method: 4-bit NF4 QLoRA on Modal
- LoRA: rank 16, alpha 32, dropout 0.05
- Targets: attention and MLP projections
- Sequence length: 2,048
- Epochs: 2
- Learning rate: 2e-4 with cosine decay
- Runtime thinking mode: disabled for deterministic JSON generation
The dataset was filtered for JSON validity, concrete situation grounding, non-toxic positivity, and short first-person spells. This model is for whimsical encouragement; it is not a therapist or a substitute for professional support.
Inference
Use the base model's chat template with enable_thinking=False. The app enforces the output with
Pydantic and retries malformed generations at most twice.
The repository also includes a Q4_K_M GGUF build for local llama.cpp inference.
License
Apache-2.0, following the base model and project code. Dataset source licenses are documented on the dataset card.
Model provider
build-small-hackathon
Model tree
Base
openbmb/MiniCPM5-1B
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information