Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Training

  • Base: openbmb/MiniCPM5-1B (Llama architecture, about 1.08B parameters)
  • Data: build-small-hackathon/compliment-forest-sft
  • Method: 4-bit NF4 QLoRA on Modal
  • LoRA: rank 16, alpha 32, dropout 0.05
  • Targets: attention and MLP projections
  • Sequence length: 2,048
  • Epochs: 2
  • Learning rate: 2e-4 with cosine decay
  • Runtime thinking mode: disabled for deterministic JSON generation

The dataset was filtered for JSON validity, concrete situation grounding, non-toxic positivity, and short first-person spells. This model is for whimsical encouragement; it is not a therapist or a substitute for professional support.

Inference

Use the base model's chat template with enable_thinking=False. The app enforces the output with Pydantic and retries malformed generations at most twice.

The repository also includes a Q4_K_M GGUF build for local llama.cpp inference.

License

Apache-2.0, following the base model and project code. Dataset source licenses are documented on the dataset card.

Model provider

build-small-hackathon

Model tree

Base

openbmb/MiniCPM5-1B

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today