brandonbaek/Bori-2-135M-Instruct API & Inference Endpoint

🤖 Model Details

Base Architecture: SmolLM2 (Llama-based)
Parameter Count: ~135M
Languages: Korean, English
Vocabulary Size: 49,152 (Base) + 8,981 (Korean tokens) = 58,133 tokens

💻 Hardware & Compute

Like the base model, SFT was performed under strict compute constraints:

Hardware: Kaggle Notebooks, 2x NVIDIA T4 GPUs (16GB VRAM each).
Optimization: Multi-GPU Accelerate distributed training was utilized to maximize the effective batch size across the T4s, paired with gradient accumulation and FP16 mixed precision.

📚 Training Dataset

The SFT data mixture was heavily interleaved to balance English reasoning with Korean generation:

HuggingFaceH4/ultrachat_200k (50% - English conversational)
brandonbaek/konglish-synthetic-instruct (40% - Bilingual/Korean synthetic instructions)
jojo0217/korean_safe_conversation (10% - Korean safety & alignment)

⚠️ Known Issues & Failure Modes

We are publishing this checkpoint specifically so the open-source community can study the dynamics of SFT on extreme SLMs when parameters and datasets are sub-optimal.

Collator Masking Bug (Response-Only Loss Failure): Standard instruction tuning requires "response-only loss," where the model only calculates loss gradients on the assistant's response, ignoring the user and system prompts (setting their labels to -100). During this training run, a bug in DataCollatorForLanguageModeling(mlm=False) inadvertently stripped the -100 ignore-index masks from the user prompts. Consequently, the model calculated loss over the entire sequence, severely degrading its instruction-following adherence and causing it to often mimic user prompts rather than answering them.
Dataset Over-Complexity: The heavy reliance on ultrachat_200k (50% of the batch) overwhelmed the 135M parameter capacity. The complex, multi-turn reasoning and lengthy conversational histories required by Ultrachat caused severe hallucinations in this small model, leading to logical breakdowns and basic arithmetic failures.

🎯 Intended Use

This checkpoint is highly experimental and not recommended for application deployment. It serves as an excellent case study for the necessity of tailored, high-quality, and appropriately scaled SFT datasets (as well as rigid testing of data collator masks) for models under 1B parameters.

🤖 Model Details

Base Architecture: SmolLM2 (Llama-based)
Parameter Count: ~135M
Languages: Korean, English
Vocabulary Size: 49,152 (Base) + 8,981 (Korean tokens) = 58,133 tokens

💻 Hardware & Compute

Like the base model, SFT was performed under strict compute constraints:

Hardware: Kaggle Notebooks, 2x NVIDIA T4 GPUs (16GB VRAM each).
Optimization: Multi-GPU Accelerate distributed training was utilized to maximize the effective batch size across the T4s, paired with gradient accumulation and FP16 mixed precision.

📚 Training Dataset

The SFT data mixture was heavily interleaved to balance English reasoning with Korean generation:

HuggingFaceH4/ultrachat_200k (50% - English conversational)
brandonbaek/konglish-synthetic-instruct (40% - Bilingual/Korean synthetic instructions)
jojo0217/korean_safe_conversation (10% - Korean safety & alignment)

⚠️ Known Issues & Failure Modes

We are publishing this checkpoint specifically so the open-source community can study the dynamics of SFT on extreme SLMs when parameters and datasets are sub-optimal.

Collator Masking Bug (Response-Only Loss Failure): Standard instruction tuning requires "response-only loss," where the model only calculates loss gradients on the assistant's response, ignoring the user and system prompts (setting their labels to -100). During this training run, a bug in DataCollatorForLanguageModeling(mlm=False) inadvertently stripped the -100 ignore-index masks from the user prompts. Consequently, the model calculated loss over the entire sequence, severely degrading its instruction-following adherence and causing it to often mimic user prompts rather than answering them.
Dataset Over-Complexity: The heavy reliance on ultrachat_200k (50% of the batch) overwhelmed the 135M parameter capacity. The complex, multi-turn reasoning and lengthy conversational histories required by Ultrachat caused severe hallucinations in this small model, leading to logical breakdowns and basic arithmetic failures.

Bori-2-135M-Instruct

Get help setting up a custom Dedicated Endpoints.

README

🤖 Model Details

💻 Hardware & Compute

📚 Training Dataset

⚠️ Known Issues & Failure Modes

🎯 Intended Use

Explore FriendliAI today

README

🤖 Model Details

💻 Hardware & Compute

📚 Training Dataset

⚠️ Known Issues & Failure Modes

🎯 Intended Use