🤖 Model Details
- Base Architecture: SmolLM2 (Llama-based)
- Parameter Count: ~135M
- Languages: Korean, English
- Vocabulary Size: 49,152 (Base) + 8,981 (Korean tokens) = 58,133 tokens
💻 Hardware & Compute
Like the base model, SFT was performed under strict compute constraints:
- Hardware: Kaggle Notebooks, 2x NVIDIA T4 GPUs (16GB VRAM each).
- Optimization: Multi-GPU Accelerate distributed training was utilized to maximize the effective batch size across the T4s, paired with gradient accumulation and FP16 mixed precision.
📚 Training Dataset
The SFT data mixture was heavily interleaved to balance English reasoning with Korean generation:
HuggingFaceH4/ultrachat_200k (50% - English conversational)
brandonbaek/konglish-synthetic-instruct (40% - Bilingual/Korean synthetic instructions)
jojo0217/korean_safe_conversation (10% - Korean safety & alignment)
⚠️ Known Issues & Failure Modes
We are publishing this checkpoint specifically so the open-source community can study the dynamics of SFT on extreme SLMs when parameters and datasets are sub-optimal.
-
Collator Masking Bug (Response-Only Loss Failure):
Standard instruction tuning requires "response-only loss," where the model only calculates loss gradients on the assistant's response, ignoring the user and system prompts (setting their labels to -100). During this training run, a bug in DataCollatorForLanguageModeling(mlm=False) inadvertently stripped the -100 ignore-index masks from the user prompts. Consequently, the model calculated loss over the entire sequence, severely degrading its instruction-following adherence and causing it to often mimic user prompts rather than answering them.
-
Dataset Over-Complexity:
The heavy reliance on ultrachat_200k (50% of the batch) overwhelmed the 135M parameter capacity. The complex, multi-turn reasoning and lengthy conversational histories required by Ultrachat caused severe hallucinations in this small model, leading to logical breakdowns and basic arithmetic failures.
🎯 Intended Use
This checkpoint is highly experimental and not recommended for application deployment. It serves as an excellent case study for the necessity of tailored, high-quality, and appropriately scaled SFT datasets (as well as rigid testing of data collator masks) for models under 1B parameters.