🤖 Model Details
- Base Architecture: SmolLM2 (Llama-based)
- Parameter Count: ~135M
- Languages: Korean, English
- Vocabulary Size: 49,152 (Base) + 8,981 (Korean tokens) = 58,133 tokens
- Context Length: 2048 tokens
- License: Apache 2.0
💻 Hardware & Compute Constraints
A core goal of the Bori project is achieving meaningful language adaptation under strict free-tier compute limitations.
- Hardware: Trained entirely on Kaggle Notebooks utilizing 2x NVIDIA T4 GPUs (16GB VRAM each).
- Optimization: The training pipeline leveraged PyTorch's native
sdpa (Scaled Dot-Product Attention) for Turing-architecture efficiency, FP16 mixed precision, and gradient checkpointing to fit the optimizer states into the tight 16GB VRAM limit.
🛠️ Training Methodology (CPT)
The model was adapted via Continuous Pre-Training (CPT) using a two-phase approach designed to inject deep Korean language understanding without causing catastrophic destruction of the base model's world knowledge and English representations.
1. Vocabulary Expansion (EEVE Initialization)
Pre-trained English-centric SLMs represent Korean prose very inefficiently, splitting single syllables into multiple bytes. To solve this, we trained a custom standalone Korean Byte-Level BPE tokenizer and merged it with the base tokenizer, adding 8,981 highly efficient Korean tokens.
Crucially, in src/model.py, the newly added Korean token embeddings are not initialized randomly. Instead, we utilized the EEVE (Efficient Embedding Vector Extraction) strategy, which initializes each new token from the mean embeddings of its English constituent subwords from the base tokenizer. This gives the model an excellent starting approximation and drastically lowers initial cross-entropy loss.
2. Phase 1A: Embedding Warmup
- Objective: Stabilize the newly added Korean token embeddings without distorting the pre-trained weights.
- Duration: 1,000 steps
- Data: 100% Korean text (
HuggingFaceFW/fineweb-2:kor_Hang)
- Parameters: Backbone frozen; only the embedding layer and LM head were trained.
3. Phase 1B: Full CPT (WSD Scheduler)
- Objective: Deep language acquisition and alignment.
- Duration: 10,000 steps (Final Checkpoint)
- Data Mixture: 90% Korean (
fineweb-2:kor_Hang) and 10% English replay (fineweb-edu-dedup) to prevent catastrophic forgetting.
- Parameters: All model parameters unfrozen.
- Scheduler: Utilized a custom PyTorch Warmup-Stable-Decay (WSD) scheduler to maximize optimizer progress over high-entropy web text before decaying down to 10% to consolidate weights.
⚠️ Limitations & Intended Use
- Not an Instruct Model: This is a base completion model. It has not undergone Supervised Fine-Tuning (SFT) or RLHF and will not follow instructions out of the box. Please see the Bori-2 Instruct model for chat capabilities.
- Reasoning Capacity: At only 135M parameters, the model's capacity for complex reasoning, logic, or deep factual recall is inherently limited.
- Intended Use: This model is published for researchers and developers interested in SLM vocabulary expansion, extreme compute-constrained training, and bilingual adaptation methodologies. It serves as an excellent, computationally cheap base for downstream Korean fine-tuning.