Training Data
- Corpus: Unified State Register of Court Decisions of Ukraine (EDRSR)
- Documents: 33.9M court decisions (after dedup + quality filtering from 38.5M)
- Tokens: 161.4B tokens (Qwen2 BPE tokenizer, fertility = 0.515 for Ukrainian legal text)
- Sequence length: 8,192 tokens
- Shards: 1,233 pre-packaged numpy shards
Training Details
- Hardware: 8x NVIDIA H100 SXM 80GB (NVIDIA Innovation Lab via Brev)
- Framework: HuggingFace Trainer + DeepSpeed ZeRO-3
- Precision: bfloat16
- Global batch size: 128 sequences (1.05M tokens/step)
- Total steps: 9,536 (10B tokens processed)
- Learning rate: 1e-4, cosine schedule, 300-step linear warmup
- Training time: 31 hours
- Throughput: 91K tokens/sec, 11.5 sec/step
Results
Table with columns: Metric, Value| Metric | Value |
|---|
| Initial loss (step 10) | 1.08 |
| Final loss (step 9,536) | 0.231 |
| Loss reduction | -79% |
| Base perplexity | 3.83 |
| CPT perplexity | 1.30 |
| Perplexity reduction | -66.1% |
Scaling Law
All four models in the series converge to similar perplexity after CPT:
Table with columns: Model, Base PPL, CPT PPL, Reduction| Model | Base PPL | CPT PPL | Reduction |
|---|
| 0.5B | 6.83 | 1.35 | -80% |
| 1.5B | 4.61 | 1.31 | -72% |
| 3B | 3.83 | 1.30 |
Intended Use
This is a base model (not instruction-tuned). It is intended for:
- Research on domain adaptation of LLMs for low-resource legal languages
- Downstream fine-tuning for Ukrainian legal NLP tasks
- Scaling law analysis of continued pretraining
- Perplexity evaluation on Ukrainian legal text
Limitations
- Not instruction-tuned; will not follow instructions or chat
- Trained on Ukrainian court decisions only; may not generalize to other legal systems