Shamima/babylm-2026-multilingual-uniform-100M-v2 API & Inference Endpoint

Architecture

Llama (HF LlamaForCausalLM) — RoPE, RMSNorm, SwiGLU, no biases, tied embeddings
12 layers · 768 hidden · 12 heads · 2048 FFN
1024 sequence length
110,119,680 parameters
Tokenizer: joint byte-level BPE 32 768 (same as v1; reused so the two are directly comparable)

Training

Data: BabyBabelLM 2026 100M tier (EN/NL/ZH); full corpora loaded in memory and shuffled
Mixture: byte-premium-uniform via deficit-driven selection (1/3 of reference tokens per language)
Optimiser: AdamW (β1=0.9, β2=0.95, wd=0.1)
LR: 6e-4 peak, WSD schedule (warmup 200 → constant peak → linear 25% decay tail to 6e-5)
Compute: 4× NVIDIA A10G (23 GB), bf16, DDP, micro-batch 16 × grad-accum 2 (eff. batch 128 sequences = 131k tokens/step)
Tokens consumed at this checkpoint: 100,016,896 byte-premium-adjusted reference tokens (= 1 epoch over the corpus)
Per-language epochs at this checkpoint: ~1.0 each (well within the BabyLM ≤10-epoch cap)

Revisions

19 fast-eval branches: chck_1M, chck_2M, …, chck_9M, chck_10M, chck_20M, …, chck_90M, chck_100M. main is chck_100M.

How to evaluate

bash
git clone https://github.com/babylm-org/babylm-eval
cd babylm-eval/multilingual
bash scripts/zeroshot_model.sh --model_name Shamima/babylm-2026-multilingual-uniform-100M-v2
bash scripts/zeroshot_model_fast_all.sh --model_name Shamima/babylm-2026-multilingual-uniform-100M-v2

Comparison vs v1

See https://github.com/silvererudite/bb-lm-challenge-sub for the iteration log, scaffold, and ablation configs.

Citation

markdown
@misc{babylm-2026-uniform-v2,
  title  = {BabyLM 2026 MultiLingual baseline v2 (WSD schedule)},
  author = {Hossain, Shamima},
  year   = {2026},
  url    = {https://huggingface.co/Shamima/babylm-2026-multilingual-uniform-100M-v2}
}

babylm-2026-multilingual-uniform-100M-v2

Get help setting up a custom Dedicated Endpoints.

README