Shamima/babylm-2026-multilingual-uniform-100M API & Inference Endpoint

Architecture

Llama (HF LlamaForCausalLM) — RoPE, RMSNorm, SwiGLU, no biases, tied embeddings
12 layers · 768 hidden · 12 heads · 2048 FFN
1024 sequence length
110,119,680 parameters

Tokenizer

Joint byte-level BPE, 32,768 vocab, trained on a balanced 50M-char sample from each of EN/NL/ZH. The same tokenizer is shared across all three languages (see the data card for why a joint tokenizer is required: ZH is 6.8% Latin script).

Training

Data: BabyLM-community/babylm-eng + babylm-nld + babylm-zho (BabyBabelLM 2026 100M tier). Full corpora loaded in memory and shuffled (the Hub layout is category-clustered; streaming with reasonable buffers produces a biased sample).
Mixture: byte-premium-uniform — equal share of reference tokens per language (1/3 each), achieved by deficit-driven selection, not uniform doc sampling (mean doc sizes differ across languages).
Optimizer: AdamW (β₁=0.9, β₂=0.95, wd=0.1), lr 6e-4, cosine to 10%, 100-step warmup
Compute: 4× NVIDIA A10G (23 GB), bf16, DDP, micro-batch 16 × grad-accum 2 (eff. batch 128 sequences = 131k tokens/step)
Tokens consumed at this checkpoint: 100,000,000 byte-premium-adjusted reference tokens
Per-language epochs at this checkpoint: ≈1.0 each (within the BabyLM ≤10-epoch cap)

Revisions

The chck_{N}M revisions match the BabyLM eval pipeline's fast-eval naming:

markdown
chck_1M, chck_2M, ..., chck_9M, chck_10M, chck_20M, ..., chck_90M, chck_100M

Use revision=chck_NM to load any milestone. The default (main) is chck_100M.

How to evaluate

bash
git clone https://github.com/babylm-org/babylm-eval
cd babylm-eval/multilingual
bash scripts/zeroshot_model.sh --model_name Shamima/babylm-2026-multilingual-uniform-100M
bash scripts/zeroshot_model_fast_all.sh --model_name Shamima/babylm-2026-multilingual-uniform-100M

Citation

markdown
@misc{babylm-2026-uniform,
  title  = {BabyLM 2026 MultiLingual baseline (byte-premium-uniform)},
  author = {Hossain, Shamima},
  year   = {2026},
  url    = {https://huggingface.co/Shamima/babylm-2026-multilingual-uniform-100M}
}

Companion repo with audit, scaffold, and ablation configs: https://github.com/silvererudite/bb-lm-challenge-sub

babylm-2026-multilingual-uniform-100M

Get help setting up a custom Dedicated Endpoints.

README