TilQazyna/Til-mini-1B API & Inference Endpoint

Model details

Table

Architecture	DeepSeek-V3-style dense decoder with MLA (Multi-head Latent Attention)
Parameters	956.3M (tied input/output embeddings)
Hidden / layers	1792 / 24
Attention	16 heads, MLA: q_lora_rank 384, kv_lora_rank 192, qk_rope 32, qk_nope 64, v_head 64
FFN intermediate	4864 (SwiGLU)
Context length	2048
Position encoding	RoPE, θ = 100 000
Vocab	131 072 — Til-Tokenizer-128k
Precision	bf16

MLA compresses the KV-cache via low-rank latent projections, which makes the model memory-efficient at inference time — including on mobile-class hardware (≈0.5 GB at 4-bit quantization).

Tokenizer

TilQazyna/Til-Tokenizer-128k — 131 072 BPE vocabulary trained with a focus on Kazakh morphology (≈1 token per Kazakh word on average), while remaining efficient for Russian, English, code and math. Special tokens: pad=0, <s>=1, </s>=2, <|im_start|>=6, <|im_end|>=7.

Training data

One full epoch over Til-Corpus — 47.0B tokens, ~71M documents:

Table
Domain	Tokens	Share
English	11.9B	25%
Code	9.9B	21%
Kazakh	9.7B	21%
Math	9.0B	19%
Russian	6.6B	14%

Documents are tokenized, concatenated with </s> separators and packed into fixed 2048-token sequences. Batches are fully shuffled across domains.

Training procedure

Table

Steps	89 690 (1 epoch)
Global batch	256 sequences × 2048 = 0.52M tokens/step
Optimizer	AdamW, lr 6e-4, weight decay 0.1, grad clip 1.0
LR schedule	WSD (warmup 1000 → stable → linear decay over final 30%)
Precision	bf16
Hardware	8×H200, DDP, 35.5 h
Tokens / parameter	≈47 (deliberately overtrained for deployment quality)

Evaluation

Bits-per-byte (BPB) on a frozen held-out set, 5 domains. BPB normalizes by UTF-8 bytes of the scored text, so the number is independent of the tokenizer:

Table
Domain	BPB ↓
Kazakh (kk)	0.4645
Code	0.4389
Russian (ru)	0.5079
Math	0.7715
English (en)	0.9208
Macro	0.6207

Usage

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "TilQazyna/Til-mini-1B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto")

ids = tok("Абай Құнанбайұлы — қазақ халқының", return_tensors="pt").input_ids.to(model.device)
out = model.generate(ids, max_new_tokens=80, do_sample=True,
                     temperature=0.7, top_p=0.9, repetition_penalty=1.1,
                     pad_token_id=0)
print(tok.decode(out[0], skip_special_tokens=True))

Sample completions (temperature 0.7, base model, no SFT):

Қазақстан Республикасының астанасы - Астана қаласы.

Абай Құнанбайұлы — қазақ халқының ұлы ақыны, ағартушы, қазақтың жазба әдебиетінің және әдеби тілінің негізін қалаушы, философ, композитор.

Intended use & limitations

Intended: research on Kazakh/multilingual NLP; foundation for fine-tunes (instruct, GEC, domain adaptation); on-device text completion after quantization.
Base model: completes text, does not answer questions or follow instructions.
Factuality: like all sub-1B models, it hallucinates facts and numbers; do not use raw outputs as a source of truth.
Reasoning/code: surface form is fluent; logical and arithmetic correctness is not guaranteed.
Context window is 2048 tokens.
No safety alignment has been applied.

License

Apache 2.0. Access is gated (manual approval) for usage tracking.

Til-mini-1B

Get help setting up a custom Dedicated Endpoints.

README