TilQazyna/Til-2B API & Inference Endpoint

Model details


Architecture	DeepSeek-V3-style dense decoder with MLA (Multi-head Latent Attention)
Parameters	1977.1M (tied input/output embeddings)
Hidden / layers	2048 / 30
Attention	16 heads, MLA: q_lora_rank 512, kv_lora_rank 256, qk_rope 32, qk_nope 96, v_head 96
FFN intermediate	8192 (SwiGLU)
Context length	4096
Position encoding	RoPE, θ = 100 000
Vocab	131 072 — Til-Tokenizer-128k
Precision	bf16

MLA compresses the KV-cache via low-rank latent projections, keeping the model memory-efficient at inference time.

Tokenizer

TilQazyna/Til-Tokenizer-128k — 131 072 BPE vocabulary trained with a focus on Kazakh morphology (≈1 token per Kazakh word on average), efficient for Russian, English, code and math. Special tokens: pad=0, <s>=1, </s>=2, <|im_start|>=6, <|im_end|>=7.

Training data

One full epoch over Til-Corpus — 47.0B tokens:

Domain	Tokens	Share
English	11.9B	25%
Code	9.9B	21%
Kazakh	9.7B	21%
Math	9.0B	19%
Russian	6.6B	14%

Documents are tokenized, concatenated with </s> separators and packed into fixed 4096-token sequences. Batches are fully shuffled across domains.

Training procedure


Global batch	256 sequences × 4096 = 1.05M tokens/step
Optimizer	AdamW, lr 6e-4, weight decay 0.1, grad clip 1.0
LR schedule	WSD (warmup → stable → linear decay over final 30%)
Precision	bf16
Hardware	8×H200, DDP

Evaluation

Bits-per-byte (BPB) on a frozen held-out set, 5 domains. BPB normalizes by UTF-8 bytes, so the number is independent of the tokenizer:

Domain	BPB ↓
Kazakh (kk)	0.4521
Code	0.4250
Russian (ru)	0.4969
Math	0.7524
English (en)	0.9056
Macro	0.6064

Usage

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "TilQazyna/Til-2B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto")

ids = tok("Абай Құнанбайұлы — қазақ халқының", return_tensors="pt").input_ids.to(model.device)
out = model.generate(ids, max_new_tokens=80, do_sample=True,
                     temperature=0.7, top_p=0.9, repetition_penalty=1.1, pad_token_id=0)
print(tok.decode(out[0], skip_special_tokens=True))

Intended use & limitations

Intended: research on Kazakh/multilingual NLP; foundation for fine-tunes (instruct, GEC, domain adaptation); on-device text completion after quantization.
Base model: completes text, does not answer questions or follow instructions.
Factuality: hallucinates facts and numbers; do not use raw outputs as truth.
Reasoning/code: surface form is fluent; logical/arithmetic correctness is not guaranteed.
No safety alignment has been applied.

License

Apache 2.0. Access is gated (manual approval) for usage tracking.

Til-2B

Get help setting up a custom Dedicated Endpoints.

README