Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model details
| Architecture | DeepSeek-V3-style dense decoder with MLA (Multi-head Latent Attention) |
| Parameters | 1977.1M (tied input/output embeddings) |
| Hidden / layers | 2048 / 30 |
| Attention | 16 heads, MLA: q_lora_rank 512, kv_lora_rank 256, qk_rope 32, qk_nope 96, v_head 96 |
| FFN intermediate | 8192 (SwiGLU) |
| Context length | 4096 |
| Position encoding | RoPE, θ = 100 000 |
| Vocab | 131 072 — Til-Tokenizer-128k |
| Precision | bf16 |
MLA compresses the KV-cache via low-rank latent projections, keeping the model memory-efficient at inference time.
Tokenizer
TilQazyna/Til-Tokenizer-128k —
131 072 BPE vocabulary trained with a focus on Kazakh morphology
(≈1 token per Kazakh word on average), efficient for Russian, English, code and
math. Special tokens: pad=0, <s>=1, </s>=2, <|im_start|>=6, <|im_end|>=7.
Training data
One full epoch over Til-Corpus — 47.0B tokens:
| Domain | Tokens | Share |
|---|---|---|
| English | 11.9B | 25% |
| Code | 9.9B | 21% |
| Kazakh | 9.7B | 21% |
| Math | 9.0B | 19% |
| Russian | 6.6B | 14% |
Documents are tokenized, concatenated with </s> separators and packed into
fixed 4096-token sequences. Batches are fully shuffled across domains.
Training procedure
| Global batch | 256 sequences × 4096 = 1.05M tokens/step |
| Optimizer | AdamW, lr 6e-4, weight decay 0.1, grad clip 1.0 |
| LR schedule | WSD (warmup → stable → linear decay over final 30%) |
| Precision | bf16 |
| Hardware | 8×H200, DDP |
Evaluation
Bits-per-byte (BPB) on a frozen held-out set, 5 domains. BPB normalizes by UTF-8 bytes, so the number is independent of the tokenizer:
| Domain | BPB ↓ |
|---|---|
| Kazakh (kk) | 0.4521 |
| Code | 0.4250 |
| Russian (ru) | 0.4969 |
| Math | 0.7524 |
| English (en) | 0.9056 |
| Macro | 0.6064 |
Usage
python
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizerrepo = "TilQazyna/Til-2B"tok = AutoTokenizer.from_pretrained(repo)model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto")ids = tok("Абай Құнанбайұлы — қазақ халқының", return_tensors="pt").input_ids.to(model.device)out = model.generate(ids, max_new_tokens=80, do_sample=True,temperature=0.7, top_p=0.9, repetition_penalty=1.1, pad_token_id=0)print(tok.decode(out[0], skip_special_tokens=True))
Intended use & limitations
- Intended: research on Kazakh/multilingual NLP; foundation for fine-tunes (instruct, GEC, domain adaptation); on-device text completion after quantization.
- Base model: completes text, does not answer questions or follow instructions.
- Factuality: hallucinates facts and numbers; do not use raw outputs as truth.
- Reasoning/code: surface form is fluent; logical/arithmetic correctness is not guaranteed.
- No safety alignment has been applied.
License
Apache 2.0. Access is gated (manual approval) for usage tracking.
Model provider
TilQazyna
Model tree
Base
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information