TilQazyna
Til-Core-0.5B
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Why a 256K morpheme-aware vocabulary?
Kazakh is highly agglutinative — a single root takes long chains of suffixes. Standard byte-level BPE fragments these into many sub-tokens, wasting context and parameters. Til Core uses a 256,000-token morpheme-aware BPE (stukenov/sozkz-morphbpe-256k-kk-v1) that aligns tokens with morphological boundaries, giving ~15–20% better compression on Kazakh text. The trade-off — a heavier embedding table — is absorbed by tying input/output embeddings and using a deeper-than-usual transformer body.
Model details
| Architecture | Qwen2 (decoder-only, SwiGLU, RoPE, GQA) |
| Parameters | 497.8M (embedding ≈ 229M, transformer ≈ 268M) |
| Vocabulary | 256,000 (morpheme-aware BPE) |
| Hidden size | 896 |
| Layers | 18 |
| Attention heads | 14 (GQA, 2 KV heads) |
| Intermediate size | 4864 |
| Context length | 32,768 (rope_theta = 1e6) |
| Tied embeddings | yes |
| Precision | bf16 |
Training
| Data | stukenov/sozkz-corpus-tokenized-kk-morphbpe256k-v1 — pre-tokenized clean Kazakh (~1.44M sequences × 2048 tokens ≈ 2.94B tokens) |
| Tokens seen | ≈ 5.88B (2 epochs) |
| Steps | 11,222 |
| Global batch | 524,288 tokens/step (8 × 8 × grad-accum 4 × 2048) |
| Optimizer | AdamW (β default), weight decay 0.1, grad clip 1.0 |
| LR schedule | 4e-4, cosine, 500 warmup steps |
| Sequence length | 2048 |
| Hardware | 8 × NVIDIA H200 (140 GB), ~3h15m |
| Final eval loss | 2.436 (validation), perplexity ≈ 11.4 |
Chinchilla-style budget: ~498M params with ≈5.9B tokens (~11.8 tokens/param).
Usage
python
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizerrepo = "TilQazyna/Til-Core-0.5B"tok = AutoTokenizer.from_pretrained(repo)model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto").eval()prompt = "Абай Құнанбайұлы — қазақтың"ids = tok(prompt, return_tensors="pt").to(model.device)out = model.generate(**ids, max_new_tokens=60, do_sample=True,temperature=0.8, top_p=0.9, repetition_penalty=1.2)print(tok.decode(out[0], skip_special_tokens=True))
The tokenizer is bundled with this repository (tokenizer.json, tokenizer_config.json).
Sample generations
markdown
Қазақстан Республикасының астанасы→ … Астана қаласында орналасқан, Қазақстан Республикасы Президентініңрезиденциясы. Сарайдың негізгі ғимараттары: «Ақорда» залы …Абай Құнанбайұлы — қазақтың→ … рухани мәдениетінің көрнекті өкілі. Ол – ақын, ағартушы, жазбаәдебиетінің негізін салушы әрі дамытушы …Жасанды интеллект дегеніміз —→ … ақпаратты беру мен оны өңдеудің үздіксіз және тиімді жұмыс жасауынқамтамасыз ететін технологиялар жиынтығы.
Limitations
- Base model, not instruction-tuned — it continues text, it does not follow chat instructions out of the box. Fine-tune for downstream tasks.
- Trained on web/encyclopedic Kazakh, so it can emit corpus artifacts (URLs, site names, boilerplate).
- No safety alignment — outputs are unfiltered.
- Knowledge is limited to the training corpus.
Citation
bibtex
@misc{tilcore05b2026,title = {Til Core 0.5B: a morpheme-aware Kazakh language model},author = {TilQazyna},year = {2026},url = {https://huggingface.co/TilQazyna/Til-Core-0.5B}}
Tokenizer: stukenov/sozkz-morphbpe-256k-kk-v1 · Dataset: stukenov/sozkz-corpus-tokenized-kk-morphbpe256k-v1
Model provider
TilQazyna
Model tree
Base
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information