TilQazyna

TilQazyna

Til-Core-0.5B

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Why a 256K morpheme-aware vocabulary?

Kazakh is highly agglutinative — a single root takes long chains of suffixes. Standard byte-level BPE fragments these into many sub-tokens, wasting context and parameters. Til Core uses a 256,000-token morpheme-aware BPE (stukenov/sozkz-morphbpe-256k-kk-v1) that aligns tokens with morphological boundaries, giving ~15–20% better compression on Kazakh text. The trade-off — a heavier embedding table — is absorbed by tying input/output embeddings and using a deeper-than-usual transformer body.

Model details

Table
ArchitectureQwen2 (decoder-only, SwiGLU, RoPE, GQA)
Parameters497.8M (embedding ≈ 229M, transformer ≈ 268M)
Vocabulary256,000 (morpheme-aware BPE)
Hidden size896
Layers18
Attention heads14 (GQA, 2 KV heads)
Intermediate size4864
Context length32,768 (rope_theta = 1e6)
Tied embeddingsyes
Precisionbf16

Training

Table
Datastukenov/sozkz-corpus-tokenized-kk-morphbpe256k-v1 — pre-tokenized clean Kazakh (~1.44M sequences × 2048 tokens ≈ 2.94B tokens)
Tokens seen≈ 5.88B (2 epochs)
Steps11,222
Global batch524,288 tokens/step (8 × 8 × grad-accum 4 × 2048)
OptimizerAdamW (β default), weight decay 0.1, grad clip 1.0
LR schedule4e-4, cosine, 500 warmup steps
Sequence length2048
Hardware8 × NVIDIA H200 (140 GB), ~3h15m
Final eval loss2.436 (validation), perplexity ≈ 11.4

Chinchilla-style budget: ~498M params with ≈5.9B tokens (~11.8 tokens/param).

Usage

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "TilQazyna/Til-Core-0.5B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto").eval()
prompt = "Абай Құнанбайұлы — қазақтың"
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=60, do_sample=True,
temperature=0.8, top_p=0.9, repetition_penalty=1.2)
print(tok.decode(out[0], skip_special_tokens=True))

The tokenizer is bundled with this repository (tokenizer.json, tokenizer_config.json).

Sample generations

markdown

Қазақстан Республикасының астанасы
→ … Астана қаласында орналасқан, Қазақстан Республикасы Президентінің
резиденциясы. Сарайдың негізгі ғимараттары: «Ақорда» залы …
Абай Құнанбайұлы — қазақтың
→ … рухани мәдениетінің көрнекті өкілі. Ол – ақын, ағартушы, жазба
әдебиетінің негізін салушы әрі дамытушы …
Жасанды интеллект дегеніміз —
→ … ақпаратты беру мен оны өңдеудің үздіксіз және тиімді жұмыс жасауын
қамтамасыз ететін технологиялар жиынтығы.

Limitations

  • Base model, not instruction-tuned — it continues text, it does not follow chat instructions out of the box. Fine-tune for downstream tasks.
  • Trained on web/encyclopedic Kazakh, so it can emit corpus artifacts (URLs, site names, boilerplate).
  • No safety alignment — outputs are unfiltered.
  • Knowledge is limited to the training corpus.

Citation

bibtex

@misc{tilcore05b2026,
title = {Til Core 0.5B: a morpheme-aware Kazakh language model},
author = {TilQazyna},
year = {2026},
url = {https://huggingface.co/TilQazyna/Til-Core-0.5B}
}

Tokenizer: stukenov/sozkz-morphbpe-256k-kk-v1 · Dataset: stukenov/sozkz-corpus-tokenized-kk-morphbpe256k-v1

Model provider

TilQazyna

TilQazyna

Model tree

Base

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today