Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model details

ArchitectureDeepSeek-V3-style dense decoder with MLA (Multi-head Latent Attention)
Parameters1977.1M (tied input/output embeddings)
Hidden / layers2048 / 30
Attention16 heads, MLA: q_lora_rank 512, kv_lora_rank 256, qk_rope 32, qk_nope 96, v_head 96
FFN intermediate8192 (SwiGLU)
Context length4096
Position encodingRoPE, θ = 100 000
Vocab131 072 — Til-Tokenizer-128k
Precisionbf16

MLA compresses the KV-cache via low-rank latent projections, keeping the model memory-efficient at inference time.

Tokenizer

TilQazyna/Til-Tokenizer-128k — 131 072 BPE vocabulary trained with a focus on Kazakh morphology (≈1 token per Kazakh word on average), efficient for Russian, English, code and math. Special tokens: pad=0, <s>=1, </s>=2, <|im_start|>=6, <|im_end|>=7.

Training data

One full epoch over Til-Corpus — 47.0B tokens:

DomainTokensShare
English11.9B25%
Code9.9B21%
Kazakh9.7B21%
Math9.0B19%
Russian6.6B14%

Documents are tokenized, concatenated with </s> separators and packed into fixed 4096-token sequences. Batches are fully shuffled across domains.

Training procedure

Global batch256 sequences × 4096 = 1.05M tokens/step
OptimizerAdamW, lr 6e-4, weight decay 0.1, grad clip 1.0
LR scheduleWSD (warmup → stable → linear decay over final 30%)
Precisionbf16
Hardware8×H200, DDP

Evaluation

Bits-per-byte (BPB) on a frozen held-out set, 5 domains. BPB normalizes by UTF-8 bytes, so the number is independent of the tokenizer:

DomainBPB ↓
Kazakh (kk)0.4521
Code0.4250
Russian (ru)0.4969
Math0.7524
English (en)0.9056
Macro0.6064

Usage

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "TilQazyna/Til-2B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto")
ids = tok("Абай Құнанбайұлы — қазақ халқының", return_tensors="pt").input_ids.to(model.device)
out = model.generate(ids, max_new_tokens=80, do_sample=True,
temperature=0.7, top_p=0.9, repetition_penalty=1.1, pad_token_id=0)
print(tok.decode(out[0], skip_special_tokens=True))

Intended use & limitations

  • Intended: research on Kazakh/multilingual NLP; foundation for fine-tunes (instruct, GEC, domain adaptation); on-device text completion after quantization.
  • Base model: completes text, does not answer questions or follow instructions.
  • Factuality: hallucinates facts and numbers; do not use raw outputs as truth.
  • Reasoning/code: surface form is fluent; logical/arithmetic correctness is not guaranteed.
  • No safety alignment has been applied.

License

Apache 2.0. Access is gated (manual approval) for usage tracking.

Model provider

TilQazyna

TilQazyna

Model tree

Base

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today