Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Highlights
- 🇰🇿 Kazakh-first — the majority of the instruction data is native Kazakh, with Russian and English mixed in for cross-lingual robustness.
- 🧠 Reasoning — supports optional step-by-step "thinking" mode that can be toggled on or off at request time.
- 🔧 Tool calling — emits Hermes-style
<tool_call>blocks and is compatible with the OpenAI-style function-calling interface and agent frameworks. - 📚 Grounded answering — trained to answer from provided documents and context, including longer inputs.
- 🪶 Small & deployable — 0.6B parameters, runs comfortably on a single modest GPU.
Languages
| Language | Approx. share of instruction data |
|---|---|
| Kazakh (kk) | ~56% |
| English (en) | ~33% |
| Russian (ru) | ~10% |
Data coverage by domain
The model was instruction-tuned on a broad, internally curated mixture. Described in general terms (no technical specifics), the approximate domain composition is:
| Domain | Approx. share |
|---|---|
| General instruction following & multi-turn conversation | ~45% |
| Reasoning & step-by-step problem solving | ~27% |
| Retrieval-grounded answering, long context & document Q&A | ~13% |
| Tool use, function calling & agentic interaction | ~7% |
| Knowledge, culture, news & encyclopedic content | ~4% |
| Mathematics, language tasks (grammar / translation), safety & appropriate refusal, device & environment control, and assistant identity | ~4% |
Shares are approximate and reflect general domain proportions rather than exact figures.
Data provenance & acknowledgments
The training datasets were created internally by the author, including original synthesis as well as additionally processed and enriched material.
Approximately 5.4% of all data used for instruction tuning was derived (with additional processing and enrichment) from resources of two organizations, whose contributions to the Kazakh language are gratefully acknowledged:
- Институт языкознания имени А. Байтурсынова — Institute of Linguistics named after A. Baitursynov
- ННПЦ «Тіл-Қазына» имени Шайсултана Шаяхметова — Sh. Shayakhmetov National Research and Practical Center "Til-Qazyna"
Recommended sampling parameters
A good starting point for general use:
json
{"temperature": 0.15,"top_p": 0.95,"max_tokens": 1024,"repetition_penalty": 1.05,"stream": true,"chat_template_kwargs": {"enable_thinking": true},"continue_final_message": true}
Set "enable_thinking": false to get direct answers without an explicit reasoning step.
Raise temperature for more creative / open-ended generation.
Serving with vLLM
Start an OpenAI-compatible server with tool-calling enabled:
bash
vllm serve nur-dev/farabi-0.6B \--served-model-name farabi-0.6b \--enable-auto-tool-choice \--tool-call-parser hermes
Query it with the standard OpenAI client (and the recommended sampling params):
python
from openai import OpenAIclient = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")resp = client.chat.completions.create(model="farabi-0.6b",messages=[{"role": "system", "content": "Сіз пайдалы әрі дәл көмекшісіз."},{"role": "user", "content": "Алматы туралы қысқаша айтып бер."},],temperature=0.15,top_p=0.95,max_tokens=1024,extra_body={"repetition_penalty": 1.05,"chat_template_kwargs": {"enable_thinking": True},},stream=True,)for chunk in resp:delta = chunk.choices[0].delta.contentif delta:print(delta, end="", flush=True)
Tool calling works through the standard tools=[...] argument — the model returns
function calls that the server parses into structured tool_calls.
Serving with PyTorch / Transformers
python
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizermodel_id = "nur-dev/farabi-0.6B"tokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained(model_id,torch_dtype=torch.bfloat16,device_map="auto",)messages = [{"role": "system", "content": "Сіз пайдалы әрі дәл көмекшісіз."},{"role": "user", "content": "Қазақстанның астанасы қай қала?"},]inputs = tokenizer.apply_chat_template(messages,add_generation_prompt=True,enable_thinking=True, # set False for direct answersreturn_tensors="pt",).to(model.device)outputs = model.generate(inputs,max_new_tokens=1024,do_sample=True,temperature=0.15,top_p=0.95,repetition_penalty=1.05,)print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Evaluation
⚠️ Interim results. The numbers below were measured on an early checkpoint (~17% through instruction tuning). They are expected to improve as training continues, but already show meaningful capability.
Tool / function calling — BFCL v4
Berkeley Function-Calling Leaderboard (v4), 1,040 cases, evaluated with the HuggingFace backend.
| Category | Accuracy | n | What it measures |
|---|---|---|---|
| Simple | 80.5% | 322/400 | one call, one tool available |
| Multiple | 71.5% | 143/200 | pick the right tool from several |
| Parallel | 65.5% | 131/200 | several calls in one turn |
| Irrelevance | 5.4% | 13/240 | abstain when no tool fits |
| Overall | 58.6% | 609/1040 | |
| Function-calling avg | 74.5% | 596/800 | excludes irrelevance |
Takeaways:
- Strong calling ability for a 0.6B model. When a call is warranted it is correct ~74.5% of the time — right tool, valid arguments, clean JSON — including 65.5% on the hard parallel / multi-call category.
- The weakness is abstention, not calling. On queries that match no available tool, the model still tends to emit a call (irrelevance 5.4% → it over-triggers). This is the main driver of the lower overall score and the clearest area for improvement.
Multilingual comprehension — 4-way multiple choice
Multiple-choice comprehension across the model's three languages (random baseline = 25%),
evaluated with the chat template and enable_thinking=False.
| Language | Accuracy |
|---|---|
| English | 53.7% ±1.7 |
| Russian | 50.0% ±1.7 |
| Kazakh | 41.8% ±1.6 |
Takeaways:
- Well above the 25% random baseline in all three languages — real comprehension in English, Russian, and Kazakh.
- Resource ordering (en > ru > kk) is as expected; Kazakh at 41.8% is clearly non-trivial.
- Evaluating with the chat template and
enable_thinking=Falseadds ~5–6 points per language versus a raw prompt — another reason to serve the model with its chat template (see serving instructions above).
Intended use & limitations
Farabi-0.6B is intended as a helpful general-purpose and agentic assistant, with a focus on Kazakh-language use cases. As a small model, it can make factual mistakes, and outputs should be verified for high-stakes or factual-critical applications. It should be used responsibly and in accordance with applicable laws and the base model's license.
Citation
If you use this model, please credit the author:
Nurgali Kadyrbek — Farabi-0.6B. https://www.linkedin.com/in/nurgali-kadyrbek-504260231/
Model provider
nur-dev
Model tree
Base
nur-dev/farabi-0.6B-base
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information