Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Highlights

  • 🇰🇿 Kazakh-first — the majority of the instruction data is native Kazakh, with Russian and English mixed in for cross-lingual robustness.
  • 🧠 Reasoning — supports optional step-by-step "thinking" mode that can be toggled on or off at request time.
  • 🔧 Tool calling — emits Hermes-style <tool_call> blocks and is compatible with the OpenAI-style function-calling interface and agent frameworks.
  • 📚 Grounded answering — trained to answer from provided documents and context, including longer inputs.
  • 🪶 Small & deployable — 0.6B parameters, runs comfortably on a single modest GPU.

Languages

LanguageApprox. share of instruction data
Kazakh (kk)~56%
English (en)~33%
Russian (ru)~10%

Data coverage by domain

The model was instruction-tuned on a broad, internally curated mixture. Described in general terms (no technical specifics), the approximate domain composition is:

DomainApprox. share
General instruction following & multi-turn conversation~45%
Reasoning & step-by-step problem solving~27%
Retrieval-grounded answering, long context & document Q&A~13%
Tool use, function calling & agentic interaction~7%
Knowledge, culture, news & encyclopedic content~4%
Mathematics, language tasks (grammar / translation), safety & appropriate refusal, device & environment control, and assistant identity~4%

Shares are approximate and reflect general domain proportions rather than exact figures.


Data provenance & acknowledgments

The training datasets were created internally by the author, including original synthesis as well as additionally processed and enriched material.

Approximately 5.4% of all data used for instruction tuning was derived (with additional processing and enrichment) from resources of two organizations, whose contributions to the Kazakh language are gratefully acknowledged:

  1. Институт языкознания имени А. БайтурсыноваInstitute of Linguistics named after A. Baitursynov
  2. ННПЦ «Тіл-Қазына» имени Шайсултана ШаяхметоваSh. Shayakhmetov National Research and Practical Center "Til-Qazyna"

Recommended sampling parameters

A good starting point for general use:

json

{
"temperature": 0.15,
"top_p": 0.95,
"max_tokens": 1024,
"repetition_penalty": 1.05,
"stream": true,
"chat_template_kwargs": {
"enable_thinking": true
},
"continue_final_message": true
}

Set "enable_thinking": false to get direct answers without an explicit reasoning step. Raise temperature for more creative / open-ended generation.


Serving with vLLM

Start an OpenAI-compatible server with tool-calling enabled:

bash

vllm serve nur-dev/farabi-0.6B \
--served-model-name farabi-0.6b \
--enable-auto-tool-choice \
--tool-call-parser hermes

Query it with the standard OpenAI client (and the recommended sampling params):

python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="farabi-0.6b",
messages=[
{"role": "system", "content": "Сіз пайдалы әрі дәл көмекшісіз."},
{"role": "user", "content": "Алматы туралы қысқаша айтып бер."},
],
temperature=0.15,
top_p=0.95,
max_tokens=1024,
extra_body={
"repetition_penalty": 1.05,
"chat_template_kwargs": {"enable_thinking": True},
},
stream=True,
)
for chunk in resp:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)

Tool calling works through the standard tools=[...] argument — the model returns function calls that the server parses into structured tool_calls.


Serving with PyTorch / Transformers

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "nur-dev/farabi-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "Сіз пайдалы әрі дәл көмекшісіз."},
{"role": "user", "content": "Қазақстанның астанасы қай қала?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
enable_thinking=True, # set False for direct answers
return_tensors="pt",
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=1024,
do_sample=True,
temperature=0.15,
top_p=0.95,
repetition_penalty=1.05,
)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Evaluation

⚠️ Interim results. The numbers below were measured on an early checkpoint (~17% through instruction tuning). They are expected to improve as training continues, but already show meaningful capability.

Tool / function calling — BFCL v4

Berkeley Function-Calling Leaderboard (v4), 1,040 cases, evaluated with the HuggingFace backend.

CategoryAccuracynWhat it measures
Simple80.5%322/400one call, one tool available
Multiple71.5%143/200pick the right tool from several
Parallel65.5%131/200several calls in one turn
Irrelevance5.4%13/240abstain when no tool fits
Overall58.6%609/1040
Function-calling avg74.5%596/800excludes irrelevance

Takeaways:

  • Strong calling ability for a 0.6B model. When a call is warranted it is correct ~74.5% of the time — right tool, valid arguments, clean JSON — including 65.5% on the hard parallel / multi-call category.
  • The weakness is abstention, not calling. On queries that match no available tool, the model still tends to emit a call (irrelevance 5.4% → it over-triggers). This is the main driver of the lower overall score and the clearest area for improvement.

Multilingual comprehension — 4-way multiple choice

Multiple-choice comprehension across the model's three languages (random baseline = 25%), evaluated with the chat template and enable_thinking=False.

LanguageAccuracy
English53.7% ±1.7
Russian50.0% ±1.7
Kazakh41.8% ±1.6

Takeaways:

  • Well above the 25% random baseline in all three languages — real comprehension in English, Russian, and Kazakh.
  • Resource ordering (en > ru > kk) is as expected; Kazakh at 41.8% is clearly non-trivial.
  • Evaluating with the chat template and enable_thinking=False adds ~5–6 points per language versus a raw prompt — another reason to serve the model with its chat template (see serving instructions above).

Intended use & limitations

Farabi-0.6B is intended as a helpful general-purpose and agentic assistant, with a focus on Kazakh-language use cases. As a small model, it can make factual mistakes, and outputs should be verified for high-stakes or factual-critical applications. It should be used responsibly and in accordance with applicable laws and the base model's license.


Citation

If you use this model, please credit the author:

Nurgali Kadyrbek — Farabi-0.6B. https://www.linkedin.com/in/nurgali-kadyrbek-504260231/

Model provider

nur-dev

nur-dev

Model tree

Base

nur-dev/farabi-0.6B-base

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today