nur-dev/farabi-1.7b-agent-rag API & Inference Endpoint

Capabilities

Multilingual (kk / ru / en). Understands and answers in Kazakh, Russian, and English, including mixed-language prompts.
Grounded RAG. Answers from provided passages/documents, ties claims to the supplied evidence, and abstains when the context is insufficient instead of hallucinating.
Agentic tool calling (Hermes / function calling). Decides whether a tool is needed, asks for missing required arguments, confirms before destructive or mutating actions, emits a valid tool call, and grounds the final answer in the tool result.
Multi-step tool chaining & error recovery. Sequences dependent calls without answering prematurely, and recovers gracefully from not_found / denied / empty results.
Numeric & rule reasoning. Table/fee arithmetic, deadline/eligibility/business-day rules, and structured-output / slot-completion tasks.
Clean, no-think outputs. Trainable targets are final answers and tool calls (no exposed chain-of-thought), so responses are production-ready.

How to use

Serve with vLLM (OpenAI-compatible, Hermes tool calls)

bash
vllm serve nur-dev/farabi-1.7b-agent-rag \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 8192

The chat template (chat_template.jinja) ships with the model. If your vLLM version does not auto-apply it, add --chat-template chat_template.jinja.

Chat (OpenAI Python SDK)

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

resp = client.chat.completions.create(
    model="nur-dev/farabi-1.7b-agent-rag",
    messages=[
        {"role": "user", "content": "Алматыдағы ауа райы қандай болады ертең?"},
    ],
)
print(resp.choices[0].message.content)

Tool calling (function calling)

python
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="nur-dev/farabi-1.7b-agent-rag",
    messages=[{"role": "user", "content": "What's the weather in Astana?"}],
    tools=tools,
)
msg = resp.choices[0].message
# msg.tool_calls -> [{function: {name: "get_weather", arguments: '{"city": "Astana"}'}}]
# Run the tool, append the tool result as a {"role": "tool", ...} message,
# then call the API again to get the grounded final answer.

RAG (answer from provided context)

python
context = """[1] The library is open 09:00–18:00 on weekdays.
[2] On Saturdays it closes at 14:00. It is closed on Sundays."""

resp = client.chat.completions.create(
    model="nur-dev/farabi-1.7b-agent-rag",
    messages=[
        {"role": "system", "content": "Answer only from the provided context. "
                                       "If the context is insufficient, say so."},
        {"role": "user", "content": f"{context}\n\nWhen does the library close on Saturday?"},
    ],
)
print(resp.choices[0].message.content)

Inference notes

Architecture: Qwen3-compatible causal LM (1.7B), bfloat16.
Context length: 8192 tokens.
Tool-call format: Hermes (--tool-call-parser hermes).
Works with the OpenAI Agents SDK via base_url + any placeholder api_key.

farabi-1.7b-agent-rag

Get help setting up a custom Dedicated Endpoints.

README

Capabilities

How to use

Serve with vLLM (OpenAI-compatible, Hermes tool calls)

Chat (OpenAI Python SDK)

Tool calling (function calling)

RAG (answer from provided context)

Inference notes

Explore FriendliAI today

farabi-1.7b-agent-rag