Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Capabilities

  • Multilingual (kk / ru / en). Understands and answers in Kazakh, Russian, and English, including mixed-language prompts.
  • Grounded RAG. Answers from provided passages/documents, ties claims to the supplied evidence, and abstains when the context is insufficient instead of hallucinating.
  • Agentic tool calling (Hermes / function calling). Decides whether a tool is needed, asks for missing required arguments, confirms before destructive or mutating actions, emits a valid tool call, and grounds the final answer in the tool result.
  • Multi-step tool chaining & error recovery. Sequences dependent calls without answering prematurely, and recovers gracefully from not_found / denied / empty results.
  • Numeric & rule reasoning. Table/fee arithmetic, deadline/eligibility/business-day rules, and structured-output / slot-completion tasks.
  • Clean, no-think outputs. Trainable targets are final answers and tool calls (no exposed chain-of-thought), so responses are production-ready.

How to use

Serve with vLLM (OpenAI-compatible, Hermes tool calls)

bash

vllm serve nur-dev/farabi-1.7b-agent-rag \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-model-len 8192

The chat template (chat_template.jinja) ships with the model. If your vLLM version does not auto-apply it, add --chat-template chat_template.jinja.

Chat (OpenAI Python SDK)

python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
resp = client.chat.completions.create(
model="nur-dev/farabi-1.7b-agent-rag",
messages=[
{"role": "user", "content": "Алматыдағы ауа райы қандай болады ертең?"},
],
)
print(resp.choices[0].message.content)

Tool calling (function calling)

python

tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
resp = client.chat.completions.create(
model="nur-dev/farabi-1.7b-agent-rag",
messages=[{"role": "user", "content": "What's the weather in Astana?"}],
tools=tools,
)
msg = resp.choices[0].message
# msg.tool_calls -> [{function: {name: "get_weather", arguments: '{"city": "Astana"}'}}]
# Run the tool, append the tool result as a {"role": "tool", ...} message,
# then call the API again to get the grounded final answer.

RAG (answer from provided context)

python

context = """[1] The library is open 09:00–18:00 on weekdays.
[2] On Saturdays it closes at 14:00. It is closed on Sundays."""
resp = client.chat.completions.create(
model="nur-dev/farabi-1.7b-agent-rag",
messages=[
{"role": "system", "content": "Answer only from the provided context. "
"If the context is insufficient, say so."},
{"role": "user", "content": f"{context}\n\nWhen does the library close on Saturday?"},
],
)
print(resp.choices[0].message.content)

Inference notes

  • Architecture: Qwen3-compatible causal LM (1.7B), bfloat16.
  • Context length: 8192 tokens.
  • Tool-call format: Hermes (--tool-call-parser hermes).
  • Works with the OpenAI Agents SDK via base_url + any placeholder api_key.

Model provider

nur-dev

nur-dev

Model tree

Base

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today