Dedicated Endpoints
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Container
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Capabilities
- Multilingual (kk / ru / en). Understands and answers in Kazakh, Russian, and English, including mixed-language prompts.
- Grounded RAG. Answers from provided passages/documents, ties claims to the supplied evidence, and abstains when the context is insufficient instead of hallucinating.
- Agentic tool calling (Hermes / function calling). Decides whether a tool is needed, asks for missing required arguments, confirms before destructive or mutating actions, emits a valid tool call, and grounds the final answer in the tool result.
- Multi-step tool chaining & error recovery. Sequences dependent calls without
answering prematurely, and recovers gracefully from
not_found/denied/ empty results. - Numeric & rule reasoning. Table/fee arithmetic, deadline/eligibility/business-day rules, and structured-output / slot-completion tasks.
- Clean, no-think outputs. Trainable targets are final answers and tool calls (no exposed chain-of-thought), so responses are production-ready.
How to use
Serve with vLLM (OpenAI-compatible, Hermes tool calls)
bash
vllm serve nur-dev/farabi-1.7b-agent-rag \--enable-auto-tool-choice \--tool-call-parser hermes \--max-model-len 8192
The chat template (chat_template.jinja) ships with the model. If your vLLM
version does not auto-apply it, add --chat-template chat_template.jinja.
Chat (OpenAI Python SDK)
python
from openai import OpenAIclient = OpenAI(base_url="http://localhost:8000/v1", api_key="x")resp = client.chat.completions.create(model="nur-dev/farabi-1.7b-agent-rag",messages=[{"role": "user", "content": "Алматыдағы ауа райы қандай болады ертең?"},],)print(resp.choices[0].message.content)
Tool calling (function calling)
python
tools = [{"type": "function","function": {"name": "get_weather","description": "Get the weather for a city.","parameters": {"type": "object","properties": {"city": {"type": "string"}},"required": ["city"],},},}]resp = client.chat.completions.create(model="nur-dev/farabi-1.7b-agent-rag",messages=[{"role": "user", "content": "What's the weather in Astana?"}],tools=tools,)msg = resp.choices[0].message# msg.tool_calls -> [{function: {name: "get_weather", arguments: '{"city": "Astana"}'}}]# Run the tool, append the tool result as a {"role": "tool", ...} message,# then call the API again to get the grounded final answer.
RAG (answer from provided context)
python
context = """[1] The library is open 09:00–18:00 on weekdays.[2] On Saturdays it closes at 14:00. It is closed on Sundays."""resp = client.chat.completions.create(model="nur-dev/farabi-1.7b-agent-rag",messages=[{"role": "system", "content": "Answer only from the provided context. ""If the context is insufficient, say so."},{"role": "user", "content": f"{context}\n\nWhen does the library close on Saturday?"},],)print(resp.choices[0].message.content)
Inference notes
- Architecture: Qwen3-compatible causal LM (1.7B),
bfloat16. - Context length: 8192 tokens.
- Tool-call format: Hermes (
--tool-call-parser hermes). - Works with the OpenAI Agents SDK via
base_url+ any placeholderapi_key.
Model provider
nur-dev
Model tree
Base
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information