As an official launch partner with NVIDIA, FriendliAI provides full Day 0 support for the new Nemotron 3 Nano on Friendli Dedicated Endpoints. This guide walks you through how to run Nemotron 3 Nano on FriendliAI. You will learn how to authenticate, send inference requests, configure model parameters, and optimize for both performance and cost. In addition, you will also learn how to enable or disable reasoning at request time.
Introduction
Nemotron 3 Nano is of high-performance, high-efficiency foundation models designed for agentic AI applications. Built using a hybrid Mamba–Transformer Mixture-of-Experts (MoE) architecture and supporting 1M-token context windows, Nemotron 3 enables developers to build reliable, high-throughput agents that operate across complex workflows, multi-document reasoning, and long-duration tasks.
The Nemotron 3 Nano is optimized for targeted workloads, multi-agent collaboration, and mission-critical applications. With open weights, datasets, and tools, developers can fine-tune, optimize, and deploy the models in their own infrastructure for maximum privacy and security.
To learn more, check out our partnership announcement and NVIDIA’s Nemotron 3 page.
Friendli Dedicated Endpoints
Friendli Dedicated Endpoints give you full control over deployment, scaling, and hardware selection, making it ideal for production workloads and mission-critical applications.
- High-throughput with consistent, guaranteed performance
- 50%+ GPU savings
- Full control over GPU resources, 99.99% availability
- Ideal for high-volume production applications and long-running services
Prerequisites
Before you get started, ensure that you have the following:
- A FriendliAI account
Sign up or log in at https://friendli.ai/suite.
- An API token
You can create and manage API keys in: Suite -> Settings -> API Tokens.
Set your key as an environment variable:
export FRIENDLI_TOKEN="YOUR API KEY HERE"
- Install Friendli Python SDK or any OpenAI-compatible SDK
# uv
uv add friendli
# pip
pip install friendli
-
-
Select your base model and multi-LoRA adapters.
Search for the model you want to deploy. For Nemotron 3 models, just type “Nemotron-3” in the search bar and select the variant you want.
You can also apply as many Multi-LoRA Adapters as you want.
Note that you can deploy any of the 485,000+ supported models, including your own custom fine-tuned models, directly from our Models page, Hugging Face model repositories, or your Weights & Biases model artifacts.
-
Here you can customize your endpoint to:
-
Choose the GPU type.
-
Customize autoscaling parameters.
Friendli lets you customize the autoscaling parameters to tailor endpoints to your workloads.
-
Customize the inference engine to match your application’s requirements. These settings control how the model processes inputs, handles tokens, and logs requests during use. You can:
- Add special tokens
- Skip special tokens
- Set maximum batch size
- Log request content
-
Deploy.
Click “Create” to deploy your Dedicated Endpoint and start using your model!
-
Send inference requests.
Same as Serverless Endpoints, you can immediately try the model on Dedicated Endpoints in Playground, which provides a chat-style interface for quick experimentation with built-in tools, such as calculator, Python interpreter, and Linkup web search.
You can also start sending API requests right away. Copy-paste the endpoint ID from the endpoint overview page into the model field in your requests.
Example code using Friendli Python SDK:
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
res = friendli.dedicated.chat.stream(
model="your-dedicated-endpoint-id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, Nemotron 3!"},
],
)
for chunk in res:
if content := chunk.data.choices[0].delta.content:
print(content, end="")
Example curl script:
curl -X POST https://api.friendli.ai/dedicated/v1/chat/completions \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "your-dedicated-endpoint-id",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello, Nemotron 3!"
}
],
"stream": true
}'
-
Monitor the endpoint behavior.
Real-time metrics and logs give teams immediate visibility into system behavior, making it far easier to understand and resolve issues quickly.
- You can view full metrics and request activities.
- Monitor real-time throughput, latency, token processed, and replica counts over time.
- Review request activity and troubleshoot issues more quickly.
- View specific request and response content (when explicitly ena bled).
- Get a clearer view of how the model is behaving.
- Spot and investigate requests that may require attention.
Enabling and Disabling Reasoning
Nemotron 3 models support an explicit reasoning (or “thinking”) mode, which allows the model to internally reason step-by-step before producing a final answer. You can enable or disable this behavior at request time, depending on whether you want maximum reasoning quality or fast, deterministic responses.
By default, Nemotron 3 utilizes reasoning when the enable_thinking parameter is not specified.
When to Enable Reasoning
Enable reasoning when:
- You want higher-quality answers for complex or open-ended questions.
- Creativity and exploration are more important than determinism.
- Higher latency or token usage is acceptable.
When reasoning is enabled:
- Use temperature=1.0 and top_p=1.0 for best performance.
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
res = friendli.dedicated.chat.stream(
model="your-dedicated-endpoint-id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, Nemotron 3!"},
],
chat_template_kwargs={
"enable_thinking": True,
},
temperature=1.0,
top_p=1.0,
)
for chunk in res:
if content := chunk.data.choices[0].delta.content:
print(content, end="")
When to Disable Reasoning
Disable reasoning when:
- You want fast, predictable, and deterministic outputs.
- The task is simple (e.g., classification, extraction, short factual answers).
- You want minimal token usage.
When reasoning is disabled:
- Set enable_thinking=False
- Use temperature=0 for deterministic behavior.
- top_p can be omitted or left at its default.
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
res = friendli.dedicated.chat.stream(
model="your-dedicated-endpoint-id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, Nemotron 3!"},
],
chat_template_kwargs={
"enable_thinking": False,
},
temperature=0,
)
for chunk in res:
if content := chunk.data.choices[0].delta.content:
print(content, end="")
Choosing the right mode lets you balance answer quality, determinism, latency, and cost for your specific use case.
Conclusion
Nemotron 3 unlocks a new generation of high-performance, long-context, agent-ready AI capabilities, and FriendliAI makes it easy to deploy them from day 0. Whether you want instant access through Serverless Endpoints or full control and maximum efficiency with Dedicated Endpoints, FriendliAI provides the infrastructure, tooling, and reliability needed to build and scale production-grade AI systems. With flexible configuration options, seamless deployment workflows, and real-time observability, you can confidently bring Nemotron-powered applications to life and optimize them for your team’s workflow, performance needs, and budget.
Ready to get started? Sign in to Friendli Suite and launch your first Nemotron 3 deployment today.