As an official launch partner for NVIDIA Nemotron 3 Nano and Nemotron 3 Super, FriendliAI provides high-performance deployment options for the Nemotron 3 family on Friendli Dedicated Endpoints. This guide focuses on how to run Nemotron 3 models on FriendliAI. You will learn how to authenticate, deploy the model, send inference requests, configure model parameters, and optimize for both performance and cost. In addition, you will also learn how to enable or disable reasoning at request time.
If you are exploring which Nemotron model to deploy, visit our Nemotron landing page, where you can browse supported Nemotron models on FriendliAI and jump directly into deployment options.
Introduction
Nemotron 3 is a family of high-performance, high-efficiency foundation models designed for agentic AI applications. Built using a hybrid Mamba–Transformer Mixture-of-Experts (MoE) architecture and supporting 1M-token context windows, Nemotron 3 enables developers to build reliable, high-throughput agents that operate across complex workflows, multi-document reasoning, and long-duration tasks.
FriendliAI participated in the Nemotron 3 Nano launch and now provides Day-0 support for Nemotron 3 Super. While Nano is optimized for efficient targeted workloads, Nemotron 3 Super is purpose-built for more advanced multi-agent systems, complex tool-calling, and production-scale agentic AI workloads. This tutorial covers deploying the Nemotron 3 model family on Friendli Dedicated Endpoints.
To learn more, check out our Nemotron 3 Nano launch announcement, our Nemotron 3 Super launch announcement, and NVIDIA’s Nemotron 3 page.
Friendli Dedicated Endpoints
Friendli Dedicated Endpoints give you full control over deployment, scaling, and hardware selection, making it ideal for production workloads and mission-critical applications.
- High-throughput with consistent, guaranteed performance
- 50%+ GPU savings
- Full control over GPU resources, 99.99% availability
- Ideal for high-volume production applications, long-running services, and advanced agentic workloads such as multi-agent systems and complex tool-calling
Prerequisites
Before you get started, ensure that you have the following:
- A FriendliAI account
Sign up or log in at https://friendli.ai/suite.
- An API token
You can create and manage API keys in: Suite -> Settings -> API Tokens.
Set your key as an environment variable:
export FRIENDLI_TOKEN="YOUR API KEY HERE"
- Install Friendli Python SDK or any OpenAI-compatible SDK
# uv
uv add friendli
# pip
pip install friendli
-
-
Select your base model and multi-LoRA adapters.
Search for the model you want to deploy. For Nemotron 3 models, type Nemotron-3 in the search bar and select the variant you want to run.
Note that you can deploy any of the 520,000+ supported models, including your own custom fine-tuned models, directly from our Models page, Hugging Face model repositories, or your Weights & Biases model artifacts.
-
Choose the GPU type.
Choose the GPU type that best matches your latency, throughput, and budget requirements for your Nemotron 3 deployment.
-
Customize autoscaling parameters.
Friendli lets you customize the autoscaling parameters to tailor endpoints to your workloads.
-
Customize the inference engine to match your application’s requirements. You can set maximum batch size for beta models.
-
Deploy.
Click “Deploy” to deploy your Dedicated Endpoint and start using your model.
-
Send inference requests.
Same as Serverless Endpoints, you can immediately try the model on Dedicated Endpoints in Playground, which provides a chat-style interface for quick experimentation. You can also customize the system prompt and tune various parameters to explore different behaviors and response styles.
You can also start sending API requests right away. Copy-paste the endpoint ID from the endpoint overview page into the model field in your requests.
Example code using Friendli Python SDK:
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
res = friendli.dedicated.chat.stream(
model="your-dedicated-endpoint-id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": "Summarize the architectural advantages of Nemotron 3 Super for long-context, tool-using agent workflows.",
},
],
)
for chunk in res:
if content := chunk.data.choices[0].delta.content:
print(content, end="")
Example curl script:
curl -X POST https://api.friendli.ai/dedicated/v1/chat/completions \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "your-dedicated-endpoint-id",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Summarize the architectural advantages of Nemotron 3 Super for long-context, tool-using agent workflows."
}
],
"stream": true
}'
-
Monitor the endpoint behavior.
Real-time metrics and logs give teams immediate visibility into system behavior, making it far easier to understand and resolve issues quickly.
- You can view full metrics and request activities.
- Monitor real-time throughput, latency, token processed, and replica counts over time.
- Review request activity and troubleshoot issues more quickly.
- View specific request and response content (when explicitly enabled).
- Get a clearer view of how the model is behaving.
- Spot and investigate requests that may require attention.
Enabling and Disabling Reasoning
Nemotron 3 models support an explicit reasoning (or “thinking”) mode, which allows the model to internally reason step-by-step before producing a final answer. You can enable or disable this behavior at request time, depending on whether you want maximum reasoning quality or fast, deterministic responses.
By default, Nemotron 3 utilizes reasoning when the enable_thinking parameter is not specified.
When to Enable Reasoning
Enable reasoning when:
- You want higher-quality answers for complex or open-ended questions.
- Creativity and exploration are more important than determinism.
- Higher latency or token usage is acceptable.
When reasoning is enabled:
- Use temperature=1.0 and top_p=1.0 for best performance.
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
res = friendli.dedicated.chat.stream(
model="your-dedicated-endpoint-id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": "Design a multi-agent workflow using Nemotron 3 Super for research, planning, and tool execution.",
},
],
chat_template_kwargs={
"enable_thinking": True,
},
temperature=1.0,
top_p=1.0,
)
for chunk in res:
if content := chunk.data.choices[0].delta.content:
print(content, end="")
When to Disable Reasoning
Disable reasoning when:
- You want fast, predictable, and deterministic outputs.
- The task is simple (e.g., classification, extraction, short factual answers).
- You want minimal token usage.
When reasoning is disabled:
- Set enable_thinking=False
- Use temperature=0 for deterministic behavior.
- top_p can be omitted or left at its default.
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
res = friendli.dedicated.chat.stream(
model="your-dedicated-endpoint-id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": "Extract the key deployment settings from this endpoint configuration.",
},
],
chat_template_kwargs={
"enable_thinking": False,
},
temperature=0,
)
for chunk in res:
if content := chunk.data.choices[0].delta.content:
print(content, end="")
Choosing the right mode lets you balance answer quality, determinism, latency, and cost for your specific use case.
Conclusion
Nemotron 3 unlocks a new generation of high-performance, long-context, agent-ready AI capabilities, and FriendliAI makes it fast and easy to deploy them.
As an official launch partner for Nemotron 3 models, FriendliAI provides the infrastructure, tooling, and reliability needed to build and scale production-grade AI systems across the Nemotron 3 family. With flexible configuration options, seamless deployment workflows, and real-time observability, you can confidently bring Nemotron-powered applications to life and optimize them for your team’s workflow, performance needs, and budget.
Ready to get started? Sign in to Friendli Suite and launch your first Nemotron 3 deployment today.