As an official launch partner for NVIDIA Nemotron 3 Nano and Nemotron 3 Super, FriendliAI provides high-performance deployment options for the Nemotron 3 family on Friendli Dedicated Endpoints. This guide focuses on how to run Nemotron 3 models on FriendliAI. You will learn how to authenticate, deploy the model, send inference requests, configure model parameters, and optimize for both performance and cost. In addition, you will also learn how to enable or disable reasoning at request time. If you are exploring which Nemotron model to deploy, visit our Nemotron landing page, where you can browse supported Nemotron models on FriendliAI and jump directly into deployment options.Documentation Index
Fetch the complete documentation index at: https://friendli.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
Nemotron 3 is a family of high-performance, high-efficiency foundation models designed for agentic AI applications. Built using a hybrid Mamba–Transformer Mixture-of-Experts (MoE) architecture and supporting 1M-token context windows, Nemotron 3 enables you to build reliable, high-throughput agents that operate across complex workflows, multi-document reasoning, and long-duration tasks. FriendliAI participated in the Nemotron 3 Nano launch and now provides Day-0 support for Nemotron 3 Super. While Nano is optimized for efficient targeted workloads, Nemotron 3 Super is purpose-built for more advanced multi-agent systems, complex tool-calling, and production-scale agentic AI workloads. This tutorial covers deploying the Nemotron 3 model family on Friendli Dedicated Endpoints. To learn more, check out our Nemotron 3 Nano launch announcement, our Nemotron 3 Super launch announcement, and NVIDIA’s Nemotron 3 page.Friendli Dedicated Endpoints
Friendli Dedicated Endpoints give you full control over deployment, scaling, and hardware selection, making it ideal for production workloads and mission-critical applications.- High-throughput with consistent, guaranteed performance
- 50%+ GPU savings
- Full control over GPU resources, 99.99% availability
- Ideal for high-volume production applications, long-running services, and advanced agentic workloads such as multi-agent systems and complex tool-calling
Prerequisites
Before you get started, ensure that you have the following:- A FriendliAI account
- A Personal API Key
- Install Friendli Python SDK or any OpenAI-compatible SDK
Run Nemotron 3 on Dedicated Endpoints for maximum performance
-
Select your base model and multi-LoRA adapters.
Nemotron-3 in the search bar and select the variant you want to run.
Note that you can deploy any of the 540,000+ supported models, including your own custom fine-tuned models, directly from our Models page, Hugging Face model repositories, or your Weights & Biases model artifacts.
-
Choose the GPU type.
-
Customize autoscaling parameters.
-
Configure the inference engine.
-
Deploy.
-
Send inference requests.
model field in your requests.
Example code using Friendli Python SDK:
curl script:
-
Monitor the endpoint behavior.
- You can view full metrics and request activities.
- Monitor real-time throughput, latency, token processed, and replica counts over time.
- Review request activity and troubleshoot issues more quickly.
- View specific request and response content (when explicitly enabled).
- Get a clearer view of how the model is behaving.
- Spot and investigate requests that may require attention.
Enabling and disabling reasoning
Nemotron 3 models support an explicit reasoning (or “thinking”) mode, which allows the model to internally reason step-by-step before producing a final answer. You can enable or disable this behavior at request time, depending on whether you want maximum reasoning quality or fast, deterministic responses.By default, Nemotron 3 utilizes reasoning when the
enable_thinking parameter is not specified.When to enable reasoning
Enable reasoning when:- You want higher-quality answers for complex or open-ended questions.
- Creativity and exploration are more important than determinism.
- Higher latency or token usage is acceptable.
- Use temperature=1.0 and top_p=1.0 for best performance.
When to disable reasoning
Disable reasoning when:- You want fast, predictable, and deterministic outputs.
- The task is simple (e.g., classification, extraction, short factual answers).
- You want minimal token usage.
- Set enable_thinking=False
- Use temperature=0 for deterministic behavior.
- top_p can be omitted or left at its default.