Skip to main content
thumbnail As an official launch partner with NVIDIA, FriendliAI provides full Day 0 support for the new Nemotron 3 Nano on Friendli Dedicated Endpoints. This guide walks you through how to run Nemotron 3 Nano on FriendliAI. You will learn how to authenticate, send inference requests, configure model parameters, and optimize for both performance and cost. In addition, you will also learn how to enable or disable reasoning at request time.

Introduction

Nemotron 3 Nano is of high-performance, high-efficiency foundation models designed for agentic AI applications. Built using a hybrid Mamba–Transformer Mixture-of-Experts (MoE) architecture and supporting 1M-token context windows, Nemotron 3 enables developers to build reliable, high-throughput agents that operate across complex workflows, multi-document reasoning, and long-duration tasks. The Nemotron 3 Nano is optimized for targeted workloads, multi-agent collaboration, and mission-critical applications. With open weights, datasets, and tools, developers can fine-tune, optimize, and deploy the models in their own infrastructure for maximum privacy and security. To learn more, check out our partnership announcement and NVIDIA’s Nemotron 3 page.

Friendli Dedicated Endpoints

Friendli Dedicated Endpoints give you full control over deployment, scaling, and hardware selection, making it ideal for production workloads and mission-critical applications.
  • High-throughput with consistent, guaranteed performance
  • 50%+ GPU savings
  • Full control over GPU resources, 99.99% availability
  • Ideal for high-volume production applications and long-running services

Prerequisites

Before you get started, ensure that you have the following:
  1. A FriendliAI account
Sign up or log in at https://friendli.ai/suite.
  1. An API token
You can create and manage API keys in: Suite -> Settings -> API Tokens. api-token Set your key as an environment variable:
export FRIENDLI_TOKEN="YOUR API KEY HERE"
  1. Install Friendli Python SDK or any OpenAI-compatible SDK
# uv
uv add friendli

# pip
pip install friendli

Run Nemotron 3 Nano on Dedicated Endpoints for Maximum Performance

  1. Go to the Dedicated Endpoint creation page.

endpoint-creation
  1. Select your base model and multi-LoRA adapters.

Search for the model you want to deploy. For Nemotron 3 models, just type “Nemotron-3” in the search bar and select the variant you want. You can also apply as many Multi-LoRA Adapters as you want. select-model Note that you can deploy any of the 484,000+ supported models, including your own custom fine-tuned models, directly from our Models page, Hugging Face model repositories, or your Weights & Biases model artifacts.
  1. Configure endpoint features.

Here you can customize your endpoint to: endpoint-features
  1. Choose the GPU type.

gpu-type
  1. Customize autoscaling parameters.

Friendli lets you customize the autoscaling parameters to tailor endpoints to your workloads. autoscaling
  1. Configure the inference engine.

Customize the inference engine to match your application’s requirements. These settings control how the model processes inputs, handles tokens, and logs requests during use. You can:
  • Add special tokens
  • Skip special tokens
  • Set maximum batch size
  • Log request content
engine
  1. Deploy.

Click “Create” to deploy your Dedicated Endpoint and start using your model!
  1. Send inference requests.

Same as Serverless Endpoints, you can immediately try the model on Dedicated Endpoints in Playground, which provides a chat-style interface for quick experimentation with built-in tools, such as calculator, Python interpreter, and Linkup web search. playground You can also start sending API requests right away. Copy-paste the endpoint ID from the endpoint overview page into the model field in your requests. endpoint-id Example code using Friendli Python SDK:
import os

from friendli import SyncFriendli

with SyncFriendli(
    token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
    res = friendli.dedicated.chat.stream(
        model="your-dedicated-endpoint-id",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello, Nemotron 3!"},
        ],
    )
    for chunk in res:
        if content := chunk.data.choices[0].delta.content:
            print(content, end="")
Example curl script:
curl -X POST https://api.friendli.ai/dedicated/v1/chat/completions \
  -H "Authorization: Bearer $FRIENDLI_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-dedicated-endpoint-id",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello, Nemotron 3!"
      }
    ],
    "stream": true
  }'
  1. Monitor the endpoint behavior.

Real-time metrics and logs give teams immediate visibility into system behavior, making it far easier to understand and resolve issues quickly.
  • You can view full metrics and request activities.
    • Monitor real-time throughput, latency, token processed, and replica counts over time.
    • Review request activity and troubleshoot issues more quickly.
metrics
  • View specific request and response content (when explicitly ena bled).
    • Get a clearer view of how the model is behaving.
    • Spot and investigate requests that may require attention.
logs

Enabling and Disabling Reasoning

Nemotron 3 models support an explicit reasoning (or “thinking”) mode, which allows the model to internally reason step-by-step before producing a final answer. You can enable or disable this behavior at request time, depending on whether you want maximum reasoning quality or fast, deterministic responses.

When to Enable Reasoning

Enable reasoning when:
  • You want higher-quality answers for complex or open-ended questions.
  • Creativity and exploration are more important than determinism.
  • Higher latency or token usage is acceptable.
When reasoning is enabled:
  • Use temperature=1.0 and top_p=1.0 for best performance.
  • Optionally set a reasoning_budget to cap how many tokens the model can spend on its internal reasoning trace.
import os

from friendli import SyncFriendli

with SyncFriendli(
    token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
    res = friendli.dedicated.chat.stream(
        model="your-dedicated-endpoint-id",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello, Nemotron 3!"},
        ],
        chat_template_kwargs={
            "enable_thinking": True,
            "reasoning_budget": 50,  # optional - limits thinking tokens
        },
        temperature=1.0,
        top_p=1.0,
    )
    for chunk in res:
        if content := chunk.data.choices[0].delta.content:
            print(content, end="")
reasoning_budget
  • Optional
  • Sets a strict upper bound on the number of tokens used for internal reasoning.
  • Useful for controlling latency and cost while still benefiting from reasoning.

When to Disable Reasoning

Disable reasoning when:
  • You want fast, predictable, and deterministic outputs.
  • The task is simple (e.g., classification, extraction, short factual answers).
  • You want minimal token usage.
When reasoning is disabled:
  • Set enable_thinking=False
  • Use temperature=0 for deterministic behavior.
  • top_p can be omitted or left at its default.
import os

from friendli import SyncFriendli

with SyncFriendli(
    token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
    res = friendli.dedicated.chat.stream(
        model="your-dedicated-endpoint-id",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello, Nemotron 3!"},
        ],
        chat_template_kwargs={
            "enable_thinking": False,
        },
        temperature=0,
    )
    for chunk in res:
        if content := chunk.data.choices[0].delta.content:
            print(content, end="")
Choosing the right mode lets you balance answer quality, determinism, latency, and cost for your specific use case.

Conclusion

Nemotron 3 unlocks a new generation of high-performance, long-context, agent-ready AI capabilities, and FriendliAI makes it easy to deploy them from day one. Whether you want instant access through Serverless Endpoints or full control and maximum efficiency with Dedicated Endpoints, FriendliAI provides the infrastructure, tooling, and reliability needed to build and scale production-grade AI systems. With flexible configuration options, seamless deployment workflows, and real-time observability, you can confidently bring Nemotron-powered applications to life and optimize them for your team’s workflow, performance needs, and budget. Ready to get started? Sign in to Friendli Suite and launch your first Nemotron 3 deployment today.