
Introduction
Nemotron 3 Nano is of high-performance, high-efficiency foundation models designed for agentic AI applications. Built using a hybrid Mamba–Transformer Mixture-of-Experts (MoE) architecture and supporting 1M-token context windows, Nemotron 3 enables developers to build reliable, high-throughput agents that operate across complex workflows, multi-document reasoning, and long-duration tasks. The Nemotron 3 Nano is optimized for targeted workloads, multi-agent collaboration, and mission-critical applications. With open weights, datasets, and tools, developers can fine-tune, optimize, and deploy the models in their own infrastructure for maximum privacy and security. To learn more, check out our partnership announcement and NVIDIA’s Nemotron 3 page.Friendli Dedicated Endpoints
Friendli Dedicated Endpoints give you full control over deployment, scaling, and hardware selection, making it ideal for production workloads and mission-critical applications.- High-throughput with consistent, guaranteed performance
- 50%+ GPU savings
- Full control over GPU resources, 99.99% availability
- Ideal for high-volume production applications and long-running services
Prerequisites
Before you get started, ensure that you have the following:- A FriendliAI account
- An API token

- Install Friendli Python SDK or any OpenAI-compatible SDK
Run Nemotron 3 Nano on Dedicated Endpoints for Maximum Performance

-
Select your base model and multi-LoRA adapters.

-
Configure endpoint features.
- Set Online Quantization for higher throughput and lower number of GPUs.
- Enable N-gram Speculative Decoding for faster Time-per-Output-Token (TPOT).

-
Choose the GPU type.

-
Customize autoscaling parameters.

-
Configure the inference engine.
- Add special tokens
- Skip special tokens
- Set maximum batch size
- Log request content

-
Deploy.
-
Send inference requests.

model field in your requests.

curl script:
-
Monitor the endpoint behavior.
- You can view full metrics and request activities.
- Monitor real-time throughput, latency, token processed, and replica counts over time.
- Review request activity and troubleshoot issues more quickly.

- View specific request and response content (when explicitly ena bled).
- Get a clearer view of how the model is behaving.
- Spot and investigate requests that may require attention.

Enabling and Disabling Reasoning
Nemotron 3 models support an explicit reasoning (or “thinking”) mode, which allows the model to internally reason step-by-step before producing a final answer. You can enable or disable this behavior at request time, depending on whether you want maximum reasoning quality or fast, deterministic responses.When to Enable Reasoning
Enable reasoning when:- You want higher-quality answers for complex or open-ended questions.
- Creativity and exploration are more important than determinism.
- Higher latency or token usage is acceptable.
- Use temperature=1.0 and top_p=1.0 for best performance.
- Optionally set a reasoning_budget to cap how many tokens the model can spend on its internal reasoning trace.
- Optional
- Sets a strict upper bound on the number of tokens used for internal reasoning.
- Useful for controlling latency and cost while still benefiting from reasoning.
When to Disable Reasoning
Disable reasoning when:- You want fast, predictable, and deterministic outputs.
- The task is simple (e.g., classification, extraction, short factual answers).
- You want minimal token usage.
- Set enable_thinking=False
- Use temperature=0 for deterministic behavior.
- top_p can be omitted or left at its default.