June 4, 2026
4 min read

Run NVIDIA's Most Powerful Open Reasoning Model on Day 0 — Nemotron 3 Ultra on FriendliAI

TL;DR

NVIDIA Nemotron 3 Ultra is a 550B-A55B open frontier-reasoning model built for orchestration in long-running agentic workflows — with the highest throughput among open frontier models and up to 1M token context
The metric that matters for agentic AI is the speed of task completion; Nemotron 3 Ultra is designed around exactly that
FriendliAI supports Nemotron 3 Ultra on Day 0 — available now via Dedicated Endpoint

Run NVIDIA's Most Powerful Open Reasoning Model on Day 0 — Nemotron 3 Ultra on FriendliAI thumbnail

What is NVIDIA Nemotron 3 Ultra?

NVIDIA Nemotron 3 Ultra is an open frontier-reasoning and orchestration model built for long-running autonomous agents. As part of the NVIDIA Nemotron family of open models for agentic AI, Nemotron 3 Ultra is designed to handle the hardest, most reasoning-intensive steps in an agent workflow: orchestration, planning, error recovery, and synthesis.

In agentic AI, models are in the service of agents. Agents plan, call tools, delegate work, check results, and complete tasks. The measure that matters is no longer just model quality—it is the speed of task completion. Nemotron 3 Ultra is designed around exactly that metric.

The model ships with a Hybrid Transformer-Mamba MoE architecture at 550B total parameters with 55B active parameters. This architecture makes it uniquely efficient for long-running workflows: it supports up to 1M token context, accepts text input, and produces text output. It is available in BF16 and NVFP4 precisions and is optimized for deployment on H100 and B200 GPUs, with software support across vLLM, SGLang, and TensorRT-LLM.

Key specs at a glance:

Table
Architecture	Hybrid Transformer-Mamba MoE
Parameters	550B total / 55B active
Context Length	Up to 1M tokens
Modality	Text input, text output
Precision	BF16, NVFP4
License	Open, most permissive license

Why Nemotron 3 Ultra for Agentic AI?

Long-running agents work in turns: plan, act, observe, reflect. As tasks grow more complex, the number of turns increases—and more turns drive exponential token generation. The metric that matters isn't benchmark accuracy; it's speed of task completion. Nemotron 3 Ultra delivers the highest throughput among open frontier models, so agent orchestrators make better decisions faster and complex tasks actually complete.

Primary use cases:

Agent Orchestration & Planning — Decomposes complex goals into sub-tasks, delegates to specialized sub-agents, and synthesizes results across the full agent loop
Coding Agents — Reads entire codebases, plans changes, and coordinates parallel execution across files and modules; the 1M context window eliminates the need to chunk large repos
Deep Research — Breaks a complex thesis into parallel research tracks and synthesizes findings from dozens of concurrent sub-agents into a single output
Complex Enterprise Workflows — Handles multi-hop reasoning for customer service resolution, document intelligence, and other workflows requiring deep context

Running Nemotron 3 Ultra on FriendliAI

FriendliAI is the Frontier Inference Cloud for Agents, delivering frontier-level intelligence, throughput, and lower cost of inference to complete agentic tasks.

Nemotron 3 Ultra is a 550B-parameter model. Running it in production requires not just raw GPU capacity, but an inference stack that is optimized specifically for the model's architecture. FriendliAI's engine is continuously updated to support state-of-the-art open-weight models at production scale—and Nemotron 3 Ultra is available from Day 0.

What FriendliAI brings to Nemotron 3 Ultra:

5x throughput for long-running agents Nemotron 3 Ultra is designed for high-throughput reasoning workflows. FriendliAI's inference engine maximizes tokens per second per GPU through continuous batching, speculative decoding, and kernel-level optimizations—delivering more reasoning cycles per time budget, exactly the metric that matters for agentic workloads.

Cost-efficient inference at scale Running a 550B model repeatedly across hundreds of agent turns gets expensive fast. FriendliAI helps reduce inference costs while delivering higher tokens-per-dollar than alternative OSS serving stacks. For production agents where each task involves dozens of reasoning turns, this cost efficiency directly impacts what is viable to build.

Production-grade reliability Agentic workflows cannot tolerate dropped context or unstable serving. FriendliAI provides enterprise-grade reliability with dedicated endpoints, auto-scaling, and operational monitoring—giving teams the infrastructure confidence to run long-horizon agent workflows in production.

Full model ownership Nemotron 3 Ultra is an open model. FriendliAI lets you deploy it in your own environment—cloud, data center, or dedicated GPU cluster—so your data stays in your infrastructure and you maintain full control over the model, its fine-tunes, and the inference stack.

Get Started

Nemotron 3 Ultra is available on FriendliAI starting today. You can deploy it via Friendli Dedicated Endpoint, depending on your throughput and latency requirements.

Dedicated Endpoint: Reserved GPU capacity for consistent, high-throughput production workloads where latency predictability matters.

To deploy NVIDIA Nemotron 3 Ultra on Friendli Dedicated Endpoints:

1️⃣ Navigate to the dedicated endpoint creation page.

2️⃣ Choose your desired model: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16

3️⃣ Click "Create"

You can send requests to Nemotron 3 Ultra using any OpenAI-compatible inference API/SDK. For example, using the Friendli Python SDK:

Prerequisites

Before getting started, you'll need:

A FriendliAI account
A Friendli Token from Friendli Suite settings

Install the package

shell

uv pip install friendli

Environment Setup

Set up your FriendliAI API key (aka Friendli Token):

shell

export FRIENDLI_TOKEN="your-token-here"

Example Code

python

import os
 
from friendli import SyncFriendli
 
with SyncFriendli(
    token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
	res = friendli.dedicated.chat.stream(
        model="your-dedicated-endpoint-id",
    	messages=[
        	{"role": "system", "content": "You are a helpful assistant."},
        	{"role": "user", "content": "Hello, Nemotron 3 Ultra!"},
    	],
	)
	for chunk in res:
    	if content := chunk.data.choices[0].delta.content:
        	print(content, end="")

Deploy NVIDIA Nemotron at Scale with FriendliAI

FriendliAI is proud to partner with NVIDIA to offer the Nemotron family of models to the developer community and businesses building production-ready, agentic AI systems.

NVIDIA Nemotron 3 Ultra represents the frontier of open reasoning models—built to orchestrate long-running agents that plan, delegate, and synthesize at scale. FriendliAI is your platform to take it to production. From experimentation to enterprise deployment, FriendliAI delivers the performance, reliability, and control required to scale agentic AI in real-world applications.

👉 Launch your Nemotron 3 Ultra Dedicated Endpoint today and start building the next generation of frontier agentic AI with FriendliAI.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 580,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.