- June 4, 2026
- 4 min read
Run NVIDIA's Most Powerful Open Reasoning Model on Day 0 — Nemotron 3 Ultra on FriendliAI
- NVIDIA Nemotron 3 Ultra is a 550B-A55B open frontier-reasoning model built for orchestration in long-running agentic workflows — with the highest throughput among open frontier models and up to 1M token context
- The metric that matters for agentic AI is the speed of task completion; Nemotron 3 Ultra is designed around exactly that
- FriendliAI supports Nemotron 3 Ultra on Day 0 — available now via Dedicated Endpoint

What is NVIDIA Nemotron 3 Ultra?
NVIDIA Nemotron 3 Ultra is an open frontier-reasoning and orchestration model built for long-running autonomous agents. As part of the NVIDIA Nemotron family of open models for agentic AI, Nemotron 3 Ultra is designed to handle the hardest, most reasoning-intensive steps in an agent workflow: orchestration, planning, error recovery, and synthesis.
In agentic AI, models are in the service of agents. Agents plan, call tools, delegate work, check results, and complete tasks. The measure that matters is no longer just model quality—it is the speed of task completion. Nemotron 3 Ultra is designed around exactly that metric.
The model ships with a Hybrid Transformer-Mamba MoE architecture at 550B total parameters with 55B active parameters. This architecture makes it uniquely efficient for long-running workflows: it supports up to 1M token context, accepts text input, and produces text output. It is available in BF16 and NVFP4 precisions and is optimized for deployment on H100 and B200 GPUs, with software support across vLLM, SGLang, and TensorRT-LLM.
Key specs at a glance:
| Architecture | Hybrid Transformer-Mamba MoE |
| Parameters | 550B total / 55B active |
| Context Length | Up to 1M tokens |
| Modality | Text input, text output |
| Precision | BF16, NVFP4 |
| License | Open, most permissive license |
Why Nemotron 3 Ultra for Agentic AI?
Long-running agents work in turns: plan, act, observe, reflect. As tasks grow more complex, the number of turns increases—and more turns drive exponential token generation. The metric that matters isn't benchmark accuracy; it's speed of task completion. Nemotron 3 Ultra delivers the highest throughput among open frontier models, so agent orchestrators make better decisions faster and complex tasks actually complete.
Primary use cases:
- Agent Orchestration & Planning — Decomposes complex goals into sub-tasks, delegates to specialized sub-agents, and synthesizes results across the full agent loop
- Coding Agents — Reads entire codebases, plans changes, and coordinates parallel execution across files and modules; the 1M context window eliminates the need to chunk large repos
- Deep Research — Breaks a complex thesis into parallel research tracks and synthesizes findings from dozens of concurrent sub-agents into a single output
- Complex Enterprise Workflows — Handles multi-hop reasoning for customer service resolution, document intelligence, and other workflows requiring deep context
Running Nemotron 3 Ultra on FriendliAI
FriendliAI is the Frontier Inference Cloud for Agents, delivering frontier-level intelligence, throughput, and lower cost of inference to complete agentic tasks.
Nemotron 3 Ultra is a 550B-parameter model. Running it in production requires not just raw GPU capacity, but an inference stack that is optimized specifically for the model's architecture. FriendliAI's engine is continuously updated to support state-of-the-art open-weight models at production scale—and Nemotron 3 Ultra is available from Day 0.
What FriendliAI brings to Nemotron 3 Ultra:
5x throughput for long-running agents Nemotron 3 Ultra is designed for high-throughput reasoning workflows. FriendliAI's inference engine maximizes tokens per second per GPU through continuous batching, speculative decoding, and kernel-level optimizations—delivering more reasoning cycles per time budget, exactly the metric that matters for agentic workloads.
Cost-efficient inference at scale Running a 550B model repeatedly across hundreds of agent turns gets expensive fast. FriendliAI helps reduce inference costs while delivering higher tokens-per-dollar than alternative OSS serving stacks. For production agents where each task involves dozens of reasoning turns, this cost efficiency directly impacts what is viable to build.
Production-grade reliability Agentic workflows cannot tolerate dropped context or unstable serving. FriendliAI provides enterprise-grade reliability with dedicated endpoints, auto-scaling, and operational monitoring—giving teams the infrastructure confidence to run long-horizon agent workflows in production.
Full model ownership Nemotron 3 Ultra is an open model. FriendliAI lets you deploy it in your own environment—cloud, data center, or dedicated GPU cluster—so your data stays in your infrastructure and you maintain full control over the model, its fine-tunes, and the inference stack.
Get Started
Nemotron 3 Ultra is available on FriendliAI starting today. You can deploy it via Friendli Dedicated Endpoint, depending on your throughput and latency requirements.
- Dedicated Endpoint: Reserved GPU capacity for consistent, high-throughput production workloads where latency predictability matters.
To deploy NVIDIA Nemotron 3 Ultra on Friendli Dedicated Endpoints:
1️⃣ Navigate to the dedicated endpoint creation page.
2️⃣ Choose your desired model: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
3️⃣ Click "Create"
You can send requests to Nano Omni using any OpenAI-compatible inference API/SDK. For example, using the Friendli Python SDK:
Prerequisites
Before getting started, you'll need:
- A FriendliAI account
- A Friendli Token from Friendli Suite settings
Install the package
Environment Setup
Set up your FriendliAI API key (aka Friendli Token):
Example Code
Deploy NVIDIA Nemotron at Scale with FriendliAI
FriendliAI is proud to partner with NVIDIA to offer the Nemotron family of models to the developer community and businesses building production-ready, agentic AI systems.
NVIDIA Nemotron 3 Ultra represents the frontier of open reasoning models—built to orchestrate long-running agents that plan, delegate, and synthesize at scale. FriendliAI is your platform to take it to production. From experimentation to enterprise deployment, FriendliAI delivers the performance, reliability, and control required to scale agentic AI in real-world applications.
👉 Launch your Nemotron 3 Ultra Dedicated Endpoint today and start building the next generation of frontier agentic AI with FriendliAI.
Written by
FriendliAI Tech & Research
Share
General FAQ
What is FriendliAI?
FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.
How does FriendliAI help my business?
Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing
Which models and modalities are supported?
Over 570,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models
Can I deploy models from Hugging Face directly?
Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership
Still have questions?
If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.

