April 29, 2026
5 min read

NVIDIA Nemotron™ 3 Nano Omni, Day-0 on FriendliAI: Unified Multimodal Reasoning, at Peak Performance

TL;DR

FriendliAI has launched Day-0 support for NVIDIA's Nemotron 3 Nano Omni, available via FriendliAI Dedicated Endpoints for production-scale multimodal AI.
The "Eyes and Ears" of Agentic AI: A unified foundation model reasons across video, audio, images, documents, charts, and text within a single loop, eliminating fragmented modality-specific pipelines.
Hybrid Breakthrough Architecture: Built on a Transformer–Mamba MoE architecture (30B total/3B active parameters) with Conv3D and Efficient Video Sampling (EVS), delivering ~9× higher throughput.
World-Class Accuracy: Outperforms the best open alternatives in multimodal intelligence, purpose-built for long-context production agentic workflows.

NVIDIA Nemotron™ 3 Nano Omni, Day-0 on FriendliAI: Unified Multimodal Reasoning, at Peak Performance thumbnail

NVIDIA Nemotron™ 3 Nano Omni is a production-ready foundation model designed to redefine how AI agents interact with the world. By unifying reasoning across video, audio, images, and text into a single hybrid architecture, it eliminates the need for fragmented, modality-specific pipelines. It serves as the intelligent "perception layer" for agentic systems, allowing them to see, hear, and read with a depth and coherence previously reserved for much larger, proprietary models.

FriendliAI is proud to launch Day-0 support for this breakthrough model, adding to our collection of NVIDIA models available for one-click deployment FriendliAI. Our Dedicated Endpoints are specifically engineered to unlock the full potential of Nemotron 3 Nano Omni’s hybrid MoE architecture, delivering peak performance through high-throughput multimodal inference and low-latency execution. Whether you are building real-time computer-use agents or long-horizon document intelligence workflows, FriendliAI provides the specialized infrastructure required to run Nemotron 3 Nano Omni at its absolute limit in production.

Nemotron 3 Nano Omni replaces fragmented modality‑specific pipelines with a single coherent model. In a system of agents, it functions as the multimodal perception sub-agent, maintaining a converged multimodal context across loops and feeding structured understanding into orchestration and execution agents.

Provision your Nemotron 3 Nano Omni endpoint here.

Nemotron 3 Nano Omni Unlocks the Power of Open, Unified Multimodal Reasoning

Real-world enterprise tasks are inherently multimodal. A customer service agent may need to analyze audio, interpret a screen recording, and reference policy documents simultaneously. A financial analyst needs to cross-reference earnings call audio with scanned charts and a filing's full text. Today, most systems handle these inputs through separate modality-specific models, each with its own inference pass, data transformation, and orchestration logic, before attempting to fuse results after the fact.

Nemotron 3 Nano Omni replaces that fragmented architecture with a single foundation model that natively understands and reasons across video, audio, images, documents, charts, GUIs, and text, with up to a 256K-tokens shared multimodal context window. When perception and decision-making share a single reasoning loop, cross-modal understanding improves, failure modes shrink, and agent development becomes dramatically simpler.

Model Highlights

Unified Multimodal Understanding	Natively processes video, audio, images, documents, charts, and text in one model, up to 256K tokens of shared multimodal context. Eliminates fragmented pipelines and cross-model compatibility issues.
Best-in-Class Efficiency	Using this model, an AI system will achieves 9x higher throughput than other open omni models with the same interactivity, resulting in lower cost and better scalability without sacrificing responsiveness.
World-Class Accuracy	~20% higher multimodal intelligence via multi-environment RL training with NeMo RL. Leading results on video + audio understanding, OCR-based reasoning, chart and table analysis, document intelligence, and GUI/screen comprehension.
Open by Design	Open weights (deploy anywhere with full data control), open post-training and optimization techniques, open high-quality synthetic datasets, and open recipes for customization.
Run Anywhere	Available on Hugging Face, supported across leading inference platforms, packaged as NVIDIA NIM, and productionized on FriendliAI via Dedicated Endpoints.

What Can You Build with Nemotron 3 Nano Omni on FriendliAI?

Nemotron 3 Nano Omni is purpose-built for agents that operate in the real world, where inputs arrive across multiple modalities, context must be preserved across long interactions, and decisions require coherent cross-modal reasoning. FriendliAI's high-performance hosting unlocks these capabilities at production scale.

Customer Service Agents

A customer service agent operates in a deeply multimodal environment, working across customer call recordings, screen recordings of user sessions, screenshots of errors or invoices, and knowledge-base articles or CRM history, all at once. Nemotron 3 Nano Omni unifies all of these signals so the agent understands not just what the customer said, but what they experienced and what the business rules allow, enabling accurate, context-aware resolution in a single reasoning loop.

Financial Analyst Agents

Financial analysis depends on more than text alone. Nemotron 3 Nano Omni reasons across earnings transcripts, scanned charts and reports, earnings call recordings, and investor presentation video simultaneously, tying together what executives say, how numbers are presented visually, and what the underlying documents show. The result is grounded insight rather than surface-level summarization.

Computer Use Agents

Computer use is one of the clearest demonstrations of unified multimodality. The agent processes screen recording video to understand UI state over time, interprets instructions and system signals, and reads task instructions and validation policies. Nano Omni enables the agent to see the interface, understand intent, read constraints, and take the correct action, all within one reasoning loop. This collapses when perception and decision-making are split across models.

Friendli Dedicated Endpoints for Nemotron 3 Nano Omni

FriendliAI enables production deployment of Nemotron 3 Nano Omni with dedicated GPUs, predictable performance, and a 99.99% uptime SLA—so teams can scale multimodal agentic workloads without managing infrastructure.

Built for enterprise AI systems, Friendli Dedicated Endpoints are optimized for the hybrid MoE inference pattern that makes Nano Omni efficient: outing tokens to the right experts, batching multimodal inputs efficiently, and sustaining the throughput required for long-context, multi-step agent workflows.

Key Capabilities

Private Nano Omni deployment on dedicated GPUs for consistent, predictable performance
High-throughput, low-latency MoE-optimized inference—maximizing token throughput for multimodal workloads
Production-ready endpoints with built-in observability and autoscaling
Flexible GPU options (B200, H100/H200) based on workload and performance requirements
Enterprise-grade reliability, security, and full data ownership in a private environment

With Friendli Dedicated Endpoints, you can:

Deploy Nemotron 3 Nano Omni into production within minutes
Scale multimodal agents without rebuilding infrastructure
Run 256K-token long-context reasoning pipelines with predictable latency
Autoscale inference dynamically across GPUs, instantly right-sizing capacity to match demand
Operate customized Nano Omni deployments securely and privately

FriendliAI handles serving, batching, scaling, and GPU orchestration, letting your team focus on building, not infrastructure.

Get Started with Nemotron 3 Nano Omni on FriendliAI

To deploy NVIDIA Nemotron 3 Nano Omni on Friendli Dedicated Endpoints in just a few steps:

1️⃣ Navigate to the Dedicated Endpoint creation page.

2️⃣ Choose your desired model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning

3️⃣ Click "Create"

Once deployed, you can send requests using any OpenAI-compatible inference API/SDK.

Prerequisites

Before getting started, you'll need:

A FriendliAI account
A Friendli Token (available in Friendli Suite settings)

Install the SDK

shell

uv pip install friendli

Set Up Your Environment Setup

Set up your FriendliAI API key (aka Friendli Token):

shell

export FRIENDLI_TOKEN="your-token-here"

Example: Streaming Chat Completion

python

import os

from friendli import SyncFriendli

with SyncFriendli(
    token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
    res = friendli.dedicated.chat.stream(
        model="your-dedicated-endpoint-id",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello, Nemotron 3 Nano Omni!"},
        ],
    )
    for chunk in res:
        if content := chunk.data.choices[0].delta.content:
            print(content, end="")

Streaming applies to response generation (text output).

Deploy NVIDIA Nemotron at Scale with FriendliAI

FriendliAI is proud to partner with NVIDIA to offer the Nemotron family of models to the developer community and businesses building production-ready, agentic AI systems.

Nemotron 3 Nano Omni represents a new class of open foundation model, one that reasons coherently across video, audio, images, documents, and text within a unified context. FriendliAI is your platform to take it to production. From experimentation to enterprise deployment, FriendliAI provides the performance, reliability, and control required to scale multimodal agentic AI in real-world applications.

👉 Launch your Nemotron 3 Nano Omni Dedicated Endpoint today and start building the next generation of omni-modal agentic AI with FriendliAI.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 550,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.