- December 15, 2025
- 5 min read
Enabling the Next Level of Efficient Agentic AI: FriendliAI Supports NVIDIA Nemotron 3 Nano Launch

We’re excited to share that FriendliAI is an official launch partner for NVIDIA’s Nemotron 3, reflecting NVIDIA’s confidence in our technology and our track record of success in delivering high-performance inference at scale. As a trusted partner of NVIDIA, FriendliAI is proud to help bring this new class of highly efficient reasoning models to developers worldwide.
NVIDIA Nemotron 3 is the most efficient, accuracy-leading family of open models for building agentic AI applications. It uses a hybrid Mamba-Transformer mixture-of-experts (MoE) architecture and 1 million-token context length to enable developers to build reliable, high-throughput agents across complex, multi-document, and long-duration operations.
Starting today, developers worldwide can deploy Nemotron 3 Nano on our high-performance inference platform, along with the remarkable speed, cost-efficiency, reliability, and scalability you expect from FriendliAI.
What You Need to Know About NVIDIA Nemotron 3
The Nemotron 3 is specifically designed to accomplish targeted tasks with high accuracy, enables multi-agent collaboration, and powers mission-critical applications.
With open weights, datasets, tools, and techniques, developers can easily customize, optimize, and deploy the models on their own infrastructure for maximum privacy and security.
According to NVIDIA, Nemotron 3 Nano provides:
- Highest Efficiency
- The model’s hybrid Mamba-Transformer MoE architecture delivers up to 13x higher token generation, enabling the model to think faster and provide higher accuracy.
- By predicting multiple future tokens simultaneously in one forward pass, multi-token prediction (MTP) dramatically reduces the time required to generate long sequences of text.
- NVFP4 quantization allows the model to deliver high throughput and compute efficiency while maintaining accuracy.
- Thinking budget avoids the model from overthinking and optimizes for lower, predictable inference cost.
- Leading Accuracy
- With multi-environment reinforcement learning (RL) training, Nemotron 3 Nano achieves leading accuracy by post-training with RL across 10 environments available in the new open-sourced NVIDIA NeMo Gym.
- Latent MoE achieves higher accuracy by capturing more nuanced patterns and handling diverse inputs better.
- 1 million-token context length enhances agent coherence by allowing agents to retain extensive conversation history and plan states for longer, more consistent multi-step processes. It also improves cross-document reasoning by enabling RAG pipelines to feed more information that may be relevant.
- The models achieve the highest accuracies across leading benchmarks, including SWE Bench, GPQA Diamond, AIME 2025, Humanity Last Exam, IFBench, RULER, and Arena Hard.
- Fully Open
- Fully open source, including model weights, ensuring full control and deployment flexibility for enterprises.
- Open 10 trillion-token training dataset, offering transparency, data control, and reproducibility.
- Open recipes and tools, allowing researchers to replicate or extend the development pipeline with full transparency and model customization.
- Model Specs
- Architecture: Hybrid Mamba-Transformer + MoE
- Size: 30B parameters (3B active)
- Context: 1 million-token
- Input modality: Text
- Output modality: Text
Real-World Applications for Nemotron 3
Nemotron 3’s strengths in long-context reasoning, tool integration, and efficient execution make it ideal for real-world, reasoning-intensive agentic AI applications such as:
1. Software Development & Engineering Productivity
- Code review for large repositories
- Debugging and patch suggestions
- Summarization of long codebases
- Generating new functions or modules
2. Customer Support & IT Operations Automation
- Multi-step task execution through function calling
- Automated IT ticket triage and resolution
- Customer-facing assistants for handling structured workflows
3. Finance & Cybersecurity Intelligence
- Financial transaction compliance monitoring
- Fraud and anomaly detection
- Cybersecurity threat triage using large event logs
4. Enterprise Knowledge Management & Reporting (RAG)
- Long-form business, compliance, or audit report generation
- Synthesizing dashboards, logs, emails, memos, and meeting notes
- Producing consistent multi-section documents from heterogeneous inputs
Why Nemotron 3 on FriendliAI?
At FriendliAI, our mission is to deliver high-performance, cost-efficient inference for AI native startups and enterprises running AI workloads. Nemotron 3 pairs naturally with our platform:
- Optimized Kernels for Faster Inference: Our custom GPU kernels unlock the model’s maximum capabilities.
- More Efficient MoE Serving: Our cutting-edge inference technologies like Online Quantization and Speculative Decoding further optimize the MoE models like Nemotron 3.
- Scale Without Compromise: With FriendliAI, you can enjoy:
- Predictable and stable latency
- Autoscaling for traffic spikes, with customizable tuning parameters
- 50%+ GPU cost savings via optimized serving paths on Dedicated Endpoints
- OpenAI-compatible APIs for easy integration
- Support for many LoRA Adapters on a single endpoint
- Enhanced Observability & Debugging: With built-in monitoring dashboard, you can:
- Monitor real-time throughput, latency, and usage trend
- Investigate request flows with detailed log
- Gain clearer operational insight for faster decision-making
- Review request activity and troubleshoot issues more quickly
- Enterprise-grade reliability: With a 99.99 % uptime SLA, we deliver stable production performance, which is critical when running high-value workloads.
- Global, compliance-ready deployment with full privacy and control: Geo-distributed infrastructure, data-locality controls, and governance features, with FriendliAI Containers running seamlessly on AWS EKS or any on-premise environments to ensure full control and meet regulated-industry requirements. With our SOC2-certified platform, organizations can confidently operate sensitive, regulated workloads.
Get started with Nemotron 3 on FriendliAI
We are proud to partner with NVIDIA to bring Nemotron 3 Nano to the general public, combining their cutting-edge model innovation with FriendliAI’s high-performance inference platform to unlock a new level of efficient, scalable agentic AI.
NVIDIA’s Nemotron 3 Nano is available to use through Dedicated Endpoints.
1️⃣ To deploy NVIDIA’s Nemotron 3 Nano on Friendli Dedicated Endpoints,
- Navigate to the dedicated endpoint creation page.
- Choose your desired model, such as "nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16".
- Click “Create”
You can send requests to Nemotron 3 using any OpenAI-compatible inference API/SDK. For example, using Friendli Python SDK:
Prerequisites
Before getting started, you need to set up:
- A FriendliAI account
- Friendli Token from Friendli Suite settings tab
Install the Package
Environment Setup
Set up your FriendliAI API key (aka Friendli Token):
Example Code
Run better with FriendliAI
FriendliAI offers a purpose-built inference platform designed to further unlock the full potential of Nemotron 3 for real-world production workloads. With industry-leading speed, cost-efficiency, reliability, and scalability, FriendliAI enables teams to run advanced models with confidence and control.
Start building your AI applications faster and smarter by running Nemotron 3 on FriendliAI’s high-performance inference platform.
Written by
FriendliAI Tech & Research
Share
General FAQ
What is FriendliAI?
FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.
How does FriendliAI help my business?
Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing
Which models and modalities are supported?
Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models
Can I deploy models from Hugging Face directly?
Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership
Still have questions?
If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.

