July 1, 2026
6 min read

How Kilo Code and FriendliAI Bring Open Source AI Coding Agents to Production with NVIDIA Nemotron

Q: What is FriendliAI?

FriendliAI is the Frontier Inference Cloud for Agents, delivering high throughput, low latency, and reliability at scale for agentic workloads. Through vertically optimized inference infrastructure, it delivers 2–5× faster output token speed and a 99.99% uptime SLA for high-volume production traffic.

TL;DR

Kilo and FriendliAI are partnering to make production-grade AI coding agents faster, more accurate, and significantly more cost-efficient with NVIDIA Nemotron open models.
Configure Nemotron 3 Ultra in Kilo Code using FriendliAI as your inference provider, directly inside your IDE.
Using FriendliAI, Kilo achieved up to 7x faster inference compared to several other providers and cut costs by up to 72% on complex agent workloads by routing to NVIDIA Nemotron 3 Ultra.

How Kilo Code and FriendliAI Bring Open Source AI Coding Agents to Production with NVIDIA Nemotron thumbnail

Introduction

Kilo and FriendliAI are partnering to help engineers build AI agents at scale with NVIDIA Nemotron open models. Production agents thrive when built using specialized models, both closed and open, and our priority is to make developing high-performance agents accessible, performant, and cost-effective.

NVIDIA is a leader in open models that enable developers, enterprises and nations to build AI applications and agents that they can trust, control, and customize.

NVIDIA open Nemotron family combines strong reasoning performance with efficient deployment, providing open weights, training data, and recipes for building specialized AI agents. The Nemotron family offers reasoning models in 3 sizes: Nano, Super, and Ultra for different deployment needs. Nano provides cost efficiency with high accuracy specialized sub-agents, Super delivers highest efficiency with leading accuracy for reasoning and tool calling for multi-agent applications, and Ultra is designed for applications demanding the highest reasoning accuracy for complex agentic tasks.

Nemotron 3 Ultra, newest member of the Nemotron family, is a 550B-parameter Mixture-of-Experts model with 55B active parameters, built for frontier reasoning and orchestration in agentic systems.

Kilo is an all-in-one agentic engineering platform for software developers. Their open-source coding harness, Kilo Code, is a flexible AI coding assistant centered on model freedom — giving developers the ability to plug in their existing model providers alongside open-weight models. You can experience top open models on Kilo Code, like Nemotron 3 Ultra, directly in your terminal or VS Code.

FriendliAI sits directly within Kilo’s orchestrator and maximizes efficiency. Our inference platform is optimized for real-time coding workloads, giving Kilo Code the speed and efficiency it needs to scale. Through continuous batching, also known as iteration batching, and memory optimization, FriendliAI enables Kilo Code to feed massive multi-file codebases into NVIDIA Nemotron models without degrading performance, running out of memory, or spiking costs.

Together, Kilo, FriendliAI, and NVIDIA provide an open stack for building production AI coding agents—from intelligent orchestration and optimized inference to frontier reasoning. In this blog, we will walk through configuring Kilo Code with FriendliAI and Nemotron 3 Ultra in your existing workflow.

Configure NVIDIA Nemotron 3 Ultra in Kilo Code with FriendliAI

Kilo Code is built around a simple idea: developers should be free to choose the models and infrastructure that best fit their workflows. Rather than locking teams into a single AI provider, Kilo Code enables developers to connect their preferred inference platforms and open models directly into their coding environment.

By combining Kilo Code, NVIDIA Nemotron 3 Ultra, and FriendliAI, developers gain access to a frontier reasoning model optimized for complex software engineering tasks while maintaining the performance, scalability, and cost efficiency required for production workloads.

Connecting FriendliAI to Kilo Code

Getting started requires only a few configuration steps:

Deploy Nemotron 3 Ultra on FriendliAI

Begin by deploying Nemotron 3 Ultra through the Friendli Suite. Once deployed, you'll receive an Endpoint ID and API credentials that can be used by external applications.

Nvidia-Nemotron-3-Ultra model page on FriendliAI

2. Configure FriendliAI as a Custom Provider in Kilo Code

Within Kilo Code, navigate to the Provider settings and select a Custom Provider configuration. Enter your FriendliAI endpoint URL, Endpoint ID, and API key to connect your deployment.

3. Start Building with Nemotron 3 Ultra

After configuration is complete, Nemotron 3 Ultra becomes available directly within Kilo Code's model selector. Developers can immediately begin using the model for code generation, repository analysis, debugging, tool use, long-context reasoning, and multi-step agentic workflows—all without leaving their IDE.

With the integration complete, Kilo Code can route requests directly to your FriendliAI deployment, providing access to Nemotron 3 Ultra inside your existing development workflow.

Why FriendliAI for Nemotron 3

Large reasoning models are most valuable when they can process substantial context, reason across multiple files, and respond quickly enough to keep developers in flow. Running these workloads efficiently requires more than simply hosting a model.

Production inference depends on optimized scheduling, batching, memory management, and efficient GPU utilization. FriendliAI's inference platform is optimized for high-throughput, low-latency agentic workloads.

This enables coding agents powered by NVIDIA Nemotron 3 Ultra to work with large codebases and long-context prompts while maintaining responsiveness and controlling infrastructure costs.

How Kilo uses FriendliAI to optimize cost and performance

Over the past year, Kilo Code has tested several different inference providers hosting a range of both open and proprietary models. According to Kilo’s internal evaluations using GLM-5 usage, they were especially impressed by FriendliAI, which consistently delivered up to 7x faster inference than several other providers while significantly reducing error rates.

FriendliAI is now a core component of the Kilo stack, enabling high-performance access to the latest open models. To further reduce friction, Kilo developed an auto-routing feature that automatically selects the optimal model for specific tasks like planning, coding, or data analysis.

This routing is organized into four distinct modes:

Auto: Frontier, which offers maximum capability with the best available models when cost is not an issue.
Auto: Balanced, which offers strong performance at a lower cost.
Auto: Efficient, which offers the lowest cost per task, with capability matched to difficulty.
Auto: Free, which features the best available free models.

These modes leverage models from labs like OpenAI and Anthropic alongside open models from MiniMax, Z.ai, Alibaba Qwen, and NVIDIA.

The centerpiece of this collaboration is NVIDIA Nemotron 3 Ultra, an open model built for long-running agents. Nemotron 3 Ultra, recently led the open-weight category of PinchBench leaderboard, a benchmark that evaluates models’ performance on real-world agentic tasks in OpenClaw.

By intelligently routing complex reasoning tasks to Nemotron 3 Ultra, Kilo reports cost reductions of up to 72% on complex agentic coding tasks while maintaining frontier-level performance.

As agentic engineering continues to evolve to support an even wider range of tasks, Kilo is excited to continue working with both FriendliAI and NVIDIA to optimize cost and performance hand-in-hand. This strategic partnership integrates Kilo Code's sophisticated agent orchestration layer with FriendliAI’s optimized inference platform and NVIDIA’s remarkably flexible Nemotron open models. There’s no longer a reason to overpay for dependable AI.

Build the stack that works for you.

Production AI coding agents work best when each layer of the stack handles what it's built for. Kilo Code orchestrates across models intelligently, FriendliAI delivers production-grade inference performance that keeps developers productive. NVIDIA Nemotron 3 Ultra open model provides advanced reasoning that you can customize, fine-tune, and deploy for specialized AI agents.

Together, they're already powering real results — Kilo users are running 7x faster on FriendliAI and cutting costs by up to 72% on complex agentic tasks by routing to Nemotron 3 Ultra. Open, optimized AI development is no longer a trade-off between performance and price.

Together, the three layers give developers everything needed to ship production-grade AI coding agents without compromising on performance or cost, while leaving more room for model freedom.

Explore NVIDIA Nemotron 3 Ultra, bring frontier open models directly into your IDE with Kilo Code, and get started with FriendliAI to deploy and serve your favorite models with optimized inference.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is the Frontier Inference Cloud for Agents, delivering high throughput, low latency, and reliability at scale for agentic workloads. Through vertically optimized inference infrastructure, it delivers 2–5× faster output token speed and a 99.99% uptime SLA for high-volume production traffic.

How does FriendliAI reduce inference costs?

FriendliAI reduces inference costs through higher GPU utilization and optimized inference performance. FriendliAI's patented continuous batching technique, along with quantization, speculative decoding, KV cache offloading, multi-LoRA serving, and autoscaling, helps you serve more tokens with fewer GPUs, lowering your infrastructure costs without sacrificing performance.

Why should I choose FriendliAI over other inference providers?

FriendliAI is built for production AI agents, combining speed, reliability, and efficiency at scale. It delivers low-latency streaming, reliable long-context inference, and robust tool calling without compromising stability. According to independent OpenRouter benchmarks, FriendliAI consistently ranks among the top providers for throughput, latency, and reliability across leading open-weight models. See why customers choose FriendliAI

Which open-weight models does FriendliAI support?

Run today’s frontier open-weight models—including GLM, MiniMax, Kimi, DeepSeek, Qwen, Gemma, and more—with a simple API call. FriendliAI Model API gives you instant access to the latest models with optimized inference performance for production workloads. Explore models and pricing

How do I get started?

Getting started takes just a few minutes. [1] Sign up for FriendliAI, [2] Generate your API key, and [3] Make your first inference request with frontier open-weight models.

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.