September 27, 2023
1 min read

Iteration Batching (a.k.a. Continuous Batching): Accelerate LLM Inference Serving with Flexible Scheduling

FriendliAI is on a mission to supercharge generative AI serving. Driving this is Friendli Inference, our cutting-edge engine that makes serving generative AI (LLMs, etc.) easier, cheaper, and faster than ever before.

Friendli Inference is blazingly fast at serving generative AI models, especially large language models (LLMs). It was born out of our Orca research paper published in OSDI 2022. Friendli Inference supports a wide range of models and workloads, from language models to multi-modal models, and contains many deep optimizations to speed up LLM serving.

One such optimization is a method we invented and named iteration batching (a.k.a. continuous batching), which is batching through iteration-level scheduling. Our method is protected by patents in the US and Korea.

How does iteration-level scheduling help? We observed that existing systems for LLM inference serving perform poorly due to their inflexibility around changing the current batch of requests. Requests that have finished earlier than other requests in a batch cannot return immediately to the client, while newly queued requests must wait to begin until the current batch completely finishes. The iteration batching we invented solves both of these problems by dynamically changing the requests that make up the batch while it is in progress. Iteration batching can achieve up to tens of times higher throughput than conventional batching while satisfying the same latency requirement. If you encounter this type of batching in any other framework, it’s probably iteration batching, regardless of its branding.

The following animations show how iteration batching is different from conventional batching.

Iteration batching

Conventional batching

To summarize, Friendli Inference is highly optimized to make LLM serving fast and cost-effective. One of these optimizations is iteration batching, which we are proud to have pioneered. There are currently two ways to use Friendli Inference: Friendli Container and Friendli Dedicated Endpoints. Get started today with Friendli Inference!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Contact Sales — our experts (not a bot) will reply within one business day.