Iteration batching (a.k.a. continuous batching) to increase LLM inference serving throughput

Iteration batching (a.k.a. continuous batching) to increase LLM inference serving throughput thumbnail

FriendliAI is on a mission to supercharge generative AI serving. Driving this is Friendli Engine, our cutting-edge engine that makes serving generative AI (LLMs, etc.) easier, cheaper, and faster than ever before.

Friendli Engine is blazingly fast at serving generative AI models, especially large language models (LLMs). It was born out of our Orca research paper published in OSDI 2022. Friendli Engine supports a wide range of models and workloads, from language models to multi-modal models, and contains many deep optimizations to speed up LLM serving.

One such optimization is a method we invented and named iteration batching, which is batching through iteration-level scheduling. Our method is protected by patents in the US and Korea and cannot be used without our authorization.

How does iteration-level scheduling help? We observed that existing systems for LLM inference serving perform poorly due to their inflexibility around changing the current batch of requests. Requests that have finished earlier than other requests in a batch cannot return immediately to the client, while newly queued requests must wait to begin until the current batch completely finishes. The iteration batching we invented solves both of these problems by dynamically changing the requests that make up the batch while it is in progress. Iteration batching can achieve up to tens of times higher throughput than conventional batching while satisfying the same latency requirement. If you encounter this type of batching in any other framework, it’s probably iteration batching, regardless of its branding.

The following animations show how iteration batching is different from conventional batching.

Iteration batching