- September 27, 2023
- 2 min read
Iteration batching (aka continuous batching) to increase LLM inference serving throughput
FriendliAI is on a mission to supercharge generative AI serving. Driving this is PeriFlow, our cutting-edge engine that makes serving generative AI (LLMs, etc.) easier, cheaper, and faster than ever before.
PeriFlow is blazingly fast at serving generative AI models, especially large language models (LLMs). It was born out of our Orca research paper published in OSDI 2022. PeriFlow contains many deep optimizations to speed up LLM serving and offers wide support of models and workloads ranging from language models to multi-modal models.
One such optimization is a method we invented and named iteration batching, which is batching through iteration-level scheduling. Our method is protected by our patents in the US and Korea, and cannot be used without our authorization.
How does iteration-level scheduling help? We observed that existing systems for LLM inference serving perform poorly due to their inflexibility around changing the current batch of requests. Requests that have finished earlier than other requests in a batch cannot return immediately to the client, while newly queued requests must wait to begin until the current batch completely finishes. The iteration batching we invented solves both of these problems by dynamically changing the requests that make up the batch while it is in progress. Iteration batching can achieve up to tens of times higher throughput than conventional batching while satisfying the same latency requirement. If you encounter this type of batching in any other framework, it’s probably iteration batching, regardless of branding.
The following animations show how iteration batching is different from conventional batching.