- October 30, 2023
- 2 min read
Comparing two LLM serving frameworks: PeriFlow vs. vLLM
FriendliAI is on a mission to supercharge generative AI serving. Driving this is PeriFlow, our cutting-edge engine that makes serving generative AI, such as LLMs, easier, cheaper, and faster than ever before. In this analysis, we show that PeriFlow is significantly faster than vLLM, another serving framework.
PeriFlow is blazingly fast at serving LLMs (large language models). PeriFlow was born out of our Orca research; the Orca paper was published in OSDI 2022. Our team has built many different optimizations into PeriFlow, but one of the important optimizations we are proud to have pioneered is iteration batching, which is protected by our patents in the US and Korea, and cannot be used without our authorization.
Recently, we’ve seen other frameworks claim to surpass Orca’s performance. Because Orca is not publicly available and PeriFlow is only used in production, outside sources are not able to compare PeriFlow directly with other LLM frameworks. Thus, we decided to share some performance results to set things straight and illustrate the power of PeriFlow.
In particular, we compare PeriFlow with vLLM. vLLM is an LLM serving framework by a team of researchers mostly from UC Berkeley. Of course, PeriFlow has many more features, such as supporting encoder-decoder models and quantized models, but we will focus on performance here.
The vLLM team released a research paper that describes vLLM, which they presented at SOSP 2023, and is available now on arxiv. At the beginning of the paper, the authors claim that vLLM improves throughput compared to systems like Orca, but later in the paper the authors explain that “[they] implement [their] own version of Orca,” assuming various implementation details. The paper thus does not actually compare vLLM to Orca—nor does it compare vLLM to PeriFlow. To test their claim, we decided to compare vLLM with PeriFlow ourselves.
We run the Llama 2 13B model from Meta on an NVIDIA A100 80GB GPU. For workloads, we use the online inference serving scenario, generating traffic using a Poisson process with generation requests derived from the Databricks Dolly dataset.
PeriFlow achieves approximately 2x to 4x latency reduction under the same load, i.e., shows 2-4x lower latency than vLLM. PeriFlow achieves approximately up to 6x higher throughput than vLLM under the same latency requirement, with the degree of improvement depending on the latency requirement, model, and workload.
PeriFlow is highly optimized to make LLM serving fast and cost-effective. It’s the fastest on the market, with our performance testing showing that PeriFlow is significantly faster than vLLM. We’re thrilled to be helping companies relax and cut costs while running their LLMs with PeriFlow.
There are two ways to use PeriFlow: PeriFlow Container and PeriFlow Cloud. Try PeriFlow Container for four weeks free of charge in your own environment, or sign up and start using PeriFlow Cloud, our managed service that eliminates all the operational burdens of serving LLMs. Get started today with PeriFlow!
FriendliAI Tech & Research