Comparing two LLM serving frameworks: PeriFlow vs. vLLM

Blog post thumbnail

FriendliAI is on a mission to supercharge generative AI serving. Driving this is PeriFlow, our cutting-edge engine that makes serving generative AI, such as LLMs, easier, cheaper, and faster than ever before. In this analysis, we show that PeriFlow is significantly faster than vLLM, another serving framework.

PeriFlow is blazingly fast at serving LLMs (large language models). PeriFlow was born out of our Orca research; the Orca paper was published in OSDI 2022. Our team has built many different optimizations into PeriFlow, but one of the important optimizations we are proud to have pioneered is iteration batching, which is protected by our patents in the US and Korea, and cannot be used without our authorization.

Recently, we’ve seen other frameworks claim to surpass Orca’s performance. Because Orca is not publicly available and PeriFlow is only used in production, outside sources are not able to compare PeriFlow directly with other LLM frameworks. Thus, we decided to share some performance results to set things straight and illustrate the power of PeriFlow.

In particular, we compare PeriFlow with vLLM. vLLM is an LLM serving framework by a team of researchers mostly from UC Berkeley. Of course, PeriFlow has many more features, such as supporting encoder-decoder models and quantized models, but we will focus on performance here.

The vLLM team released a research paper that describes vLLM, which they presented at SOSP 2023, and is available now on arxiv. At the beginning of the paper, the authors claim that vLLM improves throughput compared to systems like Orca, but later in the paper the authors explain that “[they] implement [their] own version of Orca,” assuming various implementation details. The paper thus does not actually compare vLLM to Orca—nor does it compare vLLM to PeriFlow. To test their claim, we decided to compare vLLM with PeriFlow ourselves.

Comparison

We run the Llama 2 13B model from Meta on an NVIDIA A100 80GB GPU. For workloads, we use the online inference serving scenario, generating traffic using a Poisson process with generation requests derived from the Databricks Dolly dataset.

PeriFlow achieves approximately 2x to 4x latency reduction under the same load, i.e., shows 2-4x lower latency than vLLM. PeriFlow achieves approximately up to 6x higher throughput than vLLM under the same latency requirement, with the degree of improvement depending on the latency requirement, model, and workload.

Summary

PeriFlow is highly optimized to make LLM serving fast and cost-effective. It’s the fastest on the market, with our performance testing showing that PeriFlow is significantly faster than vLLM. We’re thrilled to be helping companies relax and cut costs while running their LLMs with PeriFlow.

There are two ways to use PeriFlow: PeriFlow Container and PeriFlow Cloud. Try PeriFlow Container for four weeks free of charge in your own environment, or sign up and start using PeriFlow Cloud, our managed service that eliminates all the operational burdens of serving LLMs. Get started today with PeriFlow!



Share

Related Posts

thumbnail
  • November 7, 2023
  • 2 min read

Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: PeriFlow vs. vLLM

Quantization
Large Language Models
thumbnail
  • October 27, 2023
  • 4 min read

Chat Docs: A RAG Application with PeriFlow and LangChain

Langchain
Large Language Models
LLM
See all from blog
We use cookiesWe use cookies to enhance your browsing experience on our website. By clicking “Accept all,” you consent to our use of cookies.
scroll to top