Comparing two LLM serving frameworks: Friendli Engine vs. vLLM

Comparing two LLM serving frameworks: Friendli Engine vs. vLLM thumbnail

FriendliAI is on a mission to supercharge generative AI serving. Driving this is Friendli Engine, our cutting-edge engine that makes serving generative AI, such as LLMs, easier, cheaper, and faster than ever before. In this analysis, we show that Friendli Engine is significantly faster than vLLM, another serving framework.

Friendli Engine is blazingly fast at serving LLMs (large language models). Friendli Engine was born out of our Orca research; the Orca paper was published in OSDI 2022. Our team has built many different optimizations into the Friendli Engine, but one of the important optimizations we are proud to have pioneered is iteration batching, which is protected by our patents in the US and Korea, and cannot be used without our authorization.

Recently, we’ve seen other frameworks claim to surpass Orca’s performance. Because Orca is not publicly available and the Friendli Engine is only used in production, outside sources are not able to compare Friendli Engine directly with other LLM frameworks. Thus, we decided to share some performance results to set things straight and illustrate the power of Friendli Engine.

In particular, we compare Friendli Engine with vLLM. vLLM is an LLM serving framework by a team of researchers mostly from UC Berkeley. Of course, Friendli Engine has many more features, such as supporting encoder-decoder models and quantized models, but we will focus on performance here.

The vLLM team released a research paper that describes vLLM, which they presented at SOSP 2023, and is available now on arxiv. At the beginning of the paper, the authors claim that vLLM improves throughput compared to systems like Orca, but later in the paper the authors explain that “[they] implement [their] own version of Orca,” assuming various implementation details. The paper thus does not actually compare vLLM to Orca—nor does it compare vLLM to Friendli Engine. To test their claim, we decided to compare vLLM with Friendli Engine ourselves.

Comparison

vLLM and Friendli Engine comparison

We run the Llama 2 13B model from Meta on an NVIDIA A100 80GB GPU. For workloads, we use the online inference serving scenario, generating traffic using a Poisson process with generation requests derived from the Databricks Dolly dataset.

Friendli Engine achieves approximately 2x to 4x latency reduction under the same load, i.e., shows 2-4x lower latency than vLLM. Friendli Engine achieves approximately up to 6x higher throughput than vLLM under the same latency requirement, with the degree of improvement depending on the latency requirement, model, and workload.

Summary

Friendli Engine is highly optimized to make LLM serving fast and cost-effective. It’s the fastest on the market, with our performance testing showing that Friendli Engine is significantly faster than vLLM. We’re thrilled to be helping companies relax and cut costs while running their LLMs with Friendli Engine.

There are two ways to use Friendli Engine: Friendli Container and Friendli Dedicated Endpoints. Try Friendli Container for four weeks free of charge in your own environment, or sign up and start using Friendli Dedicated Endpoints, our managed service that eliminates all the operational burdens of serving LLMs. Get started today with Friendli Engine!



Share

Related Posts

Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: Friendli Engine vs. vLLM thumbnail
  • November 7, 2023
  • 2 min read

Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: Friendli Engine vs. vLLM

Quantization
Large Language Models
Chat Docs: A RAG Application with Friendli Engine and LangChain thumbnail
  • October 27, 2023
  • 4 min read

Chat Docs: A RAG Application with Friendli Engine and LangChain

Langchain
Large Language Models
LLM
See all from blog