October 30, 2023
2 min read

Comparing two LLM serving frameworks: Friendli Inference vs. vLLM

FriendliAI is on a mission to supercharge generative AI serving. Driving this is Friendli Inference, our cutting-edge engine that makes serving generative AI, such as LLMs, easier, cheaper, and faster than ever before. In this analysis, we show that Friendli Inference is significantly faster than vLLM, another serving framework.

Friendli Inference is blazingly fast at serving LLMs (large language models). Friendli Inference was born out of our Orca research; the Orca paper was published in OSDI 2022. Our team has built many different optimizations into the Friendli Inference, but one of the important optimizations we are proud to have pioneered is iteration batching, which is protected by our patents in the US and Korea, and cannot be used without our authorization.

Recently, we’ve seen other frameworks claim to surpass Orca’s performance. Because Orca is not publicly available and the Friendli Inference is only used in production, outside sources are not able to compare Friendli Inference directly with other LLM frameworks. Thus, we decided to share some performance results to set things straight and illustrate the power of Friendli Inference.

In particular, we compare Friendli Inference with vLLM. vLLM is an LLM serving framework by a team of researchers mostly from UC Berkeley. Of course, Friendli Inference has many more features, such as supporting encoder-decoder models and quantized models, but we will focus on performance here.

The vLLM team released a research paper that describes vLLM, which they presented at SOSP 2023, and is available now on arxiv. At the beginning of the paper, the authors claim that vLLM improves throughput compared to systems like Orca, but later in the paper the authors explain that “[they] implement [their] own version of Orca,” assuming various implementation details. The paper thus does not actually compare vLLM to Orca—nor does it compare vLLM to Friendli Inference. To test their claim, we decided to compare vLLM with Friendli Inference ourselves.

Comparison

vLLM and Friendli Inference comparison

We run the Llama 2 13B model from Meta on an NVIDIA A100 80GB GPU. For workloads, we use the online inference serving scenario, generating traffic using a Poisson process with generation requests derived from the Databricks Dolly dataset.

Friendli Inference achieves approximately 2x to 4x latency reduction under the same load, i.e., shows 2-4x lower latency than vLLM. Friendli Inference achieves approximately up to 6x higher throughput than vLLM under the same latency requirement, with the degree of improvement depending on the latency requirement, model, and workload.

Summary

Friendli Inference is highly optimized to make LLM serving fast and cost-effective. It’s the fastest on the market, with our performance testing showing that Friendli Inference is significantly faster than vLLM. We’re thrilled to be helping companies relax and cut costs while running their LLMs with Friendli Inference.

There are two ways to use Friendli Inference: Friendli Container and Friendli Dedicated Endpoints. Try Friendli Container for four weeks free of charge in your own environment, or sign up and start using Friendli Dedicated Endpoints, our managed service that eliminates all the operational burdens of serving LLMs. Get started today with Friendli Inference!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.