January 12, 2024
3 min read

LLM Serving Engine Comparative Analysis: Friendli Inference vs. vLLM vs. TensorRT-LLM

The Rise of LLMs, the Challenge of Serving: Large Language Models (LLMs) like Meta’s Llama 2 are revolutionizing text generation, dialogue, and code creation. But unleashing their full potential requires efficient serving engines – the bridge between models and real-world applications. Today, we dive into a head-to-head comparison of three popular engine options – TensorRT-LLM, vLLM, and Friendli Inference – to uncover their performances on an enterprise-level scale.

The Serving Engines:

TensorRT-LLM: NVIDIA’s recently-released serving engine with easy integration support with their Triton inference servers. TensorRT-LLM provides a way to serve LLM models, but exhibits some challenges as it still is a young project. For example, it requires repetitive “engine build” procedures for finding optimal configurations, which is crucial for achieving the best of the engine without out-of-memory (OOM) errors.
vLLM: UC Berkeley's open-source option, simple to use but lacking scalability and optimizations, especially for heavy workloads, leading to high latencies and low throughput throughout the evaluation.
Friendli Inference: Our very own serving engine, built for low-latency and high-throughput handling of diverse generative AI models including LLMs. Friendli Inference provides high inference serving performance with its optimizations covering various use cases, preventing the configuration burden from the users.

The Evaluation Set-up:

We put all three engines to the test, serving Meta's Llama 2 70B model on four NVIDIA A100 80GB GPUs and simulating varying load intensities with the Databricks Dolly dataset. We set up the evaluations to cover even the very stressful cases with high input load pressures in order to test our system under enterprise-scale loads. The request arrival times are randomly sampled from the Poisson distribution, making the scenario more realistic and challenging compared to experiments with a static number of concurrent requests. We vary the Poisson parameter from 1N to 5N req/s throughout the evaluation, resulting in about 20 to 200 concurrent requests on average. Our performance metric on the graph is the 90th percentile time per output token (TPOT, lower is better), selecting the 90th percentile of TPOT among all requests in the experiment.

The Results:

Meta's Llama-2 70B model on four NVIDIA's A100 80GB GPUs Databricks Dolly dataset p90 TPOT comparison-FriendliAI

vLLM: Showed decent performance at low loads, but quickly crumbled under pressure, failing to respond within the allotted time under moderate load. TPOTs exceeding the threshold (>1000ms, >2 seconds per word) are omitted in the graph.
TensorRT-LLM: Showed initial promise, outperforming vLLM at moderate load. However, as the pressure intensified, latency escalated, demanding additional GPUs to maintain service level objectives.
Friendli Inference: Made all the difference, maintaining rock-solid stability throughout all loads presented in the evaluation. At all loads, even at extremely high loads, it delivered 90th percentile latencies within 100 ms per token, achieving evidently faster TPOT performances, compared to other engines at all load levels in our evaluation. Under the 80 ms constraint on the p90 TPOT, Friendli Inference shows 4x throughput against TensorRT-LLM (4N req/s vs. 1N req/s).

Why Friendli? The Verdict is In:

The results paint a clear picture. Friendli Inference proves its exceptional ability to handle diverse LLM tasks under any load, offering:

Unmatched Scalability: Handles even the most demanding workloads without sacrificing performance. Shows stable performance under high input load pressures.
Low Latency and High Throughput: Responds to prompts lightning-fast, ensuring smooth user experiences.
Stability and Reliability: Counts on consistent performance no matter the pressure.

Ready to Unleash the Power of Your LLM? Experience Friendli Inference's magic firsthand! We offer three options to suit your preferences:

Friendli Dedicated Endpoints: Run any custom generative AI models on dedicated GPU instances in autopilot.
Friendli Serverless Endpoints: No setup required, simply call our APIs and let us handle the rest.
Friendli Container: Deploy the engine on your own infrastructure for ultimate control.

Visit Friendli to begin your journey into the world of high-performance LLM serving with the Friendli Inference!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Contact Sales — our experts (not a bot) will reply within one business day.