- January 12, 2024
- 3 min read
The LLM Serving Engine Showdown: Friendli Engine Outshines
The Rise of LLMs, the Challenge of Serving: Large Language Models (LLMs) like Meta’s Llama 2 are revolutionizing text generation, dialogue, and code creation. But unleashing their full potential requires efficient serving engines – the bridge between models and real-world applications. Today, we dive into a head-to-head comparison of three popular engine options – TensorRT-LLM, vLLM, and Friendli Engine – to uncover their performances on an enterprise-level scale.
The Serving Engines:
- TensorRT-LLM: NVIDIA’s recently-released serving engine with easy integration support with their Triton inference servers. TensorRT-LLM provides a way to serve LLM models, but exhibits some challenges as it still is a young project. For example, it requires repetitive “engine build” procedures for finding optimal configurations, which is crucial for achieving the best of the engine without out-of-memory (OOM) errors.
- vLLM: UC Berkeley's open-source option, simple to use but lacking scalability and optimizations, especially for heavy workloads, leading to high latencies and low throughput throughout the evaluation.
- Friendli Engine: Our very own serving engine, built for low-latency and high-throughput handling of diverse generative AI models including LLMs. Friendli Engine provides high inference serving performance with its optimizations covering various use cases, preventing the configuration burden from the users.
The Evaluation Set-up:
We put all three engines to the test, serving Meta's Llama 2 70B model on four NVIDIA A100 80GB GPUs and simulating varying load intensities with the Databricks Dolly dataset. We set up the evaluations to cover even the very stressful cases with high input load pressures in order to test our system under enterprise-scale loads. The request arrival times are randomly sampled from the Poisson distribution, making the scenario more realistic and challenging compared to experiments with a static number of concurrent requests. We vary the Poisson parameter from 1N to 5N req/s throughout the evaluation, resulting in about 20 to 200 concurrent requests on average. Our performance metric on the graph is the 90th percentile time per output token (TPOT, lower is better), selecting the 90th percentile of TPOT among all requests in the experiment.
- vLLM: Showed decent performance at low loads, but quickly crumbled under pressure, failing to respond within the allotted time under moderate load. TPOTs exceeding the threshold (>1000ms, >2 seconds per word) are omitted in the graph.
- TensorRT-LLM: Showed initial promise, outperforming vLLM at moderate load. However, as the pressure intensified, latency escalated, demanding additional GPUs to maintain service level objectives.
- Friendli Engine: Made all the difference, maintaining rock-solid stability throughout all loads presented in the evaluation. At all loads, even at extremely high loads, it delivered 90th percentile latencies within 100 ms per token, achieving evidently faster TPOT performances, compared to other engines at all load levels in our evaluation. Under the 80 ms constraint on the p90 TPOT, Friendli Engine shows 4x throughput against TensorRT-LLM (4N req/s vs. 1N req/s).
Why Friendli? The Verdict is In:
The results paint a clear picture. Friendli Engine proves its exceptional ability to handle diverse LLM tasks under any load, offering:
- Unmatched Scalability: Handles even the most demanding workloads without sacrificing performance. Shows stable performance under high input load pressures.
- Low Latency and High Throughput: Responds to prompts lightning-fast, ensuring smooth user experiences.
- Stability and Reliability: Counts on consistent performance no matter the pressure.
Ready to Unleash the Power of Your LLM? Experience Friendli Engine's magic firsthand! We offer three options to suit your preferences:
- Friendli Dedicated Endpoints: Run any custom generative AI models on dedicated GPU instances in autopilot.
- Friendli Serverless Endpoints: No setup required, simply call our APIs and let us handle the rest.
- Friendli Container: Deploy the engine on your own infrastructure for ultimate control.
FriendliAI Tech & Research