- November 7, 2023
- 2 min read
Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: Friendli Inference vs. vLLM
In this blog, we show the excellent performance of running AWQ-ed models on FriendliAI’s Friendli Inference. Friendli Inference uses innovative optimizations including iteration batching, which we pioneered. To illustrate the mindblowing efficiency of running AWQ-ed models on Friendli Inference, we will compare Friendli Inference with vLLM.
Before we dive into the performance comparison, let's quickly revisit Activation-Aware Weight Quantization (AWQ). As mentioned in our previous articles, Understanding AWQ: Boosting Inference Serving Efficiency in LLMs and Unlocking Efficiency of Serving LLMs with AWQ on Friendli Inference, AWQ is a powerful technique for reducing the model size and memory footprint, enhancing its efficiency without compromising accuracy. It ensures that AI models provide accurate results while reducing the computational and memory requirements with a smaller number of bits. Now, let's turn our attention to how well AWQ performs in Friendli Inference with an example of running Meta’s Llama2 70B model on an NVIDIA A100 80GB GPU.
Performance on the Stanford Alpaca Dataset
In a recent evaluation, we put AWQ to the test by running the Meta’s Llama 2 70B model on NVIDIA’s A100 80GB GPUs while handling the Stanford Alpaca dataset under varying workloads. The workloads were represented as "1N," "2N”, and "3N", signifying different levels of requests per second.
- Under the "1N" load, Friendli Inference demonstrated remarkable efficiency with a single GPU, providing ~1.5x and ~2.2x faster responses respectively compared to vLLM using 4 and 2 GPUs.
- At the "2N" load, Friendli Inference continues to impress, showcasing its ability to handle increased demands without compromising latency and throughput. It exhibits ~2x and ~3.6x faster responses on a single GPU compared to vLLM on 4 and 2 GPUs, respectively.
- Even under the most challenging "3N" load with the highest number of requests per second, Friendli Inference maintained its efficiency with a single GPU, offering over 2x faster latency and contextually accurate responses compared to vLLM.
Comparing FP16 Performance on vLLM
It's worth noting that we compared the performance of vLLM running in "fp16" mode instead of in “AWQ” mode. While vLLM supports AWQ, its performance is better in the “fp16” mode as described in this comment. To provide a credible performance comparison of the two engines, we have compared Friendli Inference against the best-performing option on vLLM.
Unlocking the Power of Friendli Inference
From the numbers in our evaluation, we can see that FriendliAI's Friendli Inference not only uses fewer GPUs compared to vLLM but also simultaneously excels in terms of latency and throughput. These figures reflect the superiority of Friendli Inference as an LLM serving engine in production environments. We invite you to explore the power and efficiency of FriendliAI's Friendli Inference, our cutting-edge LLM serving engine that enables you to leverage the full potential of AWQ. Friendli Inference is a game-changer for LLM serving. Try out Friendli Inference today!
Written by
FriendliAI Tech & Research
Share