Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: Friendli Engine vs. vLLM

Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: Friendli Engine vs. vLLM thumbnail

In this blog, we show the excellent performance of running AWQ-ed models on FriendliAI’s Friendli Engine. Friendli Engine uses innovative optimizations including iteration batching, which we pioneered. To illustrate the mindblowing efficiency of running AWQ-ed models on Friendli Engine, we will compare Friendli Engine with vLLM.

Before we dive into the performance comparison, let's quickly revisit Activation-Aware Weight Quantization (AWQ). As mentioned in our previous articles, Understanding AWQ: Boosting Inference Serving Efficiency in LLMs and Unlocking Efficiency of Serving LLMs with AWQ on Friendli Engine, AWQ is a powerful technique for reducing the model size and memory footprint, enhancing its efficiency without compromising accuracy. It ensures that AI models provide accurate results while reducing the computational and memory requirements with a smaller number of bits. Now, let's turn our attention to how well AWQ performs in Friendli Engine with an example of running Meta’s Llama2 70B model on an NVIDIA A100 80GB GPU.

Performance on the Stanford Alpaca Dataset

In a recent evaluation, we put AWQ to the test by running the Meta’s Llama 2 70B model on NVIDIA’s A100 80GB GPUs while handling the Stanford Alpaca dataset under varying workloads. The workloads were represented as "1N," "2N”, and "3N", signifying different levels of requests per second.

Llama-2-70b, A100 80GB, Stanford Alpaca dataset mean TPOT comparison-FriendliAI

  • Under the "1N" load, Friendli Engine demonstrated remarkable efficiency with a single GPU, providing ~1.5x and ~2.2x faster responses respectively compared to vLLM using 4 and 2 GPUs.
  • At the "2N" load, Friendli Engine continues to impress, showcasing its ability to handle increased demands without compromising latency and throughput. It exhibits ~2x and ~3.6x faster responses on a single GPU compared to vLLM on 4 and 2 GPUs, respectively.
  • Even under the most challenging "3N" load with the highest number of requests per second, Friendli Engine maintained its efficiency with a single GPU, offering over 2x faster latency and contextually accurate responses compared to vLLM.

Comparing FP16 Performance on vLLM

It's worth noting that we compared the performance of vLLM running in "fp16" mode instead of in “AWQ” mode. While vLLM supports AWQ, its performance is better in the “fp16” mode as described in this comment. To provide a credible performance comparison of the two engines, we have compared Friendli Engine against the best-performing option on vLLM.

Unlocking the Power of Friendli Engine

From the numbers in our evaluation, we can see that FriendliAI's Friendli Engine not only uses fewer GPUs compared to vLLM but also simultaneously excels in terms of latency and throughput. These figures reflect the superiority of Friendli Engine as an LLM serving engine in production environments. We invite you to explore the power and efficiency of FriendliAI's Friendli Engine, our cutting-edge LLM serving engine that enables you to leverage the full potential of AWQ. Friendli Engine is a game-changer for LLM serving. Try out Friendli Engine today!



Share

Related Posts

Simultaneously Serving Multiple LoRAs on a single GPU with Friendli Engine thumbnail
  • November 16, 2023
  • 3 min read

Simultaneously Serving Multiple LoRAs on a single GPU with Friendli Engine

LoRA
multi-LoRA
Comparing two LLM serving frameworks: Friendli Engine vs. vLLM thumbnail
  • October 30, 2023
  • 3 min read

Comparing two LLM serving frameworks: Friendli Engine vs. vLLM

LLM
Inference
Serving
See all from blog