Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: PeriFlow vs. vLLM

Blog post thumbnail

In this blog, we show the excellent performance of running AWQ-ed models on FriendliAI’s PeriFlow. PeriFlow uses innovative optimizations including iteration batching we pioneered. To illustrate how good it is to run AWQ-ed models on PeriFlow, we will compare PeriFlow with vLLM.

Before we dive into the performance comparison, let's quickly revisit Activation-Aware Weight Quantization (AWQ). As mentioned in our previous articles, Understanding AWQ: Boosting Inference Serving Efficiency in LLMs and Unlocking Efficiency of Serving LLMs with AWQ on PeriFlow, AWQ is a powerful technique for reducing the model size and memory footprint, enhancing their efficiency without compromising accuracy. It ensures that AI models provide accurate results while reducing the computational and memory requirements with a smaller number of bits. Now, let's turn our attention to how well AWQ performs in PeriFlow with an example of running Meta’s Llama2 70B model on an NVIDIA’s A100 80GB GPU.

Performance on the Stanford Alpaca Dataset

In a recent evaluation, we put AWQ to the test by running the Meta’s Llama 2 70B model on NVIDIA’s A100 80GB GPUs while handling the Stanford Alpaca dataset under varying workloads. The workloads were represented as "1N," "2N”, and "3N" signifying different levels of requests per second.

  • Under "1N" load, PeriFlow demonstrated remarkable efficiency with a single GPU, providing ~1.5x and ~2.2x faster responses compared to vLLM using 4 and 2 GPUs.
  • At "2N" load, PeriFlow continues to impress, showcasing its ability to handle increased demands without compromising on latency and throughput. It exhibits ~2x and ~3.6x faster responses on a single GPU compared to vLLM on 4 and 2 GPUs, respectively.
  • Even under the most challenging "3N" load, PeriFlow maintained its efficiency with a single GPU, offering over 2x faster latencies and contextually accurate responses compared to vLLM under the higher number of requests per second.

Comparing FP16 Performance on vLLM

It's worth noting that we compared the performance of vLLM running in "fp16" mode instead of in “AWQ” mode. While vLLM supports AWQ, its performance is better in the “fp16” mode as described in this comment. To provide a credible performance comparison of the two engines, we have compared PeriFlow against the best-performing option on vLLM.

Unlocking the Power of PeriFlow

From the numbers in our evaluation, we can see that FriendliAI's PeriFlow not only uses fewer GPUs compared to vLLM but also excels in terms of latency and throughput simultaneously. These figures reflect the factual superiority of PeriFlow as an LLM serving engine in production environments. Based on these results, we invite you to explore the power and efficiency of FriendliAI's PeriFlow, our cutting-edge LLM serving engine that enables you to leverage the full potential of AWQ. PeriFlow is a game-changer for LLM serving. Try out PeriFlow today!


Related Posts

  • November 16, 2023
  • 3 min read

Simultaneously Serving Multiple LoRAs on a single GPU with PeriFlow

  • October 30, 2023
  • 2 min read

Comparing two LLM serving frameworks: PeriFlow vs. vLLM

See all from blog
We use cookiesWe use cookies to enhance your browsing experience on our website. By clicking “Accept all,” you consent to our use of cookies.
scroll to top