• November 7, 2023
  • 2 min read

Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: Friendli Inference vs. vLLM

Faster serving of the 4-bit quantized Llama 2 70B model with fewer GPUs: Friendli Inference vs. vLLM thumbnail

In this blog, we show the excellent performance of running AWQ-ed models on FriendliAI’s Friendli Inference. Friendli Inference uses innovative optimizations including iteration batching, which we pioneered. To illustrate the mindblowing efficiency of running AWQ-ed models on Friendli Inference, we will compare Friendli Inference with vLLM.

Before we dive into the performance comparison, let's quickly revisit Activation-Aware Weight Quantization (AWQ). As mentioned in our previous articles, Understanding AWQ: Boosting Inference Serving Efficiency in LLMs and Unlocking Efficiency of Serving LLMs with AWQ on Friendli Inference, AWQ is a powerful technique for reducing the model size and memory footprint, enhancing its efficiency without compromising accuracy. It ensures that AI models provide accurate results while reducing the computational and memory requirements with a smaller number of bits. Now, let's turn our attention to how well AWQ performs in Friendli Inference with an example of running Meta’s Llama2 70B model on an NVIDIA A100 80GB GPU.

Performance on the Stanford Alpaca Dataset

In a recent evaluation, we put AWQ to the test by running the Meta’s Llama 2 70B model on NVIDIA’s A100 80GB GPUs while handling the Stanford Alpaca dataset under varying workloads. The workloads were represented as "1N," "2N”, and "3N", signifying different levels of requests per second.

Llama-2-70b, A100 80GB, Stanford Alpaca dataset mean TPOT comparison-FriendliAI

  • Under the "1N" load, Friendli Inference demonstrated remarkable efficiency with a single GPU, providing ~1.5x and ~2.2x faster responses respectively compared to vLLM using 4 and 2 GPUs.
  • At the "2N" load, Friendli Inference continues to impress, showcasing its ability to handle increased demands without compromising latency and throughput. It exhibits ~2x and ~3.6x faster responses on a single GPU compared to vLLM on 4 and 2 GPUs, respectively.
  • Even under the most challenging "3N" load with the highest number of requests per second, Friendli Inference maintained its efficiency with a single GPU, offering over 2x faster latency and contextually accurate responses compared to vLLM.

Comparing FP16 Performance on vLLM

It's worth noting that we compared the performance of vLLM running in "fp16" mode instead of in “AWQ” mode. While vLLM supports AWQ, its performance is better in the “fp16” mode as described in this comment. To provide a credible performance comparison of the two engines, we have compared Friendli Inference against the best-performing option on vLLM.

Unlocking the Power of Friendli Inference

From the numbers in our evaluation, we can see that FriendliAI's Friendli Inference not only uses fewer GPUs compared to vLLM but also simultaneously excels in terms of latency and throughput. These figures reflect the superiority of Friendli Inference as an LLM serving engine in production environments. We invite you to explore the power and efficiency of FriendliAI's Friendli Inference, our cutting-edge LLM serving engine that enables you to leverage the full potential of AWQ. Friendli Inference is a game-changer for LLM serving. Try out Friendli Inference today!


Written by

FriendliAI Tech & Research


Share


General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Contact Sales — our experts (not a bot) will reply within one business day.