- October 16, 2023
- 4 min read
Understanding Activation-Aware Weight Quantization (AWQ): Boosting Inference Serving Efficiency in LLMs
In the world of LLMs (large language models) such as Llama 2 and MPT, inference serving efficiency is paramount. As models grow in size and complexity, the computational and memory requirements can become prohibitive, limiting their deployment. Activation-Aware Weight Quantization (AWQ) is a technique that seeks to address this challenge by optimizing LLMs, or more broadly deep neural networks, for efficient execution. In this article, we will delve into what AWQ is and how it benefits LLM inference serving. If you would like to make use of AWQ-ed LLMs, try out Friendli Engine! You can run AWQ-ed LLMs (e.g., Llama 2 70B 4-bit on a single A100 80 GB GPU) natively on FriendliAI with Friendli Engine.
The Basics of Weight Quantization
Before we dive into Activation-Aware Weight Quantization, let’s first understand the concept of weight quantization. Weight quantization is the process of reducing the precision of the parameters (weights) in a neural network. In typical neural networks, weights are represented as floating-point numbers with relatively high precision, often 16 bits (e.g., fp16 and bf16 formats). However, this level of precision requires significant GPU memory resources.
Weight quantization aims to represent these weights with a smaller number of bits, such as 8-bit or even 4-bit integers. This reduction in precision can significantly reduce the memory requirements, making it feasible to deploy LLMs on a smaller number of GPUs.
The Role of Activation in Weight Quantization
AWQ takes the concept of weight quantization to the next level by considering the activations of the model during the quantization process. In traditional weight quantization, the weights are quantized independently of the data they process. In AWQ, the quantization process takes into account the actual data distribution in the activations produced by the model during inference.
Here’s how AWQ works:
- Collect Activation Statistics: During this calibration phase, a subset of the data is used to collect statistics on the activations produced by the model. This involves running the model on this data and recording the range of values and the distribution of activations.
- Search Weight Quantization Parameters: Weights are quantized by taking the activation statistics into account. Concretely, we perform a space search for quantization parameters (e.g., scales and zeropoints), to minimize the distortions incurred by quantization on output activations. As a result, the quantized weights can be accurately represented with fewer bits.
- Quantize : With the quantization parameters in place, the model weights are quantized using a reduced number of bits.
Benefits of AWQ
AWQ offers several advantages for neural networks:
- Improved Accuracy: By considering the distribution of activations during quantization, the technique can achieve better preservation of model accuracy compared to traditional weight quantization, where activations are not taken into account.
- Efficiency: Using AWQ, weights can be represented with narrower bits, such as 4-bit integers, without accuracy degradation. This reduces the memory requirements by up to 4x, making it feasible to deploy large models on a wider range of devices. In addition, it can reduce the latency of token generation by saving the memory bandwidth of GPUs with smaller weight sizes.
- Robustness: AWQ helps ensure that the model remains accurate even when faced with challenging or varied input data.
- No Training Required: AWQ falls into the PTQ (Post-Training Quantization) category among various quantization techniques; it does not require costly additional re-training or vast amounts of training data. For example, for a 70B LLama model, it only makes use of a hundred example sentences in a couple of hours for quantization on a single NVIDIA A100 80GB GPU.
AWQ is a powerful technique that optimizes LLMs for efficiency without sacrificing model accuracy. By considering the data distribution in activations during the quantization process, it tailors the precision of weights to the specific characteristics of the model’s input data. This approach not only reduces the memory and computational requirements of neural networks but also ensures that the model remains accurate and robust in a variety of real-world scenarios.
For practical use of AWQ-ed LLMs, you can explore the capabilities of FriendliAI’s Friendli Engine, a cutting-edge LLM serving platform that facilitates the deployment and execution of quantized LLMs. By harnessing Friendli Engine, you can experience the benefits of AWQ firsthand and witness its impact on the efficiency and performance of deep learning models. We believe that Friendli Engine is a key enabler for the widespread deployment of efficient and accurate LLMs. In our next blog, we will share the details of running AWQ-ed LLMs on Friendli Engine. Stay tuned!
Written by
FriendliAI Tech & Research
Share