October 16, 2023
4 min read

Activation-aware Weight Quantization (AWQ): Unlocking LLM Efficiency—Part 1: Understanding the Basics

In the world of LLMs (large language models) such as Llama 2 and MPT, inference serving efficiency is paramount. As models grow in size and complexity, the computational and memory requirements can become prohibitive, limiting their deployment. Activation-Aware Weight Quantization (AWQ) is a technique that seeks to address this challenge by optimizing LLMs, or more broadly deep neural networks, for efficient execution. In this article, we will delve into what AWQ is and how it benefits LLM inference serving. If you would like to make use of AWQ-ed LLMs, try out Friendli Inference! You can run AWQ-ed LLMs (e.g., Llama 2 70B 4-bit on a single A100 80 GB GPU) natively on FriendliAI with Friendli Inference.

The Basics of Weight Quantization

Before we dive into Activation-Aware Weight Quantization, let’s first understand the concept of weight quantization. Weight quantization is the process of reducing the precision of the parameters (weights) in a neural network. In typical neural networks, weights are represented as floating-point numbers with relatively high precision, often 16 bits (e.g., fp16 and bf16 formats). However, this level of precision requires significant GPU memory resources.

Weight quantization aims to represent these weights with a smaller number of bits, such as 8-bit or even 4-bit integers. This reduction in precision can significantly reduce the memory requirements, making it feasible to deploy LLMs on a smaller number of GPUs.

The Role of Activation in Weight Quantization

AWQ takes the concept of weight quantization to the next level by considering the activations of the model during the quantization process. In traditional weight quantization, the weights are quantized independently of the data they process. In AWQ, the quantization process takes into account the actual data distribution in the activations produced by the model during inference.

Here’s how AWQ works:

Collect Activation Statistics: During this calibration phase, a subset of the data is used to collect statistics on the activations produced by the model. This involves running the model on this data and recording the range of values and the distribution of activations.
Search Weight Quantization Parameters: Weights are quantized by taking the activation statistics into account. Concretely, we perform a space search for quantization parameters (e.g., scales and zeropoints), to minimize the distortions incurred by quantization on output activations. As a result, the quantized weights can be accurately represented with fewer bits.
Quantize : With the quantization parameters in place, the model weights are quantized using a reduced number of bits.

Benefits of AWQ

AWQ offers several advantages for neural networks:

Improved Accuracy: By considering the distribution of activations during quantization, the technique can achieve better preservation of model accuracy compared to traditional weight quantization, where activations are not taken into account.
Efficiency: Using AWQ, weights can be represented with narrower bits, such as 4-bit integers, without accuracy degradation. This reduces the memory requirements by up to 4x, making it feasible to deploy large models on a wider range of devices. In addition, it can reduce the latency of token generation by saving the memory bandwidth of GPUs with smaller weight sizes.
Robustness: AWQ helps ensure that the model remains accurate even when faced with challenging or varied input data.
No Training Required: AWQ falls into the PTQ (Post-Training Quantization) category among various quantization techniques; it does not require costly additional re-training or vast amounts of training data. For example, for a 70B LLama model, it only makes use of a hundred example sentences in a couple of hours for quantization on a single NVIDIA A100 80GB GPU.

AWQ is a powerful technique that optimizes LLMs for efficiency without sacrificing model accuracy. By considering the data distribution in activations during the quantization process, it tailors the precision of weights to the specific characteristics of the model’s input data. This approach not only reduces the memory and computational requirements of neural networks but also ensures that the model remains accurate and robust in a variety of real-world scenarios.

For practical use of AWQ-ed LLMs, you can explore the capabilities of FriendliAI’s Friendli Inference, a cutting-edge LLM serving platform that facilitates the deployment and execution of quantized LLMs. By harnessing Friendli Inference, you can experience the benefits of AWQ firsthand and witness its impact on the efficiency and performance of deep learning models. We believe that Friendli Inference is a key enabler for the widespread deployment of efficient and accurate LLMs. In our [next blog] (https://friendli.ai/blog/activation-aware-weight-quantization-friendli), we will share the details of running AWQ-ed LLMs on Friendli Inference. Stay tuned!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.

October 23, 2023
2 min read

Activation-aware Weight Quantization (AWQ): Unlocking LLM Efficiency—Part 2: Benchmarks and Practical Guide

AWQ

Quantization

Benchmarks

September 27, 2023
2 min read

Iteration Batching (a.k.a. Continuous Batching): Accelerate LLM Inference Serving with Flexible Scheduling

Iteration Batching

LLM Inference

See all from blog