October 16, 2023
4 min read

Activation-aware Weight Quantization (AWQ): Unlocking LLM Efficiency—Part 1: Understanding the Basics

Q: What is FriendliAI?

FriendliAI is the Frontier Inference Cloud for Agents, delivering high throughput, low latency, and reliability at scale for agentic workloads. Through vertically optimized inference infrastructure, it delivers 2–5× faster output token speed and a 99.99% uptime SLA for high-volume production traffic.

Activation-aware Weight Quantization (AWQ): Unlocking LLM Efficiency—Part 1: Understanding the Basics thumbnail

In the world of LLMs (large language models) such as Llama 2 and MPT, inference serving efficiency is paramount. As models grow in size and complexity, the computational and memory requirements can become prohibitive, limiting their deployment. Activation-Aware Weight Quantization (AWQ) is a technique that seeks to address this challenge by optimizing LLMs, or more broadly deep neural networks, for efficient execution. In this article, we will delve into what AWQ is and how it benefits LLM inference serving. If you would like to make use of AWQ-ed LLMs, try out Friendli Inference! You can run AWQ-ed LLMs (e.g., Llama 2 70B 4-bit on a single A100 80 GB GPU) natively on FriendliAI with Friendli Inference.

The Basics of Weight Quantization

Before we dive into Activation-Aware Weight Quantization, let’s first understand the concept of weight quantization. Weight quantization is the process of reducing the precision of the parameters (weights) in a neural network. In typical neural networks, weights are represented as floating-point numbers with relatively high precision, often 16 bits (e.g., fp16 and bf16 formats). However, this level of precision requires significant GPU memory resources.

Weight quantization aims to represent these weights with a smaller number of bits, such as 8-bit or even 4-bit integers. This reduction in precision can significantly reduce the memory requirements, making it feasible to deploy LLMs on a smaller number of GPUs.

The Role of Activation in Weight Quantization

AWQ takes the concept of weight quantization to the next level by considering the activations of the model during the quantization process. In traditional weight quantization, the weights are quantized independently of the data they process. In AWQ, the quantization process takes into account the actual data distribution in the activations produced by the model during inference.

Here’s how AWQ works:

Collect Activation Statistics: During this calibration phase, a subset of the data is used to collect statistics on the activations produced by the model. This involves running the model on this data and recording the range of values and the distribution of activations.
Search Weight Quantization Parameters: Weights are quantized by taking the activation statistics into account. Concretely, we perform a space search for quantization parameters (e.g., scales and zeropoints), to minimize the distortions incurred by quantization on output activations. As a result, the quantized weights can be accurately represented with fewer bits.
Quantize : With the quantization parameters in place, the model weights are quantized using a reduced number of bits.

Benefits of AWQ

AWQ offers several advantages for neural networks:

Improved Accuracy: By considering the distribution of activations during quantization, the technique can achieve better preservation of model accuracy compared to traditional weight quantization, where activations are not taken into account.
Efficiency: Using AWQ, weights can be represented with narrower bits, such as 4-bit integers, without accuracy degradation. This reduces the memory requirements by up to 4x, making it feasible to deploy large models on a wider range of devices. In addition, it can reduce the latency of token generation by saving the memory bandwidth of GPUs with smaller weight sizes.
Robustness: AWQ helps ensure that the model remains accurate even when faced with challenging or varied input data.
No Training Required: AWQ falls into the PTQ (Post-Training Quantization) category among various quantization techniques; it does not require costly additional re-training or vast amounts of training data. For example, for a 70B LLama model, it only makes use of a hundred example sentences in a couple of hours for quantization on a single NVIDIA A100 80GB GPU.

AWQ is a powerful technique that optimizes LLMs for efficiency without sacrificing model accuracy. By considering the data distribution in activations during the quantization process, it tailors the precision of weights to the specific characteristics of the model’s input data. This approach not only reduces the memory and computational requirements of neural networks but also ensures that the model remains accurate and robust in a variety of real-world scenarios.

For practical use of AWQ-ed LLMs, you can explore the capabilities of FriendliAI’s Friendli Inference, a cutting-edge LLM serving platform that facilitates the deployment and execution of quantized LLMs. By harnessing Friendli Inference, you can experience the benefits of AWQ firsthand and witness its impact on the efficiency and performance of deep learning models. We believe that Friendli Inference is a key enabler for the widespread deployment of efficient and accurate LLMs. In our next blog, we will share the details of running AWQ-ed LLMs on Friendli Inference. Stay tuned!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is the Frontier Inference Cloud for Agents, delivering high throughput, low latency, and reliability at scale for agentic workloads. Through vertically optimized inference infrastructure, it delivers 2–5× faster output token speed and a 99.99% uptime SLA for high-volume production traffic.

How does FriendliAI reduce inference costs?

FriendliAI reduces inference costs through higher GPU utilization and optimized inference performance. FriendliAI's patented continuous batching technique, along with quantization, speculative decoding, KV cache offloading, multi-LoRA serving, and autoscaling, helps you serve more tokens with fewer GPUs, lowering your infrastructure costs without sacrificing performance.

Why should I choose FriendliAI over other inference providers?

FriendliAI is built for production AI agents, combining speed, reliability, and efficiency at scale. It delivers low-latency streaming, reliable long-context inference, and robust tool calling without compromising stability. According to independent OpenRouter benchmarks, FriendliAI consistently ranks among the top providers for throughput, latency, and reliability across leading open-weight models. See why customers choose FriendliAI

Which open-weight models does FriendliAI support?

Run today’s frontier open-weight models—including GLM, MiniMax, Kimi, DeepSeek, Qwen, Gemma, and more—with a simple API call. FriendliAI Model API gives you instant access to the latest models with optimized inference performance for production workloads. Explore models and pricing

How do I get started?

Getting started takes just a few minutes. [1] Sign up for FriendliAI, [2] Generate your API key, and [3] Make your first inference request with frontier open-weight models.

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.

September 27, 2023
2 min read

Iteration Batching (a.k.a. Continuous Batching): Accelerate LLM Inference Serving with Flexible Scheduling

Iteration Batching

LLM Inference

October 23, 2023
2 min read

Activation-aware Weight Quantization (AWQ): Unlocking LLM Efficiency—Part 2: Benchmarks and Practical Guide