April 3, 2024
6 min read

Improve Latency and Throughput with Weight-Activation Quantization in FP8

Quantization is a popular technique used to reduce the size of a machine learning model by lowering the numerical precision of some of its components. As the sizes of recent LLMs (Large Language Models) reach hundreds of billions of parameters, deploying them efficiently becomes increasingly challenging. The goal of quantization is to make a model smaller, and thus faster, ideally without compromising performance.

In this article, we will learn about 8-bit floating-point (FP8) weight-activation quantization, supported by our Friendli Inference, which has been shown to improve both latency and throughput compared to other quantization schemes. We will break apart and delve into the two important aspects here–the components being quantized and the precision type used–to fully understand the advantages of FP8 weight-activation quantization.

If you want to skip the details and jump straight into using FP8 quantization, great news! Friendli Inference supports FP8 quantization, and you can get started today. Head straight to the conclusion for a concise summary of FP8 quantization and explore the various options for getting started with Friendli Inference.

What to quantize

The two main components that can be quantized in an LLM are (1) the model parameters or weights, and (2) the activations, which are the outputs of each layer. Quantizing the parameters reduces the size of the model, while quantizing the activations is akin to performing computations like matrix multiplications with a lower precision.

Weight-only quantization (WOQ) techniques such as Activation-Aware Quantization (AWQ) and GPTQ quantize only the model weights. By reducing the size of the model, less GPU memory is used and less bandwidth is required to move the smaller size weights from memory. After the quantized weights are loaded from GPU memory to GPU registers, they are upcasted to higher-bit floats on the fly, introducing a little dequantization overhead and reducing the overall computational efficiency. Thus, WOQ does not improve throughput but improves latency, especially in low batch size scenarios where the limiting factor is bandwidth, not computation. Learn more about AWQ in our blog series here.

Weight-activation quantization techniques (commonly named in a WxAx format, such as W8A8 to specify 8-bit weight and 8-bit activation precisions), quantize both the model parameters and the activations. Quantizing activations enables W8A8 to leverage hardware accelerations for lower bit computations. For example, the NVIDIA H100 Tensor Core GPU can perform 8-bit integer or floating-point operations two times faster than 16-bit floating-point math, measured in floating-point operations per second (FLOPs). At sufficiently large batch sizes, higher FLOPs improves throughput. Therefore, in addition to the latency reduction achieved from quantizing the weights as explained in weight-only quantization, weight-activation quantization also boosts throughput.

Precision

While quantization improves latency and even throughput if the activation is also quantized, it may cause a degradation in accuracy depending on the precision type used. Most LLMs today have weights in 16-bit precision. Quantizing a model to use a lower precision means fewer bits to store and run computations on. This lower precision can be an integer or floating-point type. In this article we’ll focus on the commonly used 8-bit integer (INT8) and 8-bit floating-point (FP8) types, but there are even smaller bit types as well. Whichever precision is used, the objective is to minimize the differences in actual and quantized values by choosing the best set of representable values in the lower precision data type.

With its 8 bits, the INT8 type can represent 2^8 = 256 integers. Like all integer types, INT8 uses a uniform scale, essentially scaling and rounding the parameters to one of the 256 possible values.

FP8 is a floating-point format capable of representing a non-uniform distribution. The 8 bits include a mantissa portion and an exponent portion, where the mantissa defines the non-zero part of the number, and the exponent portion describes the number of positions to shift the decimal point by. By making the exponent negative, we can accurately represent many numbers close to zero with high precision.

The two commonly used 8-bit floating-point variants are:

E5M2: 1 sign bit, 5 exponent bits, 2 mantissa bits
E4M3: 1 sign bit, 4 exponent bits, 3 mantissa bits (more commonly used for inference)

Outlier values are an important consideration in preserving accuracy with quantization. Based on the differences between the distribution of model weights and quantized values, quantized outlier values can significantly diverge. Model activations tend to have many more outliers than model weights, making handling outliers even more important in weight-activation quantization. Weight-only quantizations using even lower bit precisions like INT4 are thus able to maintain comparable model quality and are already widely adopted, whereas lower bit weight-activation quantizations incur noticeable accuracy drops due to the high presence of outliers in the activations.

Unsigned INT4 grid, unsigned FP4 2M2E grids with different biases-FriendliAI

An example of integer vs. floating-point formats for 4 bits, with two floating-point examples centered at different values. credit: https://arxiv.org/pdf/2303.17951.pdf

Between INT8 and FP8, INT8 is much more sensitive to outliers due to its uniform distribution. Model weights and especially activations tend to be in bell-shaped distributions with heavy tails, where the heavy tails indicate a high frequency of outliers. While techniques such as SmoothQuant can partially address outliers in INT8, its uniform distribution struggles to handle the abundance of outliers. On the other hand, floating-point type FP8 is on an “exponential,” non-uniform scale, where the difference between values increases the further you go from the center, as seen in the figure above. This makes FP8 very robust to outliers. As a result, FP8 quantization typically exhibits a notably milder accuracy drop compared to INT8 quantization on many models.

However, INT8 is the winner in terms of hardware support. It is supported on the NVIDIA Turing, Ampere, and Hopper Tensor Core GPU architectures, while FP8 is currently only supported on the Hopper architecture.

Conclusion

In conclusion, quantization is used to reduce the numerical precision of a model to enable more efficient deployment, and quantization schemes can vary in their quantized components and their precision types.

Weight-only quantization like GPTQ and AWQ significantly reduce weight sizes by using very low bit precision types such as INT4 without compromising model quality. This technique improves latency but does not yield improvements in throughput.

W8A8 quantization such as SmoothQuant (INT8 weight, INT8 activation) and FP8 (FP8 weight, FP8 activation) both remarkably halve latency while doubling throughput. However, SmoothQuant often suffers accuracy drops due to the uniform distribution of the integer type despite weight “smoothing,” whereas FP8 preserves model accuracy.

FP8 emerges as the optimal choice for W8A8 quantization, preserving accuracy while delivering lower latency and higher throughput.

Experience the Friendli Advantage

Start using FP8 quantization with your models and experience the difference yourself with Friendli Inference, which natively supports FP8.

To get started, take a look at our official documentation for FP8 inference on how to quantize your own checkpoints in FP8 and run them on Friendli Inference. For your convenience, we also offer pre-converted FP8 checkpoints of well-known language models like Llama 2 (1, 2, 3) and Mistral 7b so you can experience the power of FP8 on the Friendli Inference right away.

Not sure which quantization scheme is right for you? Friendli Inference also supports AWQ for weight-only quantization, and our previous article is just the guide to help you choose.

Ready to unleash the power of your LLM? Experience Friendli Inference's performance! We offer three options to suit your preferences:

Friendli Dedicated Endpoints: Run any custom generative AI models on dedicated GPU instances in autopilot.
Friendli Serverless Endpoints: No setup required, simply call our APIs and let us handle the rest.
Friendli Container: Deploy the engine on your own infrastructure for ultimate control.

Visit https://friendli.ai/get-started/container to begin your journey into the world of high-performance LLM serving with the Friendli Inference!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 580,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.