Get an Extra Speedup of LLM Inference with Integer Quantization on Friendli Engine

Get an Extra Speedup of LLM Inference with Integer Quantization on Friendli Engine thumbnail

At FriendliAI, our top priority is to deliver a serving system with the best performance. We are excited to introduce a new feature that boosts serving performance by utilizing integer quantization, built on top of Friendli Engine.

What is integer quantization?

Large language models comprise vast amounts of operations with more than billions of parameters. Among the operations, the ‘matmul’ (matrix multiplication) operation takes up the majority, resulting in substantial overhead in computation time. To address this, modern NVIDIA GPUs are equipped with Tensor Cores for matmul operations, which can deliver processing speeds an order of magnitude faster compared to non-Tensor Core GPUs [1].

To maximize performance, recent GPU architectures like Turing, Ampere, and Hopper feature Tensor Cores capable of integer matrix multiplication at least twice as fast as float16 matmul operations. Taking advantage of this hardware enhancement, recent research has focused on employing small integer types for matmul, such as int8, int4, or even binary representations. This technique, known as integer quantization, involves representing weight and activation tensors with narrower integer types to harness the efficient computation capabilities of integer matmul.

Among various quantization schemes, our team specifically focuses on int8 quantization, which effectively reduces the latency required for CUDA kernels while preserving model accuracy.

Performance of int8 quantization

End-to-end mean latency comparison on OPT-13B-FriendliAI

To evaluate the performance, we compared the mean end-to-end latency of an OPT-13B model [2] between int8 and fp16 modes on an NVIDIA A100 80 GB GPU. The graph above shows that int8 quantization achieves 2.46x faster latency compared to the fp16 mode when operating at the same level of throughput.

This experiment offers valuable insight into enhancing our engine’s performance. In addition to our Friendli Engine, whose performance already surpasses existing serving systems, we can utilize the above quantization scheme. For comparisons with other serving systems on various LLMs, please refer to the following links: #1, #2, #3.

Addressing accuracy drop in quantization

Typically, integer quantization results in some degree of accuracy reduction and requires additional fine-tuning to restore the model quality. Recent research has made significant advancements in mitigating this issue. Several techniques have been developed to maintain accuracy during quantization without needing to fine-tune the model.

One notable technique is SmoothQuant [3], which tackles the challenge of quantizing activation tensors by shifting the difficulty to weight matrices. This approach effectively “smooths” outlier values of activation tensors, enabling them to be safely included within the quantization range.

SmoothQuant demonstrates its effectiveness in preserving the accuracy of OPT models ranging from 1.3B to 175B when quantized to int8-FriendliAI

A graph from the SmoothQuant paper [3], demonstrating that SmoothQuant preserves the accuracy of int8-quantized models compared to the fp16 models.

SmoothQuant demonstrates its effectiveness in preserving the accuracy of OPT models ranging from 1.3B to 175B when quantized to int8. This aligns with our own accuracy evaluations on int8-quantized OPT models using SmoothQuant on our Friendli Engine.

Summary

We introduce a new feature, integer quantization, which significantly improves the serving performance and speed of LLMs. We employ SmoothQuant to preserve the accuracy of the quantized models. Our evaluation demonstrates that at the same level of throughput, int8 quantization achieves a 2.46x faster mean end-to-end latency compared to fp16 mode.

For more information about FriendliAI, check the link.
About Friendli Engine, check the
link.

[1] NVIDIA Technical blog, https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/

[2] Zhang, Susan, et al. “Opt: Open pre-trained transformer language models.” arXiv preprint arXiv:2205.01068 (2022)

[3] Xiao, Guangxuan, et al. “Smoothquant: Accurate and efficient post-training quantization for large language models.” International Conference on Machine Learning. PMLR, 2023



Share

Related Posts

Friendli Engine's Enriched Coverage for Sought-After LLMs: MPT, LLaMA, and Dolly thumbnail
  • July 3, 2023
  • 2 min read

Friendli Engine's Enriched Coverage for Sought-After LLMs: MPT, LLaMA, and Dolly

Transformers
Generative Model
Large Model
Fine-tuning and Serving CodeGen, a Code Generation Model, with Friendli Engine thumbnail
  • January 17, 2023
  • 3 min read

Fine-tuning and Serving CodeGen, a Code Generation Model, with Friendli Engine

Codegen
Mlops
Transformers
See all from blog