- June 27, 2023
- 3 min read
Get an Extra Speedup of LLM Inference with Integer Quantization on PeriFlow
At FriendliAI, our top priority is to deliver a serving system with the best performance. We are excited to introduce a new feature that boosts serving performance by utilizing integer quantization, built on top of PeriFlow Serving Engine.
What is integer quantization?
Large language models comprise vast amounts of operations with more than billions of parameters. Among the operations, the ‘matmul’ (matrix multiplication) takes up the majority, resulting in substantial overhead in computation time. To address this, modern NVIDIA GPUs are equipped with Tensor Cores for matmul operations, which can deliver processing speeds an order of magnitude faster compared to non-Tensor Core GPUs .
To maximize performance, recent GPU architectures like Turing, Ampere, and Hopper feature Tensor Cores capable of integer matrix multiplication, at least twice as fast as float16 matmul operations. Taking advantage of this hardware enhancement, recent research has focused on employing small integer types for matmul, such as int8, int4, or even binary representations. This technique, known as integer quantization, involves representing weight and activation tensors with narrower integer types to harness the efficient computation capabilities of integer matmul.
Among various quantization schemes, our team specifically focuses on int8 quantization, which effectively reduces the latency required for CUDA kernels while preserving model accuracy.
Performance of int8 quantization
To evaluate the performance, we compared the mean end-to-end latency of OPT-13B model  between int8 and fp16 modes on an NVIDIA A100 80 GB GPU. The graph above shows that int8 quantization achieves 2.46x faster latency compared to fp16 mode when operating at the same level of throughput.
This experiment offers valuable insight into enhancing our engine’s performance. In addition to our PeriFlow Serving Engine, whose performance already surpasses existing serving systems, we can utilize the described quantization scheme. For comparisons with other serving systems on various LLMs, please refer to the following links: #1, #2, #3.
Addressing accuracy drop in quantization
Typically, integer quantization results in some degree of accuracy reduction, requiring an additional fine-tuning to restore the model quality. Recent research has made significant advancements in mitigating this issue. Several techniques have been developed to maintain accuracy during quantization, without needing to fine-tune the model.
One notable technique is SmoothQuant , which tackles the challenge of quantizing activation tensors by shifting the difficulty to weight matrices. This approach effectively “smooths” outlier values of activation tensors, enabling them to be safely included within the quantization range.
SmoothQuant demonstrates its effectiveness in preserving the accuracy of OPT models ranging from 1.3B to 175B when quantized to int8. This aligns with our own accuracy evaluations on int8-quantized OPT models using SmoothQuant on our PeriFlow Serving Engine.
We introduce a new feature, integer quantization, which significantly improves the serving performance and speed of LLMs. We employ SmoothQuant to preserve the accuracy of the quantized models. Our evaluation demonstrates that at the same level of throughput, int8 quantization achieves a 2.46x faster mean end-to-end latency compared to fp16 mode.
 NVIDIA Technical blog, https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/
 Zhang, Susan, et al. “Opt: Open pre-trained transformer language models.” arXiv preprint arXiv:2205.01068 (2022)
 Xiao, Guangxuan, et al. “Smoothquant: Accurate and efficient post-training quantization for large language models.” International Conference on Machine Learning. PMLR, 2023
FriendliAI Tech & Research