Get an Extra Speedup of LLM Inference with Integer Quantization on PeriFlow

Blog post thumbnail

At FriendliAI, our top priority is to deliver a serving system with the best performance. We are excited to introduce a new feature that boosts serving performance by utilizing integer quantization, built on top of PeriFlow Serving Engine.

What is integer quantization?

Large language models comprise vast amounts of operations with more than billions of parameters. Among the operations, the ‘matmul’ (matrix multiplication) takes up the majority, resulting in substantial overhead in computation time. To address this, modern NVIDIA GPUs are equipped with Tensor Cores for matmul operations, which can deliver processing speeds an order of magnitude faster compared to non-Tensor Core GPUs [1].

To maximize performance, recent GPU architectures like Turing, Ampere, and Hopper feature Tensor Cores capable of integer matrix multiplication, at least twice as fast as float16 matmul operations. Taking advantage of this hardware enhancement, recent research has focused on employing small integer types for matmul, such as int8, int4, or even binary representations. This technique, known as integer quantization, involves representing weight and activation tensors with narrower integer types to harness the efficient computation capabilities of integer matmul.

Among various quantization schemes, our team specifically focuses on int8 quantization, which effectively reduces the latency required for CUDA kernels while preserving model accuracy.

Performance of int8 quantization

To evaluate the performance, we compared the mean end-to-end latency of OPT-13B model [2] between int8 and fp16 modes on an NVIDIA A100 80 GB GPU. The graph above shows that int8 quantization achieves 2.46x faster latency compared to fp16 mode when operating at the same level of throughput.

This experiment offers valuable insight into enhancing our engine’s performance. In addition to our PeriFlow Serving Engine, whose performance already surpasses existing serving systems, we can utilize the described quantization scheme. For comparisons with other serving systems on various LLMs, please refer to the following links: #1, #2, #3.

Addressing accuracy drop in quantization

Typically, integer quantization results in some degree of accuracy reduction, requiring an additional fine-tuning to restore the model quality. Recent research has made significant advancements in mitigating this issue. Several techniques have been developed to maintain accuracy during quantization, without needing to fine-tune the model.

One notable technique is SmoothQuant [3], which tackles the challenge of quantizing activation tensors by shifting the difficulty to weight matrices. This approach effectively “smooths” outlier values of activation tensors, enabling them to be safely included within the quantization range.

SmoothQuant demonstrates its effectiveness in preserving the accuracy of OPT models ranging from 1.3B to 175B when quantized to int8. This aligns with our own accuracy evaluations on int8-quantized OPT models using SmoothQuant on our PeriFlow Serving Engine.


We introduce a new feature, integer quantization, which significantly improves the serving performance and speed of LLMs. We employ SmoothQuant to preserve the accuracy of the quantized models. Our evaluation demonstrates that at the same level of throughput, int8 quantization achieves a 2.46x faster mean end-to-end latency compared to fp16 mode.

For more information about FriendliAI, check the link.
About PeriFlow, check the

[1] NVIDIA Technical blog,

[2] Zhang, Susan, et al. “Opt: Open pre-trained transformer language models.” arXiv preprint arXiv:2205.01068 (2022)

[3] Xiao, Guangxuan, et al. “Smoothquant: Accurate and efficient post-training quantization for large language models.” International Conference on Machine Learning. PMLR, 2023


Related Posts

  • July 3, 2023
  • 2 min read

PeriFlow’s Enriched Coverage for Sought-After LLMs: MPT, LLaMA, and Dolly

Generative Model
Large Model
  • January 17, 2023
  • 3 min read

Fine-tuning and Serving CodeGen, a Code Generation Model, with PeriFlow

See all from blog
We use cookiesWe use cookies to enhance your browsing experience on our website. By clicking “Accept all,” you consent to our use of cookies.
scroll to top