June 27, 2023
3 min read

Get an Extra Speedup of LLM Inference with Integer Quantization on Friendli Inference

At FriendliAI, our top priority is to deliver a serving system with the best performance. We are excited to introduce a new feature that boosts serving performance by utilizing integer quantization, built on top of Friendli Inference.

What is integer quantization?

Large language models comprise vast amounts of operations with more than billions of parameters. Among the operations, the ‘matmul’ (matrix multiplication) operation takes up the majority, resulting in substantial overhead in computation time. To address this, modern NVIDIA GPUs are equipped with Tensor Cores for matmul operations, which can deliver processing speeds an order of magnitude faster compared to non-Tensor Core GPUs [1].

To maximize performance, recent GPU architectures like Turing, Ampere, and Hopper feature Tensor Cores capable of integer matrix multiplication at least twice as fast as float16 matmul operations. Taking advantage of this hardware enhancement, recent research has focused on employing small integer types for matmul, such as int8, int4, or even binary representations. This technique, known as integer quantization, involves representing weight and activation tensors with narrower integer types to harness the efficient computation capabilities of integer matmul.

Among various quantization schemes, our team specifically focuses on int8 quantization, which effectively reduces the latency required for CUDA kernels while preserving model accuracy.

Performance of int8 quantization

End-to-end mean latency comparison on OPT-13B-FriendliAI

To evaluate the performance, we compared the mean end-to-end latency of an OPT-13B model [2] between int8 and fp16 modes on an NVIDIA A100 80 GB GPU. The graph above shows that int8 quantization achieves 2.46x faster latency compared to the fp16 mode when operating at the same level of throughput.

This experiment offers valuable insight into enhancing our engine’s performance. In addition to our Friendli Inference, whose performance already surpasses existing serving systems, we can utilize the above quantization scheme. For comparisons with other serving systems on various LLMs, please refer to the following links: #1, #2, #3.

Addressing accuracy drop in quantization

Typically, integer quantization results in some degree of accuracy reduction and requires additional fine-tuning to restore the model quality. Recent research has made significant advancements in mitigating this issue. Several techniques have been developed to maintain accuracy during quantization without needing to fine-tune the model.

One notable technique is SmoothQuant [3], which tackles the challenge of quantizing activation tensors by shifting the difficulty to weight matrices. This approach effectively “smooths” outlier values of activation tensors, enabling them to be safely included within the quantization range.

SmoothQuant demonstrates its effectiveness in preserving the accuracy of OPT models ranging from 1.3B to 175B when quantized to int8-FriendliAI

A graph from the SmoothQuant paper [3], demonstrating that SmoothQuant preserves the accuracy of int8-quantized models compared to the fp16 models.

SmoothQuant demonstrates its effectiveness in preserving the accuracy of OPT models ranging from 1.3B to 175B when quantized to int8. This aligns with our own accuracy evaluations on int8-quantized OPT models using SmoothQuant on our Friendli Inference.

Summary

We introduce a new feature, integer quantization, which significantly improves the serving performance and speed of LLMs. We employ SmoothQuant to preserve the accuracy of the quantized models. Our evaluation demonstrates that at the same level of throughput, int8 quantization achieves a 2.46x faster mean end-to-end latency compared to fp16 mode.

For more information about FriendliAI, check the link.
About Friendli Inference, check the link.

[1] NVIDIA Technical blog, https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/

[2] Zhang, Susan, et al. “Opt: Open pre-trained transformer language models.” arXiv preprint arXiv:2205.01068 (2022)

[3] Xiao, Guangxuan, et al. “Smoothquant: Accurate and efficient post-training quantization for large language models.” International Conference on Machine Learning. PMLR, 2023

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.

July 3, 2023
2 min read

Friendli Inference's Enriched Coverage for Sought-After LLMs: MPT, LLaMA, and Dolly

GenAI

Llama

Models

January 17, 2023
3 min read

Fine-tuning and Serving CodeGen, a Code Generation Model, with Friendli Inference

Tutorial