February 15, 2024
3 min read

Which Quantization to Use to Reduce the Size of LLMs?

Interacting with powerful language models like Llama 2 and GPT-4 is becoming very heavy due to their large sizes. This results in memory bottlenecks, commonly with large and complex models, which hinders the potentials of generative AI. However, techniques like Activation-Aware Weight Quantization (AWQ) (described in our articles [1], [2], and [3]) provides quantization methods that can relieve models from their resource bottlenecks, optimizing LLMs for efficient "inference serving", the real-world deployment where they interact with users.

This consideration has been done by many researchers, and various quantization methods each take a different approach, balancing speed, accuracy, and model size reduction. Some prioritize near-lossless compression, while others target hardware compatibility or lightweight quantization. Understanding these trade-offs is key to choosing the best method for your specific goals. This article dives into the world of LLM quantization and explores how different quantization methods benefit LLM inference serving, with a focus on finding the sweet spot for both speed and accuracy. After all, when it comes to serving users, high latencies translates directly to unhappy interactions, making it one of the most crucial factors to consider.

Several recent techniques exist for quantizing LLMs (Large Language Models) to reduce the model sizes and mitigate the memory bottleneck during inference serving. Here's a quick overview of the key methods and their characteristics:

Method:	Contributions
GPTQ (2022)	Post-training quantization using Hessian information. Push down to 4 bits per weight. Pioneering work; led to lots of weight-only post-training quantization works.
OWQ (Outlier-Aware Weight Quantization, 2023)	Mixed-precision quantization scheme considering activation outliers.
SpQR (Sparse-Quantized Representation, 2023)	Isolate outlier weights and keep them in high precision.
SqueezeLLM (2023)	Sensitivity-based non-uniform quantization, outlier extraction.
SmoothQuant (2022)	Quantize both weights and activations. Utilize faster acceleration units such as INT8 TensorCore on a NVIDIA A100 GPU.
AWQ (Activation-aware Weight Quantization, 2023)	Search for optimal per-channel scaling by observing activation.

Key insights:

In terms of accuracy, OWQ, SpQR, and SqueezeLLM are known to show slightly better accuracy than AWQ. AWQ gives better accuracy than GPTQ, while SmoothQuant falls behind them.
In terms of speed, SmoothQuant shows the best performance, followed by AWQ. OWQ, SpQR, and SqueezeLLM require more complex computation for their quantization schemes, hence slower.
AWQ stands out for its balance of accuracy and speed, making it well-suited for inference serving with strict latency requirements under high load.

We put our quantization to the test for the AWQ-ed Llama-2 70B chat model on a single NVIDIA A100 80GB GPU with the Stanford Alpaca dataset.

We measured performance using two key metrics:

Time to First Token (TTFT): How long it takes to generate the first token.
Time Per Output Token (TPOT): How long it takes to generate each subsequent token.

Compared to vLLM, an open-source serving engine, the Friendli Inference, both using AWQ:

Responds at least 2x faster on the first token, making those initial interactions snappy and responsive.
Delivers tokens up to 219x faster TTFT at higher loads, ensuring smooth, lag-free conversations even when things get intense.

Llama 2 70B Chat AWQ model on a NVIDIA's A100 80GB GPU Stanford Alpaca dataset p90 time to first token comparison-FriendliAI

Considering that we can read about 4 words per second on average, vLLM starts to feel sluggish beyond a certain point. On the other hand, the Friendli Inference maintains its lightning speed, keeping you comfortably engaged no matter how complex the conversation gets.

1.7x to 18.3x faster TPOT, guaranteeing a natural, human-like conversation flow even under heavy loads.

Llama 2 70B Chat AWQ model on a NVIDIA's A100 80GB GPU Stanford Alpaca dataset p90 time per output token comparison-FriendliAI

Ready to Unleash the Power of Your LLM? Experience Friendli Inference's performance! We offer three options to suit your preferences:

Friendli Dedicated Endpoints: Run any custom generative AI models on dedicated GPU instances in autopilot.
Friendli Serverless Endpoints: No setup required, simply call our APIs and let us handle the rest.
Friendli Container: Deploy the engine on your own infrastructure for ultimate control.

Visit https://friendli.ai/try-friendli to begin your journey into the world of high-performance LLM serving with the Friendli Inference!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.