(function() { var utmInheritingDomain = "appstore.com", utmRegExp = /(&|\?)utm_[A-Za-z]+=[A-Za-z0-9]+/gi, links = document.getElementsByTagName("a"), utms = [ "utm_medium={{URL – utm_medium}}", "utm_source={{URL – utm_source}}", "utm_campaign={{URL – utm_campaign}}" ]; for (var index = 0; index < links.length; index += 1) { var tempLink = links[index].href, tempParts; if (tempLink.indexOf(utmInheritingDomain) > 0) { tempLink = tempLink.replace(utmRegExp, ""); tempParts = tempLink.split("#"); if (tempParts[0].indexOf("?") < 0 ) { tempParts[0] += "?" + utms.join("&"); } else { tempParts[0] += "&" + utms.join("&"); } tempLink = tempParts.join("#"); } links[index].href = tempLink; } }());
  • February 15, 2024
  • 4 min read

Which Quantization to Use to Reduce the Size of LLMs?

Which Quantization to Use to Reduce the Size of LLMs? thumbnail

Interacting with powerful language models like Llama 2 and GPT-4 is becoming very heavy due to their large sizes. This results in memory bottlenecks, commonly with large and complex models, which hinders the potentials of generative AI. However, techniques like Activation-Aware Weight Quantization (AWQ) (described in our articles [1], [2], and [3]) provides quantization methods that can relieve models from their resource bottlenecks, optimizing LLMs for efficient "inference serving", the real-world deployment where they interact with users.

This consideration has been done by many researchers, and various quantization methods each take a different approach, balancing speed, accuracy, and model size reduction. Some prioritize near-lossless compression, while others target hardware compatibility or lightweight quantization. Understanding these trade-offs is key to choosing the best method for your specific goals. This article dives into the world of LLM quantization and explores how different quantization methods benefit LLM inference serving, with a focus on finding the sweet spot for both speed and accuracy. After all, when it comes to serving users, high latencies translates directly to unhappy interactions, making it one of the most crucial factors to consider.

Several recent techniques exist for quantizing LLMs (Large Language Models) to reduce the model sizes and mitigate the memory bottleneck during inference serving. Here's a quick overview of the key methods and their characteristics:

GPTQ (2022)Post-training quantization using Hessian information. Push down to 4 bits per weight. Pioneering work; led to lots of weight-only post-training quantization works.
OWQ (Outlier-Aware Weight Quantization, 2023)Mixed-precision quantization scheme considering activation outliers.
SpQR (Sparse-Quantized Representation, 2023)Isolate outlier weights and keep them in high precision.
SqueezeLLM (2023)Sensitivity-based non-uniform quantization, outlier extraction.
SmoothQuant (2022)Quantize both weights and activations. Utilize faster acceleration units such as INT8 TensorCore on a NVIDIA A100 GPU.
AWQ (Activation-aware Weight Quantization, 2023)Search for optimal per-channel scaling by observing activation.

Key insights:

  • In terms of accuracy, OWQ, SpQR, and SqueezeLLM are known to show slightly better accuracy than AWQ. AWQ gives better accuracy than GPTQ, while SmoothQuant falls behind them.
  • In terms of speed, SmoothQuant shows the best performance, followed by AWQ. OWQ, SpQR, and SqueezeLLM require more complex computation for their quantization schemes, hence slower.
  • AWQ stands out for its balance of accuracy and speed, making it well-suited for inference serving with strict latency requirements under high load.

We put our quantization to the test for the AWQ-ed Llama-2 70B chat model on a single NVIDIA A100 80GB GPU with the Stanford Alpaca dataset.

We measured performance using two key metrics:

  • Time to First Token (TTFT): How long it takes to generate the first token.
  • Time Per Output Token (TPOT): How long it takes to generate each subsequent token.

Compared to vLLM, an open-source serving engine, the Friendli Engine, both using AWQ:

  • Responds at least 2x faster on the first token, making those initial interactions snappy and responsive.
  • Delivers tokens up to 219x faster TTFT at higher loads, ensuring smooth, lag-free conversations even when things get intense.

Llama 2 70B Chat AWQ model on a NVIDIA's A100 80GB GPU Stanford Alpaca dataset p90 time to first token comparison-FriendliAI

Considering that we can read about 4 words per second on average, vLLM starts to feel sluggish beyond a certain point. On the other hand, the Friendli Engine maintains its lightning speed, keeping you comfortably engaged no matter how complex the conversation gets.

  • 1.7x to 18.3x faster TPOT, guaranteeing a natural, human-like conversation flow even under heavy loads.

Llama 2 70B Chat AWQ model on a NVIDIA's A100 80GB GPU Stanford Alpaca dataset p90 time per output token comparison-FriendliAI

Ready to Unleash the Power of Your LLM? Experience Friendli Engine's performance! We offer three options to suit your preferences:

Visit https://friendli.ai/try-friendli/ to begin your journey into the world of high-performance LLM serving with the Friendli Engine!

Written by

FriendliAI logo

FriendliAI Tech & Research