October 23, 2023
2 min read

Activation-aware Weight Quantization (AWQ): Unlocking LLM Efficiency—Part 2: Benchmarks and Practical Guide

As mentioned in our previous article, Activation-Aware Weight Quantization (AWQ) is a technique that optimizes the efficiency of a neural network without compromising its accuracy. Unlike traditional weight quantization methods, AWQ leverages a deep understanding of the data distribution within neural networks during inference. In the calibration phase, it collects statistics on the specific activations a model generates when exposed to input data. These statistics enable the precise determination of quantization parameters, such as scale and offset, tailored to the data distribution.

AWQ strikes a harmonious balance between model efficiency and accuracy, making it an invaluable tool for deploying LLMs efficiently. What’s more, running AWQ-ed models is made seamless with Friendli Inference, a powerful LLM serving engine from FriendliAI. For example, one can run AWQ-ed LLMs (e.g., Llama 2 70B 4-bit on a single A100 80 GB GPU) natively on Friendli Inference.

Benchmark Accuracy Numbers: Unlocking the Potential of AWQ on Friendli Inference

The accuracy of AWQ-ed models on Friendli Inference is remarkable. We run the following benchmark tests with Llama-2–13b-chat (meta-llama/Llama-2–13b-chat-hf in Hugging Face).

4bit AWQ-ed Llama-2–13b-chat model running on Friendli Inference shows similar performance to the original Llama-2–13b-chat model-FriendliAI

As you can see, the 4bit AWQ-ed Llama-2–13b-chat model running on Friendli Inference shows similar performance to the original Llama-2–13b-chat model. These results underscore the effectiveness of AWQ-ed models running on Friendli Inference in maintaining or even improving model accuracy while significantly reducing the memory and computational requirements.

Running AWQ-ed Models on Friendli Inference: A Step-by-Step Guide

Converting an Unquantized Model: To harness the power of AWQ, begin by converting your unquantized model to its quantized counterpart using the following commands:

bash
# Install friendli-client package.
$ pip install "friendli-client[mllib]"

# Start checkpoint conversion.
$ friendli checkpoint convert \
    --model-name-or-path $MODEL_NAME_OR_PATH \
    --output-dir $OUTPUT_DIR \
    --data-type $DTYPE \
    --quantize \
    --quant-config-file $QUANT_CONFIG_FILE

The content of file specified at $QUANT_CONFIG_FILE is as follows:

yaml
mode: awq
device: cuda:0
seed: 42
calibration_dataset:
  path_or_name: lambada
  format: json
  split: validation
  lookup_column_name: text
  num_samples: 128
  max_length: 512
awq_args:
  quant_bit: 4
  quant_group_size: 64

This step ensures that your model is quantized using AWQ to reduce the model size for efficiency while preserving its accuracy.

Running Friendli Inference: Once you have the quantized model checkpoint, load it into Friendli Inference, the versatile serving engine from FriendliAI.

bash
$ docker run --gpus=1 -v $LOCAL_CKPT_PATH:/model --network=host
$PERIFLOW_CONTAINER_IMAGE /bin/bash -c "/root/launcher --web-server-port 6000
--tokenizer-file-path /model/tokenizer.json --ckpt-path /model/model.h5 --dtype
fp16 --quant-scheme awq --awq-group-size 64 --model-type llama --num-layers 40
--num-heads 40 --head-size 128 --rotary-dim 128 --ff-intermediate-size 13824
--max-length 4096 --vocab-size 32000 --eos-token 2"

With the provided commands, Friendli Inference provides a seamless and efficient environment for serving your AWQ-ed models.

Sending Inference Requests to the AWQ-ed model on Friendli Inference: With Friendli Inference up and running, you can now send inference requests to the server.

bash
$ curl -X POST http://0.0.0.0:6000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Say this is an example.\n",
    "max_tokens": 100,
    "temperature": 0.5,
    "top_p": 0.5,
    "stream": true
}'

Friendli Inference takes care of the heavy lifting, delivering high-performance inference serving while sparing you the complexities of deployment.

Stay Tuned for our Performance Numbers!

Running LLMs with AWQ on Friendli Inference enables users to achieve efficient LLM deployment. This powerful feature achieves remarkable efficiency gains without sacrificing accuracy. Stay tuned for our next article, where we’ll discuss performance that demonstrates the true potential of AWQ-ed models on Friendli Inference.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.

October 26, 2023
2 min read

Retrieval-Augmented Generation: A Dive into Contextual AI

RAG

Contextual AI

Friendli Inference

October 16, 2023
4 min read

Activation-aware Weight Quantization (AWQ): Unlocking LLM Efficiency—Part 1: Understanding the Basics