Unlocking Efficiency of Serving LLMs with Activation-aware Weight Quantization (AWQ) on PeriFlow

Blog post thumbnail

As mentioned in our previous article, Activation-Aware Weight Quantization (AWQ) is a technique that optimizes neural networks for efficiency without compromising accuracy. Unlike traditional weight quantization methods, AWQ leverages a deep understanding of the data distribution within neural networks during inference. In the calibration phase, it collects statistics on the specific activations a model generates when exposed to input data. These statistics enable the precise determination of quantization parameters, such as scale and offset, tailored to the data distribution.

AWQ strikes a harmonious balance between model efficiency and accuracy, making it an invaluable tool for deploying LLMs efficiently. What’s more, running AWQ-ed models is made seamless with PeriFlow, a powerful LLM serving engine from FriendliAI. For example, one can run AWQ-ed LLMs (e.g., Llama 2 70B 4-bit on a single A100 80 GB GPU) natively on PeriFlow.

Benchmark Accuracy Numbers: Unlocking the Potential of AWQ on PeriFlow

The accuracy of AWQ-ed models on PeriFlow is remarkable. We run the following benchmark tests with Llama-2–13b-chat (meta-llama/Llama-2–13b-chat-hf in Hugging Face).

As you can see, the 4bit AWQ-ed Llama-2–13b-chat model running on PeriFlow shows similar performance to the original Llama-2–13b-chat model. These results underscore the effectiveness of AWQ-ed models running on PeriFlow in maintaining or even improving model accuracy while significantly reducing the memory and computational requirements.

Running AWQ-ed Models on PeriFlow: A Step-by-Step Guide

  1. Conversion of Unquantized to Quantized Models: To harness the power of AWQ, begin by converting your unquantized model to its quantized counterpart using the following commands:

    # Install periflow-client package.
    $ pip install "periflow-client[mllib]"
    
    # Start checkpoint conversion.
    $ pf checkpoint convert \
        --model-name-or-path $MODEL_NAME_OR_PATH \
        --output-dir $OUTPUT_DIR \
        --data-type $DTYPE \
        --quantize \
        --quant-config-file $QUANT_CONFIG_FILE
    

    The content of file specified at $QUANT_CONFIG_FILE is as follows:

    mode: awq
    device: cuda:0
    seed: 42
    calibration_dataset:
      path_or_name: lambada
      format: json
      split: validation
      lookup_column_name: text
      num_samples: 128
      max_length: 512
    awq_args:
      quant_bit: 4
      quant_group_size: 64
    

    This step ensures that your model is quantized using AWQ to reduce the model size for efficiency while preserving its accuracy.

  2. Running PeriFlow: Once you have the quantized model checkpoint, load it into PeriFlow, the versatile serving engine from FriendliAI.

    $ docker run --gpus=1 -v $LOCAL_CKPT_PATH:/model --network=host
    $PERIFLOW_CONTAINER_IMAGE /bin/bash -c "/root/launcher --web-server-port 6000
    --tokenizer-file-path /model/tokenizer.json --ckpt-path /model/model.h5 --dtype
    fp16 --quant-scheme awq --awq-group-size 64 --model-type llama --num-layers 40
    --num-heads 40 --head-size 128 --rotary-dim 128 --ff-intermediate-size 13824
    --max-length 4096 --vocab-size 32000 --eos-token 2"
    

    With the provided commands, PeriFlow provides a seamless and efficient environment for serving your AWQ-ed models.

  3. Sending Inference Requests to the AWQ-ed model on PeriFlow: With PeriFlow up and running, you can now send inference requests to the server.

    $ curl -X POST http://0.0.0.0:6000/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "prompt": "Say this is an example.\n",
        "max_tokens": 100,
        "temperature": 0.5,
        "top_p": 0.5,
        "stream": true
    }'
    

    PeriFlow takes care of the heavy lifting, delivering high-performance inferencing while sparing you the complexities of deployment.

Stay Tuned for our Performance Numbers!

Running LLMs with AWQ on PeriFlow enables users to achieve efficient LLM deployment. This powerful feature empowers you to achieve remarkable efficiency gains without sacrificing accuracy. As we look ahead, stay tuned for our next article, where we’ll discuss performance that demonstrates the true potential of AWQ-ed models on PeriFlow.



Share

Related Posts

thumbnail
  • October 26, 2023
  • 3 min read

Retrieval-Augmented Generation: A Dive into Contextual AI

Large Language Models
Model Serving
Langchain
thumbnail
  • October 16, 2023
  • 4 min read

Understanding Activation-Aware Weight Quantization (AWQ): Boosting Inference Serving Efficiency in LLMs

Quantization
Large Language Models
Transformers
See all from blog
We use cookiesWe use cookies to enhance your browsing experience on our website. By clicking “Accept all,” you consent to our use of cookies.
scroll to top