Serving Quantized Models

Introduction

Quantization is a technique that reduces the precision of a generative AI model’s parameters, optimizing memory usage and inference speed while maintaining acceptable accuracy. This tutorial will walk you through the process of serving quantized models with Friendli Container.

Off-the-Shelf Model Checkpoints from Hugging Face Hub

To use model checkpoints that are already quantized and available on Hugging Face Hub, check the following options:

Checkpoints quantized with friendli-model-optimizer
Quantized model checkpoints by FriendliAI
a subset of models quantized with:

For details on how to use these models, go directly to Serving Quantized Models.

Quantizing Your Own Models (FP8/INT8)

To quantize your own models with FP8 or INT8, follow these steps:

Install the friendli-model-optimizer package This tool provides model quantization for efficient generative AI serving with Friendli Engine. Install it using the following command:

pip install "friendli-model-optimizer"

Prepare the Original Model Ensure you have the original model checkpoint that can be loaded using Hugging Face’s transformers library.
Quantize Model with Friendli-Model-Optimizer(FMO) You can simply run quantization with the command below:

export MODEL_NAME_OR_PATH="" # Hugging Face pretrained model name or directory path of the original model checkpoint.
export OUTPUT_DIR="" # Directory path to save the quantized checkpoint and related configurations.
export QUANTIZATION_SCHEME="" # Quantization techniques to apply. You can use fp8, int8.
export DEVICE="" # Device to run the quantization process. Defaults to "cuda:0".

fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode $QUANTIZATION_SCHEME \
--device $DEVICE \

When the model checkpoint is successfully quantized, the following files will be created at $OUTPUT_DIR.

config.json
model.safetensors
special_tokens_map.json
tokenizer_config.json
tokenizer.json

If the size of the model exceeds 10GB, multiple sharded checkpoints are generated as follows instead of a single model.safetensors.

model-00001-of-00005.safetensors
model-00002-of-00005.safetensors
model-00003-of-00005.safetensors
model-00004-of-00005.safetensors
model-00005-of-00005.safetensors

For more information about FMO, check out this documentation for details.