Serving Quantized Models
Tutorial for serving quantized model with Friendli Engine. Friendli Engine supports FP8, IN8, and AWQ-ed model checkpoints.
Introduction
Quantization is a technique that reduces the precision of a generative AI model’s parameters, optimizing memory usage and inference speed while maintaining acceptable accuracy. This tutorial will walk you through the process of serving quantized models with Friendli Container.
Off-the-Shelf Model Checkpoints from Hugging Face Hub
To use model checkpoints that are already quantized and available on Hugging Face Hub, check the following options:
- Checkpoints quantized with friendli-model-optimizer
- Quantized model checkpoints by FriendliAI
- a subset of models quantized with:
For details on how to use these models, go directly to Serving Quantized Models.
Quantizing Your Own Models (FP8/INT8)
To quantize your own models with FP8 or INT8, follow these steps:
- Install the
friendli-model-optimizer
package This tool provides model quantization for efficient generative AI serving with Friendli Engine. Install it using the following command:
-
Prepare the Original Model Ensure you have the original model checkpoint that can be loaded using Hugging Face’s
transformers
library. -
Quantize Model with Friendli-Model-Optimizer(FMO) You can simply run quantization with the command below:
When the model checkpoint is successfully quantized, the following files will be created at $OUTPUT_DIR
.
config.json
model.safetensors
special_tokens_map.json
tokenizer_config.json
tokenizer.json
If the size of the model exceeds 10GB, multiple sharded checkpoints are generated as follows instead of a single model.safetensors
.
model-00001-of-00005.safetensors
model-00002-of-00005.safetensors
model-00003-of-00005.safetensors
model-00004-of-00005.safetensors
model-00005-of-00005.safetensors
For more information about FMO, check out this documentation for details.
Serving Quantized Models
Search Optimal Policy
To serve quantized models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at Running Policy Search.
Serving FP8 Models
Once you have prepared the quantized model checkpoint, you are ready to create a serving endpoint.
Example: FriendliAI/Llama-3.1-8B-Instruct-fp8
FP8 model serving is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures.
Was this page helpful?