- September 2, 2024
- 5 min read
Compress Generative AI Models with Friendli Model Optimizer
We are excited to introduce Friendli Model Optimizer (FMO), which is primarily a quantization tool that optimizes generative AI models for efficient deployments with the Friendli Engine.
By leveraging FMO, you can:
- Improve Inference Speed: Improve response times for applications requiring real-time interactions.
- Lower Resource Consumption: Create smaller models to reduce GPU resource consumption and save costs.
- Maintain Model Accuracy: Accurately compress your models, ensuring that your AI solutions deliver high-quality results.
Figure 1: Friendli Model Optimizer (FMO) Diagram - Friendli Container and Friendli Dedicated Endpoints are powered by the Friendli Engine
Deploying generative AI models presents significant challenges, from ensuring low latency and high throughput to managing the computational costs associated with running complex models. At FriendliAI, we address these challenges by offering a unified library encompassing our empirical recipes that prepares models for efficient, real-world deployment with different optimization techniques.
In this article, we’ll explore how to use the FMO library and understand its quantization feature. Stay tuned for part 2 of this blog, where we'll delve deeper into the performance analysis.
Post-Training Quantization in Friendli Model Optimizer
Different Pedantic Levels
FMO includes a pedantic level setting that allows users to balance the trade-off between the accuracy of the quantized model and the processing time needed for quantization. Higher pedantic levels can increase the accuracy of quantized models but may also extend the time required for their generation and slow down inference due to more “pedantic” computations. On the other hand, lower pedantic levels speed up the quantization process, though they may slightly affect model accuracy.
It’s important to note that this balance is a trade-off rather than a straightforward benefit or drawback. The pedantic level does not affect the size of the compressed output model and each quantization mode supports different ranges of pedantic levels.
Quantization Modes
FMO currently supports the following PTQ (Post-Training Quantization) techniques:
INT8 Quantization represents weights and activations using the INT8 format while ensuring that model accuracy is maintained even after quantization. Friendli Engine enables dynamic activation scaling, where scales are computed on-the-fly during runtime.
- INT8 Quantization supports two pedantic levels: level 0 and level 1.
- Supported Model Architectures:
CohereForCausalLM
Gemma2ForCausalLM
LlamaForCausalLM
MistralForcausalLM
Qwen2ForCausalLM
- and more to come
FP8 Quantization utilizes the FP8 format, an 8-bit floating-point representation that provides a higher dynamic range compared to INT8. This makes FP8 more effective for quantizing both weights and activations, resulting in increased throughput and reduced latency while preserving high output quality with minimal degradation. Currently, we employ the E4M3 encoding format, which is defined by a 4-bit exponent and a 3-bit mantissa configuration.
- FP8 Quantization supports three pedantic levels: level 0, level 1, and level 2.
- Supported Model Architectures:
ArcticForCausalLM
CohereForCausalLM
Gemma2ForCausalLM
LlamaForCausalLM
MistralForcausalLM
MixtralForCausalLM
MptForCausalLM
Phi3ForCausalLM
Qwen2ForCausalLM
- and more to come
- Currently, FP8 is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures. Also, FP8 Quantization for
Phi3ForCausalLM
,MptForCausalLM
,ArcticForCausalLM
, andMixtralForCausalLM
are available just with the pedantic level 0 setting.
For more insights on quantization, refer to our blog posts: comparing the different quantization schemes, weight-activation quantization in fp8, serving performance comparison of quantized models, and activation-aware weight quantization (AWQ) (+ tutorial).
A Quick Quantization Demo
In this section, we provide a quick demonstration of how to use the Friendli Model Optimizer for quantizing large language models. Specifically, we showcase the process of applying FP8 quantization on Meta’s Llama 3.1 8B Instruct Model.
The following section will guide you through the installation of the FMO library and provide instructions running the quantization example below. Additionally, we provide descriptions on the command line parameters that can be used for setting up the quantization, allowing users to specify the model, output directory, quantization mode, and other configurations.
FMO Installation and Example Usage
The FMO library can be easily installed using pip, the Python package manager. Once installed, FMO offers a command-line interface (CLI) that allows users to perform quantization with a single command.
You can install the FMO library with the below command:
pip install friendli-model-optimizer
After setting the environment variables, you can run a quantization process with the below command:
fmo quantize \ --model-name-or-path $MODEL_NAME_OR_PATH \ --output-dir $OUTPUT_DIR \ --mode $QUANTIZATION_SCHEME \ --pedantic-level $PEDANTIC_LEVEL --device $DEVICE \ --offload
An example of running FP8 quantization with Meta-Llama-3.1-8B-Instruct is:
export MODEL_NAME_OR_PATH="meta-llama/Meta-Llama-3.1-8B-Instruct" export OUTPUT_DIR="./" fmo quantize \ --model-name-or-path $MODEL_NAME_OR_PATH \ --output-dir $OUTPUT_DIR \ --mode "fp8" \ --device "cuda:0" \ --offload --pedantic-level 1
Command Line Parameters
model-name-or-path
: Model name or directory path of the saved model checkpoint.output-dir
: Directory path to save the quantized checkpoint and related configurations.mode
: Quantization mode. You can usefp8
,int8
.pedantic-level
: Defaults to level 1.device
: Device to run the quantization process. Defaults to "cuda:0".offload
: When enabled, this option significantly reduces GPU memory usage by offloading model layers onto CPU RAM—defaults to False.
How to Run an Optimized Model with Friendli Container
Once your optimized model is ready, you can easily launch the model using Friendli Container. Please check out our official documentation to learn more!
Faster Inference with Policy Search
To further improve the inference speed of your optimized models, Friendli Container offers an automated Policy Search feature that enables the identification of the most efficient execution policies. Please refer to our guide for Optimizing Inference with Policy Search to learn more.
export QUANTIZED_MODEL_DIR=$OUTPUT_DIR export FRIENDLI_CONTAINER_SECRET="{YOUR CONTAINER SECRET}" export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial" export GPU_ENUMERATION='"device=0"' export POLICY_DIR=$PWD/policy mkdir -p $POLICY_DIR docker run \ --gpus $GPU_ENUMERATION \ -p 8000:8000 \ -v $QUANTIZED_MODEL_DIR:/model \ -v $POLICY_DIR:/policy \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name /model \ --algo-policy-dir /policy \ --search-policy true
Conclusion
In conclusion, the Friendli Model Optimizer offers a powerful solution for compressing generative AI models using optimization techniques like FP8 Quantization and INT8 Quantization. By leveraging advanced quantization strategies, powered by our empirical recipe, FMO allows developers to optimize their models for faster inference, reduced latency, and lower resource consumption.
We will soon come back with part 2 of this blog, which will further explore FMO with experimentation results. For more details, you can visit the Friendli Model Optimizer (FMO) GitHub Repository.
Written by
FriendliAI Tech & Research
Share