September 2, 2024
4 min read

Compress Generative AI Models with Friendli Model Optimizer

We are excited to introduce Friendli Model Optimizer (FMO), which is primarily a quantization tool that optimizes generative AI models for efficient deployments with the Friendli Inference.

By leveraging FMO, you can:

Improve Inference Speed: Improve response times for applications requiring real-time interactions.
Lower Resource Consumption: Create smaller models to reduce GPU resource consumption and save costs.
Maintain Model Accuracy: Accurately compress your models, ensuring that your AI solutions deliver high-quality results.

FMO diagram

Figure 1: Friendli Model Optimizer (FMO) Diagram - Friendli Container and Friendli Dedicated Endpoints are powered by the Friendli Inference

Deploying generative AI models presents significant challenges, from ensuring low latency and high throughput to managing the computational costs associated with running complex models. At FriendliAI, we address these challenges by offering a unified library encompassing our empirical recipes that prepares models for efficient, real-world deployment with different optimization techniques.

In this article, we’ll explore how to use the FMO library and understand its quantization feature. Stay tuned for part 2 of this blog, where we'll delve deeper into the performance analysis.

Post-Training Quantization in Friendli Model Optimizer

Different Pedantic Levels

FMO includes a pedantic level setting that allows users to balance the trade-off between the accuracy of the quantized model and the processing time needed for quantization. Higher pedantic levels can increase the accuracy of quantized models but may also extend the time required for their generation and slow down inference due to more “pedantic” computations. On the other hand, lower pedantic levels speed up the quantization process, though they may slightly affect model accuracy.

It’s important to note that this balance is a trade-off rather than a straightforward benefit or drawback. The pedantic level does not affect the size of the compressed output model and each quantization mode supports different ranges of pedantic levels.

Quantization Modes

FMO currently supports the following PTQ (Post-Training Quantization) techniques:

INT8 Quantization represents weights and activations using the INT8 format while ensuring that model accuracy is maintained even after quantization. Friendli Inference enables dynamic activation scaling, where scales are computed on-the-fly during runtime.

INT8 Quantization supports two pedantic levels: level 0 and level 1.
Supported Model Architectures:
- CohereForCausalLM
- Gemma2ForCausalLM
- LlamaForCausalLM
- MistralForcausalLM
- Qwen2ForCausalLM
- and more to come

FP8 Quantization utilizes the FP8 format, an 8-bit floating-point representation that provides a higher dynamic range compared to INT8. This makes FP8 more effective for quantizing both weights and activations, resulting in increased throughput and reduced latency while preserving high output quality with minimal degradation. Currently, we employ the E4M3 encoding format, which is defined by a 4-bit exponent and a 3-bit mantissa configuration.

FP8 Quantization supports three pedantic levels: level 0, level 1, and level 2.
Supported Model Architectures:
- ArcticForCausalLM
- CohereForCausalLM
- Gemma2ForCausalLM
- LlamaForCausalLM
- MistralForcausalLM
- MixtralForCausalLM
- MptForCausalLM
- Phi3ForCausalLM
- Qwen2ForCausalLM
- and more to come
Currently, FP8 is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures. Also, FP8 Quantization for Phi3ForCausalLM, MptForCausalLM, ArcticForCausalLM, and MixtralForCausalLM are available just with the pedantic level 0 setting.

For more insights on quantization, refer to our blog posts: comparing the different quantization schemes, weight-activation quantization in fp8, serving performance comparison of quantized models, and activation-aware weight quantization (AWQ) (+ tutorial).

A Quick Quantization Demo

In this section, we provide a quick demonstration of how to use the Friendli Model Optimizer for quantizing large language models. Specifically, we showcase the process of applying FP8 quantization on Meta’s Llama 3.1 8B Instruct Model.

The following section will guide you through the installation of the FMO library and provide instructions running the quantization example below. Additionally, we provide descriptions on the command line parameters that can be used for setting up the quantization, allowing users to specify the model, output directory, quantization mode, and other configurations.

FMO Installation and Example Usage

The FMO library can be easily installed using pip, the Python package manager. Once installed, FMO offers a command-line interface (CLI) that allows users to perform quantization with a single command.

You can install the FMO library with the below command:


pip install friendli-model-optimizer

After setting the environment variables, you can run a quantization process with the below command:


fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode $QUANTIZATION_SCHEME \
--pedantic-level $PEDANTIC_LEVEL
--device $DEVICE \
--offload

An example of running FP8 quantization with Meta-Llama-3.1-8B-Instruct is:


export MODEL_NAME_OR_PATH="meta-llama/Meta-Llama-3.1-8B-Instruct"
export OUTPUT_DIR="./"

fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode "fp8" \
--device "cuda:0" \
--offload
--pedantic-level 1

Command Line Parameters

model-name-or-path: Model name or directory path of the saved model checkpoint.
output-dir: Directory path to save the quantized checkpoint and related configurations.
mode: Quantization mode. You can use fp8, int8.
pedantic-level: Defaults to level 1.
device: Device to run the quantization process. Defaults to "cuda:0".
offload: When enabled, this option significantly reduces GPU memory usage by offloading model layers onto CPU RAM—defaults to False.

How to Run an Optimized Model with Friendli Container

Once your optimized model is ready, you can easily launch the model using Friendli Container. Please check out our official documentation to learn more!

Faster Inference with Policy Search

To further improve the inference speed of your optimized models, Friendli Container offers an automated Policy Search feature that enables the identification of the most efficient execution policies. Please refer to our guide for Optimizing Inference with Policy Search to learn more.


export QUANTIZED_MODEL_DIR=$OUTPUT_DIR
export FRIENDLI_CONTAINER_SECRET="{YOUR CONTAINER SECRET}"
export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial"
export GPU_ENUMERATION='"device=0"'
export POLICY_DIR=$PWD/policy

mkdir -p $POLICY_DIR

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v $QUANTIZED_MODEL_DIR:/model \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name /model \
    --algo-policy-dir /policy \
    --search-policy true

Conclusion

In conclusion, the Friendli Model Optimizer offers a powerful solution for compressing generative AI models using optimization techniques like FP8 Quantization and INT8 Quantization. By leveraging advanced quantization strategies, powered by our empirical recipe, FMO allows developers to optimize their models for faster inference, reduced latency, and lower resource consumption.

We will soon come back with part 2 of this blog, which will further explore FMO with experimentation results. For more details, you can visit the Friendli Model Optimizer (FMO) GitHub Repository.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Contact Sales — our experts (not a bot) will reply within one business day.