What is Quantization?
Quantization is a technique that reduces the precision of a generative AI model’s parameters, optimizing memory usage and inference speed while maintaining response quality.Friendli Container supports:
- Online quantization: Quantize your model on the fly at serving time. You don’t need to prepare any pre-quantized weights in advance — just launch the model with the
--quantization
option, and the system will dynamically quantize it instantly as the container starts. - Serving a pre-quantized model: Serve a model that has been already quantized beforehand. In this mode, you use model weights that were already quantized and simply load them during serving.
Serving a Model with Online Quantization
If you want to serve your own model but need to quantize it (or adjust its precision), Friendli Container offers online quantization, eliminating the need to prepare a quantized model in advance. Once your model is ready, you can serve it with online quantization by adding the--quantization
argument when running Friendli Container.
--quantization
(8bit|4bit|16bit)
: Applies online quantization with the specified precision. It automatically detects your hardware and selects a suitable quantization scheme.
- Use
--quantization 8bit
for NVIDIA Ada, Hopper, and Blackwell GPUs. - Use
--quantization 4bit
for NVIDIA Hopper, and Blackwell GPUs.
To dequantize a model to 16-bit precision, use
--quantization 16bit
.Example: deepseek-ai/DeepSeek-R1
with 4-bit Online Quantization on NVIDIA H200 GPUs
To serve online quantized models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at Running Policy Search.
Serving a Pre-Quantized Model
If a model is already quantized and uploaded at Hugging Face Hub, Friendli Container supports the model with the following options:- Quantized model with well-known quantizations:
- MXFP4
- Fine-grained FP8 (including Deepseek-V3 style FP8 Quantization)
- a subset of models created by:
- Quantized model checkpoints by FriendliAI
Example: openai/gpt-oss-120b
on NVIDIA B200 GPU
To serve pre-quantized models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at Running Policy Search.