Documentation Index
Fetch the complete documentation index at: https://friendli.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
What is quantization
Quantization is a technique that reduces the precision of a generative AI model’s parameters, optimizing memory usage and inference speed while maintaining response quality.Friendli Container supports
- Online quantization: Quantize your model on the fly at serving time. You don’t need to prepare any pre-quantized weights in advance — just launch the model with the
--quantizationoption, and the system will dynamically quantize it instantly as the container starts. - Serving a pre-quantized model: Serve a model that has been already quantized beforehand. In this mode, you use model weights that were already quantized and simply load them during serving.
Serving a model with online quantization
If you want to serve your own model but need to quantize it (or adjust its precision), Friendli Container offers online quantization, eliminating the need to prepare a quantized model in advance. Once your model is ready, you can serve it with online quantization by adding the--quantization argument when running Friendli Container.
--quantization(8bit|4bit|16bit): Applies online quantization with the specified precision. It automatically detects your hardware and selects a suitable quantization scheme.
- Use
--quantization 8bitfor NVIDIA Ada, Hopper, and Blackwell GPUs. - Use
--quantization 4bitfor NVIDIA Hopper, and Blackwell GPUs.
Example: deepseek-ai/DeepSeek-R1 with 4-bit Online Quantization on NVIDIA H200 GPUs
Serving a pre-quantized model
If you have already quantized and uploaded a model to the Hugging Face Hub, Friendli Container supports the model with the following options:- Quantized model with well-known quantizations:
- MXFP4
- Fine-grained FP8 (including Deepseek-V3 style FP8 Quantization)
- a subset of models created by:
- Quantized model checkpoints by FriendliAI