Skip to main content

What is Quantization?

Quantization is a technique that reduces the precision of a generative AI model’s parameters, optimizing memory usage and inference speed while maintaining response quality.

Friendli Container supports:

  • Online quantization: Quantize your model on the fly at serving time. You don’t need to prepare any pre-quantized weights in advance — just launch the model with the --quantization option, and the system will dynamically quantize it instantly as the container starts.
  • Serving a pre-quantized model: Serve a model that has been already quantized beforehand. In this mode, you use model weights that were already quantized and simply load them during serving.

Serving a Model with Online Quantization

If you want to serve your own model but need to quantize it (or adjust its precision), Friendli Container offers online quantization, eliminating the need to prepare a quantized model in advance. Once your model is ready, you can serve it with online quantization by adding the --quantization argument when running Friendli Container.
  • --quantization (8bit|4bit|16bit): Applies online quantization with the specified precision. It automatically detects your hardware and selects a suitable quantization scheme.
  • Use --quantization 8bit for NVIDIA Ada, Hopper, and Blackwell GPUs.
  • Use --quantization 4bit for NVIDIA Hopper, and Blackwell GPUs.
To dequantize a model to 16-bit precision, use --quantization 16bit.

Example: deepseek-ai/DeepSeek-R1 with 4-bit Online Quantization on NVIDIA H200 GPUs

# GPU Info: NVIDIA H200 * 4
# Fill the values of following variables.
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE=""  # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION=""  # GPUs (e.g., '"device=0,1,2,3"')

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v $HF_HOME:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name deepseek-ai/DeepSeek-R1 \
	--quantization 4bit \
    --algo-policy-dir /policy \
    --search-policy true
To serve online quantized models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at Running Policy Search.

Serving a Pre-Quantized Model

If a model is already quantized and uploaded at Hugging Face Hub, Friendli Container supports the model with the following options:

Example: openai/gpt-oss-120b on NVIDIA B200 GPU

# GPU Info: NVIDIA B200
# Fill the values of following variables.
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE=""  # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION=""  # GPUs (e.g., '"device=0"')

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name openai/gpt-oss-120b \
    --algo-policy-dir /policy \
    --search-policy true
To serve pre-quantized models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at Running Policy Search.
I