Quantization

What is Quantization?

Quantization is a technique that reduces the precision of a generative AI model’s parameters, optimizing memory usage and inference speed while maintaining response quality.

Friendli Container supports:

Online quantization: Quantize your model on the fly at serving time. You don’t need to prepare any pre-quantized weights in advance — just launch the model with the --quantization option, and the system will dynamically quantize it instantly as the container starts.
Serving a pre-quantized model: Serve a model that has been already quantized beforehand. In this mode, you use model weights that were already quantized and simply load them during serving.

Serving a Model with Online Quantization

If you want to serve your own model but need to quantize it (or adjust its precision), Friendli Container offers online quantization, eliminating the need to prepare a quantized model in advance. Once your model is ready, you can serve it with online quantization by adding the --quantization argument when running Friendli Container.

--quantization (8bit|4bit|16bit): Applies online quantization with the specified precision. It automatically detects your hardware and selects a suitable quantization scheme.

Use --quantization 8bit for NVIDIA Ada, Hopper, and Blackwell GPUs.
Use --quantization 4bit for NVIDIA Hopper, and Blackwell GPUs.

To dequantize a model to 16-bit precision, use --quantization 16bit.

Example: `deepseek-ai/DeepSeek-R1` with 4-bit Online Quantization on NVIDIA H200 GPUs

# GPU Info: NVIDIA H200 * 4
# Fill the values of following variables.
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE=""  # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION=""  # GPUs (e.g., '"device=0,1,2,3"')

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v $HF_HOME:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name deepseek-ai/DeepSeek-R1 \
	--quantization 4bit \
    --algo-policy-dir /policy \
    --search-policy true

To serve online quantized models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at Running Policy Search.

Serving a Pre-Quantized Model

If a model is already quantized and uploaded at Hugging Face Hub, Friendli Container supports the model with the following options:

Quantized model with well-known quantizations:
- MXFP4
- Fine-grained FP8 (including Deepseek-V3 style FP8 Quantization)
- a subset of models created by:
Quantized model checkpoints by FriendliAI

Example: `openai/gpt-oss-120b` on NVIDIA B200 GPU

# GPU Info: NVIDIA B200
# Fill the values of following variables.
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE=""  # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION=""  # GPUs (e.g., '"device=0"')

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name openai/gpt-oss-120b \
    --algo-policy-dir /policy \
    --search-policy true

To serve pre-quantized models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at Running Policy Search.

Get Started

Capabilities

Friendli Dedicated Endpoints

Friendli Serverless Endpoints

Friendli Container

What is Quantization?

Friendli Container supports:

Serving a Model with Online Quantization

Example: `deepseek-ai/DeepSeek-R1` with 4-bit Online Quantization on NVIDIA H200 GPUs

Serving a Pre-Quantized Model

Example: `openai/gpt-oss-120b` on NVIDIA B200 GPU

Get Started

Capabilities

Friendli Dedicated Endpoints

Friendli Serverless Endpoints

Friendli Container

​What is Quantization?

​Friendli Container supports:

​Serving a Model with Online Quantization

​Example: deepseek-ai/DeepSeek-R1 with 4-bit Online Quantization on NVIDIA H200 GPUs

​Serving a Pre-Quantized Model

​Example: openai/gpt-oss-120b on NVIDIA B200 GPU

What is Quantization?

Friendli Container supports:

Serving a Model with Online Quantization

Example: `deepseek-ai/DeepSeek-R1` with 4-bit Online Quantization on NVIDIA H200 GPUs

Serving a Pre-Quantized Model

Example: `openai/gpt-oss-120b` on NVIDIA B200 GPU