Skip to main content
Online Quantization quantizes your model at runtime using FriendliAI’s proprietary method, improving speed and reducing cost with little to no loss in accuracy. This lets you select lower-VRAM GPU instances without sacrificing performance. You can configure the precision level with the following options:
  • Off: Serve the model at its original precision.
  • 4-bit: Quantize to 4-bit precision for the largest savings in memory and cost.
  • 8-bit: Quantize to 8-bit precision for a balance between savings and accuracy.
Some models (e.g., those already quantized) may not be compatible with Online Quantization. Not all models support all target precisions. Some may only support 8-bit.
In certain cases, specific GPU instance types may not be available when this option is enabled.
Last modified on June 24, 2026