Online Quantization

Skip the hassle of preparing a quantized model. When you enable Online Quantization, Friendli automatically quantizes your model to the target precision at runtime using a proprietary method—preserving quality while improving speed and cost-efficiency. We currently support two precision levels, 4BIT and 8BIT.
This allows you to select lower-VRAM GPU instances without performance loss.

Some models (e.g., those already quantized) may not be compatible with Online Quantization.
Not all models support all target precisions. Some may only support 8BIT.
In certain cases, specific GPU instance types may not be available when this option is enabled.

Last modified on May 12, 2026

Autoscaling Speculative Decoding

⌘I

Documentation Index

​Online Quantization

Online Quantization