Online Quantization

Online Quantization

Skip the hassle of preparing a quantized model. By enabling Online Quantization, your model will be automatically quantized to the target precision at runtime using Friendli’s proprietary method—preserving quality while improving speed and cost-efficiency. We currently support two precision levels, 4BIT and 8BIT.
This allows you to select lower-VRAM GPU instances without performance loss.

Some models (e.g., those already quantized) may not be compatible with Online Quantization.
Not all models support all target precisions. Some may only support 8BIT.
In certain cases, specific GPU instance types may not be available when this option is enabled.

Autoscaling Speculative Decoding

⌘I

Get Started

Capabilities

Friendli Dedicated Endpoints

Friendli Serverless Endpoints

Friendli Container

Online Quantization