- Off: Serve the model at its original precision.
- 4-bit: Quantize to 4-bit precision for the largest savings in memory and cost.
- 8-bit: Quantize to 8-bit precision for a balance between savings and accuracy.
Some models (e.g., those already quantized) may not be compatible with Online Quantization.
Not all models support all target precisions. Some may only support 8-bit.
In certain cases, specific GPU instance types may not be available when this option is enabled.
In certain cases, specific GPU instance types may not be available when this option is enabled.