• July 25, 2025
  • 2 min read

Announcing Online Quantization: Faster, Cheaper Inference with Same Accuracy

Announcing Online Quantization: Faster, Cheaper Inference with Same Accuracy thumbnail

We're introducing Online Quantization — a new feature in Dedicated Endpoints that lowers inference costs while maintaining accuracy or requiring any model prep.

This feature automatically quantizes your models as it loads, so there’s no need to store separate quantized versions. With online quantization, Dedicated Endpoints can run your workloads using fewer GPUs, reducing infrastructure costs and accelerating speed while maintaining reliable inference.

What Is Online Quantization?

Online quantization automatically converts model weights and activations from its original weight (such as FP16) to lower-precision formats like FP8 or FP4 during model loading. Unlike traditional quantization methods that require preprocessing, retraining, or model changes, online quantization adjusts precision on the fly, providing a seamless and efficient experience, such as:

  • Zero setup: No calibration data, retraining, or model changes required.
  • Faster inference: Improves both Time-to-First-Token (TTFT) and Time-Per-Output-Token (TPOT).
  • Lower GPU usage: Cut GPU needs by 2-4x, significantly reducing costs.
  • On-the-fly conversion: Quantization happens automatically during model load.
  • Preserved accuracy: Accuracy remains nearly identical to the original.

Powered by Friendli Inference, online quantization further optimizes compute efficiency alongside our cutting-edge technologies like iteration batching (a.k.a. continuous batching), multimodal caching, multi-LoRA, speculative decoding, and more.

How Online Quantization Cuts Costs

Quantization reduces precision (e.g., FP16 → FP8), cutting compute requirements and memory bandwidth during inference.

With online quantization now integrated into Dedicated Endpoints:

  • Models run with less GPU memory and compute power per request
  • Endpoint throughput increases, enabling more requests per GPU
  • You can serve the same workload with fewer GPUs, reducing your cloud or data center costs

Fast Inference. Lower Costs. No Retraining.

Skip the slowdowns of traditional quantization. Our online approach runs automatically at initialization—no retraining, no calibration—delivering minimal accuracy loss with maximum speed.

Accelerate Time-to-First-Token (TTFT) and Time-per-Output-Token (TPOT) while cutting GPU costs instantly. No need to manage multiple model versions or pipelines. Speed meets simplicity with our online quantization technology.

Getting Started

To enable Online Quantization, just turn it on in your Dedicated Endpoint configuration — no model changes or pipeline updates required. It supports all major model formats and GPU types, and integrates directly into your existing AI workflows.

After selecting an eligible model on the endpoint creation page, you can toggle "Online Quantization" on or off under the "Endpoint features" section.

Figure 1: Create endpoint overview.

Figure 1: Create endpoint overview.

Qwen/Qwen2.5-72B-Instruct, for instance, would require 4x NVIDIA H100 GPUs, but with Online Quantization, it can run with only 2x NVIDIA H100 GPUs, cutting the cost by half.

Figure 2: Quantization off.

Figure 2: Quantization off.

Figure 3: Quantization on.

Figure 3: Quantization on.

To see whether online quantization is enabled for an endpoint, simply check the endpoint overview.

Figure 4: Endpoint overview with online quantization enabled.

Figure 4: Endpoint overview with online quantization enabled.

To learn more about how to configure your Dedicated Endpoints, please refer to our docs.


Written by

FriendliAI Tech & Research


Share


General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.