February 7, 2024
2 min read

Friendli TCache: Optimizing LLM Serving by Reusing Computations

The Friendli Inference has already gathered attention for its impressive performance (e.g., comparison of vLLM and Friendli, quantization performance comparison) in serving large language models (LLMs), but what's under the hood for driving this efficiency? In this blog article, we highlight our Friendli TCache as a representative example of the many techniques contributing to the engine's exceptional speed and GPU optimization.

The Challenge: Repetitive Computations

Serving LLMs involves processing many numbers of tokens, which often requires the same computation multiple times. If unoptimized, these redundant computations strain GPU resources, resulting in bottlenecks in GPU computational cycles and the GPU memory.

Friendli TCache: A Cache to Reuse Recurring Computations

The Friendli TCache tackles this challenge head-on, offering a novel caching mechanism optimized for LLMs. It intelligently identifies and stores frequently used computational results. The Friendli Inference leverages the cached results, significantly reducing the workload on the GPUs.

The Results: Increased Speed & Resource Savings

The impact of the optimizations on the Friendli Inference, including Friendli TCache, can be seen on the graphs below. Our evaluations with Meta’s Llama-2 70B on four NVIDIA A100 80GB GPUs running Q&A workload on 40 documents are as follows:

Meta's Llama 2 70B model on four NVIDIA's A100 80GB GPUs Synthetic dataset p90 TTFT comparison-FriendliAI

Blazing-Fast Performance: Compared to vLLM, Friendli Inference delivers 11.3x to 23x faster time to first token (TTFT) performance across various input loads between 1N and 5N. The TTFT for vLLM increases dramatically with higher loads.
Unwavering Stability and Scalability: Unlike vLLM, Friendli Inference maintains consistent performances (i.e., TTFT) under varying load conditions, ensuring reliable and predictable response times. This also demonstrates Friendli Inference’s exceptional scalability.
Reduced Resource Consumption: By reusing computations, Friendli TCache significantly reduces the GPU resource required for inference. This translates to higher performances, lower operational costs and increased cost-efficiency.

Experience the Friendli Advantage!

Friendli offers multiple ways to leverage the power of Friendli Inference, built upon many optimizations including TCache:

Friendli Dedicated Endpoints: Run your custom generative AI models on dedicated GPUs with autopilot convenience.
Friendli Serverless Endpoints: Start instantly with open-source models through our user-friendly API with the lowest costs in the market.
Friendli Container: Deploy the engine in your environment (e.g., on-premises or cloud) for complete control.

Ready to unlock the full potential of your LLMs? Visit Friendli today and explore how Friendli Inference can expedite your generative AI serving experience.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.