- February 7, 2024
- 2 min read
Friendli TCache: Optimizing LLM Serving by Reusing Computations
The Friendli Inference has already gathered attention for its impressive performance (e.g., comparison of vLLM and Friendli, quantization performance comparison) in serving large language models (LLMs), but what's under the hood for driving this efficiency? In this blog article, we highlight our Friendli TCache as a representative example of the many techniques contributing to the engine's exceptional speed and GPU optimization.
The Challenge: Repetitive Computations
Serving LLMs involves processing many numbers of tokens, which often requires the same computation multiple times. If unoptimized, these redundant computations strain GPU resources, resulting in bottlenecks in GPU computational cycles and the GPU memory.
Friendli TCache: A Cache to Reuse Recurring Computations
The Friendli TCache tackles this challenge head-on, offering a novel caching mechanism optimized for LLMs. It intelligently identifies and stores frequently used computational results. The Friendli Inference leverages the cached results, significantly reducing the workload on the GPUs.
The Results: Increased Speed & Resource Savings
The impact of the optimizations on the Friendli Inference, including Friendli TCache, can be seen on the graphs below. Our evaluations with Meta’s Llama-2 70B on four NVIDIA A100 80GB GPUs running Q&A workload on 40 documents are as follows:
-
Blazing-Fast Performance: Compared to vLLM, Friendli Inference delivers 11.3x to 23x faster time to first token (TTFT) performance across various input loads between 1N and 5N. The TTFT for vLLM increases dramatically with higher loads.
-
Unwavering Stability and Scalability: Unlike vLLM, Friendli Inference maintains consistent performances (i.e., TTFT) under varying load conditions, ensuring reliable and predictable response times. This also demonstrates Friendli Inference’s exceptional scalability.
-
Reduced Resource Consumption: By reusing computations, Friendli TCache significantly reduces the GPU resource required for inference. This translates to higher performances, lower operational costs and increased cost-efficiency.
Experience the Friendli Advantage!
Friendli offers multiple ways to leverage the power of Friendli Inference, built upon many optimizations including TCache:
- Friendli Dedicated Endpoints: Run your custom generative AI models on dedicated GPUs with autopilot convenience.
- Friendli Serverless Endpoints: Start instantly with open-source models through our user-friendly API with the lowest costs in the market.
- Friendli Container: Deploy the engine in your environment (e.g., on-premises or cloud) for complete control.
Ready to unlock the full potential of your LLMs? Visit Friendli today and explore how Friendli Inference can expedite your generative AI serving experience.
Written by
FriendliAI Tech & Research
Share