Friendli TCache: Optimizing LLM Serving by Reusing Computations

Blog post thumbnail

The Friendli Engine has already gathered attention for its impressive performance (e.g., comparison of vLLM and Friendli, quantization performance comparison) in serving large language models (LLMs), but what's under the hood for driving this efficiency? In this blog article, we highlight our Friendli TCache as a representative example of the many techniques contributing to the engine's exceptional speed and GPU optimization.

The Challenge: Repetitive Computations

Serving LLMs involves processing many numbers of tokens, which often requires the same computation multiple times. If unoptimized, these redundant computations strain GPU resources, resulting in bottlenecks in GPU computational cycles and the GPU memory.

Friendli TCache: A Cache to Reuse Recurring Computations

The Friendli TCache tackles this challenge head-on, offering a novel caching mechanism optimized for LLMs. It intelligently identifies and stores frequently used computational results. The Friendli Engine leverages the cached results, significantly reducing the workload on the GPUs.

The Results: Increased Speed & Resource Savings

The impact of the optimizations on the Friendli Engine, including Friendli TCache, can be seen on the graphs below. Our evaluations with Meta’s Llama-2 70B on four NVIDIA A100 80GB GPUs running Q&A workload on 40 documents are as follows:

  • Blazing-Fast Performance: Compared to vLLM, Friendli Engine delivers 11.3x to 23x faster time to first token (TTFT) performance across various input loads between 1N and 5N. The TTFT for vLLM increases dramatically with higher loads.

  • Unwavering Stability and Scalability: Unlike vLLM, Friendli Engine maintains consistent performances (i.e., TTFT) under varying load conditions, ensuring reliable and predictable response times. This also demonstrates Friendli Engine’s exceptional scalability.

  • Reduced Resource Consumption: By reusing computations, Friendli TCache significantly reduces the GPU resource required for inference. This translates to higher performances, lower operational costs and increased cost-efficiency.

Experience the Friendli Advantage!

Friendli offers multiple ways to leverage the power of Friendli Engine, built upon many optimizations including TCache:

Ready to unlock the full potential of your LLMs? Visit Friendli today and explore how Friendli Engine can expedite your generative AI serving experience.


Related Posts

  • February 15, 2024
  • 3 min read

Which Quantization to Use to Reduce the Size of LLMs?

  • February 2, 2024
  • 4 min read

Grouped Query Attention (GQA) vs. Multi Head Attention (MHA): Optimizing LLM Inference Serving

See all from blog