(function() { var utmInheritingDomain = "appstore.com", utmRegExp = /(&|\?)utm_[A-Za-z]+=[A-Za-z0-9]+/gi, links = document.getElementsByTagName("a"), utms = [ "utm_medium={{URL - utm_medium}}", "utm_source={{URL - utm_source}}", "utm_campaign={{URL - utm_campaign}}" ]; for (var index = 0; index < links.length; index += 1) { var tempLink = links[index].href, tempParts; if (tempLink.indexOf(utmInheritingDomain) > 0) { tempLink = tempLink.replace(utmRegExp, ""); tempParts = tempLink.split("#"); if (tempParts[0].indexOf("?") < 0 ) { tempParts[0] += "?" + utms.join("&"); } else { tempParts[0] += "&" + utms.join("&"); } tempLink = tempParts.join("#"); } links[index].href = tempLink; } }());
  • February 7, 2024
  • 2 min read

Friendli TCache: Optimizing LLM Serving by Reusing Computations

Friendli TCache: Optimizing LLM Serving by Reusing Computations thumbnail

The Friendli Engine has already gathered attention for its impressive performance (e.g., comparison of vLLM and Friendli, quantization performance comparison) in serving large language models (LLMs), but what's under the hood for driving this efficiency? In this blog article, we highlight our Friendli TCache as a representative example of the many techniques contributing to the engine's exceptional speed and GPU optimization.

The Challenge: Repetitive Computations

Serving LLMs involves processing many numbers of tokens, which often requires the same computation multiple times. If unoptimized, these redundant computations strain GPU resources, resulting in bottlenecks in GPU computational cycles and the GPU memory.

Friendli TCache: A Cache to Reuse Recurring Computations

The Friendli TCache tackles this challenge head-on, offering a novel caching mechanism optimized for LLMs. It intelligently identifies and stores frequently used computational results. The Friendli Engine leverages the cached results, significantly reducing the workload on the GPUs.

The Results: Increased Speed & Resource Savings

The impact of the optimizations on the Friendli Engine, including Friendli TCache, can be seen on the graphs below. Our evaluations with Meta’s Llama-2 70B on four NVIDIA A100 80GB GPUs running Q&A workload on 40 documents are as follows:

Meta's Llama 2 70B model on four NVIDIA's A100 80GB GPUs Synthetic dataset p90 TTFT comparison-FriendliAI

  • Blazing-Fast Performance: Compared to vLLM, Friendli Engine delivers 11.3x to 23x faster time to first token (TTFT) performance across various input loads between 1N and 5N. The TTFT for vLLM increases dramatically with higher loads.

  • Unwavering Stability and Scalability: Unlike vLLM, Friendli Engine maintains consistent performances (i.e., TTFT) under varying load conditions, ensuring reliable and predictable response times. This also demonstrates Friendli Engine’s exceptional scalability.

  • Reduced Resource Consumption: By reusing computations, Friendli TCache significantly reduces the GPU resource required for inference. This translates to higher performances, lower operational costs and increased cost-efficiency.

Experience the Friendli Advantage!

Friendli offers multiple ways to leverage the power of Friendli Engine, built upon many optimizations including TCache:

Ready to unlock the full potential of your LLMs? Visit Friendli today and explore how Friendli Engine can expedite your generative AI serving experience.


Written by

FriendliAI logo

FriendliAI Tech & Research


Share