(function() { var utmInheritingDomain = "appstore.com", utmRegExp = /(&|\?)utm_[A-Za-z]+=[A-Za-z0-9]+/gi, links = document.getElementsByTagName("a"), utms = [ "utm_medium={{URL - utm_medium}}", "utm_source={{URL - utm_source}}", "utm_campaign={{URL - utm_campaign}}" ]; for (var index = 0; index < links.length; index += 1) { var tempLink = links[index].href, tempParts; if (tempLink.indexOf(utmInheritingDomain) > 0) { tempLink = tempLink.replace(utmRegExp, ""); tempParts = tempLink.split("#"); if (tempParts[0].indexOf("?") < 0 ) { tempParts[0] += "?" + utms.join("&"); } else { tempParts[0] += "&" + utms.join("&"); } tempLink = tempParts.join("#"); } links[index].href = tempLink; } }());
  • June 3, 2024
  • 3 min read

NextDay AI instantly saves GPU costs for LLM serving

NextDay AI instantly saves GPU costs for LLM serving thumbnail

Summary

LLM-powered chatbot company saves GPU costs by more than 50% instantly.

Introduction to NextDay AI

NextDay AI, a pioneering company in the entertainment technology sector, integrates generative AI and creative expertise to deliver personalized experiences and virtual solutions for consumers and businesses alike. One of their offerings is an innovative AI chatbot platform. This platform allows users to create chatbots with unique personas, including fictional, historical, or celebrity figures, offering highly personalized interactions tailored to diverse customer needs.

Challenges

NextDay AI's commitment to personalization and quality in its chatbot service presents significant technical and financial challenges. Personalization in character-based emotional support chatbots is vital for creating an engaging and fun user experience. It enhances the emotional connection between the user and the chatbot, ensuring the support is relevant, empathetic, and adaptive to individual needs. However, operating one of the busiest persona-driven chatbot platforms is resource-intensive and costly, primarily due to the need for high-end GPUs to serve the requests. As multiple LLMs were involved, both custom and open source, it came with a cost of Processing ~0.5 trillion tokens per month. It required the client to add tens of H100 GPUs. As the demand for their services grew, so did the need for scalable solutions to manage the rising operational costs without compromising the service quality.

How Friendli solves the problem

To address these challenges, NextDay AI turned to FriendliAI's container service. Friendli Container is powered by Friendli Engine which uses the below techniques to solve the problems faced by NextDay AI.

  • Friendli DNN Library: Friendli DNN Library is the set of optimized GPU kernels carefully curated and designed specifically for generative AI. Our novel library allows Friendli Engine to support faster inference by supporting quantizations like FP8 and AWQ. Quantization reduces model size and memory usage, but running quantized models efficiently is challenging. Friendli runs quantized models super fast thanks to its library, enabling reduced operational GPU costs.
  • Iteration Batching: Friendli Engine also uses iteration batching, a technology we invented and further optimized to handle concurrent generation requests efficiently. Iteration batching can significantly enhance the efficiency of LLM inference by addressing traditional batching methods' limitations. It helps achieve higher throughput than conventional batching while satisfying the same latency requirement, thus optimizing GPU utilization and reducing operational costs.

Results

The deployment of the Friendli Container brought remarkable results:

Results of adoption of Friendli Container

  • Increased throughput and reduced cost: NextDay AI's adoption of Friendli Container significantly relieved the financial strain. By leveraging the advanced features of the container, NextDay AI witnessed a 2-3x increase in their LLM throughput, maintaining or even enhancing latency in processing LLM tokens with fewer GPUs. This boost in service capacity, without the need for additional GPU investments, led to a at least 50% reduction in operational costs.
  • Easy to use: The Friendli Container simplifies generative AI models' deployment and scaling in live chatbot environments. NextDay AI could easily integrate the Friendli Container directly into their existing GPU infrastructure, enhancing the efficiency and capacity of their chatbot service right away.

NextDay AI's strategic adoption of FriendliAI's container is a prime example of how advanced generative AI technologies can significantly enhance the efficiencies of LLM inference serving and reduce operational costs for serving generative AI applications. Friendli Container empowers NextDay AI to expand its service offerings, reinforcing its position as a leading AI-powered entertainment technology industry player.

If you are a chatbot service provider and this resonates with you then join us in making your LLM inference faster and affordable without any hassle.

Learn more about Friendli Container here and use the service free for the first 60 days. Signup →

For inquiries, contact us at sales@friendli.ai.


Share