May 15, 2025
4 min read

Cut Latency for Image & Video AI Models : A guide to Multimodal Caching

In the world of high-performance AI systems, caching plays a pivotal role in optimizing response times and reducing computational load. FriendliAI pioneered the use of prefix caching in production LLM inference—a technique now widely adopted to store and reuse previously computed hidden states for recurring text sequences.

But what happens when your data isn’t just text?

With the growing ubiquity of multimodal models—those that process and generate across text, images, audio, and more—there’s a pressing need to rethink how we cache not just strings of words, but also embeddings and learned representations of diverse data types.

Today, we highlight a powerful new capability of Friendli TCache: extending prefix caching beyond text to support any multimodal data such as image and video embeddings. This enhancement enables smarter reuse across different modalities and delivers significant performance gains for modern AI applications.

What Is Prefix Caching in AI Inference?

Prefix caching accelerates AI inference for large language models (LLMs) by storing and reusing attention key-value (KV) pairs for frequently used text prefixes. This eliminates redundant computation when prompts are reused or extended—an efficiency gain particularly valuable in chatbots, retrieval-augmented generation (RAG), and autonomous agents.

For example, consider these two requests:

Request A

json
{
  "role": "system",
  "content": "You are a highly intelligent and helpful AI assistant trained to provide accurate,  thoughtful, and safe responses. Always be clear, concise, and kind. Avoid speculation, and admit when you don’t know something. Use simple language when possible, but adapt to the user's expertise level. Always prioritize user understanding and helpfulness while remaining unbiased and respectful. If asked to generate code, ensure it is syntactically correct, well-commented, and secure. If asked for opinions, clearly distinguish them from factual statements. When a user's request is unclear, ambiguous, or lacks sufficient detail, always ask polite and specific clarifying questions before attempting to answer. Ensure you fully understand the user's intent before proceeding. Prioritize accuracy, user satisfaction, and safe, respectful communication in every response."
},
{
  "role": "user",
  "content": "What is the capital of the United States?"
}

Request B

json
{
  "role": "system",
  "content": "You are a highly intelligent and helpful AI assistant trained to provide accurate, thoughtful, and safe responses. Always be clear, concise, and kind. Avoid speculation, and admit when you don’t know something. Use simple language when possible, but adapt to the user's expertise level. Always prioritize user understanding and helpfulness while remaining unbiased and respectful. If asked to generate code, ensure it is syntactically correct, well-commented, and secure. If asked for opinions, clearly distinguish them from factual statements. When a user's request is unclear, ambiguous, or lacks sufficient detail, always ask polite and specific clarifying questions before attempting to answer. Ensure you fully understand the user's intent before proceeding. Prioritize accuracy, user satisfaction, and safe, respectful communication in every response."
},
{
  "role": "user",
  "content": "What is the capital of Canada?"
}

Both share the prefix: the system prompt and the common phrase What is the capital of . With prefix caching, the model reuses the shared prefix and only computes the new suffix (the United States? vs. Canada?), saving time and compute resources.

Benefits of Prefix Caching for AI Acceleration

Prefix caching delivers significant improvements in both performance and cost efficiency for large language model (LLM) inference:

⚡ Reduced Latency: Reusing cached data accelerates inference, enabling faster responses for repeated or similar inputs.
💸 Lower Compute Costs: By avoiding redundant GPU operations, prefix caching reduces processing load—leading to substantial cost savings at scale.
⚖️ Cache-Aware Load Balancing: Intelligently routing incoming requests to GPU nodes where relevant input data has already been cached, cache-aware load balancing reduces latency, improves throughput, and maximizes resource utilization.

Limitations of Traditional Prefix Caching

Despite its benefits, traditional prefix caching is limited to text data only. It cannot handle non-text inputs such as image embeddings, audio features, or other modality-specific representations.

This leads to significant inefficiencies. For example, if an image is used in multiple prompts, the model must re-compute it each time—introducing unnecessary redundancy and slowing down inference. Traditional prefix caching simply isn’t equipped for the demands of today’s multimodal AI systems.

Introducing Multimodal Friendli TCache

At FriendliAI, we’ve reimagined prefix caching for the multimodal era with Friendli TCache—a flexible, future-forward solution that goes beyond text-only inputs.

Friendli TCache extends its caching capabilities beyond text data to include image, video and any other non-text representations. Built with extensibility at its core, it’s ready to support emerging data modalities—ensuring long-term compatibility with evolving AI workloads.

By eliminating redundant computation across modalities, Friendli TCache unlocks unmatched efficiency in real-world applications.

Advantages of Multimodal Friendli TCache

With multimodal caching, Friendli TCache brings substantial performance improvements to production AI workloads.

🌐 Broader Applicability

Friendli TCache supports a wide array of use cases—from image captioning and visual question answering (VQA) to multimodal agents and search. It brings scalable efficiency to any system where repeated visual or structured inputs are part of the context.

💡 Use Cases

Medical chatbots that analyze patient scans (e.g., X-rays, CTs, MRIs): The same image is often referenced across multiple questions in a session—caching the image eliminates repeated re-encoding, significantly speeding up the interaction.
Retail or e-commerce assistants that repeatedly reference the same product images or catalogs: Caching visual context once allows the model to respond faster and more cost-effectively to multiple customer queries.
Video analysis systems used in sports, security, or education: When multiple questions are asked about the same video clip, cached frames eliminate the need to reprocess visual data on each request.
Enterprise document QA over slide decks, technical diagrams, or scanned forms: Frequently referenced visuals (e.g., charts, architecture diagrams) can be cached once and reused across many queries or sessions.

In all these cases, Friendli TCache dramatically reduces redundant computation by caching and reusing large static, high-dimensional inputs—unlocking new levels of efficiency and responsiveness.

Conclusion

Multimodal AI workloads are no longer rare—they’re the new norm. From image-based Q&A to video analysis and multimodal agentic systems, AI systems must process diverse data types with speed and efficiency.

Friendli TCache is a flexible, extensible caching solution built for this new era. By enabling the reuse of encoded representations—including images, video frames, and other non-text inputs—it significantly reduces inference time, lowers compute costs, and improves system throughput, marking a pivotal advancement in building scalable, high-performance multimodal AI systems.

Leveraging groundbreaking innovations like Iteration Batching and Friendli TCache, FriendliAI provides fast, cost-efficient inference serving and fine-tuning to accelerate agentic AI and custom generative AI solutions. As the only provider supporting over 360K models on Hugging Face, FriendliAI offers unmatched model coverage for developers. Enjoy the GPU-optimized, blazingly fast Friendli Inference through FriendliAI's Dedicated Endpoints, Serverless Endpoints, and Container solutions—all seamlessly integrated within Friendli Suite. Learn more at https://friendli.ai/.

Try it out now on our Playground and experience the future of AI.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Contact Sales — our experts (not a bot) will reply within one business day.