August 6, 2024
6 min read

Retrieval Augmented Generation (RAG) with MongoDB and FriendliAI

MongoDB is partnering with FriendliAI to make differentiated AI agent deployment with retrieval-augmented generation (RAG) more accessible. MongoDB's scalable databases, equipped with native vector search capabilities, are ideal for meeting the high volume, low latency, and fresh data needs of RAG and compound generative AI applications.

Did you know that RAG tackles issues such as "hallucinations" and "outdated information" in pre-trained large language models (LLMs)? Unlike traditional methods that require retraining or fine-tuning, RAG operates exclusively at inference time. This makes latency, reliability, and cost-efficiency crucial factors for LLM inferencing in real-time RAG applications.

FriendliAI brings expertise in cost-efficient LLM inferencing, capable of reducing costs by 50-90%. We ensure that deployments of customized RAG applications are economically sustainable. Our service integrates seamlessly with LangChain and LlamaIndex, enabling the creation of sophisticated RAG pipelines with minimal complexity and cut costs.

Additionally, Friendli Tools function calling can be used with MongoDB Atlas Vector Search to build AI agents equipped with retriever tools. Combined with MongoDB databases for storing long-term conversational history, these agents can intelligently call retriever tools that use vector searches to access relevant information.

Jointly, MongoDB and FriendliAI offer solutions for developers looking to leverage open-source or custom models and combine them with proprietary data—all with unparalleled speed, cost-efficiency, and security. In the following sections of this blog, we will delve deeper into one of the most exciting collaborations: RAG.

Retrieval-Augmented Generation (RAG) in Action:

Imagine an online clothing retailer using customer service agents to guide shoppers through its catalog. The inventory is frequently updated, with items selling out and being restocked, and seasonal changes introducing new collections. In such a fast-changing environment, the agent needs real-time access to the latest catalog information.

This is where RAG takes action. By accessing an up-to-date database at inference time, the agent can provide accurate and timely information to customers, enhancing their shopping experience.

For example, without RAG, a conversation might go like this:

User Query: "Do you have the blue denim jacket in size M?"
AI Agent Answer: "We usually carry blue denim jackets in various sizes. Would you like to add it to your cart?"

With RAG, the answer is much more helpful:

User Query: "Do you have the blue denim jacket in size M?"
AI Agent Answer: “Yes, we have the blue denim jacket in size M in stock. There are 5 items available, and 3 customers are currently viewing this product. It's a best seller from our new Spring/Summer collection. Would you like to add it to your cart?”

RAG improves the efficiency of integrating external data into LLM workflows, making it easier to incorporate time-sensitive and context-specific information without the overhead of frequent retraining. By adopting RAG, organizations can ensure their AI applications remain adaptable to rapidly changing data environments.

Similarly, a personal medical agent could use RAG to access and reference a patient’s medical history, offering more accurate and personalized healthcare recommendations.

For example, without RAG:

User Query: "Can I take ibuprofen with my current medication?"
AI Agent Answer: "It's generally safe to take ibuprofen, but please consult with your healthcare provider."

With RAG:

User Query: "Can I take ibuprofen with my current medication?"
AI Agent Answer: “Based on your current medications—such as Lisinopril for hypertension and Metformin for type 2 diabetes—it's generally safe to take ibuprofen. However, if you notice any stomach pain or other side effects, please contact your healthcare provider.

By leveraging RAG, medical agents can provide personalized advice that considers relevant patient data. Using personal information in training data to fine-tune LLMs can lead to private data leakage and potential security risks. Therefore, RAG stands out as a safer option for such scenarios. The next section will discuss the advantages of RAG and fine-tuning, providing guidance on how to use each solution.

Fine-tuning and Retrieval-Augmented Generation (RAG)

Fine-tuning and RAG provide distinct approaches for enhancing LLM inferences with additional data. While both methods improve LLM responses by incorporating extra information, they fundamentally differ in their approach and use cases. Understanding the question, "Where is this information going?" is key to distinguishing between RAG and fine-tuning.

RAG implementations consist of 'parametric memory' (LLMs) and 'non-parametric memory' (vector indexes). Non-parametric memory is where the fast-changing data is stored and searched for. A benefit of RAG is its ability to reference up-to-date information from vector indexes during inference. For example, it can retrieve information on what medication a patient is currently on.

Fine-tuning, on the other hand, involves adjusting the model's weights or related parameters. Unlike RAG, fine-tuning does not rely on components like vector indexes. Instead, it alters the model by implicitly incorporating the new data into its parameters. This method can deeply integrate specialized domain knowledge, such as medical texts, into general models, customizing the overall knowledge base and behavior.

In conclusion, fine-tuning is a more static process occurring during the training phase, while RAG references new data at inference time. One can fine-tune their model for a more sustainable change, while RAG can be used for individual usages. An example of a combined usage is when one uses RAG to retrieve current private patient information and augment the response on a medical LLM, fine-tuned on medical texts, creating a capable and efficient medical agent. Careful consideration of what happens during inference and which components need orchestration is essential for effectively utilizing RAG with fine-tuned models.

Components of Retrieval-Augmented Generation (RAG)

The retrieval-augmented generation process of a shopping agent begins by fetching relevant product information from a vector-search database containing the latest catalog. This retrieved data (e.g., "five denim jackets in size M are left in stock") is then augmented into the LLM prompt, guiding the LLM to reference this information when responding to queries.

Two main components are needed to bring RAG agents to life: the retriever and the generator. The retriever uses methods like dense passage retrieval to identify relevant documents by matching a query with a document index. The generator then combines the retrieved content with the user prompt to produce a new, informed response.

RAG

The retrieval system finds pertinent information from databases, leveraging vector search for precise augmentation of LLM responses. MongoDB Atlas, with its flexible document model and native vector search, excels in this role. As a developer data platform, it enables easy creation of AI-powered experiences by simply adding vector data to existing collections.

Friendli Dedicated Endpoints offer a customizable solution for the generator component, ensuring fast, reliable, and cost-effective inferencing on fine-tuned models with up to 50-90% cost reduction. Our integration with LangChain and LlamaIndex facilitates the deployment of RAG applications, enhancing the overall effectiveness of the RAG system. Explore more LLM inference solutions within the Friendli Suite, including Friendli Container, Friendli Dedicated Endpoints, and Friendli Serverless Endpoints.

Getting Started with Friendli and MongoDB Atlas

To help you get started, please refer to the Building a RAG Chatbot with Friendli, MongoDB Atlas, and LangChain tutorial, which shows you how to build a chatbot that answers questions about the contents of a PDF document and involves:

MongoDB Atlas Database that indexes a PDF document using embeddings. (Vector Store)
MongoDB Atlas Vector Search responds to user queries by converting the query to an embedding, fetching the corresponding information. (Retrieval Engine)
Friendli Suite to use the Llama model to generate the context-aware answer. (LLM and inference)

We can also help you design the best AI agents for your organization’s needs. Feel free to contact us here to schedule a collaborative session and explore how FriendliAI and MongoDB can optimize your AI deployment process.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Contact Sales — our experts (not a bot) will reply within one business day.