October 27, 2023
3 min read

RAG AI Agent with LangChain—Query Your Internal Documents

In today’s ever-evolving landscape of AI, Retrieval-Augmented Generation (RAG) has emerged as a widely adopted technique for anchoring Large Language Models (LLMs) in external knowledge. However, in rapidly changing domains, keeping external knowledge sources up-to-date and synchronized with real-world changes can be a significant challenge. In this article, we explore a practical approach to deal with this challenge, offering insights with an exemplary RAG-powered application built using Friendli Inference and LangChain.

Example Application: Chat Docs

Let’s see how to ensure that a RAG-based application consistently maintains up-to-date data by looking into an example application. The name of this application is “Chat Docs”. This application allows you to effortlessly access information about documentation by engaging in conversations with an online chatbot, akin to ChatGPT. This chatbot is constantly updated with the latest information from the documentation page such as https://friendli.ai/docs.

Ensuring Up-to-Date Information Retrieval

To ensure the chatbot always delivers the most current and accurate information, we employ a vector DB responsible for storing document embeddings. This vector DB must be updated in synchronization with any changes made to the source documents. This synchronization process is executed through a CI/CD pipeline.

Overall chat docs system architecture and workflow-FriendliAI

The figure above depicts the overall system architecture and workflow. The following components are the core components in the system. Note that you have numerous alternatives to substitute the role of each component.

Git repo (GitHub): GitHub manages the contents of the documentation.
CI/CD platform (GitHub Actions): GitHub Actions manages the CD pipeline.
Storage service (S3): S3 stores the package of the documentation website.
Vector DB (OpenSearch): OpenSearch stores the embeddings of documents and retrieves the most relevant documents by k-Nearest Neighbor (k-NN) search.
Chatbot server: This simple chatbot server provides the following two REST API endpoints.
- POST /db/sync : Triggers the synchronization process between the vector DB and the documents in the storage service.
- GET /completions : Streams out the answer in response to the question as server-sent events.
LLM inference (Friendli Inference): Friendli Inference, the fastest generative AI serving engine available, generates the answer, faster than any other LLM inference service with fewer GPUs.

Within this system, S3, OpenSearch, and Friendli Inference are integrated with the LangChain interface. The chatbot server utilizes the interface to interact with these integrated components using just a few lines of code. See the article “LangChain Integration with Friendli Inference to find out how Friendli Inference is integrated with LangChain.

Now let’s look into how it works. The source of our documentation page resides in a GitHub repository named “docs”. When updates are required, modifications are pushed to this repository. Subsequently, GitHub Actions steps in, building a new package from the source and uploading the changes to AWS S3. As soon as these alterations are committed to S3, the GitHub Actions dispatches a webhook (POST /db/sync), triggering synchronization to the chatbot server. In response to this request, the chatbot server retrieves the updated HTML files from S3 and generates document embeddings from them. The server then proceeds to update OpenSearch index with these embeddings, ensuring it encompasses the latest features of the updated documentation. Note that the interaction with S3 and OpenSearch is achieved through the LangChain interface.

Streaming RAG-Powered Outputs

Following the aforementioned synchronization process, the chatbot server becomes adept at retrieving the most up-to-date information from the documentation. When a user asks a question to the chatbot (GET /completions), the chatbot server creates an embedding of the question and sends a query to OpenSearch to retrieve the most relevant documents. The retrieved documents are instrumental in constructing input prompts, which in turn enable the chatbot to generate responses that are always up-to-date and highly informative. The generation is processed very quickly as the LLM runs efficiently via Friendli Inference.

Now let’s check how the chatbot server actually answers the question. When I asked the question “How can I convert a model checkpoint with applying quantization?” before the quantization section is added to the documentation, the chatbot says it does not know the answer.

plain
friendli-client does not have built-in support for serving quantized models. You would need to use a third-party library or module to interface with quantization. Please let me know if you have any other questions!

However, when I asked the same question after the section was updated, the chatbot says

plain
To convert a huggingface model checkpoints with applying quantization using friendli-client, you must use the `friendli checkpoint convert` command with the ` — quant-config-file` option. This will allow you to specify the quantize configuration file and enable quantization during the conversion process. Thanks for asking!

To sum up, your application powered by RAG can be more reliable and informative by continuously updating the retrieval source in the CI/CD pipeline. The Friendli Inference can help your application run LLMs with high speeds at low costs. Try out Friendli Inference today!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Contact Sales — our experts (not a bot) will reply within one business day.