October 27, 2023
4 min read

Chat Docs: A RAG Application with Friendli Engine and LangChain

In today’s ever-evolving landscape of AI, Retrieval-Augmented Generation (RAG) has emerged as a widely adopted technique for anchoring Large Language Models (LLMs) in external knowledge. However, in rapidly changing domains, keeping external knowledge sources up-to-date and synchronized with real-world changes can be a significant challenge. In this article, we explore a practical approach to deal with this challenge, offering insights with an exemplary RAG-powered application built using Friendli Engine and LangChain.

Example Application: Chat Docs

Let’s see how to ensure that a RAG-based application consistently maintains up-to-date data by looking into an example application. The name of this application is “Chat Docs”. This application allows you to effortlessly access information about documentation by engaging in conversations with an online chatbot, akin to ChatGPT. This chatbot is constantly updated with the latest information from the documentation page such as https://docs.friendli.ai.

Ensuring Up-to-Date Information Retrieval

To ensure the chatbot always delivers the most current and accurate information, we employ a vector DB responsible for storing document embeddings. This vector DB must be updated in synchronization with any changes made to the source documents. This synchronization process is executed through a CI/CD pipeline.

Overall chat docs system architecture and workflow-FriendliAI

The figure above depicts the overall system architecture and workflow. The following components are the core components in the system. Note that you have numerous alternatives to substitute the role of each component.

Git repo (GitHub): GitHub manages the contents of the documentation.
CI/CD platform (GitHub Actions): GitHub Actions manages the CD pipeline.
Storage service (S3): S3 stores the package of the documentation website.
Vector DB (OpenSearch): OpenSearch stores the embeddings of documents and retrieves the most relevant documents by k-Nearest Neighbor (k-NN) search.
Chatbot server: This simple chatbot server provides the following two REST API endpoints.
- POST /db/sync : Triggers the synchronization process between the vector DB and the documents in the storage service.
- GET /completions : Streams out the answer in response to the question as server-sent events.
LLM inference (Friendli Engine): Friendli Engine, the fastest generative AI serving engine available, generates the answer, faster than any other LLM inference service with fewer GPUs.

Within this system, S3, OpenSearch, and Friendli Engine are integrated with the LangChain interface. The chatbot server utilizes the interface to interact with these integrated components using just a few lines of code. See the article “LangChain Integration with PeriFlow” to find out how Friendli Engine is integrated with LangChain.

Now let’s look into how it works. The source of our documentation page resides in a GitHub repository named “docs”. When updates are required, modifications are pushed to this repository. Subsequently, GitHub Actions steps in, building a new package from the source and uploading the changes to AWS S3. As soon as these alterations are committed to S3, the GitHub Actions dispatches a webhook (POST /db/sync), triggering synchronization to the chatbot server. In response to this request, the chatbot server retrieves the updated HTML files from S3 and generates document embeddings from them. The server then proceeds to update OpenSearch index with these embeddings, ensuring it encompasses the latest features of the updated documentation. Note that the interaction with S3 and OpenSearch is achieved through the LangChain interface.

Streaming RAG-Powered Outputs

Following the aforementioned synchronization process, the chatbot server becomes adept at retrieving the most up-to-date information from the documentation. When a user asks a question to the chatbot (GET /completions), the chatbot server creates an embedding of the question and sends a query to OpenSearch to retrieve the most relevant documents. The retrieved documents are instrumental in constructing input prompts, which in turn enable the chatbot to generate responses that are always up-to-date and highly informative. The generation is processed very quickly as the LLM runs efficiently via Friendli Engine.

Now let’s check how the chatbot server actually answers the question. When I asked the question “How can I convert a model checkpoint with applying quantization?” before the quantization section is added to the documentation, the chatbot says it does not know the answer.

plain
friendli-client does not have built-in support for serving quantized models. You would need to use a third-party library or module to interface with quantization. Please let me know if you have any other questions!

However, when I asked the same question after the section was updated, the chatbot says

plain
To convert a huggingface model checkpoints with applying quantization using friendli-client, you must use the `friendli checkpoint convert` command with the ` — quant-config-file` option. This will allow you to specify the quantize configuration file and enable quantization during the conversion process. Thanks for asking!

To sum up, your application powered by RAG can be more reliable and informative by continuously updating the retrieval source in the CI/CD pipeline. The Friendli Engine can help your application run LLMs with high speeds at low costs. Try out Friendli Engine today!

Written by