Chat Docs: A RAG Application with PeriFlow and LangChain

Blog post thumbnail

In today’s ever-evolving landscape of AI, Retrieval-Augmented Generation (RAG) has emerged as a widely adopted technique for anchoring Large Language Models (LLMs) in external knowledge. However, in rapidly changing domains, keeping external knowledge sources up-to-date and synchronized with real-world changes can be a significant challenge. In this article, we explore a practical approach to deal with this challenge, offering insights with an exemplary RAG-powered application built using FriendliAI PeriFlow and LangChain.

Example Application: Chat Docs

Let’s see how to ensure that a RAG-based application consistently maintains up-to-date data by looking into an example application. The name of this application is “Chat Docs”. This application allows you to effortlessly access information about documentation by engaging in conversations with an online chatbot, akin to ChatGPT. This chatbot is constantly updated with the latest information from the documentation page such as

Ensuring Up-to-Date Information Retrieval

To ensure the chatbot always delivers the most current and accurate information, we employ a vector DB responsible for storing document embeddings. This vector DB must be updated in synchronization with any changes made to the source documents. This synchronization process is executed through a CI/CD pipeline.

The figure above depicts the overall system architecture and workflow. Followings are the core components in the system. Note that you have numerous alternatives to substitute the role of each component.

  • Git repo (GitHub): GitHub manages the contents of the documentation.
  • CI/CD platform (GitHub Actions): GitHub Actions manages the CD pipeline.
  • Storage service (S3): S3 stores the package of the documentation website.
  • Vector DB (OpenSearch): OpenSearch stores the embeddings of documents and retrieves most relevant documents by k-Nearest Neighbor (k-NN) search.
  • Chatbot server: This simple chatbot server provides the following two REST API endpoints.
    • POST /db/sync : Trigger the synchronization process between the vector DB and the documents in the storage service.
    • GET /completions : Streaming out the answer in response to the question as server-sent events.
  • LLM inference (PeriFlow): PeriFlow, the fastest generative AI serving engine available, generates the answer, faster than the other LLM inference services, with the fewer amount of GPUs.

Within this system, S3, OpenSearch, and PeriFlow are integrated with the LangChain interface. The chatbot server utilizes the interface to interact with these integrated components using just a few lines of code. See the article “LangChain Integration with PeriFlow” to find out how PeriFlow is integrated with LangChain.

Now let’s look into how it works. The source of our documentation page resides in a GitHub repository named “docs”. When updates are required, modifications are pushed to this repository. Subsequently, GitHub Actions steps in, building a new package from the source and uploading the changes to AWS S3. As soon as these alterations are committed to S3, the GitHub Actions dispatches a webhook (POST /db/sync), triggering synchronization to the chatbot server. In response to this request, the chatbot server retrieves the updated HTML files from S3 and generates document embeddings from them. The server then proceeds to update OpenSearch index with these embeddings, ensuring it encompasses the latest features of the updated documentation. Note that the interaction with S3 and OpenSearch is achieved through the LangChain interface.

Streaming RAG-Powered Outputs

Following the aforementioned synchronization process, the chatbot server becomes adept at retrieving the most current information from the documentation. When a user asks a question to the chatbot (GET /completions), the chatbot server creates an embedding of the question and sends a query to OpenSearch to retrieve the most relevant documents. The retrieved documents are instrumental in constructing input prompts, which in turn enable the chatbot to generate responses that are always up-to-date and highly informative. The generation is processed very fast as the LLM efficiently runs via PeriFlow.

Now let’s check how the chatbot server actually answers the question. When I asked a question “How can I convert a model checkpoint with applying quantization?” before the quantization section is added to the documentation, the chatbot says it does not know the answer.

periflow-client does not have built-in support for serving quanitzed model. You would need to use a third-party library or module to interface with quantization. Please let me know if you have any other questions!

However, when I asked the same question after the section is updated, the chatbot says

To convert a huggingface model checkpoints with applying quantization using periflow-client, you must use the `pf checkpoint convert` command with the ` — quant-config-file` option. This will allow you to specify the quantize configuration file and enable quantization during the conversion process. Thanks for asking!

To sum up, your application powered by RAG can be more reliable and informative by continuously updating the retrieval source in the CI/CD pipeline. FriendliAI’s PeriFlow can help your application run LLMs with high speed and low cost. Try out PeriFlow today!


Related Posts

  • October 30, 2023
  • 2 min read

Comparing two LLM serving frameworks: PeriFlow vs. vLLM

  • October 27, 2023
  • 3 min read

LangChain Integration with PeriFlow Cloud

Large Language Models
Model Serving
See all from blog
We use cookiesWe use cookies to enhance your browsing experience on our website. By clicking “Accept all,” you consent to our use of cookies.
scroll to top