May 3, 2024
5 min read

Building a RAG Chatbot with Friendli, MongoDB Atlas, and LangChain

Recently, LangChain introduced support for Friendli as an LLM inference serving engine. This integration allows you to leverage Friendli Engine’s blazing-fast performance and cost-efficiency for your RAG (Retrieval-Augmented Generation) pipelines.

In this guide, we will build a simple RAG-based chatbot that answers questions about the contents of a PDF document. This tutorial will use Friendli Serverless Endpoints for LLM inference and MongoDB Atlas for the vector store.

Dependencies

First, let’s install the required packages:

bash
pip install langchain langchain-community friendli-client pypdf pymongo langchain-openai tiktoken

Setting Up MongoDB Atlas

While you can run MongoDB locally, we will use MongoDB Atlas, a managed service, for this tutorial. Sign up for a MongoDB Cloud account and create a new cluster. After the cluster is set up, create a new DB and collection following their guide.

Once the DB is set up, check your MongoDB Cloud UI for the database, collection, and index names:

MongoDB

Then, initialize the MongoDB client with the appropriate variables:

python
from pymongo import MongoClient

MONGODB_ATLAS_CLUSTER_URI = "YOUR CLUSTER URI"

client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)

# Fill in your information here
DB_NAME = "my_rag"
COLLECTION_NAME = "test"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "index_name"

MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]

Test the connection by running:

python
client.server_info()

Creating a Vector Search Index

To use MongoDB as a vector store, you need to create a vector search index for querying. Configure the search index as follows:

python
{
  "fields": [
    {
      "numDimensions": 1536,
      "path": "embedding",
      "similarity": "cosine",
      "type": "vector"
    }
  ]
}

Loading Documents and Embeddings

Now, let’s load a document from a PDF file and insert them into MongoDB Atlas with their embeddings. In our case, we’ll load the BPipe paper from the ICML 2023 conference:

python
from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PyPDFLoader("https://openreview.net/pdf?id=HVKmLi1iR4")
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)

vector_store = MongoDBAtlasVectorSearch.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings(disallowed_special=()),
    collection=MONGODB_COLLECTION,
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)
retriever = vector_store.as_retriever()

Initializing the LLM with Friendli

Now, let’s initialize the LLM part using Friendli Serverless Endpoints, using Meta’s new Llama 3 70B model:

python
from langchain_community.chat_models.friendli import ChatFriendli

llm = ChatFriendli(model="meta-llama-3-70b-instruct")

Building the RAG Chain

We have prepared all the components for our RAG pipeline. Here’s how to ask questions about the PDF file. In our case, we’ll find out what the ‘memory imbalance problem’ is, within BPipe’s context.

python
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

retriever = vector_search.as_retriever()

template = """Use the following pieces of context to answer the question at the end.
If you don’t know the answer, just say that you don’t know, don’t try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say “thanks for asking!” at the end of the answer.

{context}

Question: {question}

Helpful Answer:"""

prompt = PromptTemplate.from_template(template)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is the memory imbalance problem that BPipe solves?")

Upon execution, you will be able to get the following response from the RAG-applied model, which correctly describes the information from the pdf file, despite it being excluded from the data used to train the original model:

python
The memory imbalance problem refers to a situation where some pipeline stages in a pipeline parallelism setup consume more memory than others, leading to inefficient utilization of GPU memory. This can occur when training large language models (LLMs) using pipeline parallelism, where the model is split into multiple stages, and each stage is processed sequentially on a different GPU.

In a pipeline parallelism setup, each stage typically requires a certain amount of memory to store the model’s parameters, optimizer states, and intermediate activations. However, the memory requirements can vary significantly between stages, depending on the model’s architecture and the specific computations performed at each stage.

The memory imbalance problem arises when some stages require significantly more memory than others, leading to:

1. **Memory underutilization**: GPUs with less memory-intensive stages may not be fully utilized, wasting computational resources.
2. **Memory overflow**: GPUs with more memory-intensive stages may run out of memory, causing the training process to slow down or even fail.

The memory imbalance problem can be exacerbated by the following factors:

* **Model size**: Larger models require more memory, making it more challenging to balance memory usage across pipeline stages.
* **Batch size**: Increasing the batch size can amplify the memory imbalance problem, as more data needs to be stored in memory.
* **Pipeline schedule**: The order in which pipeline stages are executed can affect the memory imbalance, with some schedules leading to more pronounced imbalances than others.

The memory imbalance problem can have significant consequences, including:

* **Slower training times**: Inefficient memory utilization can lead to slower training times, making it more challenging to train large language models.
* **Increased costs**: Underutilized GPUs can result in wasted computational resources, increasing the overall cost of training.

The BPIPE approach, described in the original document, addresses the memory imbalance problem by transferring activations between pipeline stages to balance memory usage and ensure efficient GPU memory utilization.

By following these steps and incorporating the provided code, you’ll be well on your way to implementing RAG in your applications. Remember, this is just a starting point – feel free to experiment and customize the process to suit your specific needs.

Ready to Unleash the Power of Your LLM? Experience Friendli Engine's performance! We offer three options to suit your preferences:

Friendli Container: Deploy the engine on your own infrastructure for ultimate control.
Friendli Dedicated Endpoints: Run any custom generative AI models on dedicated GPU instances in autopilot.
Friendli Serverless Endpoints: No setup required, simply call our APIs and let us handle the rest.

Visit https://friendli.ai/try-friendli/ to begin your journey into the world of high-performance LLM serving with the Friendli Engine!

Written by