- December 18, 2024
- 3 min read
Leveraging Milvus and Friendli Serverless Endpoints for Advanced RAG and Multi-Modal Queries
link to Colab: Leveraging Milvus and Friendli Serverless Endpoints for Advanced RAG and Multi-Modal Queries.ipynb
-
FriendliAI specializes in generative AI infrastructure, offering solutions that enable organizations to efficiently deploy and manage large language models (LLMs) and other generative AI models with optimized performance and reduced cost. Users have the ability to choose from production-ready conventional LLMs accessible through APIs, or custom fine-tuned LLMs deployed on the hardware of the user’s choice, whether on the public cloud or on private on-premise clusters.
-
Milvus is an open-source vector database that stores, indexes, and searches billion-scale unstructured data through high-dimensional vector embeddings. It is perfect for building modern AI applications such as retrieval augmented generation (RAG), semantic search, multimodal search, and recommendation systems.
In this article, we'll explore how to use Milvus with Friendli Serverless Endpoints to perform Retrieval-Augmented Generation (RAG) on particular documents and materials, as well as to execute multi-modal queries that incorporate images and other visual content. This powerful combination allows for more sophisticated and context-aware AI applications.
Understanding RAG and Multi-Modal Models
Retrieval-Augmented Generation (RAG)
RAG is a technique that enhances language models by providing them with relevant information, primarily retrieved from a vector database-powered knowledge base. This approach allows AI models to generate more accurate and contextually appropriate responses by referencing designated external data sources.
Multi-Modal Models
Multimodal models can process and understand multiple types of input data, such as text, images, and audio. They can analyze and generate responses based on diverse information sources, enabling more comprehensive and nuanced interactions.
Why Incorporate RAG and Multi-modal models together?
The combination of RAG and multi-modal capabilities significantly improves AI systems by providing the following features simultaneously:
- Allowing for more diverse and rich input types of the user’s choice
- Providing up-to-date information
- Enhancing accuracy and relevance of responses
- Enabling context-aware interactions
Hands-On Implementation
Let's dive into the practical implementation of RAG and multi-modal queries using the Milvus vector database and Friendli Serverless Endpoints.
Step 1: Install Prerequisites and Download Milvus Docs
First, we'll install the necessary libraries and download the Milvus documentation that we'll use for our RAG job:
bash!pip install --upgrade pymilvus requests tqdm langchain langchain-community langchain-huggingface langchain-openai friendli-client tiktoken !wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip !rm -rf milvus_docs !unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs
Step 2: Process Documentation Files
Next, we'll read the Milvus documentation files and use a simple file-splitting strategy to treat each text line as an individual chunk:
pythonfrom glob import glob text_lines = [] for file_path in glob("milvus_docs/en/faq/*.md", recursive=True): with open(file_path, "r") as file: file_text = file.read() text_lines += file_text.split("# ")
Step 3: Prepare Embeddings
We'll use the Hugging Face embeddings library to use a simple `all-MiniLM-L6` model to create vector representations of our text:
pythonfrom langchain_huggingface import HuggingFaceEmbeddings embeddings_model_name = "sentence-transformers/all-MiniLM-L6-v2" embedding = HuggingFaceEmbeddings(model_name=embeddings_model_name) test_embedding = embedding.embed_query("This is a test") embedding_dim = len(test_embedding) print(embedding_dim) print(test_embedding[:10])
Step 4: Set Up Milvus Client
Now, let's prepare the Milvus client for our RAG implementation. In this simple example, we use Milvus Lite, which runs locally and materializes a file in a local file. You can also consider other Milvus deployment options:
- If you only need a local vector database for small scale data or prototyping, setting the uri as a local file, e.g.
./milvus.db
, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file. - For larger scale data and traffic in production, you can set up a Milvus server on Docker or Kubernetes. In this setup, please use the server address and port as your
uri
, e.g.http://localhost:19530
. If you enable the authentication feature on Milvus, set thetoken
as "<your_username>:<your_password>", otherwise there is no need to set the token. - You can also use fully managed Milvus on Zilliz Cloud. Simply set the
uri
andtoken
to the Public Endpoint and API key of your Zilliz Cloud instance.
pythonfrom pymilvus import MilvusClient milvus_client = MilvusClient(uri="./milvus_demo.db") collection_name = "my_rag_collection"
Step 5: Create Milvus Collection
We'll create a collection in the Milvus client if it doesn't already exist:
pythonif milvus_client.has_collection(collection_name): milvus_client.drop_collection(collection_name) milvus_client.create_collection( collection_name=collection_name, dimension=embedding_dim, metric_type="IP", # Inner product distance consistency_level="Strong", # Strong consistency level )
Step 6: Embed and Insert Text into Milvus
Let's embed our text and insert it into the Milvus collection:
pythonfrom tqdm import tqdm data = [] for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")): data.append({"id": i, "vector": embedding.embed_query(line), "text": line}) milvus_client.insert(collection_name=collection_name, data=data)
Step 7: Perform RAG Query
Now we can ask a question and search for relevant data within our Milvus database:
pythonquestion = "How is data stored in milvus?" search_res = milvus_client.search( collection_name=collection_name, data=[ embedding.embed_query(question) ], limit=3, # Return top 3 results search_params={"metric_type": "IP", "params": {}}, # Inner product distance output_fields=["text"], # Return the text field ) import json retrieved_lines_with_distances = [ (res["entity"]["text"], res["distance"]) for res in search_res[0] ] print(json.dumps(retrieved_lines_with_distances, indent=4)) context = "\n".join( [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances] )
Step 8: Create Prompts for RAG
Let's create the system and user prompts for our RAG query:
pythonSYSTEM_PROMPT = """ Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided. """ USER_PROMPT = f""" Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags. <context> {context} </context> <question> {question} </question> """
Step 9: Set Up Friendli Token
Obtain your FRIENDLI_TOKEN
from the Friendli Suite and set it as an environment variable:
pythonimport os if "FRIENDLI_TOKEN" not in os.environ: os.environ["FRIENDLI_TOKEN"] = 'flp_FILL_IN_WITH_YOUR_OWN_PERSONAL_ACCESS_TOKEN'
Step 10: Execute RAG Query
Now we can execute our RAG query using the Friendli Serverless Endpoints:
pythonfrom langchain_openai import ChatOpenAI llm = ChatOpenAI( model="meta-llama-3.1-70b-instruct", base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser prompt = ChatPromptTemplate.from_messages([ ("system", SYSTEM_PROMPT), ("user", USER_PROMPT) ]) output_parser = StrOutputParser() chain = prompt | llm | output_parser print(chain.invoke({"input": question}))
This produces the answer based on the provided documents:
In Milvus, data is stored in two forms: inserted data and metadata. Inserted data (vector data, scalar data, and collection-specific schema) is stored in persistent storage as incremental logs. Milvus supports multiple object storage backends, including MinIO, AWS S3, Google Cloud Storage (GCS), Azure Blob Storage, Alibaba Cloud OSS, and Tencent Cloud Object Storage (COS). Metadata, on the other hand, is generated within Milvus and is stored in etcd, with each Milvus module having its own metadata.
Step 11: Multi-Modal Queries
For multi-modal queries, we'll use the Llama-3.2-11b-vision model:
pythonmultimodalllm = ChatOpenAI( model="llama-3.2-11b-vision-instruct", base_url="https://api.friendli.ai/serverless/beta", api_key=os.environ["FRIENDLI_TOKEN"], ) image_url = "https://milvus.io/docs/v2.4.x/assets/highly-decoupled-architecture.png" message = HumanMessage( content=[ {"type": "text", "text": "describe what is in this image"}, {"type": "image_url", "image_url": {"url": image_url}}, ], ) response = multimodalllm.invoke([message]) print(response.content)
From its response, we can infer that the model correctly understands the image:
The image depicts a flowchart of the components of a system, with the following components: **Coordinator Service** * Root * Query …
Step 12: Combine RAG and Multi-Modal Capabilities
Finally, let's combine the RAG and multi-modal capabilities:
pythonquestion = "How is data stored in milvus with respect to this picture?" USER_PROMPT = f""" Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags. <context> {context} </context> <question> {question} </question> """ message = HumanMessage( content=[ {"type": "text", "text": USER_PROMPT}, {"type": "image_url", "image_url": {"url": image_url}}, ], ) response = multimodalllm.invoke([message]) print(response.content)
The model correctly generates a correct response based on the image and the documents:
**Step 1: Identify the components involved in storing data in Milvus.** The components involved in storing data in Milvus include: * Access Layer * Message Storage * Worker Node **Step 2: Determine how data is stored in Milvus.** Data is stored in the Access Layer and Message Storage. **Step 3: Determine where data is stored in Milvus.** Data is stored in both Access Layer and Message Storage. **Answer:** Data is stored in both Access Layer and Message Storage.
Conclusion
This tutorial has demonstrated how to leverage Milvus and Friendli Serverless Endpoints to implement advanced RAG and multi-modal queries. By combining these powerful technologies, you can create more sophisticated AI applications that can understand and process diverse types of information, leading to more accurate and context-aware responses.
Written by
FriendliAI Tech & Research
Share