December 18, 2024
4 min read

Leveraging Milvus and Friendli Serverless Endpoints for Advanced RAG and Multi-Modal Queries

Introduction

Link to Colab: Leveraging Milvus and Friendli Serverless Endpoints for Advanced RAG and Multi-Modal Queries.ipynb

FriendliAI specializes in generative AI infrastructure, offering solutions that enable organizations to efficiently deploy and manage large language models (LLMs) and other generative AI models with optimized performance and reduced cost. Users have the ability to choose from production-ready conventional LLMs accessible through APIs, or custom fine-tuned LLMs deployed on the hardware of the user’s choice, whether on the public cloud or on private on-premise clusters.
Milvus is an open-source vector database that stores, indexes, and searches billion-scale unstructured data through high-dimensional vector embeddings. It is perfect for building modern AI applications such as retrieval augmented generation (RAG), semantic search, multimodal search, and recommendation systems.

In this article, we'll explore how to use Milvus with Friendli Serverless Endpoints to perform Retrieval-Augmented Generation (RAG) on particular documents and materials, as well as to execute multi-modal queries that incorporate images and other visual content. This powerful combination allows for more sophisticated and context-aware AI applications.

Understanding RAG and Multi-Modal Models

Retrieval-Augmented Generation (RAG)

RAG is a technique that enhances language models by providing them with relevant information, primarily retrieved from a vector database-powered knowledge base. This approach allows AI models to generate more accurate and contextually appropriate responses by referencing designated external data sources.

Multi-Modal Models

Multimodal models can process and understand multiple types of input data, such as text, images, and audio. They can analyze and generate responses based on diverse information sources, enabling more comprehensive and nuanced interactions.

Why Incorporate RAG and Multi-modal models together?

The combination of RAG and multi-modal capabilities significantly improves AI systems by providing the following features simultaneously:

Allowing for more diverse and rich input types of the user’s choice
Providing up-to-date information
Enhancing accuracy and relevance of responses
Enabling context-aware interactions

Hands-On Implementation

Let's dive into the practical implementation of RAG and multi-modal queries using the Milvus vector database and Friendli Serverless Endpoints.

Step 1: Install Prerequisites and Download Milvus Docs

First, we'll install the necessary libraries and download the Milvus documentation that we'll use for our RAG job:

bash

!pip install --upgrade pymilvus requests tqdm langchain langchain-community langchain-huggingface langchain-openai friendli-client tiktoken

!wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip
!rm -rf milvus_docs
!unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs

Step 2: Process Documentation Files

Next, we'll read the Milvus documentation files and use a simple file-splitting strategy to treat each text line as an individual chunk:

python

from glob import glob

text_lines = []

for file_path in glob("milvus_docs/en/faq/*.md", recursive=True):
   with open(file_path, "r") as file:
       file_text = file.read()

   text_lines += file_text.split("# ")

Step 3: Prepare Embeddings

We'll use the Hugging Face embeddings library to use a simple `all-MiniLM-L6` model to create vector representations of our text:

python

from langchain_huggingface import HuggingFaceEmbeddings

embeddings_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedding = HuggingFaceEmbeddings(model_name=embeddings_model_name)

test_embedding = embedding.embed_query("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])

Step 4: Set Up Milvus Client

Now, let's prepare the Milvus client for our RAG implementation. In this simple example, we use Milvus Lite, which runs locally and materializes a file in a local file. You can also consider other Milvus deployment options:

If you only need a local vector database for small scale data or prototyping, setting the uri as a local file, e.g../milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file.
For larger scale data and traffic in production, you can set up a Milvus server on Docker or Kubernetes. In this setup, please use the server address and port as your uri, e.g.http://localhost:19530. If you enable the authentication feature on Milvus, set the token as "<your_username>:<your_password>", otherwise there is no need to set the token.
You can also use fully managed Milvus on Zilliz Cloud. Simply set the uri and token to the Public Endpoint and API key of your Zilliz Cloud instance.

python

from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./milvus_demo.db")

collection_name = "my_rag_collection"

Step 5: Create Milvus Collection

We'll create a collection in the Milvus client if it doesn't already exist:

python

if milvus_client.has_collection(collection_name):
   milvus_client.drop_collection(collection_name)

milvus_client.create_collection(
   collection_name=collection_name,
   dimension=embedding_dim,
   metric_type="IP",  # Inner product distance
   consistency_level="Strong",  # Strong consistency level
)

Step 6: Embed and Insert Text into Milvus

Let's embed our text and insert it into the Milvus collection:

python

from tqdm import tqdm

data = []

for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
   data.append({"id": i, "vector": embedding.embed_query(line), "text": line})

milvus_client.insert(collection_name=collection_name, data=data)

Step 7: Perform RAG Query

Now we can ask a question and search for relevant data within our Milvus database:

python

question = "How is data stored in milvus?"

search_res = milvus_client.search(
   collection_name=collection_name,
   data=[
       embedding.embed_query(question)
   ],
   limit=3,  # Return top 3 results
   search_params={"metric_type": "IP", "params": {}},  # Inner product distance
   output_fields=["text"],  # Return the text field
)

import json

retrieved_lines_with_distances = [
   (res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))


context = "\n".join(
   [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)

Step 8: Create Prompts for RAG

Let's create the system and user prompts for our RAG query:

python

SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided. 
"""
USER_PROMPT = f""" 
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags. 
<context> 
{context} 
</context> 
<question> 
{question} 
</question> 
"""

Step 9: Set Up Friendli Token

Obtain your FRIENDLI_TOKEN from the Friendli Suite and set it as an environment variable:

python

import os

if "FRIENDLI_TOKEN" not in os.environ:
   os.environ["FRIENDLI_TOKEN"] = 'flp_FILL_IN_WITH_YOUR_OWN_PERSONAL_ACCESS_TOKEN'

Step 10: Execute RAG Query

Now we can execute our RAG query using the Friendli Serverless Endpoints:

python

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
   model="meta-llama-3.1-70b-instruct",
   base_url="https://api.friendli.ai/serverless/v1",
   api_key=os.environ["FRIENDLI_TOKEN"],
)

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
   ("system", SYSTEM_PROMPT),
   ("user", USER_PROMPT)
])
output_parser = StrOutputParser()

chain = prompt | llm | output_parser

print(chain.invoke({"input": question}))

This produces the answer based on the provided documents:

In Milvus, data is stored in two forms: inserted data and metadata.
Inserted data (vector data, scalar data, and collection-specific schema) is stored in persistent storage as incremental logs. Milvus supports multiple object storage backends, including MinIO, AWS S3, Google Cloud Storage (GCS), Azure Blob Storage, Alibaba Cloud OSS, and Tencent Cloud Object Storage (COS).
Metadata, on the other hand, is generated within Milvus and is stored in etcd, with each Milvus module having its own metadata.

Step 11: Multi-Modal Queries

For multi-modal queries, we'll use the Llama-3.2-11b-vision model:

python

multimodalllm = ChatOpenAI(
   model="llama-3.2-11b-vision-instruct",
   base_url="https://api.friendli.ai/serverless/beta",
   api_key=os.environ["FRIENDLI_TOKEN"],
)

image_url = "https://milvus.io/docs/v2.4.x/assets/highly-decoupled-architecture.webp"
message = HumanMessage(
   content=[
       {"type": "text", "text": "describe what is in this image"},
       {"type": "image_url", "image_url": {"url": image_url}},
   ],
)

response = multimodalllm.invoke([message])
print(response.content)

From its response, we can infer that the model correctly understands the image:

The image depicts a flowchart of the components of a system, with the following components:
**Coordinator Service**
* Root
* Query
…

Step 12: Combine RAG and Multi-Modal Capabilities

Finally, let's combine the RAG and multi-modal capabilities:

python

question = "How is data stored in milvus with respect to this picture?"
USER_PROMPT = f"""
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
""" 

message = HumanMessage(
   content=[ 
       {"type": "text", "text": USER_PROMPT}, 
       {"type": "image_url", "image_url": {"url": image_url}},
   ], 
) 

response = multimodalllm.invoke([message]) 
print(response.content)

The model correctly generates a correct response based on the image and the documents:

**Step 1: Identify the components involved in storing data in Milvus.**
The components involved in storing data in Milvus include:
*   Access Layer
*   Message Storage
*   Worker Node

**Step 2: Determine how data is stored in Milvus.**
Data is stored in the Access Layer and Message Storage.

**Step 3: Determine where data is stored in Milvus.**
Data is stored in both Access Layer and Message Storage.

**Answer:** Data is stored in both Access Layer and Message Storage.

Conclusion

This tutorial has demonstrated how to leverage Milvus and Friendli Serverless Endpoints to implement advanced RAG and multi-modal queries. By combining these powerful technologies, you can create more sophisticated AI applications that can understand and process diverse types of information, leading to more accurate and context-aware responses.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.