- December 18, 2024
- 3 min read
Leveraging Milvus and Friendli Serverless Endpoints for Advanced RAG and Multi-Modal Queries

link to Colab: Leveraging Milvus and Friendli Serverless Endpoints for Advanced RAG and Multi-Modal Queries.ipynb
-
FriendliAI specializes in generative AI infrastructure, offering solutions that enable organizations to efficiently deploy and manage large language models (LLMs) and other generative AI models with optimized performance and reduced cost. Users have the ability to choose from production-ready conventional LLMs accessible through APIs, or custom fine-tuned LLMs deployed on the hardware of the user’s choice, whether on the public cloud or on private on-premise clusters.
-
Milvus is an open-source vector database that stores, indexes, and searches billion-scale unstructured data through high-dimensional vector embeddings. It is perfect for building modern AI applications such as retrieval augmented generation (RAG), semantic search, multimodal search, and recommendation systems.
In this article, we'll explore how to use Milvus with Friendli Serverless Endpoints to perform Retrieval-Augmented Generation (RAG) on particular documents and materials, as well as to execute multi-modal queries that incorporate images and other visual content. This powerful combination allows for more sophisticated and context-aware AI applications.
Understanding RAG and Multi-Modal Models
Retrieval-Augmented Generation (RAG)
RAG is a technique that enhances language models by providing them with relevant information, primarily retrieved from a vector database-powered knowledge base. This approach allows AI models to generate more accurate and contextually appropriate responses by referencing designated external data sources.
Multi-Modal Models
Multimodal models can process and understand multiple types of input data, such as text, images, and audio. They can analyze and generate responses based on diverse information sources, enabling more comprehensive and nuanced interactions.
Why Incorporate RAG and Multi-modal models together?
The combination of RAG and multi-modal capabilities significantly improves AI systems by providing the following features simultaneously:
- Allowing for more diverse and rich input types of the user’s choice
- Providing up-to-date information
- Enhancing accuracy and relevance of responses
- Enabling context-aware interactions
Hands-On Implementation
Let's dive into the practical implementation of RAG and multi-modal queries using the Milvus vector database and Friendli Serverless Endpoints.
Step 1: Install Prerequisites and Download Milvus Docs
First, we'll install the necessary libraries and download the Milvus documentation that we'll use for our RAG job:
bash
Step 2: Process Documentation Files
Next, we'll read the Milvus documentation files and use a simple file-splitting strategy to treat each text line as an individual chunk:
python
Step 3: Prepare Embeddings
We'll use the Hugging Face embeddings library to use a simple `all-MiniLM-L6` model to create vector representations of our text:
python
Step 4: Set Up Milvus Client
Now, let's prepare the Milvus client for our RAG implementation. In this simple example, we use Milvus Lite, which runs locally and materializes a file in a local file. You can also consider other Milvus deployment options:
- If you only need a local vector database for small scale data or prototyping, setting the uri as a local file, e.g.
./milvus.db
, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file. - For larger scale data and traffic in production, you can set up a Milvus server on Docker or Kubernetes. In this setup, please use the server address and port as your
uri
, e.g.http://localhost:19530
. If you enable the authentication feature on Milvus, set thetoken
as "<your_username>:<your_password>", otherwise there is no need to set the token. - You can also use fully managed Milvus on Zilliz Cloud. Simply set the
uri
andtoken
to the Public Endpoint and API key of your Zilliz Cloud instance.
python
Step 5: Create Milvus Collection
We'll create a collection in the Milvus client if it doesn't already exist:
python
Step 6: Embed and Insert Text into Milvus
Let's embed our text and insert it into the Milvus collection:
python
Step 7: Perform RAG Query
Now we can ask a question and search for relevant data within our Milvus database:
python
Step 8: Create Prompts for RAG
Let's create the system and user prompts for our RAG query:
python
Step 9: Set Up Friendli Token
Obtain your FRIENDLI_TOKEN
from the Friendli Suite and set it as an environment variable:
python
Step 10: Execute RAG Query
Now we can execute our RAG query using the Friendli Serverless Endpoints:
python
This produces the answer based on the provided documents:
Step 11: Multi-Modal Queries
For multi-modal queries, we'll use the Llama-3.2-11b-vision model:
python
From its response, we can infer that the model correctly understands the image:
Step 12: Combine RAG and Multi-Modal Capabilities
Finally, let's combine the RAG and multi-modal capabilities:
python
The model correctly generates a correct response based on the image and the documents:
Conclusion
This tutorial has demonstrated how to leverage Milvus and Friendli Serverless Endpoints to implement advanced RAG and multi-modal queries. By combining these powerful technologies, you can create more sophisticated AI applications that can understand and process diverse types of information, leading to more accurate and context-aware responses.
Written by
FriendliAI Tech & Research
Share
General FAQ
What is FriendliAI?
FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.
How does FriendliAI help my business?
Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing
Which models and modalities are supported?
Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models
Can I deploy models from Hugging Face directly?
Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership
Still have questions?
If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Contact Sales — our experts (not a bot) will reply within one business day.