- April 8, 2024
- 2 min read
Building Your RAG Application on LlamaIndex with Friendli Inference: A Step-by-Step Guide
So you're ready to delve into the exciting world of Retrieval-Augmented Generation (RAG)? While the possibilities are endless, choosing the right components can feel overwhelming. This blog post will equip you with the knowledge and code to confidently deploy RAG on LlamaIndex with Friendli Inference, known for its blazing-fast performance and cost-effectiveness.
1. Setting Up Your Environment:
Before diving in, ensure you have an OpenAI API key and a Friendli Personal Access Token. You can obtain them from:
- OpenAI API key: https://platform.openai.com/api-keys
- Friendli Personal Access Token: https://suite.friendli.ai/default-team/settings/tokens
You can obtain the python notebook code for this article at arxiv_rereanker.ipynb.
You can install the required libraries and export relevant variables as:
bash$ pip install llama-index-llms-friendli $ pip install llama-index $ export FRIENDLI_TOKEN=[FILL_IN_YOUR_TOKEN] $ export OPENAI_API_KEY=[FILL_IN_YOUR_TOKEN]
Here's a Python script snippet to get you started:
pythonimport os import getpass import time from llama_index.core import Settings from llama_index.llms.friendli import Friendli llm = Friendli(max_tokens=1024) # LlamaIndex manages settings as a singleton. Settings.llm = llm Settings.chunk_size = 256
2. Preparing Your Document:
For this example, we'll use our research paper "Orca: A Distributed Serving System for Transformer-Based Generative Models", which describes our iterative (continuous) batching technique for inference serving. Feel free to substitute this with any relevant document to better suit your needs to enhance model accuracy and reduce biases.
bash$ wget https://www.usenix.org/system/files/osdi22-yu.pdf
3. Storing the Document in a Vector Store:
The next step involves parsing the document and saving it to a vector store. Let’s first read the PDF file and parse them into chunked documents.
pythonfrom llama_index.core import SimpleDirectoryReader reader = SimpleDirectoryReader(input_files=["osdi22-yu.pdf"]) documents = reader.load_data()
Here, we'll save the parsed documents to the vector store using LlamaIndex, which uses a simple in-memory dictionary by default. In the later sections of the article, we also demonstrate using ElasticSearch for larger datasets:
pythonfrom llama_index.core import VectorStoreIndex index = VectorStoreIndex.from_documents(documents)
4. Exploring RAG Retrieval Methods:
4.1. Simple RAG:
This basic approach retrieves documents based on cosine similarity between the query embedding and document embeddings.
pythonquery_engine = index.as_query_engine(streaming=True, similarity_top_k=3) query = "What are the key techniques introduced by Orca?" response = query_engine.query(query) response.print_response_stream() print("\n\n🔗 Sources") for node in response.source_nodes: print("-----") text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000] print(f"Text:\t {text_fmt} ...") print(f"Metadata:\t {node.node.metadata}") print(f"Score:\t {node.score:.3f}")
As a sample output, you can expect results like:
bashThe key techniques introduced by ORCA include the use of a narrow definition of the Attention operation, parallelization strategies such as intra-layer and inter-layer model parallelism, ...
4.2. Advanced Usage: Utilizing ElasticSearch for Large-Scale Data:
For extensive datasets, consider switching from the default in-memory storage to ElasticSearch. Here's how to set it up and use it with LlamaIndex:
First, run ElasticSearch using docker
bash$ docker run -d -p 9200:9200 \ -e "discovery.type=single-node" \ -e "xpack.security.enabled=false" \ -e "xpack.security.http.ssl.enabled=false" \ -e "xpack.license.self_generated.type=trial" \ docker.elastic.co/elasticsearch/elasticsearch:8.12.2
You can check that it’s running with:
bash$ docker ps | grep elastic
Install the ElasticSearch-LlamaIndex integration package
bash$ pip install llama-index-vector-stores-elasticsearch
We can load the storage context from our documents like
pythonfrom llama_index.core import StorageContext from llama_index.vector_stores.elasticsearch import ElasticsearchStore es = ElasticsearchStore( index_name="demo", es_url="http://localhost:9200", ) storage_context = StorageContext.from_defaults(vector_store=es) index = VectorStoreIndex.from_documents( documents=documents, storage_context=storage_context, )
You can check the index named ‘demo’ with the embeddings of documents, stored as a dense vector type
bash$ curl -X GET "localhost:9200/demo/_mapping?pretty"
Based on the information, one can run queries as follows
pythonquery_engine = index.as_query_engine( streaming=True, similarity_top_k=5, ) response = query_engine.query(query) response.print_response_stream() print("\n\n🔗 Sources") for node in response.source_nodes: print("-----") text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000] print(f"Text:\t {text_fmt} ...") print(f"Metadata:\t {node.node.metadata}") print(f"Score:\t {node.score:.3f}")
By following these steps and incorporating the provided code, you'll be well on your way to implementing RAG in your applications. Remember, this is just a starting point – feel free to experiment and customize the process to suit your specific needs.
Written by
FriendliAI Tech & Research
Share