Building Your RAG Application on LlamaIndex with Friendli Engine: A Step-by-Step Guide

Building Your RAG Application on LlamaIndex with Friendli Engine: A Step-by-Step Guide thumbnail

So you're ready to delve into the exciting world of Retrieval-Augmented Generation (RAG)? While the possibilities are endless, choosing the right components can feel overwhelming. This blog post will equip you with the knowledge and code to confidently deploy RAG on LlamaIndex with Friendli Engine, known for its blazing-fast performance and cost-effectiveness.

1. Setting Up Your Environment:

Before diving in, ensure you have an OpenAI API key and a Friendli Personal Access Token. You can obtain them from:

You can obtain the python notebook code for this article at arxiv_rereanker.ipynb.

You can install the required libraries and export relevant variables as:

bash
$ pip install llama-index-llms-friendli
$ pip install llama-index
$ export FRIENDLI_TOKEN=[FILL_IN_YOUR_TOKEN]
$ export OPENAI_API_KEY=[FILL_IN_YOUR_TOKEN]

Here's a Python script snippet to get you started:

python
import os
import getpass
import time

from llama_index.core import Settings
from llama_index.llms.friendli import Friendli

llm = Friendli(max_tokens=1024)

# LlamaIndex manages settings as a singleton.
Settings.llm = llm
Settings.chunk_size = 256

2. Preparing Your Document:

For this example, we'll use our research paper "Orca: A Distributed Serving System for Transformer-Based Generative Models", which describes our iterative (continuous) batching technique for inference serving. Feel free to substitute this with any relevant document to better suit your needs to enhance model accuracy and reduce biases.

bash
$ wget https://www.usenix.org/system/files/osdi22-yu.pdf

3. Storing the Document in a Vector Store:

The next step involves parsing the document and saving it to a vector store. Let’s first read the PDF file and parse them into chunked documents.

python
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_files=["osdi22-yu.pdf"])
documents = reader.load_data()

Here, we'll save the parsed documents to the vector store using LlamaIndex, which uses a simple in-memory dictionary by default. In the later sections of the article, we also demonstrate using ElasticSearch for larger datasets:

python
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

4. Exploring RAG Retrieval Methods:

4.1. Simple RAG:

This basic approach retrieves documents based on cosine similarity between the query embedding and document embeddings.

python
query_engine = index.as_query_engine(streaming=True, similarity_top_k=3)
query = "What are the key techniques introduced by Orca?"
response = query_engine.query(query)
response.print_response_stream()

print("\n\n🔗 Sources")
for node in response.source_nodes:
    print("-----")
    text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
    print(f"Text:\t {text_fmt} ...")
    print(f"Metadata:\t {node.node.metadata}")
    print(f"Score:\t {node.score:.3f}")

As a sample output, you can expect results like:

bash
The key techniques introduced by ORCA include the use of a narrow definition of the Attention operation, parallelization strategies such as intra-layer and inter-layer model parallelism, ...

4.2. Advanced Usage: Utilizing ElasticSearch for Large-Scale Data:

For extensive datasets, consider switching from the default in-memory storage to ElasticSearch. Here's how to set it up and use it with LlamaIndex:

First, run ElasticSearch using docker

bash
$ docker run -d -p 9200:9200 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    -e "xpack.security.http.ssl.enabled=false" \
    -e "xpack.license.self_generated.type=trial" \
    docker.elastic.co/elasticsearch/elasticsearch:8.12.2

You can check that it’s running with:

bash
$ docker ps | grep elastic

Install the ElasticSearch-LlamaIndex integration package

bash
$ pip install llama-index-vector-stores-elasticsearch

We can load the storage context from our documents like

python
from llama_index.core import StorageContext
from llama_index.vector_stores.elasticsearch import ElasticsearchStore

es = ElasticsearchStore(
    index_name="demo",
    es_url="http://localhost:9200",
)
storage_context = StorageContext.from_defaults(vector_store=es)
index = VectorStoreIndex.from_documents(
    documents=documents,
    storage_context=storage_context,
)

You can check the index named ‘demo’ with the embeddings of documents, stored as a dense vector type

bash
$ curl -X GET "localhost:9200/demo/_mapping?pretty"

Based on the information, one can run queries as follows

python
query_engine = index.as_query_engine(
    streaming=True,
    similarity_top_k=5,
)
response = query_engine.query(query)
response.print_response_stream()

print("\n\n🔗 Sources")
for node in response.source_nodes:
    print("-----")
    text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
    print(f"Text:\t {text_fmt} ...")
    print(f"Metadata:\t {node.node.metadata}")
    print(f"Score:\t {node.score:.3f}")

By following these steps and incorporating the provided code, you'll be well on your way to implementing RAG in your applications. Remember, this is just a starting point – feel free to experiment and customize the process to suit your specific needs.



Share

Related Posts

Easily Migrating LLM Inference Serving from vLLM to Friendli Container thumbnail
  • April 12, 2024
  • 3 min read

Easily Migrating LLM Inference Serving from vLLM to Friendli Container

vLLM
Friendli Container
Serving
Improve Latency and Throughput with Weight-Activation Quantization in FP8 thumbnail
  • April 3, 2024
  • 6 min read

Improve Latency and Throughput with Weight-Activation Quantization in FP8

WAQ
FP8
See all from blog