FriendliAI Secures $20M to Accelerate AI Inference Innovation — Read the Full Story

  • April 8, 2024
  • 2 min read

Building Your RAG Application on LlamaIndex with Friendli Inference: A Step-by-Step Guide

Building Your RAG Application on LlamaIndex with Friendli Inference: A Step-by-Step Guide thumbnail

So you're ready to delve into the exciting world of Retrieval-Augmented Generation (RAG)? While the possibilities are endless, choosing the right components can feel overwhelming. This blog post will equip you with the knowledge and code to confidently deploy RAG on LlamaIndex with Friendli Inference, known for its blazing-fast performance and cost-effectiveness.

1. Setting Up Your Environment:

Before diving in, ensure you have an OpenAI API key and a Friendli Personal Access Token. You can obtain them from:

  • OpenAI API key: https://platform.openai.com/api-keys
  • Friendli Personal Access Token: https://friendli.ai/suite/setting/tokens

You can obtain the python notebook code for this article at arxiv_rereanker.ipynb.

You can install the required libraries and export relevant variables as:

Here's a Python script snippet to get you started:

2. Preparing Your Document:

For this example, we'll use our research paper "Orca: A Distributed Serving System for Transformer-Based Generative Models", which describes our iterative (continuous) batching technique for inference serving. Feel free to substitute this with any relevant document to better suit your needs to enhance model accuracy and reduce biases.

3. Storing the Document in a Vector Store:

The next step involves parsing the document and saving it to a vector store. Let’s first read the PDF file and parse them into chunked documents.

Here, we'll save the parsed documents to the vector store using LlamaIndex, which uses a simple in-memory dictionary by default. In the later sections of the article, we also demonstrate using ElasticSearch for larger datasets:

4. Exploring RAG Retrieval Methods:

4.1. Simple RAG:

This basic approach retrieves documents based on cosine similarity between the query embedding and document embeddings.

As a sample output, you can expect results like:

4.2. Advanced Usage: Utilizing ElasticSearch for Large-Scale Data:

For extensive datasets, consider switching from the default in-memory storage to ElasticSearch. Here's how to set it up and use it with LlamaIndex:

First, run ElasticSearch using docker

You can check that it’s running with:

Install the ElasticSearch-LlamaIndex integration package

We can load the storage context from our documents like

You can check the index named ‘demo’ with the embeddings of documents, stored as a dense vector type

Based on the information, one can run queries as follows

By following these steps and incorporating the provided code, you'll be well on your way to implementing RAG in your applications. Remember, this is just a starting point – feel free to experiment and customize the process to suit your specific needs.


Written by

FriendliAI Tech & Research


Share


General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.


Related Posts

Easily Migrating LLM Inference Serving from vLLM to Friendli Container thumbnail
  • April 12, 2024
  • 2 min read

Easily Migrating LLM Inference Serving from vLLM to Friendli Container

LLM inference
vLLM
Container
Improve Latency and Throughput with Weight-Activation Quantization in FP8 thumbnail
  • April 3, 2024
  • 6 min read

Improve Latency and Throughput with Weight-Activation Quantization in FP8

Quantization
WAQ
AWQ
See all from blog

Products

Friendli Dedicated EndpointsFriendli Serverless EndpointsFriendli Container

Solutions

InferenceUse Cases
Models

Developers

DocsBlogResearch

Company

About usNewsCareersPatentsBrand ResourcesContact us
Pricing

Contact us:

contact@friendli.ai

FriendliAI Corp:

Redwood City, CA

Hub:

Seoul, Korea

Privacy PolicyService Level AgreementTerms of ServiceCA Notice

Copyright © 2025 FriendliAI Corp. All rights reserved

bash
$ pip install llama-index-llms-friendli
$ pip install llama-index
$ export FRIENDLI_TOKEN=[FILL_IN_YOUR_TOKEN]
$ export OPENAI_API_KEY=[FILL_IN_YOUR_TOKEN]
python
import os
import getpass
import time

from llama_index.core import Settings
from llama_index.llms.friendli import Friendli

llm = Friendli(max_tokens=1024)

# LlamaIndex manages settings as a singleton.
Settings.llm = llm
Settings.chunk_size = 256
bash
$ wget https://www.usenix.org/system/files/osdi22-yu.pdf
python
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_files=["osdi22-yu.pdf"])
documents = reader.load_data()
python
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
python
query_engine = index.as_query_engine(streaming=True, similarity_top_k=3)
query = "What are the key techniques introduced by Orca?"
response = query_engine.query(query)
response.print_response_stream()

print("\n\n🔗 Sources")
for node in response.source_nodes:
    print("-----")
    text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
    print(f"Text:\t {text_fmt} ...")
    print(f"Metadata:\t {node.node.metadata}")
    print(f"Score:\t {node.score:.3f}")
bash
The key techniques introduced by ORCA include the use of a narrow definition of the Attention operation, parallelization strategies such as intra-layer and inter-layer model parallelism, ...
bash
$ docker run -d -p 9200:9200 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    -e "xpack.security.http.ssl.enabled=false" \
    -e "xpack.license.self_generated.type=trial" \
    docker.elastic.co/elasticsearch/elasticsearch:8.12.2
bash
$ docker ps | grep elastic
bash
$ pip install llama-index-vector-stores-elasticsearch
python
from llama_index.core import StorageContext
from llama_index.vector_stores.elasticsearch import ElasticsearchStore

es = ElasticsearchStore(
    index_name="demo",
    es_url="http://localhost:9200",
)
storage_context = StorageContext.from_defaults(vector_store=es)
index = VectorStoreIndex.from_documents(
    documents=documents,
    storage_context=storage_context,
)
bash
$ curl -X GET "localhost:9200/demo/_mapping?pretty"
python
query_engine = index.as_query_engine(
    streaming=True,
    similarity_top_k=5,
)
response = query_engine.query(query)
response.print_response_stream()

print("\n\n🔗 Sources")
for node in response.source_nodes:
    print("-----")
    text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
    print(f"Text:\t {text_fmt} ...")
    print(f"Metadata:\t {node.node.metadata}")
    print(f"Score:\t {node.score:.3f}")