April 8, 2024
2 min read

Building Your RAG Application on LlamaIndex with Friendli Inference: A Step-by-Step Guide

So you're ready to delve into the exciting world of Retrieval-Augmented Generation (RAG)? While the possibilities are endless, choosing the right components can feel overwhelming. This blog post will equip you with the knowledge and code to confidently deploy RAG on LlamaIndex with Friendli Inference, known for its blazing-fast performance and cost-effectiveness.

1. Setting Up Your Environment:

Before diving in, ensure you have an OpenAI API key and a Friendli Personal Access Token. You can obtain them from:

OpenAI API key: https://platform.openai.com/api-keys
Friendli Personal Access Token: https://friendli.ai/suite/~/setting/tokens

You can obtain the python notebook code for this article at arxiv_rereanker.ipynb.

You can install the required libraries and export relevant variables as:

Here's a Python script snippet to get you started:

2. Preparing Your Document:

For this example, we'll use our research paper "Orca: A Distributed Serving System for Transformer-Based Generative Models", which describes our iterative (continuous) batching technique for inference serving. Feel free to substitute this with any relevant document to better suit your needs to enhance model accuracy and reduce biases.

3. Storing the Document in a Vector Store:

The next step involves parsing the document and saving it to a vector store. Let’s first read the PDF file and parse them into chunked documents.

Here, we'll save the parsed documents to the vector store using LlamaIndex, which uses a simple in-memory dictionary by default. In the later sections of the article, we also demonstrate using ElasticSearch for larger datasets:

4. Exploring RAG Retrieval Methods:

4.1. Simple RAG:

This basic approach retrieves documents based on cosine similarity between the query embedding and document embeddings.

As a sample output, you can expect results like:

4.2. Advanced Usage: Utilizing ElasticSearch for Large-Scale Data:

For extensive datasets, consider switching from the default in-memory storage to ElasticSearch. Here's how to set it up and use it with LlamaIndex:

First, run ElasticSearch using docker

You can check that it’s running with:

Install the ElasticSearch-LlamaIndex integration package

We can load the storage context from our documents like

You can check the index named ‘demo’ with the embeddings of documents, stored as a dense vector type

Based on the information, one can run queries as follows

By following these steps and incorporating the provided code, you'll be well on your way to implementing RAG in your applications. Remember, this is just a starting point – feel free to experiment and customize the process to suit your specific needs.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.

April 12, 2024
2 min read

Easily Migrating LLM Inference Serving from vLLM to Friendli Container

LLM inference

vLLM

Container

April 3, 2024
6 min read

Improve Latency and Throughput with Weight-Activation Quantization in FP8

April 8, 2024
2 min read