- July 12, 2024
- 4 min read
Showcasing FriendliAI’s Integration with LiteLLM
LiteLLM recently introduced FriendliAI as one of their LLM inference API providers. LiteLLM allows users to utilize over 100 large language models with load balancing, fallbacks, and cost tracking, all in the OpenAI API format. You can leverage FriendliAI’s blazing-fast performance and cost-efficiency alongside LiteLLM’s versatile features.
This blog post will explore how the Friendli Serverless Endpoint can be used with LiteLLM. We will cover basic usages, example codes for different response types, and the budget manager provided by LiteLLM. Moreover, stay tuned for a fun experiment comparing the cost-efficiency of FriendliAI and OpenAI models using the budget manager. Based on this experiment, we can generate approximately ten times more tokens with FriendliAI’s meta-llama-3-70b-instruct model than with OpenAI’s GPT-4o model under the same budget conditions. By the end, you'll be well-equipped to maximize your use of LiteLLM and FriendliAI for your specific needs. So please follow along!
Resources
Basic Usages
This section will cover the basic usages of the LiteLLM Python SDK for chat completions with four different response types: default, streaming, asynchronous, and asynchronous streaming. Throughout this blog, we will use FriendliAI’s meta-llama-3-70b-instruct model and ask it “Hello from LiteLLM”.
Before diving in, make sure you have a Friendli Personal Access Token. You can get your token here. You can install the required libraries and export relevant variables as:
$ pip install litellm $ export FRIENDLI_TOKEN=[FILL_IN_YOUR_TOKEN]
Default Example Code
This example demonstrates how you can use the LiteLLM Python SDK to generate a response. LiteLLM supports LLM inferences using the ‘completion’ function.
pythonfrom litellm import completion response = completion( model="friendliai/meta-llama-3-70b-instruct", messages=[ {"role": "user", "content": "Hello from LiteLLM"} ], ) print(response.choices[0].message.content)
Streaming Example Code
This example demonstrates how you can use the LiteLLM Python SDK to generate a streaming response. Responses can be streamed by setting the stream argument as ‘True’ in the completion function.
pythonfrom litellm import completion response = completion( model="friendliai/meta-llama-3-70b-instruct", messages=[ {"role": "user", "content": "Hello from LiteLLM"} ], stream=True, ) for chunk in response: print(chunk.choices[0].delta.content or "", end="", flush=True)
Async Example Code
This example demonstrates how you can use the LiteLLM Python SDK to generate an asynchronous response. Asynchronous chat completions are supported using the ‘acompletion’ function.
pythonfrom litellm import acompletion import asyncio async def test_get_response(): response = await acompletion( model="friendliai/meta-llama-3-70b-instruct", messages=[ {"role": "user", "content": "Hello from LiteLLM"} ], ) print(response.choices[0].message.content) asyncio.run(test_get_response())
Async Streaming Example Code
This example demonstrates how you can use the LiteLLM Python SDK to generate an asynchronous streaming response.
pythonfrom litellm import acompletion import asyncio async def test_get_response(): response = await acompletion( model="friendliai/meta-llama-3-70b-instruct", messages=[ {"role": "user", "content": "Hello from LiteLLM"} ], stream=True, ) async for chunk in response: print(chunk.choices[0].delta.content or "", end="", flush=True) asyncio.run(test_get_response())
Results
The chat completion inference result of “Hello from LiteLLM” using FriendliAI’s meta-llama-3-70b-instruct model with LiteLLM is as follows:
Hello from an AI! It's great to meet you, LiteLLM! How's your day going so far?
# Result of print(response) ModelResponse( id=None, choices=[ Choices( finish_reason='stop', index=0, message=Message( content="Hello from an AI! It's great to meet you, LiteLLM! How's your day going so far?", role='assistant' ) ) ], created=1720661080, model='friendliai/meta-llama-3-70b-instruct', object='chat.completion', system_fingerprint=None, usage=Usage( completion_tokens=25, prompt_tokens=16, total_tokens=41 ) )
Congratulations on getting the basics under your belt! You have taken the first step in leveraging LiteLLM and FriendliAI for your projects. As the LLM has answered, how's your day going so far? We hope it has been productive and enjoyable. Furthermore, pay attention to the ‘total_tokens’ variable in the response above. We will use this variable to calculate the total number of tokens used in our final experiment.
Stay tuned as we delve deeper into more advanced features and exciting experiments in the following sections. Let's continue exploring the full potential of these powerful tools together!
Budget Manager
An interesting feature of LiteLLM is their BudgetManager class. You can manage budgets and track spent costs for each user. Advanced features include storing user budgets in a database and resetting user budgets based on a set duration. You can check out their implementation code here.
User-based Rate Limiting Code
In this example, we will explore how to use the BudgetManager class to manage and enforce user-specific budgets. This feature is particularly useful for controlling the costs associated with running LLM inferences. The code goes through the process of creating a budget for a user, checking their current usage against the budget, and updating the cost after an inference is made.
Here's the code implementation:
pythonfrom litellm import BudgetManager, completion budget_manager = BudgetManager(project_name="test_project") user = "user_id" if not budget_manager.is_valid_user(user): budget_manager.create_budget(total_budget=0.001, user=user) if budget_manager.get_current_cost(user=user) <= budget_manager.get_total_budget(user): response = completion( model="friendliai/meta-llama-3-70b-instruct", messages=[ {"role": "user", "content": "Hello from LiteLLM"} ], ) budget_manager.update_cost(completion_obj=response, user=user) else: print("Sorry - no more budget!")
# user_cost.json { "user_id": { "total_budget": 0.001, "current_cost": 3.68e-05, "model_cost": { "friendliai/meta-llama-3-70b-instruct": 3.68e-05 } } }
The Final Budget Manager Experiment
Now that we have finally covered all the basics, let's try something fun! Have you ever wanted to see how many inferences you could make with a strict budget? This experiment can help us understand how much LLMs actually cost. We tried using the budget manager to see how many inferences could be made to the FriendliAI’s meta-llama-3-70b-instruct model with $0.001. Let’s try asking the model “Hello from LiteLLM” until we run out of money.
Here's the code implementation. It tracks and updates the total number of inferences and tokens used, and stops when the budget is exceeded, printing a summary.:
pythonfrom litellm import BudgetManager, completion budget_manager = BudgetManager(project_name="test_project") user = "user_id" total_inferences = 0 total_tokens = 0 if not budget_manager.is_valid_user(user): budget_manager.create_budget(total_budget=0.001, user=user) while True: if budget_manager.get_current_cost(user=user) <= budget_manager.get_total_budget(user): response = completion( model="friendliai/meta-llama-3-70b-instruct", messages=[ {"role": "user", "content": "Hello from LiteLLM"} ], )
FriendliAI’s meta-llama-3-70b-instruct Model Results
In this run, 27 inferences, using a total of 1281 tokens, could be made with $0.001.
Sorry - no more budget! Total number of successful inferences: 27 Total number of used tokens: 1281 Example of a response is: Hello from me! It's nice to meet you, LiteLLM! How are you doing today?
# user_cost.json { "user_id": { "total_budget": 0.001, "current_cost": 0.0010248, "model_cost": { "friendliai/meta-llama-3-70b-instruct": 0.0010248 } } }
OpenAI’s GPT-4o Model Results
Next, we tried running the same experiment with OpenAI’s GPT-4o model. Simply swap the model value with "gpt-4o" in the experiment code. In this run, 6 inferences, using a total of 126 tokens, could be made with $0.001. Under the same budget, we were able to use over 10 times as many tokens with FriendliAI’s meta-llama-3-70b-instruct model compared to OpenAI’s GPT-4o model!
Sorry - no more budget! Total number of successful inferences: 6 Total number of used tokens: 126 Example of a response is: Hello! How can I assist you today?
# user_cost.json { "user_id": { "total_budget": 0.001, "current_cost": 0.0011700000000000002, "model_cost": { "gpt-4o-2024-05-13": 0.0011700000000000002 } } }
Token Cost Comparison
This graph visualizes the comparison of the number of tokens generated by LLM models from FriendliAI and OpenAI within a $0.001 budget on LiteLLM.:
With $0.001, we can generate ~10.17 times more tokens with FriendliAI’s meta-llama-3-70b-instruct model (1281 tokens) compared to OpenAI’s GPT-4o model (126 tokens).
Similarly, we can compare the cost per 1M tokens for FriendliAI and OpenAI models as below:
pythonfrom litellm import model_cost print(model_cost["friendliai/meta-llama-3-70b-instruct"]["input_cost_per_token"] * 1000000) # $0.6 per 1M tokens print(model_cost["friendliai/meta-llama-3-70b-instruct"]["output_cost_per_token"] * 1000000) # $0.6 per 1M tokens print(model_cost["gpt-4o"]["input_cost_per_token"] * 1000000) # $5 per 1M tokens print(model_cost["gpt-4o"]["output_cost_per_token"] * 1000000) # $15 per 1M tokens
Conclusion
This tutorial shows basic examples of integrating LiteLLM with Friendli Serverless Endpoints for chat completions. We also demonstrate LiteLLM’s budget manager to limit user inference costs. Combining these learnings, we present a practical experiment that calculates the number of inference requests that could be made under a specific budget.
Remember, this is just a starting point – feel free to experiment and customize the process to suit your specific needs using Friendli Endpoints on LiteLLM’s versatile platform!
Written by
FriendliAI Tech & Research
Share