- September 10, 2024
- 10 min read
A Comparative Analysis of AI API Providers: Based on Llama 3.1 70B
In the world of large language models (LLMs), choosing the right AI API provider can significantly impact your project's efficiency and cost. This blog post will compare the performance metrics of popular API providers for the Llama 3.1 70B model, including:
- Cerebras
- Groq
- FriendliAI
- Together.ai Turbo
- Fireworks
- Lepton AI
- Perplexity
- OctoAI
- Databricks
- Amazon
- Deepinfra
- Azure
We'll analyze their performance across three key metrics: output speed, time to first token (TTFT), and total response time. We'll also delve into pricing considerations and context input limitations. Most of the performance metrics are sourced from Artificial Analysis, which provides performance comparisons of AI models and API providers serving different models.
While the results could be naively sought in a particular direction (e.g., one provider is inarguably the best) through a particular metric (e.g., output speed), there are many other factors to consider, such as the performance of the providers depending on the different input token lengths, and on the environments in which each provider excels in. Depending on the provider, it may be robust under harsh conditions like long input prompts or a large number of prompts, or it could simply perform particularly well in mild conditions with a small number of adequately-sized prompts.
This blog post provides the explanations to these issues based on the metrics for the ones who wish to have a deep-dive in interpreting the performance metrics during the inference serving for transformer-based generative-AI models.
If you wish to jump straight to the conclusion, click here.
Output Speed
Output speed refers to the number of tokens a model can generate per second during the decoding phase of inference serving. Here's a breakdown of the providers' p50 (median) output speed for Llama 3.1 70B:
- Cerebras (446 tokens/second), Groq (250 tokens/second), and FriendliAI(123 tokens/second) are the leaders in terms of output speed exceeding 100t/s output speed.
- Together.ai (86 tokens/second) comes then, followed by Fireworks (68 tokens/second), Lepton AI (56 tokens/second), and Perplexity (51 tokens/second).
Below is the in-depth analysis of the output speed, also displaying the p5, p25, p75, and p95 values, where you can notice the error bars and the variance of the output speed performances. In this graph, providers that have lower tails (e.g., Together.ai and Fireworks) means that it sometimes (i.e., over 5%) gives worse performance compared to the median performance. Therefore, providers that have higher tails (e.g. FriendliAI) or higher median values (e.g. Cerebras and Groq) are considered to be better.
TTFT (Time to First Token)
TTFT measures the initial latency during the prefill phase of inference serving, which is the time it takes for the model to generate the first token after receiving a prompt. Here's how the providers stack up for TTFT:
- Perplexity (0.23 seconds) and FriendliAI (0.24 seconds) boast the lowest TTFT.
- OctoAI (0.3 seconds), Deepinfra (0.3 seconds), and Cerebras (0.35 seconds) follow behind.
- Fireworks (0.4 seconds), Groq (0.45 seconds), and Together.ai (0.5 seconds) have slightly higher TTFT.
Also refer to the graph below for the in-depth analysis of the TTFT, also displaying the p5, p25, p75, and p95 values. We can see that more than 5% of the requests suffer from a very high TTFT in the cases for Fireworks and Groq, despite their median values showing a mediocre performance within the list. Similar to the output speed case, it is better to have a longer lower tail or to have a shorter upper tail (e.g. FriendliAI) to be able to expect a stable high performance throughout your queries for the inference serving.
Total Response Time
Total response time considers both the TTFT and the average time it takes to generate the desired number of tokens (time per output token or TPOT). To calculate the total response time for 100 tokens, you can add the TTFT to 100 multiplied by the TPOT (i.e. TTFT + 100 * TPOT). Thus, this summarizes the results that we describe above.
- Cerebras (574ms), Groq (851ms), and FriendliAI (1041ms) show the best overall performance among the API providers, spending under ~1 second to generate 100 tokens after also interpreting the input tokens.
- TogetherAI (1659ms), Fireworks(1864ms), and Perplexity (2176ms) follow behind, spending under ~2 seconds to perform the same task.
Refer to the graph below for a more detailed analysis of the total response time. We can observe here that while Groq displayed a second-best median performance among the list, it sometimes suffers from high response times, indicated by the upper tail on the graph. Together.ai and Fireworks have similar patterns.
While Cerebras and Groq show better performance in terms of output speed (tokens/second), with their custom-made hardware chips, the total response time doesn’t seem to be dramatically different due to their higher TTFTs. TTFTs contribute to the performance of the serving engine for initially processing the input tokens during the prefill phase before the decoding phase. While these chips are tailored to display blazing performances for the output speed for particular models, by mitigating the memory bottleneck occurring during the decoding phase, we can see that there is still room for improvement in cases where the engine is facing many short requests, in cases where TTFT becomes a dominant factor in the total response times.
Pricing
Pricing is another crucial factor to consider. Here's a comparison of the blended prices (cost per 1 million tokens; 3:1 input:output tokens ratio) for these providers:
- Deepinfra ($0.36), FriendliAI ($0.6), Cerebras ($0.6), and Groq ($0.64) are the most economical options.
- Lepton AI ($0.8), Together.ai ($0.88), OctoAI ($0.9), and Fireworks ($0.9) fall within a similar price range.
Among the different providers, Cerebras and Groq are also chip manufacturers that create hardware accelerators dedicated to particular model structures (which are often inflexible). The costs for the manufacturing of the chips and customizing them for particular models are currently hidden under the current operational serving costs, but they also have to be considered to interpret these prices in more depth.
Other providers usually use conventional GPUs (i.e., GPGPUs) to perform inference serving, while there also exists differences in whether they physically keep and manage the GPUs (i.e. on-premise) or borrow them from cloud environments. While using hardware accelerators dedicated for execution of particular models (e.g. Llama 3.1 70B) may provide better performances, using conventional GPUs provide general availability and sustainability for a wider range of users, especially for those who already own such conventional GPU devices or who do not wish to risk investing in unconventional hardwares for their generative AI inference serving.
Context Window Limitations
Depending on the types of workload that a user wishes to serve, it's important to note that some providers may have limitations on the length of the context window one can provide (e.g., Cerebras). Be sure to check each provider's documentation for the specific details.
Comparison by Context Length
We'll delve deeper into how output speed, TTFT, and total response time vary for different context lengths in this section. This will help you understand how these metrics can be affected by the length and the complexity of your prompts. As one can guess, processing an increased number of input tokens pose extra burdens on the providers, as it means more tokens to process. Thus, one provider (i.e., Cerebras) doesn’t provide the option of processing 10K tokens by default, while some others suffer from performance degradation (e.g., lower output speed; higher TTFT and response times) in longer context lengths. This is also an important factor to consider, as this indicates the robustness of the inference serving engine from each provider for processing query inputs of various lengths. Note that the graphs in this section are log-scaled in order to display the differences between the different providers more distinctly.