September 10, 2024
9 min read

Llama 3.1 70B API Providers Comparative Analysis—FriendliAI Outshines!

In the world of large language models (LLMs), choosing the right AI API provider can significantly impact your project's efficiency and cost. This blog post will compare the performance metrics of popular API providers for the Llama 3.1 70B model, including:

We'll analyze their performance across three key metrics: output speed, time to first token (TTFT), and total response time. We'll also delve into pricing considerations and context input limitations. Most of the performance metrics are sourced from Artificial Analysis, which provides performance comparisons of AI models and API providers serving different models.

While the results could be naively sought in a particular direction (e.g., one provider is inarguably the best) through a particular metric (e.g., output speed), there are many other factors to consider, such as the performance of the providers depending on the different input token lengths, and on the environments in which each provider excels in. Depending on the provider, it may be robust under harsh conditions like long input prompts or a large number of prompts, or it could simply perform particularly well in mild conditions with a small number of adequately-sized prompts.

This blog post provides the explanations to these issues based on the metrics for the ones who wish to have a deep-dive in interpreting the performance metrics during the inference serving for transformer-based generative-AI models.

If you wish to jump straight to the conclusion, click here.

Output Speed

Output speed refers to the number of tokens a model can generate per second during the decoding phase of inference serving. Here's a breakdown of the providers' p50 (median) output speed for Llama 3.1 70B:

Cerebras (446 tokens/second), Groq (250 tokens/second), and FriendliAI(123 tokens/second) are the leaders in terms of output speed exceeding 100t/s output speed.
Together.ai (86 tokens/second) comes then, followed by Fireworks (68 tokens/second), Lepton AI (56 tokens/second), and Perplexity (51 tokens/second).

Output speed 1

Below is the in-depth analysis of the output speed, also displaying the p5, p25, p75, and p95 values, where you can notice the error bars and the variance of the output speed performances. In this graph, providers that have lower tails (e.g., Together.ai and Fireworks) means that it sometimes (i.e., over 5%) gives worse performance compared to the median performance. Therefore, providers that have higher tails (e.g. FriendliAI) or higher median values (e.g. Cerebras and Groq) are considered to be better.

Output speed 2

TTFT (Time to First Token)

TTFT measures the initial latency during the prefill phase of inference serving, which is the time it takes for the model to generate the first token after receiving a prompt. Here's how the providers stack up for TTFT:

Perplexity (0.23 seconds) and FriendliAI (0.24 seconds) boast the lowest TTFT.
OctoAI (0.3 seconds), Deepinfra (0.3 seconds), and Cerebras (0.35 seconds) follow behind.
Fireworks (0.4 seconds), Groq (0.45 seconds), and Together.ai (0.5 seconds) have slightly higher TTFT.

TTFT 1

Also refer to the graph below for the in-depth analysis of the TTFT, also displaying the p5, p25, p75, and p95 values. We can see that more than 5% of the requests suffer from a very high TTFT in the cases for Fireworks and Groq, despite their median values showing a mediocre performance within the list. Similar to the output speed case, it is better to have a longer lower tail or to have a shorter upper tail (e.g. FriendliAI) to be able to expect a stable high performance throughout your queries for the inference serving.

TTFT 2

Total Response Time

Total response time considers both the TTFT and the average time it takes to generate the desired number of tokens (time per output token or TPOT). To calculate the total response time for 100 tokens, you can add the TTFT to 100 multiplied by the TPOT (i.e. TTFT + 100 * TPOT). Thus, this summarizes the results that we describe above.

Cerebras (574ms), Groq (851ms), and FriendliAI (1041ms) show the best overall performance among the API providers, spending under ~1 second to generate 100 tokens after also interpreting the input tokens.
TogetherAI (1659ms), Fireworks(1864ms), and Perplexity (2176ms) follow behind, spending under ~2 seconds to perform the same task.

Total response time 1

Refer to the graph below for a more detailed analysis of the total response time. We can observe here that while Groq displayed a second-best median performance among the list, it sometimes suffers from high response times, indicated by the upper tail on the graph. Together.ai and Fireworks have similar patterns.

Total response time 2

While Cerebras and Groq show better performance in terms of output speed (tokens/second), with their custom-made hardware chips, the total response time doesn’t seem to be dramatically different due to their higher TTFTs. TTFTs contribute to the performance of the serving engine for initially processing the input tokens during the prefill phase before the decoding phase. While these chips are tailored to display blazing performances for the output speed for particular models, by mitigating the memory bottleneck occurring during the decoding phase, we can see that there is still room for improvement in cases where the engine is facing many short requests, in cases where TTFT becomes a dominant factor in the total response times.

Pricing

Pricing is another crucial factor to consider. Here's a comparison of the blended prices (cost per 1 million tokens; 3:1 input:output tokens ratio) for these providers:

Deepinfra ( $0.36), FriendliAI ($ 0.6), Cerebras ( $0.6), and Groq ($ 0.64) are the most economical options.
Lepton AI ( $0.8), Together.ai ($ 0.88), OctoAI ( $0.9), and Fireworks ($ 0.9) fall within a similar price range.

Pricing

Among the different providers, Cerebras and Groq are also chip manufacturers that create hardware accelerators dedicated to particular model structures (which are often inflexible). The costs for the manufacturing of the chips and customizing them for particular models are currently hidden under the current operational serving costs, but they also have to be considered to interpret these prices in more depth.

Other providers usually use conventional GPUs (i.e., GPGPUs) to perform inference serving, while there also exists differences in whether they physically keep and manage the GPUs (i.e. on-premise) or borrow them from cloud environments. While using hardware accelerators dedicated for execution of particular models (e.g. Llama 3.1 70B) may provide better performances, using conventional GPUs provide general availability and sustainability for a wider range of users, especially for those who already own such conventional GPU devices or who do not wish to risk investing in unconventional hardwares for their generative AI inference serving.

Context Window Limitations

Depending on the types of workload that a user wishes to serve, it's important to note that some providers may have limitations on the length of the context window one can provide (e.g., Cerebras). Be sure to check each provider's documentation for the specific details.

Context window

Comparison by Context Length

We'll delve deeper into how output speed, TTFT, and total response time vary for different context lengths in this section. This will help you understand how these metrics can be affected by the length and the complexity of your prompts. As one can guess, processing an increased number of input tokens pose extra burdens on the providers, as it means more tokens to process. Thus, one provider (i.e., Cerebras) doesn’t provide the option of processing 10K tokens by default, while some others suffer from performance degradation (e.g., lower output speed; higher TTFT and response times) in longer context lengths. This is also an important factor to consider, as this indicates the robustness of the inference serving engine from each provider for processing query inputs of various lengths. Note that the graphs in this section are log-scaled in order to display the differences between the different providers more distinctly.

Output speed 3

While most providers are able to handle up to 1K tokens quite well, some tend to show dramatic performance drops under 10K input tokens. For example, Together.ai suffers a 66% performance degradation while processing 1K to 10K input tokens, which is similar in the case of Databricks and Amazon. Other providers, on the other hand, give a pretty good robustness to the number of input tokens in the case of the output speed.

TTFT 3

In the case of the TTFT, all providers handle the initial latency quite well up to 1K input tokens, while almost all of them suffer from performance degradations in the case of 10K input tokens. Especially for Groq, the difference between 1K and 10K input tokens becomes dramatic, incurring a 560% increase in the TTFT.

Total response time 3

In a nutshell, we can see that there are many providers that are successful in keeping the response times and latencies low up to 1K input tokens. However in the case of 10K input tokens, we can witness a significant shift in this trend. Interestingly, FriendliAI shows the best performance among the different providers in this case, providing the most overall robustness for the increased amount of input load.

Output Speed vs. Price

We'll create a graph to visualize the trade-off between output speed and pricing. This will help you identify providers that offer the best performance for your budget.

Output speed price

Visualizing the output speed (higher is better) and pricing (lower is better), we can see that Cerebras, Groq, and FriendliAI fall into the most attractive quadrant simultaneous satisfying both criteria among the different providers. Other providers either fall short in providing enough output speed to meet the performance requirement, or also don't meet the budget limits at the same time.

TTFT vs. Output Speed

This time, we will compare the TTFT with the output speed. This will help you determine which providers are the most reactive and suitable for real-time applications.

TTFT output speed

As depicted on the graph, a higher output speed is better, and a lower TTFT is better. The providers that satisfy these two criteria and fall into the most attractive quadrant are Cerebras and FriendliAI. This means that other providers either show worse output speed, worse TTFT, or both.

Total Response Time vs. Price

Finally, we'll create a graph to compare the total response time and pricing. This comprehensive view will help you select the provider that delivers the best overall performance for your needs.

Total response time price

Similarly, a provider with lower prices and lower responses are more attractive to users. Interestingly, this shows a somewhat linear relationship, indicating that the providers tend to be more expensive if their total response time is higher (worse). Therefore, also in this case, the three most competitive candidates are Cerebras, Groq, and FriendliAI, both in terms of price and performance.

Conclusion

Below is a table summarizing the information depicted in the various graphs above.

API provider

We can see from the results that among the different candidates, Cerebras, Groq, and FriendliAI stand out as the best-performing candidates. Nevertheless, Cerebras has a limited input context of 8K tokens, and the TTFT of Groq falls dramatically with 10K input tokens, as shown in the evaluations above, making it an unsuitable solution for cases where there are heavy input loads or a large number of input requests. FriendliAI, while showing slightly worse performance in terms of output speed due to its physical limitation of using conventional NVIDIA GPUs (i.e. GPGPUs) instead of dedicated hardware accelerators, shows a well-rounded performance in all criteria, striking a good balance between good performance and price while not putting any restrictions on the different environmental situations.

Looks interesting? Try it out yourself by deploying your model on the Friendli Suite today.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.