December 11, 2023
3 min read

Groundbreaking Performance of the Friendli Inference for LLM Serving on an NVIDIA H100 GPU

Empowering the future of generative AI, FriendliAI offers Friendli Inference (formerly known as ‘PeriFlow’), a revolutionary serving engine that accelerates and democratizes the deployment of generative AI models (e.g., LLMs), making them more accessible to everyone. In this analysis, we are excited to share that our engine can flexibly adapt to the new NVIDIA H100 GPUs, achieving 4x throughput improvement compared to the performance on former NVIDIA A100 GPUs.

As generative AI models rapidly grow and become more widely used in our daily lives, deploying them requires more computational resources. In line with the requirements, accelerators are also being rapidly developed. Among them, the most commonly used hardware for serving generative AI models are GPU devices manufactured by NVIDIA, backed up by numerous users.

Last year, NVIDIA announced the H100 GPU, based on the new Hopper architecture. As shown in the table below, in addition to the increased clock frequency, the NVIDIA H100 SXM5 80GB GPU provides 22% more streaming multiprocessors (SMs) compared to the prior generation (the A100 80GB GPU), each of which is 2x faster with the new generation of TensorCore. As a result, the new hardware pushes the boundaries of remarkable performance in LLM serving.

NVIDIA H100 SXM5 80GB GPU provides 22% more streaming multiprocessors (SMs) compared to the prior generation (the A100 80GB GPU), each of which is 2x faster with the new generation of TensorCore-FriendliAI

Serving Llama 2 70B Using Friendli Inference on NVIDIA H100 GPUs

Key Metrics

Latency: The time it takes for the inference serving engine to generate its full response.
Throughput: The total number of output tokens, per GPU, per second, that the inference serving engine can generate across all users and requests on the engine.

To demonstrate how the Friendli Inference efficiently serves LLM models on the new hardware, we compare the performance of our engine on Meta’s Llama 2 70B chat model on NVIDIA A100 80GB GPUs and NVIDIA H100 80GB GPUs, using the Databricks’ Dolly dataset. We assumed an online serving scenario, where each request arrives at a different time. The first graph shows the average throughput (i.e., higher is better) on the different GPUs. The graph shows that for the same level of latency, our engine on the NVIDIA H100 GPUs achieves 4 times higher throughput compared to running the workload on NVIDIA A100 GPUs.

For the same level of latency, Friendli Inference on the NVIDIA H100 GPUs achieves 4 times higher throughput compared to running the workload on NVIDIA A100 GPUs-FriendliAI

Let’s see the results again from the perspective of the latency (i.e., lower is better) for generating tokens under the various workloads. The workloads are represented as “1N,” “2N”, and “4N”, signifying the different numbers of requests per second. For the low load, our engine on the NVIDIA H100 GPUs achieves 1.5x lower latency than the A100 GPUs. For the higher load, our engine achieves up to 1.8x lower latency on NVIDIA H100 GPUs.

For low load Friendli Inference on the NVIDIA H100 GPUs achieves 1.5x lower latency than the A100 GPUs and for the higher load, Friendli Inference achieves up to 1.8x lower latency on NVIDIA H100 GPUs-FriendliAI

Note that the results are based on the FP16 data type. Leveraging the new data type, FP8, supported by the Hopper architecture, we can improve the performance even further, which we are planning to report on in the near future. Stay tuned for further enhancements on our outstanding LLM serving on the Hopper architecture!

Conclusion

FriendliAI provides Friendli Inference, an optimized serving engine for generative AI models. Our engine successfully leverages the boost in serving performance unlocked by the NVIDIA H100 80GB GPUs, achieving 4 times higher throughput compared to the NVIDIA A100 80GB GPUs. In the following article, we will compare our engine to other products to further demonstrate its outstanding performance on the new hardware. Get started today with FriendliAI!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 520,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.