Groundbreaking Performance of the Friendli Engine for LLM Serving on an NVIDIA H100 GPU

Blog post thumbnail

Empowering the future of generative AI, FriendliAI offers Friendli Engine (formerly known as ‘PeriFlow’), a revolutionary serving engine that accelerates and democratizes the deployment of generative AI models (e.g., LLMs), making them more accessible to everyone. In this analysis, we are excited to share that our engine can flexibly adapt to the new NVIDIA H100 GPUs, achieving 4x throughput improvement compared to the performance on former NVIDIA A100 GPUs.

As generative AI models rapidly grow and become more widely used in our daily lives, deploying them requires more computational resources. In line with the requirements, accelerators are also being rapidly developed. Among them, the most commonly used hardware for serving generative AI models are GPU devices manufactured by NVIDIA, backed up by numerous users.

Last year, NVIDIA announced the H100 GPU, based on the new Hopper architecture. As shown in the table below, in addition to the increased clock frequency, the NVIDIA H100 SXM5 80GB GPU provides 22% more streaming multiprocessors (SMs) compared to the prior generation (the A100 80GB GPU), each of which is 2x faster with the new generation of TensorCore. As a result, the new hardware pushes the boundaries of remarkable performance in LLM serving.

Serving Llama 2 70B Using Friendli Engine on NVIDIA H100 GPUs

Key Metrics

  • Latency: The time it takes for the inference serving engine to generate its full response.
  • Throughput: The total number of output tokens, per GPU, per second, that the inference serving engine can generate across all users and requests on the engine.

To demonstrate how the Friendli Engine efficiently serves LLM models on the new hardware, we compare the performance of our engine on Meta’s Llama 2 70B chat model on NVIDIA A100 80GB GPUs and NVIDIA H100 80GB GPUs, using the Databricks’ Dolly dataset. We assumed an online serving scenario, where each request arrives at a different time. The first graph shows the average throughput (i.e., higher is better) on the different GPUs. The graph shows that for the same level of latency, our engine on the NVIDIA H100 GPUs achieves 4 times higher throughput compared to running the workload on NVIDIA A100 GPUs.

Let’s see the results again from the perspective of the latency (i.e., lower is better) for generating tokens under the various workloads. The workloads are represented as “1N,” “2N”, and “4N”, signifying the different numbers of requests per second. For the low load, our engine on the NVIDIA H100 GPUs achieves 1.5x lower latency than the A100 GPUs. For the higher load, our engine achieves up to 1.8x lower latency on NVIDIA H100 GPUs.

Note that the results are based on the FP16 data type. Leveraging the new data type, FP8, supported by the Hopper architecture, we can improve the performance even further, which we are planning to report on in the near future. Stay tuned for further enhancements on our outstanding LLM serving on the Hopper architecture!

Conclusion

FriendliAI provides Friendli Engine, an optimized serving engine for generative AI models. Our engine successfully leverages the boost in serving performance unlocked by the NVIDIA H100 80GB GPUs, achieving 4 times higher throughput compared to the NVIDIA A100 80GB GPUs. In the following article, we will compare our engine to other products to further demonstrate its outstanding performance on the new hardware. Get started today with FriendliAI!



Share

Related Posts

thumbnail
  • January 4, 2024
  • 2 min read

Friendli Serverless Endpoints: Unleashing Generative AI for Everyone

inference
generative AI models
thumbnail
  • November 16, 2023
  • 3 min read

Simultaneously Serving Multiple LoRAs on a single GPU with Friendli Engine

LoRA
multi-LoRA
See all from blog