October 8, 2022
1 min read

Serve generative AI models like T5 faster than ever with Friendli Inference (32.8x faster for T5–3B)

In our previous blog posts (#1, #2), we showed the performance gain of Friendli Inference (aka PeriFlow or Orca) on GPT3, a popular generative AI model. Orca consistently outperformed Triton + FasterTransformer on models of various sizes, all the way from 345M, 1.3B up to 175B parameters. GPT is a decoder-only model. Today, we will be looking at the performance of Friendli Inference on Google’s T5, which is a Transformer-based encoder-decoder model. T5 is widely used in machine translation.

T5, or Text-To-Text Transfer Transformer, is a neural network model that can convert practically any language task into a text-to-text format. It differs from models such as BERT (encoder-only) or GPT (decoder-only) in that it incorporates both the encoder and the decoder component into its architecture.

We ran our evaluation on NVIDIA A10G GPU with a T5 model with 3B parameters. The below figure shows throughput and mean normalized latency. Since each request in the trace requires different processing time, which is (roughly) in proportion to the number of generated tokens, we report mean latency normalized by the number of generated tokens of each request.

Throughput and mean normalized latency comparison on Orca against Triton and FasterTransformer combination

At the same latency level of 24ms/token, Orca has 32.8X higher throughput than NVIDIA Triton Inference Server with FasterTransformer as its backend engine.

Orca provides significantly higher throughput and lower latency than Triton + FasterTransformer. Notably, as the load becomes heavier, Orca yields higher throughput with a relatively small increase in latency.

Regardless of the Transformer model architecture, whether it is a decoder-only or an encoder-decoder model, Orca continues to outperform existing serving systems.

*Orca is a product developed by FriendliAI. We provide Friendli Suite, an end-to-end AI development and serving service, which offers highly optimized engines (e.g., Orca) for Transformer models. For more information, check the link.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.