FriendliAI Secures $20M to Accelerate AI Inference Innovation — Read the Full Story

  • October 8, 2022
  • 1 min read

Serve generative AI models like T5 faster than ever with Friendli Inference (32.8x faster for T5–3B)

Serve generative AI models like T5 faster than ever with Friendli Inference (32.8x faster for T5–3B) thumbnail

In our previous blog posts (#1, #2), we showed the performance gain of Friendli Inference (aka PeriFlow or Orca) on GPT3, a popular generative AI model. Orca consistently outperformed Triton + FasterTransformer on models of various sizes, all the way from 345M, 1.3B up to 175B parameters. GPT is a decoder-only model. Today, we will be looking at the performance of Friendli Inference on Google’s T5, which is a Transformer-based encoder-decoder model. T5 is widely used in machine translation.

T5, or Text-To-Text Transfer Transformer, is a neural network model that can convert practically any language task into a text-to-text format. It differs from models such as BERT (encoder-only) or GPT (decoder-only) in that it incorporates both the encoder and the decoder component into its architecture.

We ran our evaluation on NVIDIA A10G GPU with a T5 model with 3B parameters. The below figure shows throughput and mean normalized latency. Since each request in the trace requires different processing time, which is (roughly) in proportion to the number of generated tokens, we report mean latency normalized by the number of generated tokens of each request.

Throughput and mean normalized latency comparison on Orca against Triton and FasterTransformer combination

At the same latency level of 24ms/token, Orca has 32.8X higher throughput than NVIDIA Triton Inference Server with FasterTransformer as its backend engine.

Orca provides significantly higher throughput and lower latency than Triton + FasterTransformer. Notably, as the load becomes heavier, Orca yields higher throughput with a relatively small increase in latency.

Regardless of the Transformer model architecture, whether it is a decoder-only or an encoder-decoder model, Orca continues to outperform existing serving systems.

*Orca is a product developed by FriendliAI. We provide Friendli Suite, an end-to-end AI development and serving service, which offers highly optimized engines (e.g., Orca) for Transformer models. For more information, check the link.


Written by

FriendliAI Tech & Research


Share


General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.


Related Posts

Save on Training Costs of Generative AI with Friendli Training thumbnail
  • November 1, 2022
  • 1 min read

Save on Training Costs of Generative AI with Friendli Training

GenAI
LLM training
Friendli Inference: How Good is it on Small Models? thumbnail
  • August 4, 2022
  • 2 min read

Friendli Inference: How Good is it on Small Models?

LLM inference
Optimization
Models
See all from blog

Products

Friendli Dedicated EndpointsFriendli Serverless EndpointsFriendli Container

Solutions

InferenceUse Cases
Models

Developers

DocsBlogResearch

Company

About usNewsCareersPatentsBrand ResourcesContact us
Pricing

Contact us:

contact@friendli.ai

FriendliAI Corp:

Redwood City, CA

Hub:

Seoul, Korea

Privacy PolicyService Level AgreementTerms of ServiceCA Notice

Copyright © 2025 FriendliAI Corp. All rights reserved