PeriFlow: How Good is it on Small Models?

Blog post thumbnail

We showed the dramatic performance gain (cost saving) of PeriFlow (aka Orca) running large-scale generative models like GPT 175B, thanks to our patented technologies, in our previous blog post PeriFlow: How to Serve Large-scale Transformer Models. Since then we are getting lots of inquiries on the performance of Orca on serving smaller generative models (e.g., models with a few billion parameters) on a single GPU.

Yes, Orca outperforms FasterTransformer significantly for the models with hundreds of millions to a few billion parameters! And we at FriendliAI are still working non-stop on optimizing Orca on small-size models as well as large-sized ones.

Today, we are going to compare Orca against FasterTransformer, but with smaller sized models this time — 1.3B and 345M each.

In both cases, we ran our evaluation on NVIDIA A10G GPU. The below figures show throughput and mean normalized latency. Since each request in the trace requires different processing time, which is (roughly) in proportion to the number of generated tokens, we report mean latency normalized by the number of generted tokens of each request.

In our last blog post, when comparing Orca against FasterTransformer, because FasterTransformer does not have its own scheduler, we implemented a custom scheduler that mimics the batching scheduler of the NVIDIA Trition inference server. Note that this time we used an actual NVIDIA Triton Inference Server.

GPT 1.3B with A10G GPU

At the same latency level of 11 ms/token, Orca has 55.4X higher throughput than FasterTransformer. Among Transformer-based generative models with the same size, there is GPT-Neo, for instance.

GPT 345M with A10G GPU

At the latency level of 11 ms/token, Orca has 26.1X higher throughput than FasterTransformer. Among similar-sized models there is GPT-2 medium (355M).


Here, you can see that Orca provides significantly higher throughput and lower latency than NVIDIA FasterTransformer. As the load becomes heavier, Orca provides higher throughput with a relatively small increase in latency.

Regardless of the model size, large or small, Orca continues to outperform existing serving systems. We anticipate such results would help broaden our customer base, from companies running heavy models to those working on relatively small-sized ones as well.

*The research on Orca was presented in OSDI 2022, on July 12th. You can read the paper here.

**Orca was developed by FriendliAI. We provide the end-to-end AI development platform PeriFlow as our product. For more information, check the link.


Related Posts

  • October 8, 2022
  • 1 min read

Serve generative AI models like T5 faster than ever with PeriFlow (32.8x faster for T5–3B)

Generative AI
  • July 18, 2022
  • 7 min read

PeriFlow: How to Serve Large-scale Transformer Models

Machine Learning
System Architecture
See all from blog
We use cookiesWe use cookies to enhance your browsing experience on our website. By clicking “Accept all,” you consent to our use of cookies.
scroll to top