August 4, 2022
2 min read

Friendli Inference: How Good is it on Small Models?

In our previous blog post Friendli Inference: How to Serve Large-scale Transformer Models, we showed the dramatic performance gain and cost savings of using Friendli Inference (a.k.a. PeriFlow or Orca) to run large-scale generative models like GPT 175B, thanks to our patented technologies. Since then, we have been getting many inquiries on the performance of Orca on serving smaller generative models (e.g., models with a few billion parameters) on a single GPU.

Yes, Orca outperforms FasterTransformer significantly for models from hundreds of millions to a few billion parameters! And we at FriendliAI are still working non-stop on optimizing Orca on small-size models as well as large-sized ones.

Today, we are going to compare Orca against FasterTransformer, but with smaller sized models this time — 1.3B and 345M each.

In both cases, we ran our evaluation on NVIDIA A10G GPU. The below figures show throughput and mean normalized latency. Since each request in the trace requires different processing time, which is (roughly) in proportion to the number of generated tokens, we report mean latency normalized by the number of generated tokens of each request.

In our last blog post, when comparing Orca against FasterTransformer, because FasterTransformer does not have its own scheduler, we implemented a custom scheduler that mimics the batching scheduler of the NVIDIA Trition inference server. Note that this time we used an actual NVIDIA Triton Inference Server.

GPT 1.3B with A10G GPU

At the same latency level of 11 ms/token, Orca has 55.4X higher throughput than FasterTransformer. Among Transformer-based generative models with the same size, there is GPT-Neo, for instance.

GPT 345M with A10G GPU

At the latency level of 11 ms/token, Orca has 26.1X higher throughput than FasterTransformer. Among similar-sized models there is GPT-2 medium (355M).

Summary

Here, you can see that Orca provides significantly higher throughput and lower latency than NVIDIA FasterTransformer. As the load becomes heavier, Orca provides higher throughput with a relatively small increase in latency.

Regardless of model size, large or small, Orca continues to outperform existing serving systems. We hope such results might make Orca helpful to a broader range of users, from companies running heavy models to those working on relatively small-sized ones as well.

*The research on Orca was presented in OSDI 2022, on July 12th. You can read the paper here.

**Orca was developed by FriendliAI. We provide the end-to-end AI development platform Friendli Suite as our product. For more information, check the link.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.