July 18, 2022
7 min read

Friendli Inference: How to Serve Large-scale Transformer Models

Transformer models have recently been transforming the landscape in deep learning, particularly in natural language processing, thanks to their excellence in tracking the relations between sequential data, such as words in a sentence. Among some of the popular pre-trained Transformers are PaLM from Google (Chowdhery et al, 2022), Gopher from DeepMind (Rae et al, 2022) and OPT from Facebook (Zhang et al, 2022). On the other hand, these state-of-the-art models can be bulky and resource-hungry, making them expensive to utilize. GPT-3 (Brown et al, 2020), for example, has 175 billion parameters — and serving models of this size can incur high costs due to the massive computational overhead.

Hence we, at FriendliAI, implemented a distributed serving system called Friendli Inference (a.k.a. PeriFlow or Orca) for Transformer-based generative models. Friendli Inference is in production use. We provide Friendli Container, Friendli Dedicated Endpoints (managed cloud service), and Friendli Serverless Endpoints.

Our evaluation on a GPT-3 175B model shows that Friendli Inference (Orca) can significantly outperform NVIDIA FasterTransformer in terms of both latency and throughput: 36.9X throughput improvement at the same level of latency. The work was presented in OSDI ‘22 as well.

System architecture and the overall workflow of Friendli Inference

The above diagram shows the system architecture and the overall workflow of Friendli Inference (Orca). Friendli Inference exposes an endpoint where inference requests arrive and responses are sent out. The endpoint assigns newly arrived requests in the request pool, which manages all requests in the system during their lifetime. Then, the pool is managed by the scheduler. How the scheduler works is explained in more detail in the following section.

Friendli Inference (Orca) was built based on two key technologies — iteration-level scheduling and selective batching. These two techniques have been devised to solve the limitations in existing serving systems.

Iteration-level scheduling

Iteration-level scheduling is a new scheduling mechanism that schedules execution at the granularity of iteration. We define a single run of all layers as an iteration of the model. Existing model serving systems are mostly designed to schedule executions at request granularity. That is, the serving system and the execution engine interact with one another only when (1) the serving system schedules the next batch of requests on an idle engine; or (2) the engine finishes processing requests in the current batch. This can be problematic in the serving of generative models, since different requests in the batch can require different numbers of iterations, resulting in some requests finishing earlier than the others.

Case where the serving system schedules the engine at request granularity

The above illustration shows the case where the serving system schedules the engine at request granularity. Here, although request x₂ finished earlier than x₁, it went through some extra computation (iter 3 & 4) until x₁ was finished. Such behavior limits the efficiency of batched execution. Furthermore, the early-finished requests cannot return to the client promptly, as the engine would return execution results to the serving system only after it finished processing every request in the batch. Similarly, when a new request arrives in the middle of an execution, it must wait until the current batch is completely processed.

All of this extra latency has one cause in common: a blunt scheduling mechanism that operates at the request-level.

By scheduling the system with a finer granularity — iteration, rather than request — the system now has much-needed fluidity.

What do we mean, then, by iteration-level operation? The scheduler basically repeats the following procedure:
(1) selects requests to run next;
(2) invokes the engine to execute one iteration for the selected requests;
(3) receives execution results for the scheduled iteration.

The GIF below shows the process in animation.

Iteration-level operation scheduler

As the scheduler receives a result after every iteration, it can detect the completion of requests, thereby returning finished ones immediately to the client. On top of that, newly arrived requests only have to wait for a single iteration, which significantly reduces the queueing delay.

Selective batching

When using iteration-level scheduling in practice, one major challenge becomes batching. For computational efficiency, it is a must that the engine is able to process any selected set of requests in a batched manner. However, batching becomes tricky under the new scheduling scheme as the requests in the pool can have different characteristics. This inconvenience can be addressed by selectively applying batching to different types of operations, hence the name selective batching.

Selective batching

The above diagram shows a system overview of Friendli Inference (Orca) with four requests. xᵢⱼ is the j-th token of the i-th request. Shaded tokens represent input tokens received from the clients, while unshaded tokens are the ones generated by Friendli Inference (Orca).

Operations in Transformer layers can be divided into two types: Attention and non-Attention. Non-Attention operations such as Linear, Layer-Norm and GeLU operations do not require distinguishing tensor elements of different requests. Therefore, requests can be batched token-wise, rather than request-wise.

For example, in the diagram, when the scheduler decides to run an iteration on batch (x₁,x₂,x₃,x₄), the inputs x₃,x₄ cannot coalesce into a single tensor of shape [B,L,…] as they have different number of tokens (L), 2 and 3 (B: batch size, L: number of tokens processed together). However, the two requests can still compose a tensor of shape [∑L,…]=[5,…] without an explicit batch dimension before being fed into non-Attention operations. So token-wise batching can be achieved with non-Attention operations.

With Attention operations, however, batching — request-wise or token-wise — cannot be performed. This is because the Attention operation requires a notion of requests (i.e., requires a batch dimension) to compute attention only between the tokens of the same request. To refer again to the diagram above, request x₁ and x₂ have different number of total tokens (input + generated), 4 and 2 respectively, which means that the tensors cannot be merged into a single tensor. Unlike previously, they cannot be flattened by ignoring the batch dimension either, since x₁ and x₂ still need to be distinguished request-wise before they can be fed into Attention operations.

So this is where our second solution, selective batching, comes into play. Instead of batching all operations (both Attention & non-Attention) composing the model, it selectively applies batching only to a handful of operations (non-Attention). It splits the batch and processes each request individually for the Attention operation while applying token-wise (instead of request-wise) batching to non-Attention operations. Our findings show that the decision not to batch the execution of the Attention operation only has a small impact on efficiency, too. In conclusion, selective batching allows for higher flexibility in composing requests as a batch.

Selective batching

Input tensors are of the dimension [L,H] where L denotes the number of tokens processed together, and Ha hidden size of the model.

Performance evaluation

Performance evaluation displaying median of normalized latency and throughput

End-to-End Results; GPT-3 175B, 16 GPUs; Latency normalized by output length

We ran our evaluation on Azure ND96asr A100 v4 VMs, each equipped with 8 NVIDIA 40-GB A100 GPUs connected over NVLink. Here, GPT-3 with 175B parameters was used as a representative example of Transformer-based generative models. We compared Orca against FasterTransformer (baseline), an inference engine that supports large scale Transformer models via distributed execution. Because FasterTransformer does not have its own scheduler, we implemented a custom scheduler that mimics the batching scheduler of the NVIDIA Trition inference server.

Friendli Inference (Orca) outperforms FasterTransformer in terms of both latency and throughput. For example, to match a latency of 190ms, FasterTransformer provides a throughput of 0.185 req/s whereas Friendli Inference provides a throughput of 6.81 req/s, which is a 36.9X speedup.

Conclusion

Despite the rising importance of Transformer models, existing serving systems all had one major limitation: low efficiency with early-finished and late-joining requests. Hence we came up with a new scheduling mechanism, iteration-level scheduling. This technique enables the scheduler to interact with the engine at iteration — rather than request — granularity, thereby imbuing the system with more flexibility. However, this new way of scheduling brought certain inconveniences within batching, which we again solved by introducing a technique called selective batching. It applies token-wise batching to all non-Attention operations while individually processing the Attention operations.

And experiments show the effectiveness of our approach: the Friendli Inference (Orca) provides an order of magnitude higher throughput than current state-of-the-art systems at the same level of latency.

With Friendli Inference (Orca), utilizing large-scale AI has been taken to a whole new level. Please contact FriendliAI if you are interested!

*The research on Orca was presented in OSDI 2022, on July 12th. You can read the paper here.

**Orca was developed by FriendliAI. We provide the end-to-end AI development platform Friendli Suite as our product. For more information, check the link.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.