- July 18, 2022
- 7 min read
PeriFlow: How to Serve Large-scale Transformer Models
Transformer models have recently been transforming the landscape in deep learning, particularly in natural language processing, thanks to their excellence in tracking the relations between sequential data, like words in a sentence. Among some of the popular pre-trained Transformers are PaLM from Google (Chowdhery et al, 2022), Gopher from DeepMind (Rae et al, 2022) and OPT from Facebook (Zhang et al, 2022). On the other hand, these state-of-the-art models could be bulky and resource-hungry, making them expensive to utilize. GPT-3 (Brown et al, 2020), for example, has 175 billion parameters — serving models of this size could bring about high costs due to computational overhead.
Hence we, at FriendliAI, have implemented a distributed serving system called PeriFlow (aka Orca) for Transformer-based generative models. PeriFlow is in production use. We provide PeriFlow container and PeriFlow cloud, FriendliAI’s managed cloud service.
Our evaluation on a GPT-3 175B model shows that Orca can significantly outperform NVIDIA FasterTransformer in terms of both latency and throughput: 36.9X throughput improvement at the same level of latency. The work was presented in OSDI ’22 as well.
The above diagram shows the system architecture and the overall workflow of Orca. Orca exposes an endpoint where inference requests arrive and responses are sent out. The endpoint assigns newly arrived requests in the request pool, which manages all requests in the system during their lifetime. Then, the pool is managed by the scheduler. How the scheduler works is explained in more detail in the following section.
Orca was implemented based on two key technologies — iteration-level scheduling and selective batching. These two techniques have been devised to solve the limitations in existing serving systems.
Iteration-level scheduling is a new scheduling mechanism that schedules execution at the granularity of iteration. We define the run of all layers as an iteration of the model. Existing model serving systems are mostly designed to schedule executions at request granularity. That is, the serving system and the execution engine interact with one another only when (1) the serving system schedules the next batch of requests on an idle engine; or (2) the engine finishes processing requests in the current batch. This could be problematic in the serving of generative models, since different requests in the batch could require different numbers of iterations, resulting in some requests finishing earlier than the others.
The above illustration shows the case where the serving system schedules the engine at request granularity. Here, although request x₂ finished earlier than x₁, it went through some extra computation (iter 3 & 4) until x₁ was finished. Such behavior limits the efficiency of batched execution. Furthermore, the early-finished requests cannot return to the client promptly, as the engine would return execution results to the serving system only when it finishes processing every request in the batch. Similarly, when a new request arrives in the middle of an execution, it has to wait until the current batch is completely processed.
All of this extra latency has one cause in common: a blunt scheduling mechanism that operates on a request-level.
By scheduling the system on a finer granularity — iteration, rather than request — the system can now have the much-needed fluidity.
What do we mean, then, by iteration-level operation? The scheduler basically repeats the following procedure:
(1) selects requests to run next;
(2) invokes the engine to execute one iteration for the selected requests;
(3) receives execution results for the scheduled iteration.
The GIF below shows the process in animation.