- July 18, 2022
- 7 min read
Friendli Engine: How to Serve Large-scale Transformer Models
Transformer models have recently been transforming the landscape in deep learning, particularly in natural language processing, thanks to their excellence in tracking the relations between sequential data, such as words in a sentence. Among some of the popular pre-trained Transformers are PaLM from Google (Chowdhery et al, 2022), Gopher from DeepMind (Rae et al, 2022) and OPT from Facebook (Zhang et al, 2022). On the other hand, these state-of-the-art models can be bulky and resource-hungry, making them expensive to utilize. GPT-3 (Brown et al, 2020), for example, has 175 billion parameters — and serving models of this size can incur high costs due to the massive computational overhead.
Hence we, at FriendliAI, implemented a distributed serving system called Friendli Engine (a.k.a. PeriFlow or Orca) for Transformer-based generative models. Friendli Engine is in production use. We provide Friendli Container, Friendli Dedicated Endpoints (managed cloud service), and Friendli Serverless Endpoints.
Our evaluation on a GPT-3 175B model shows that Friendli Engine (Orca) can significantly outperform NVIDIA FasterTransformer in terms of both latency and throughput: 36.9X throughput improvement at the same level of latency. The work was presented in OSDI ‘22 as well.
The above diagram shows the system architecture and the overall workflow of Friendli Engine (Orca). Friendli Engine exposes an endpoint where inference requests arrive and responses are sent out. The endpoint assigns newly arrived requests in the request pool, which manages all requests in the system during their lifetime. Then, the pool is managed by the scheduler. How the scheduler works is explained in more detail in the following section.
Friendli Engine (Orca) was built based on two key technologies — iteration-level scheduling and selective batching. These two techniques have been devised to solve the limitations in existing serving systems.
Iteration-level scheduling
Iteration-level scheduling is a new scheduling mechanism that schedules execution at the granularity of iteration. We define a single run of all layers as an iteration of the model. Existing model serving systems are mostly designed to schedule executions at request granularity. That is, the serving system and the execution engine interact with one another only when (1) the serving system schedules the next batch of requests on an idle engine; or (2) the engine finishes processing requests in the current batch. This can be problematic in the serving of generative models, since different requests in the batch can require different numbers of iterations, resulting in some requests finishing earlier than the others.
The above illustration shows the case where the serving system schedules the engine at request granularity. Here, although request x₂ finished earlier than x₁, it went through some extra computation (iter 3 & 4) until x₁ was finished. Such behavior limits the efficiency of batched execution. Furthermore, the early-finished requests cannot return to the client promptly, as the engine would return execution results to the serving system only after it finished processing every request in the batch. Similarly, when a new request arrives in the middle of an execution, it must wait until the current batch is completely processed.
All of this extra latency has one cause in common: a blunt scheduling mechanism that operates at the request-level.
By scheduling the system with a finer granularity — iteration, rather than request — the system now has much-needed fluidity.
What do we mean, then, by iteration-level operation? The scheduler basically repeats the following procedure:
(1) selects requests to run next;
(2) invokes the engine to execute one iteration for the selected requests;
(3) receives execution results for the scheduled iteration.
The GIF below shows the process in animation.