Published by FriendliAI and SNU at OSDI 2022

Orca: A Distributed Serving System for Transformer-Based Generative Models

Large-scale Transformer-based models trained for generation tasks (e.g., GPT-3) have recently attracted huge interest, emphasizing the need for system support for serving models in this family. Since these models generate a next token in an autoregressive manner, one has to run the model multiple times to process an inference request where each iteration of the model generates a single output token for the request. However, existing systems for inference serving do not perform well on this type of workload that has a multi-iteration characteristic, due to their inflexible scheduling mechanism that cannot change the current batch of requests being processed; requests that have finished earlier than other requests in a batch cannot return to the client, while newly arrived requests have to wait until the current batch completely finishes.

Read more

Published by FriendliAI and SNU at ICML ‘23

BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models

Pipeline parallelism is a key technique for train-ing large language models within GPU clusters. However, it often leads to a memory imbalance problem, where certain GPUs face high memory pressure while others underutilize their capacity. This imbalance results in suboptimal training per-formance, even when the overall GPU memory capacity is sufficient for more efficient setups. To address this inefficiency, we propose BPIPE, a novel approach for achieving memory balance in pipeline parallelism. BPIPE employs an activa-tion balancing method to transfer intermediate activations between GPUs during training, en-abling all GPUs to utilize comparable amounts of memory. With balanced memory utilization, BPIPE enhances the training efficiency of large language models like GPT-3 by eliminating re-dundant recomputations or increasing the micro-batch size. Our evaluation conducted on 48 A100 GPUs across six nodes interconnected with HDR InfiniBand shows that BPIPE accelerates the train-ing of GPT-3 96B and GPT-3 134B models by 1.25x-2.17x compared to Megatron-LM, a state-of-the-art framework for training large language models.v

Read more

Published by FriendliAI and SNU at Proceedings of VLDB

Hippo: Sharing Computations in Hyper-Parameter Optimization

Hyper-parameter optimization is crucial for pushing the accuracy of a deep learning model to its limits. However, a hyper-parameter optimization job, referred to as a study, involves numerous trials of training a model using different training knobs, and therefore is very computation-heavy, typically taking hours and days to finish. We observe that trials issued from hyper-parameter optimization algorithms often share common hyper-parameter sequence prefixes. Based on this observation, we propose Hippo, a hyper-parameter optimization system that reuses computation across trials to reduce the overall amount of computation significantly. Instead of treating each trial independently as in existing hyper-parameter optimization systems, Hippo breaks down the hyper-parameter sequences into stages and merges common stages to form a tree of stages (a stage tree). Hippo maintains an internal data structure, search plan, to manage the current status and history of a study, and employs a critical path based scheduler to minimize the overall study completion time. Hippo applies to not only single studies but multi-study scenarios as well. Evaluations show that Hippo’s stage-based execution strategy outperforms trial-based methods for several models and hyper-parameter optimization algorithms, reducing end-to-end training time by up to 2.76× (3.53×) and GPU-hours by up to 4.81× (6.77×), for single (multiple) studies.

Read more