Research
Published by FriendliAI and SNU at OSDI 2022
Orca: A Distributed Serving System for Transformer-Based Generative Models
Orca is iteration level scheduling or selective batching (a.k.a., continuous batching) mechanism for large scale Transformer based models for generation tasks. ORCA can significantly outperform NVIDIA FasterTransformer with 36.9x throughput improvement at the same level of latency.
Published by FriendliAI and SNU at ICML ‘23
BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models
BPipe balances memory utilization and enhances the training efficiency of LLM by eliminating recomputations or increasing the micro-batch size. BPipe employs an activation balancing method to transfer intermediate activations between GPUs during training, enabling all GPUs to utilize comparable amounts of memory.
Published by FriendliAI and SNU at Proceedings of VLDB
Hippo: Sharing Computations in Hyper-Parameter Optimization
Hippo is a hyper-parameter optimization system that reuses computation across trials to reduce the overall amount of computation significantly. Hippo breaks down the hyper-parameter sequences into stages and merges common stages to form a stage tree.