-
PhD Thesis Defense - Qinyi Luo
Thu, May 09, 2024 @ 11:00 AM - 02:00 PM
Thomas Lord Department of Computer Science
University Calendar
PhD Thesis Defense - Qinyi (Chelsea) Luo
Committee members: Xuehai Qian (co-chair), Viktor Prasanna (co-chair), Ramesh Govindan, Chao Wang, Feng Qian
Title: High-Performance Heterogeneity-Aware Distributed Machine Learning Model Training
Abstract: The increasing size of machine learning models and the ever-growing amount of data result in days or even weeks of time required to train a machine learning model. To accelerate training, distributed training with parallel stochastic gradient descent is widely adopted as the go-to training method. This thesis targets four challenges in distributed training: (1) performance degradation caused by large amount of data transfer among parallel workers, (2) heterogeneous computation and communication capacities in the training devices, i.e., the straggler issue, (3) huge memory consumption during training caused by gigantic model sizes, and (4) automatic selection of parallelization strategies. This thesis first delves into the topic of decentralized training and proposes system support and algorithmic innovation that strengthen tolerance against stragglers in data-parallel training. On the system side, a unique characteristic of decentralized training, the iteration gap, is identified, and a queue-based synchronization mechanism is proposed to efficiently support decentralized training as well as common straggler-mitigation techniques. In the experiments, the proposed training protocol, Hop, can provide strong tolerance against stragglers and train much faster than standard decentralized training when stragglers are present. On the algorithm side, a novel communication primitive, randomized partial All-Reduce, is proposed to enable fast synchronization in decentralized data-parallel training. The proposed approach, Prague, can achieve a 1.2x speedup against All-Reduce in a straggler-free environment and a 4.4x speedup when stragglers are present. Then, on the topic of memory optimization for training Deep Neural Networks (DNNs), an adaptive during-training model compression technique, FIITED, is proposed to reduce the memory consumption of training huge recommender models. FIITED adapts to dynamic changes in data and adjusts the dimension of each individual embedding vector continuously during training. Experiments show that FIITED is able to reduce the memory consumption of training significantly more than other embedding pruning methods, while maintaining the trained model's quality. In the end, in the aspect of automatic parallelization of training workloads, a novel unified representation of parallelization strategies, incorporating Data Parallelism (DP), Model Parallelism (MP) and Pipeline Parallelism (PP), is proposed, as well as a search algorithm that selects superior parallel settings in the vast search space. An ideal stage partition ratio for synchronous pipelines is derived for the first time, to the best of my knowledge, and it is theoretically proven that unbalanced partitions are better than balanced partitions. In addition, by examining the pipeline schedule, a trade-off between memory and performance is uncovered and explored. Experiments show that hybrid parallel strategies generated with the aforementioned optimizations consistently outperform those without such considerations.
Date: May 9, 2024
Time: 11:00 a.m. - 1:00 p.m.
Location: EEB 110
Zoom link: https://usc.zoom.us/j/95741130954?pwd=dkRkblNlNGt0TlkwOU51SlRNS0hPZz09Location: Hughes Aircraft Electrical Engineering Center (EEB) -
Audiences: Everyone Is Invited
Contact: CS Events
Event Link: https://usc.zoom.us/j/95741130954?pwd=dkRkblNlNGt0TlkwOU51SlRNS0hPZz09