Thu, Apr 27, 2017 @ 02:15 AM - 04:15 PM
Resource Scheduling in Geo-distributed Computing
Date & Time:
April 27th, Thursday; 2:15-4:15pm
Professor Leana Golubchik (advisor)
Professor Bhaskar Krishnamachari (external member)
Professor Wyatt Lloyd
Professor Minlan Yu
Doctor Ganesh Ananthanarayanan (Microsoft Research)
Due to the growing needs in computing and the increasing volume of data, cloud service providers deploy multiple datacenters around the world in order to provide fast computing response. Many applications utilizing such geo-distributed deployment include web search, user behavior analysis, machine learning applications and live camera feeds processing. Depending on the characteristics of the applications, their data may be generated, stored, and processed across the geo-distributed sites. Hence, how to efficiently process the data across the geo-distributed sites has become critical for the applications' performance.
Existing solutions first aggregate all the required data to one location and execute the computation within the site. Such solutions incur a large amount of data transfer across the WAN and lead to prolonged response time for the applications due to the significant network delay. An emerging trend is to instead distribute the computation across the sites based on data distribution, and aggregate only the results afterward. Recent works have shown such new approach results in an improvement of 3-19X in response time, or 250X in the reduction of WAN bandwidth usage.
Despite the preliminary gains, the performance of the geo-distributed jobs highly depends on how the resources are scheduled, which raises new challenges as the trivial extensions of state-of-the-art scheduling solutions lead to sub-optimal performance.
In this thesis, we first take an initiative step for improving the performance of geo-distributed jobs from the perspective of computation resource. We provide the insights into how conventional Shortest Remaining Processing Time (SRPT) falls short due to the lack of scheduling coordination among the sites, and propose a light-weight heuristic that significantly improves the jobs' response time. We also design a new job scheduling heuristic that coordinates the workload demands and the resource availability among the sites, and greedily schedule for the job that can quickly finish.
The trace-driven simulation studies show that our proposed scheduling heuristics effectively reduce the response time for the geo-distributed jobs by up to 50%.
Next, we take a step further by addressing the geo-distributed jobs' performance from the perspectives of both the computation and the network resources. Specifically, we address the scheduling challenge of the heterogeneity of the resources availability across the sites and the mismatch of the data distribution across the geo-distributed sites. We formulate the task placement decisions into Linear Programming optimization, and allocate the resources to the job that can finish quickly. In addition to the response time, our design can also nicely incorporate other performance goals, e.g., fairness and WAN usage, with simple control knobs. The EC2-based deployment of our prototype and the large-scale trace-driven simulations showed that our solutions can improve the response time of the baseline in-place scheduling approach by up to 77%, and improve the state-of-the-art geo-distributed analytics solution by up to 55%.
Finally, we expand to a more general setting in which each job has multiple configuration options, and its quality depends on the configuration it utilizes. We motivate this problem by the scenario of processing live camera feeds across hierarchical clusters. In this setting, we focus on the scheduling problem of jointly deciding job configuration and placement for concurrent jobs, and design efficient heuristic to maximize the overall quality with available resources across the geo-distributed sites. Our evaluation based on the Azure deployment of our prototype showed that the proposed solution outperforms the stat-of-the-art video analytics scheduler by up to $2.3X$, and outperforms the widely deployed Fair Scheduler by up to $15.7X$, in terms of the average quality of the concurrent jobs.
Audiences: Everyone Is Invited
Contact: Lizsl De Leon