Dissertation/Thesis Abstract

Building Efficient Large-Scale Big Data Processing Platforms
by Wang, Jiayin, Ph.D., University of Massachusetts Boston, 2017, 130; 10262281
Abstract (Summary)

In the era of big data, many cluster platforms and resource management schemes are created to satisfy the increasing demands on processing a large volume of data. A general setting of big data processing jobs consists of multiple stages, and each stage represents generally defined data operation such as filtering and sorting. To parallelize the job execution in a cluster, each stage includes a number of identical tasks that can be concurrently launched at multiple servers. Practical clusters often involve hundreds or thousands of servers processing a large batch of jobs. Resource management, that manages cluster resource allocation and job execution, is extremely critical for the system performance.

Generally speaking, there are three main challenges in resource management of the new big data processing systems. First, while there are various pending tasks from different jobs and stages, it is difficult to determine which ones deserve the priority to obtain the resources for execution, considering the tasks' different characteristics such as resource demand and execution time. Second, there exists dependency among the tasks that can be concurrently running. For any two consecutive stages of a job, the output data of the former stage is the input data of the later one. The resource management has to comply with such dependency. The third challenge is the inconsistent performance of the cluster nodes. In practice, run-time performance of every server is varying. The resource management needs to dynamically adjust the resource allocation according to the performance change of each server.

The resource management in the existing platforms and prior work often rely on fixed user-specific configurations, and assumes consistent performance in each node. The performance, however, is not satisfactory under various workloads. This dissertation aims to explore new approaches to improving the efficiency of large-scale big data processing platforms. In particular, the run-time dynamic factors are carefully considered when the system allocates the resources. New algorithms are developed to collect run-time data and predict the characteristics of jobs and the cluster. We further develop resource management schemes that dynamically tune the resource allocation for each stage of every running job in the cluster. New findings and techniques in this dissertation will certainly provide valuable and inspiring insights to other similar problems in the research community.

Indexing (document details)
Advisor: Sheng, Bo
Commitee: Ding, Wei, Sheng, Bo, Simovici, Dan, Zhang, Honggang
School: University of Massachusetts Boston
Department: Computer Science
School Location: United States -- Massachusetts
Source: DAI-B 78/10(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Computer science
Keywords: Big data, Performance, Scheduling, System
Publication Number: 10262281
ISBN: 9781369817027
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest