Dissertation/Thesis Abstract

Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud
by Lin, Hao, Ph.D., Purdue University, 2018, 109; 10829520
Abstract (Summary)

Large-scale data management and deep data analysis are increasingly important for both enterprise and scientific applications. Statistical languages provide rich functionality and ease of use for data analysis and modeling and have large user bases. R is among the most widely used of these languages, but is limited by a single threaded execution model and problem sizes that fit in a single node. We propose a highly parallel R system called RABID (R Analytics for BIg Data) that maintains R compatibility, leverages the MapReduce-like Spark framework and achieves high performance and scaling across clusters. RABID preserves the R programming model by introducing R-compatible distributed data structures with overloading functions. Optimizations like reducing the memory footprint, data pipelining and serialization, and operation merging are used to improve runtime performance. We compare RABID to several other frameworks.

In the era of cloud computing, batch data process workloads like RABID applications are targeted to run in VMs or containers in a cloud-based data center. Efficient scheduling of data center VMs can reduce the number of physical servers needed and, in turn, reduce the energy and other capital costs for maintaining the virtualized data center. We propose an innovative data-driven approach to achieve efficient pro-active VM scheduling. Our approach uses a multi-capacity bin-packing technique that efficiently places VMs onto physical servers. We use time-series analysis to extract not only low frequency information about future VM workloads but also high frequency information for VM workload correlations. This approach can also be implemented in RABID and leverages its high performance.

Indexing (document details)
Advisor: Midkiff, Samuel P.
Commitee: Eigenmann, Rudolf, Hu, Charlie, Kulkarni, Milind
School: Purdue University
Department: Electrical and Computer Engineering
School Location: United States -- Indiana
Source: DAI-B 80/01(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Computer Engineering, Computer science
Keywords: Batch job scheduling, Big data, Cloud computing, Distributed systems, Parallel computing
Publication Number: 10829520
ISBN: 9780438328501
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest