Large-scale data management and deep data analysis are increasingly important for both enterprise and scientific applications. Statistical languages provide rich functionality and ease of use for data analysis and modeling and have large user bases. R is among the most widely used of these languages, but is limited by a single threaded execution model and problem sizes that fit in a single node. We propose a highly parallel R system called RABID (R Analytics for BIg Data) that maintains R compatibility, leverages the MapReduce-like Spark framework and achieves high performance and scaling across clusters. RABID preserves the R programming model by introducing R-compatible distributed data structures with overloading functions. Optimizations like reducing the memory footprint, data pipelining and serialization, and operation merging are used to improve runtime performance. We compare RABID to several other frameworks.
In the era of cloud computing, batch data process workloads like RABID applications are targeted to run in VMs or containers in a cloud-based data center. Efficient scheduling of data center VMs can reduce the number of physical servers needed and, in turn, reduce the energy and other capital costs for maintaining the virtualized data center. We propose an innovative data-driven approach to achieve efficient pro-active VM scheduling. Our approach uses a multi-capacity bin-packing technique that efficiently places VMs onto physical servers. We use time-series analysis to extract not only low frequency information about future VM workloads but also high frequency information for VM workload correlations. This approach can also be implemented in RABID and leverages its high performance.
|Advisor:||Midkiff, Samuel P.|
|Commitee:||Eigenmann, Rudolf, Hu, Charlie, Kulkarni, Milind|
|Department:||Electrical and Computer Engineering|
|School Location:||United States -- Indiana|
|Source:||DAI-B 80/01(E), Dissertation Abstracts International|
|Subjects:||Computer Engineering, Computer science|
|Keywords:||Batch job scheduling, Big data, Cloud computing, Distributed systems, Parallel computing|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
supplemental files is subject to the ProQuest Terms and Conditions of use.