Large-scale data-centric systems help organizations store, manipulate, and derive value from large volumes of data. They consist of distributed components spread across a scalable number of connected machines and involve complex software/hardware stacks with multiple semantic layers. These systems help organizations solve established problems involving large amounts of data, while catalyzing new, data-driven businesses such as search engines, social networks, and cloud computing and data storage service providers. The complexity, diversity, scale, and rapid evolution of large-scale data-centric systems make it challenging to develop intuition about these systems, gain operational experience, and improve performance. It is an important research problem to develop a method to design and evaluate such systems based on the empirical behavior of the targeted workloads. Using an unprecedented collection of nine industrial workload traces of business-critical large-scale data-centric systems, we develop a workload-driven design and evaluation method for these systems and apply the method to address previously unsolved design problems.
Specifically, the dissertation contributes the following: 1. A conceptual framework of breaking down workloads for large-scale data-centric systems into data access patterns, computation patterns, and load arrival patterns. 2. A workload analysis and synthesis method that uses multi-dimensional, non-parametric statistics to extract insights and produce representative behavior. 3. Case studies of workload analysis for industrial deployments of MapReduce and enterprise network storage systems, two examples of large-scale data-centric systems. 4. Case studies of workload-driven design and evaluation of an energy-efficient MapReduce system and Internet datacenter network transport protocol pathologies, two research topics that require workload-specific insights to address.
Overall, the dissertation develops a more objective and systematic understanding of an emerging and important class of computer systems. The work in this dissertation helps further accelerate the adoption of large-scale data-centric systems to solve real life problems relevant to business, science, and day-to-day consumers.
|Advisor:||Katz, Randy H.|
|Commitee:||Larson, Ray R., Paxson, Vern|
|School:||University of California, Berkeley|
|Department:||Electrical Engineering & Computer Sciences|
|School Location:||United States -- California|
|Source:||DAI-B 74/02(E), Dissertation Abstracts International|
|Keywords:||Datacenters, Design, Evaluation, Large-scale data, Mapreduce, Storage, Workload|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be