The ability to do rich analytics on massive sets of unstructured data drives the operation of many organizations today and has given rise to a new class of data-intensive computing systems. Many of these analytics are update-driven, they must constantly integrate new data in the analysis, and a fundamental requirement for efficiency is the ability to maintain state. However, current data-intensive computing systems do not directly support stateful analytics, making programming harder and resulting in inefficient processing.
This dissertation proposes that state become a first-class abstraction in data-intensive computing. It introduces stateful groupwise processing, a programming abstraction that integrates data-parallelism and state, allowing sophisticated, easily parallelizable stateful analytics. The explicit modeling of state abstracts the details of state management, making programming easier, and allows the runtime system to optimize state management. This work investigates the use of stateful groupwise processing in two distinct phases in the data management lifecycle: (i) the extraction of data from its sources and online analysis, and (ii) its storage and follow-on analysis. We propose two complementary architectures that manage data in these two phases.
This work proposes In-situ MapReduce (iMR), a model and architecture for efficient online analytics. The iMR model combines stateful groupwise processing with windowed processing for analyzing streams of unstructured data. To allow timely analytics, the iMR model supports reduced data fidelity through partial data processing and introduces a novel metric for the systematic characterization of partial data. For efficiency, the iMR architecture moves the data analysis from dedicated compute clusters onto the sources themselves, avoiding costly data migrations.
Once data are extracted and stored, a fundamental challenge is how to write rich analytics to gain deeper insights from bulk data. This work introduces Continuous Bulk Processing (CBP), a model and architecture for sophisticated dataflows on bulk data. CBP uses stateteful groupwise processing as the building block for expressing analytics, lending itself to incremental and iterative analytics. Further, CBP provides primitives for dataflow control that simplify the composition of sophisticated analytics. Leveraging the explicit modeling of state, CBP executes these dataflows in a scalable, efficient, and fault-tolerant manner.
|Commitee:||Cruz, Rene, Deutsch, Alin, Franceschetti, Massimo, Snoeren, Alex, Voelker, Geoffrey M.|
|School:||University of California, San Diego|
|Department:||Computer Science and Engineering|
|School Location:||United States -- California|
|Source:||DAI-B 73/03, Dissertation Abstracts International|
|Keywords:||Computer architectures, Data-intensive analytics, Stateful Data|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be