Dissertation/Thesis Abstract

Provenance framework in support of data quality estimation
by Simmhan, Yogesh L., Ph.D., Indiana University, 2007, 350; 3297094
Abstract (Summary)

Science has evolved over the past several decades, from an empirical and theoretical approach to one that includes computational simulations and modeling, commonly known as e-Science. Advances in cyberinfrastructure for e-Science have enabled researchers to run complex, computational investigations that include data access, analysis, and model runs that execute, largely automated, as data-driven workflows.

Provenance is metadata that describes the process by which datasets are generated by the workflows. This data derivation history is essential to understand how a datum was created, verify and validate the experimental results, and determine the quality of the derived data.

This dissertation makes two key contributions to scientific data management. First, it proposes a low-overhead provenance collection framework for scientific workflows. The Karma Provenance Framework is a prototype implementation that collects provenance activities from automatically instrumented services and builds a data provenance model from runtime information. Karma provides a service interface to query for different forms of provenance. The framework has been applied in the LEAD cyberinfrastructure and its performance validated through empirical analysis.

Second, it defines a data quality model for estimating the subjective quality of derived data for scientific applications. The model uses a holistic set of quality metrics, including provenance, intrinsic metadata, quality of service, and community perception, to estimate a numerical quality score for the data. This enables a scientist to select the best quality dataset for their application from numerous that qualify. Experimental studies conducted on a prototype quality broker validate the feasibility and prediction accuracy of the model.

Indexing (document details)
Advisor: Plale, Beth, Gannon, Dennis
Commitee: Bramley, Randall, Robertson, Edward
School: Indiana University
Department: Computer Sciences
School Location: United States -- Indiana
Source: DAI-B 69/02, Dissertation Abstracts International
Subjects: Computer science
Keywords: Data quality, E-Science, Grid computing, Metadata management, Provenance, Scientific workflows
Publication Number: 3297094
ISBN: 978-0-549-44242-4
Copyright © 2020 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy