Dissertation/Thesis Abstract

Dynamic Documents for Data Analytic Science
by Becker, Gabriel, Ph.D., University of California, Davis, 2014, 189; 3685178
Abstract (Summary)

The need for reproducibility in computational research has been highlighted by a number of recent failures to replicate published data analytic findings. Most efforts to ensure reproducibility involve providing guarantees that reported results can be generated from the data via the reported methods, with a popular avenue being dynamic documents. This insurance is necessary but not sufficient for full validation, as inappropriately chosen methods will simply reproduce questionable results. To fully verify computational research we must replicate analysts' research processes, including: choice of and response to exploratory or intermediate results, identification of potential analysis strategies and statistical methods, selection of a single strategy from among those considered, and finally, the generation of reported results using the chosen method.

We present the concept of comprehensive dynamic documents. These documents represent the full breadth of an analyst's work during computational research, including code and text describing: intermediate and exploratory computations, alternate methods, and even ideas the analyst had which were not fully pursued. Furthermore, additional information can be embedded in the documents such as data provenance, experimental design, or details of the computing system on which the work was originally performed. We also propose computational models for representing, processing, and programmatically operating on such documents within R.

These comprehensive documents act as databases, encompassing both the work that the analyst has performed and the relationships among specific pieces of that work. This allows us to investigate research in a number of ways difficult or impossible to achieve given only a description of the final strategy. We can explore the choice of methods and whether due diligence was performed during an analysis. Secondly, we can compare alternative strategies either side-by-side or interactively. Finally, we can treat these complex documents as data about the research process and analyze them programmatically.

We also present a proof-of-concept set of software tools for working with comprehensive dynamic documents. This includes an R package which implements a framework for comprehensive documents in R, an extension of the IPython Notebook platform which allows users to author and interactively view them, and a caching mechanism which provides the efficiency necessary for interactive, self-updating views of such documents.

Indexing (document details)
Advisor: Temple Lang, Duncan
Commitee: Baines, Paul, Nolan, Deborah
School: University of California, Davis
Department: Statistics
School Location: United States -- California
Source: DAI-B 76/07(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Statistics
Keywords: Dynamic documents, Reproducibility, Reproducible research
Publication Number: 3685178
ISBN: 9781321607970
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest