Dissertation/Thesis Abstract

Identifying Relationships between Scientific Datasets
by Alawini, Abdussalam, Ph.D., Portland State University, 2016, 164; 10127966
Abstract (Summary)

Scientific datasets associated with a research project can proliferate over time as a result of activities such as sharing datasets among collaborators, extending existing datasets with new measurements, and extracting subsets of data for analysis. As such datasets begin to accumulate, it becomes increasingly difficult for a scientist to keep track of their derivation history, which complicates data sharing, provenance tracking, and scientific reproducibility. Understanding what relationships exist between datasets can help scientists recall their original derivation history. For instance, if dataset A is contained in dataset B, then the connection between A and B could be that A was extended to create B.

We present a relationship-identification methodology as a solution to this problem. To examine the feasibility of our approach, we articulated a set of relevant relationships, developed algorithms for efficient discovery of these relationships, and organized these algorithms into a new system called ReConnect to assist scientists in relationship discovery. We also evaluated existing alternative approaches that rely on flagging differences between two spreadsheets and found that they were impractical for many relationship-discovery tasks. Additionally, we conducted a user study, which showed that relationships do occur in real-world spreadsheets, and that ReConnect can improve scientists' ability to detect such relationships between datasets.

The promising results of ReConnect's evaluation encouraged us to explore a more automated approach for relationship discovery. In this dissertation, we introduce an automated end-to-end prototype system, ReDiscover, that identifies, from a collection of datasets, the pairs that are most likely related, and the relationship between them. Our experimental results demonstrate the overall effectiveness of ReDiscover in predicting relationships in a scientist's or a small group of researchers' collections of datasets, and the sensitivity of the overall system to the performance of its various components.

Indexing (document details)
Advisor: Maier, David
Commitee: Daim, Tugrul, Mitchell, Melanie, Tufte, Kristin
School: Portland State University
Department: Computer Science
School Location: United States -- Oregon
Source: DAI-B 77/10(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Information science, Computer science
Keywords: Conditional random fields, Data extraction, Data profiling, Schema matching, Scientific data management, Support vector machines
Publication Number: 10127966
ISBN: 9781339859989
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest