Scientific datasets associated with a research project can proliferate over time as a result of activities such as sharing datasets among collaborators, extending existing datasets with new measurements, and extracting subsets of data for analysis. As such datasets begin to accumulate, it becomes increasingly difficult for a scientist to keep track of their derivation history, which complicates data sharing, provenance tracking, and scientific reproducibility. Understanding what relationships exist between datasets can help scientists recall their original derivation history. For instance, if dataset A is contained in dataset B, then the connection between A and B could be that A was extended to create B.
We present a relationship-identification methodology as a solution to this problem. To examine the feasibility of our approach, we articulated a set of relevant relationships, developed algorithms for efficient discovery of these relationships, and organized these algorithms into a new system called ReConnect to assist scientists in relationship discovery. We also evaluated existing alternative approaches that rely on flagging differences between two spreadsheets and found that they were impractical for many relationship-discovery tasks. Additionally, we conducted a user study, which showed that relationships do occur in real-world spreadsheets, and that ReConnect can improve scientists' ability to detect such relationships between datasets.
The promising results of ReConnect's evaluation encouraged us to explore a more automated approach for relationship discovery. In this dissertation, we introduce an automated end-to-end prototype system, ReDiscover, that identifies, from a collection of datasets, the pairs that are most likely related, and the relationship between them. Our experimental results demonstrate the overall effectiveness of ReDiscover in predicting relationships in a scientist's or a small group of researchers' collections of datasets, and the sensitivity of the overall system to the performance of its various components.
|Commitee:||Daim, Tugrul, Mitchell, Melanie, Tufte, Kristin|
|School:||Portland State University|
|School Location:||United States -- Oregon|
|Source:||DAI-B 77/10(E), Dissertation Abstracts International|
|Subjects:||Information science, Computer science|
|Keywords:||Conditional random fields, Data extraction, Data profiling, Schema matching, Scientific data management, Support vector machines|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be