COMING SOON! PQDT Open is getting a new home!

ProQuest Open Access Dissertations & Theses will remain freely available as part of a new and enhanced search experience at www.proquest.com.

Questions? Please refer to this FAQ.

Dissertation/Thesis Abstract

Building a Data Washing Machine for Unsupervised Entity Resolution of Unstandardized References Sources
by Al Sarkhi, Awaad Kadhim Abdalhassan, Ph.D., University of Arkansas at Little Rock, 2021, 105; 28410072
Abstract (Summary)

This dissertation describes a first attempt to build a data washing machine, a system able to take dirty data and through an unsupervised process, output clean data. The washing machine design described here focuses on two main aspects of the data curation process, token correction and data redundancy. It aims to simplify and automate the preparation of data used to create information products. In this approach, all these steps would be automated, thus saving the time and effort of the data analysts who ordinarily perform these actions. In other words, this is the opposite of the current approach to first clean and standardize the records as a prerequisite for the entity resolution process, as the first step using an unsupervised blocking and stop word scheme based on token frequency. A scoring matrix was used for linking unstandardized references, and an unsupervised process for evaluating linking results based on cluster quality. The ER process is iterative, starting with a low matching threshold. In each iteration, low-quality clusters are kept, and high-quality clusters are reprocessed at an incrementally higher match threshold. A prototype of the process was built using Python and Java and was tested on 18 fully annotated datasets. The other was that some datasets had records with two different records layouts versus datasets where all records followed the same layout. Datasets with low levels of character corruption regardless of single or mixed layouts, produced clustering with an average F-measure of 0.91, a precision of 0.96, and a recall of 0.87. In datasets with high levels of corruption, the average F-measure was 0.78, precision 0.74, and recall 0.83. In addition to datasets with datasets with high levels of corruption of single layout F- measure = 0.94, mixed layout F- measure = 0.89, and datasets with high levels of corruption single layout F- measure = 0.80, mixed layout F- measure = 0.77. Overall, this dissertation outlines an approach to future research on how unsupervised data quality improvement processes can be incorporated into the basic design allowing the design to address other types of data quality problems and types of data.

Indexing (document details)
Advisor: Talburt, John J.
Commitee: Xu, Xiaowei X, Wu, Ningning N., Pullen, Daniel D.
School: University of Arkansas at Little Rock
Department: Information Science
School Location: United States -- Arkansas
Source: DAI-A 82/10(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Information science, Computer science
Keywords: Data quality, Data washing machine, Entity resolution, Scoring matrix, Unstandardized references, Unsupervised blocking
Publication Number: 28410072
ISBN: 9798597096315
Copyright © 2021 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest