This dissertation describes a first attempt to build a data washing machine, a system able to take dirty data and through an unsupervised process, output clean data. The washing machine design described here focuses on two main aspects of the data curation process, token correction and data redundancy. It aims to simplify and automate the preparation of data used to create information products. In this approach, all these steps would be automated, thus saving the time and effort of the data analysts who ordinarily perform these actions. In other words, this is the opposite of the current approach to first clean and standardize the records as a prerequisite for the entity resolution process, as the first step using an unsupervised blocking and stop word scheme based on token frequency. A scoring matrix was used for linking unstandardized references, and an unsupervised process for evaluating linking results based on cluster quality. The ER process is iterative, starting with a low matching threshold. In each iteration, low-quality clusters are kept, and high-quality clusters are reprocessed at an incrementally higher match threshold. A prototype of the process was built using Python and Java and was tested on 18 fully annotated datasets. The other was that some datasets had records with two different records layouts versus datasets where all records followed the same layout. Datasets with low levels of character corruption regardless of single or mixed layouts, produced clustering with an average F-measure of 0.91, a precision of 0.96, and a recall of 0.87. In datasets with high levels of corruption, the average F-measure was 0.78, precision 0.74, and recall 0.83. In addition to datasets with datasets with high levels of corruption of single layout F- measure = 0.94, mixed layout F- measure = 0.89, and datasets with high levels of corruption single layout F- measure = 0.80, mixed layout F- measure = 0.77. Overall, this dissertation outlines an approach to future research on how unsupervised data quality improvement processes can be incorporated into the basic design allowing the design to address other types of data quality problems and types of data.
|Advisor:||Talburt, John J.|
|Commitee:||Xu, Xiaowei X, Wu, Ningning N., Pullen, Daniel D.|
|School:||University of Arkansas at Little Rock|
|School Location:||United States -- Arkansas|
|Source:||DAI-A 82/10(E), Dissertation Abstracts International|
|Subjects:||Information science, Computer science|
|Keywords:||Data quality, Data washing machine, Entity resolution, Scoring matrix, Unstandardized references, Unsupervised blocking|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be