Using a graph representation of the data, a graph-based similarity measure to assess the similarity between data records is proposed. Both direct and indirect similarity are considered, which comprehensively capture the relationship between data records. Different data mining techniques and applications, including clustering and entity resolution are explored.
First, the problem of clustering is considered for a dataset consisting of non-numeric attributes. The K-medoid clustering algorithm is used; some postprocessing steps are introduced to improve the quality of clustering. A set of validity indices are proposed to assess the quality of the clustering results. To reduce computational complexity, a sampling strategy is introduced. Effect of sampling on the values of validity indices and clustering result is discussed. Influence of different similarity measures, postprocessing steps, and cluster numbers on the quality of clustering is discussed, both analytically and experimentally. Similar enhancements to the fuzzy K-medoid algorithm are provided. The clusters resulting from the proposed algorithm can sometimes be interpreted as grouping objects sharing a common attribute that was not used in the clustering algorithm. A multi-medoid K-medoid algorithm is proposed by introducing multiple medoids in each cluster to enhance the performance of the K-medoid algorithm. Finally, an optional node move step is introduced to produce better clustering results based on edge-oriented evaluation measures.
The entity resolution problem, which is the process of determining whether multiple records refer to the same real world entity, is studied next. It is an important step during data cleaning and integration. A general entity resolution framework called ERUDITE, which includes data preprocessing (filtering), record matching, and postprocessing (inconsistency elimination, record updating, and equivalent record elimination), is presented. Different record matching models are explored for both supervised and unsupervised learning methods. Two record updating algorithms are proposed to significantly improve the entity resolution result. The entity resolution result generally contains inconsistent decisions. New inconsistency elimination methods are proposed and their performances are compared with that of existing methods. Experiments for both unsupervised and supervised learning on two public datasets show the good performance of the proposed framework.
|Commitee:||Holder, Larry, Kalyanaraman, Ananth, Sivakumar, Krishnamoorthy|
|School:||Washington State University|
|School Location:||United States -- Washington|
|Source:||DAI-B 73/06, Dissertation Abstracts International|
|Keywords:||Clustering, Data mining, Data similarity, Entity resolution, Similarity measure|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be