Clustering is a popular approach to exploratory data analysis and mining. However, clustering faces difficult challenges due to its ill-posed nature. First, it is well known that off-the-shelf clustering methods may discover different patterns in a given set of data, because each clustering algorithm has its own bias resulting from the optimization of different criteria. Second, there is no ground truth against which the clustering result can be validated. High dimensional data also pose a difficult challenge to the clustering process. Various clustering algorithms can handle data with low dimensionality, but as the dimensionality of the data increases, these algorithms tend to break down. In this dissertation, we introduce novel clustering ensemble techniques and novel semi-supervised approaches to address these problems.
Clustering ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature: they can provide more robust and stable solutions by making use of the consensus across multiple clustering results, and they can average out the emergent spurious structures which arise due to the various biases of each participating algorithm, and due to the variance induced by different data samples. We introduce and analyze three new consensus functions for ensembles of subspace clusterings. The ultimate goal of our consensus functions is to provide hard partitions of the data, and weight vectors which convey information regarding the subspaces within which the individual clusters exist. We demonstrate the effectiveness of our three techniques by running experiments with several real datasets, including high dimensional text data, and investigate the issue of diversity and accuracy in our ensemble techniques.
We also study scenarios in which limited knowledge on the data (in terms of pair-wise constraints) is available from the user. We develop a methodology to embed such constraints into the ensemble components, so that the desired structure emerges via the consensus clustering. We introduce a mechanism which leverages the ensemble framework to bootstrap informative constraints directly from the data and from the various clusterings, without intervention from the user. We demonstrate the effectiveness of our proposed techniques with experiments using real datasets and other state-of-the-art semi-supervised techniques.
|School:||George Mason University|
|School Location:||United States -- Virginia|
|Source:||DAI-B 69/07, Dissertation Abstracts International|
|Keywords:||Clustering ensembles, Consensus functions, Subspace clustering, Weighted clustering|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be