Due to recent advances in data collection techniques, massive amounts of data are being collected at an extremely fast pace. Also, these data are potentially unbounded. Boundless streams of data collected from sensors, equipments, and other data sources are referred to as “data streams”. Various data mining tasks can be performed on data streams in search of interesting patterns. This dissertation studies a particular data mining task, clustering, which can be used as the first step in many knowledge discovery processes. By grouping data streams into homogeneous clusters, data miners can learn about data characteristics which can then be developed into classification models for new data or predictive models for unknown events. Data streams clustering calls for data clustering techniques that require only a single pass access of data and a very short processing time per data point. Moreover, the system will likely have to discard data that have already been viewed. Therefore, suitable techniques are needed to incrementally update the clustering model. We propose a novel method called POD-Clus (Probability and Distribution-based Clustering) that complies with the above requirements for data streams clustering. This dissertation covers two paradigms for data streams clustering. In clustering by example, data points collected from the same data source can have different cluster assignments. Alternatively, clustering by variable treats each stream as one unit and all data points from the same stream must stay in the same cluster. We demonstrate that POD-Clus is applicable to both paradigms. POD-Clus also handles situations when clusters evolve. Cluster evolutions are relevant to data streams clustering since the nature of clusters from the boundless streams may change considerably over time. We include the following types of cluster evolutions: cluster appearance, cluster disappearance, cluster splitting, and cluster merging. The methodologies in this dissertation are grouped into (a) clustering by example without evolution, (b) clustering by example with evolution, (c) clustering by variable without evolution, and (d) clustering by variable with evolution. We conducted experiments on POD-Clus and compared against recent data streams clustering algorithms. Results show significant improvements in clustering results using POD-Clus as compared to competing algorithms.
|Commitee:||Chen, Zhiyuan, Karabatis, George, Schwartz, Stuart, Zhou, Lina|
|School:||University of Maryland, Baltimore County|
|School Location:||United States -- Maryland|
|Source:||DAI-B 70/05, Dissertation Abstracts International|
|Subjects:||Information science, Computer science|
|Keywords:||Cluster evolutions, Clustering perspectives, Data stream clustering, One-pass algorithms, POD-Clus algorithm|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be