By clustering one seeks to partition a given set of points into a number of clusters such that points in the same cluster are similar and are dissimilar to points in other clusters. In the virtue of this goal, data of relational nature become typical for clustering. The similarity and dissimilarity relations between the data points are supposed to be the nuts and bolts for cluster formation. Thus, the task is driven by the notion of similarity between the data points. In practice, the similarity is usually measured by the pairwise distances between the data points. Indeed, the objective function of the two widely used clustering algorithms, namely, k-means and fuzzy c-means, appears in terms of the pairwise distances between the data points.
The clustering task is complicated by the choice of the distance measure and estimating the number of clusters. Fuzzy c-means is convenient when there are uncertainties in allocating points, in overlapping areas, to clusters. The k-means algorithm allocates the points unequivocally to clusters; overlooking the similarities between those points in overlapping areas. The fuzzy approach allows a point to be a member in as many clusters as necessary; thus it provides better insight into the relations between the points in overlapping areas.
In this thesis we develop a relational framework that is inspired by the silhouette measure of clustering quality. The framework asserts the relations between the data points by means of logical reasoning with the cluster membership values. The original description of computing the silhouettes is limited to crisp partitions. A natural generalization of silhouettes, to fuzzy partitions is given within our framework. Moreover, two notions of silhouettes emerge within the framework at different levels of granularity, namely, point-wise silhouette and center-wise silhouette. Now by the generalization, each silhouette is capable of measuring the extent to which a crisp, or fuzzy, partition has fulfilled the clustering goal at the level of the individual points, or cluster centers. The partitions are evaluated by the silhouette measure in conjunction with point-to-point or center-to-point distances.
By the generalization, the average silhouette value becomes a reasonable device for selecting between crisp and fuzzy partitions of the same data set. Accordingly, one can find about which partition is better in representing the relations between the data points, in accordance with their pairwise distances. Such powerful feature of the generalized silhouettes has exposed a problem with the partitions generated by fuzzy c-means. We have observed that defuzzifying the fuzzy c-means partitions always improves the overall representation of the relations between the data points. This is due to the inconsistency between some of the membership values and the distances between the data points. This inconsistency was reported, by others, in a couple of occasions in real life applications.
Finally, we present an experiment that demonstrates a successful application of the generalized silhouette measure in feature selection for highly imbalanced classification. A significant improvement in the classification for a real data set has resulted from a significant reduction in the number of features.
Some files may require a special program or browser plug-in. More Information
|Commitee:||Annexstein, Fred, Berman, Kenneth, Jegga, Anil, Marius Truta, Traian, Ralescu, Dan|
|School:||University of Cincinnati|
|Department:||Computer Science & Engineering|
|School Location:||United States -- Ohio|
|Source:||DAI-B 75/10(E), Dissertation Abstracts International|
|Subjects:||Information science, Computer science|
|Keywords:||Cluster validity, Fuzzy clustering, Intercluster, Intracluster, Relational framework, Silhouettes|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be