High-dimensional data are becoming increasingly pervasive, and bring new problems and opportunities for data analysis. This thesis develops methods for both supervised and unsupervised learning of high-dimensional data. The first topic we focus on is unsupervised metric learning in the context of clustering. We propose the criterion blur ratio, minimizing which yields a transformation (distance metric) that gives well separated and predictable clusters. For minimization we propose an iterative procedure, Clustering Predictions of Cluster Membership (CPCM), which alternately predicts cluster memberships and clusters these predictions. With linear regression and k-means, this algorithm is guaranteed to converge to a fixed point. The resulting clusters are invariant to linear transformations of original features, and tend to eliminate noise features by driving their weights to zero. Building on CPCM, we propose a method to perform orthogonal clustering. This is the unsupervised analog of faceted classification, a flexible way of organizing information based on multiple independent labelings (facets). Our aim is to generating such facets automatically. With increasingly high-dimensional data, there is more reason and demand for automated faceting. We propose CPCM-orth to achieve this. The resulting clusters inherit properties of CPCM, namely invariance to linear transformations and tend to eliminate noise features by driving their weights to zero. We observe that orthogonal clustering provides a list of clusterings for the user to choose from. Also, orthogonalizing with respect to "extraneous" clusterings leads to improved performance for single labeling as compared to traditional clustering algorithms. The third algorithm we propose is a step-wise version of Lasso. Lasso can be solved efficiently using convex optimization and leads to sparse and shrunk coefficient vectors. On the flip side, Lasso is known to have problems in the correlated setting, and the prediction optimal parameter selection for regularization often leads to a model with large number of noise variables. We propose an algorithm called stepwise Lasso which uses the Lasso for variable selection in a stepwise fashion. A bonferroni threshold and Gram-Schmidt orthogonalization of the predicted vector at every step form the core of this algorithm. Numerical results demonstrate that stepwise Lasso produces models which are sparse yet competitive in their predictive ability.
|Advisor:||Foster, Dean P.|
|School:||University of Pennsylvania|
|School Location:||United States -- Pennsylvania|
|Source:||DAI-B 70/02, Dissertation Abstracts International|
|Keywords:||Clustering, Dimension reduction, Variable selection|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be