Dissertation/Thesis Abstract

Statistical methods for high-dimensional data analysis
by Gupta, Abhishek, Ph.D., University of Pennsylvania, 2008, 96; 3346226
Abstract (Summary)

High-dimensional data are becoming increasingly pervasive, and bring new problems and opportunities for data analysis. This thesis develops methods for both supervised and unsupervised learning of high-dimensional data. The first topic we focus on is unsupervised metric learning in the context of clustering. We propose the criterion blur ratio, minimizing which yields a transformation (distance metric) that gives well separated and predictable clusters. For minimization we propose an iterative procedure, Clustering Predictions of Cluster Membership (CPCM), which alternately predicts cluster memberships and clusters these predictions. With linear regression and k-means, this algorithm is guaranteed to converge to a fixed point. The resulting clusters are invariant to linear transformations of original features, and tend to eliminate noise features by driving their weights to zero. Building on CPCM, we propose a method to perform orthogonal clustering. This is the unsupervised analog of faceted classification, a flexible way of organizing information based on multiple independent labelings (facets). Our aim is to generating such facets automatically. With increasingly high-dimensional data, there is more reason and demand for automated faceting. We propose CPCM-orth to achieve this. The resulting clusters inherit properties of CPCM, namely invariance to linear transformations and tend to eliminate noise features by driving their weights to zero. We observe that orthogonal clustering provides a list of clusterings for the user to choose from. Also, orthogonalizing with respect to "extraneous" clusterings leads to improved performance for single labeling as compared to traditional clustering algorithms. The third algorithm we propose is a step-wise version of Lasso. Lasso can be solved efficiently using convex optimization and leads to sparse and shrunk coefficient vectors. On the flip side, Lasso is known to have problems in the correlated setting, and the prediction optimal parameter selection for regularization often leads to a model with large number of noise variables. We propose an algorithm called stepwise Lasso which uses the Lasso for variable selection in a stepwise fashion. A bonferroni threshold and Gram-Schmidt orthogonalization of the predicted vector at every step form the core of this algorithm. Numerical results demonstrate that stepwise Lasso produces models which are sparse yet competitive in their predictive ability.

Indexing (document details)
Advisor: Foster, Dean P.
School: University of Pennsylvania
School Location: United States -- Pennsylvania
Source: DAI-B 70/02, Dissertation Abstracts International
Subjects: Statistics
Keywords: Clustering, Dimension reduction, Variable selection
Publication Number: 3346226
ISBN: 9781109009552
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy