Dissertation/Thesis Abstract

Detection of low rank signals in noise and fast correlation mining with applications to large biological data
by Shabalin, Andrey A., Ph.D., The University of North Carolina at Chapel Hill, 2010, 139; 3418746
Abstract (Summary)

Ongoing technological advances in high-throughput measurement have given biomedical researchers access to a wealth of genomic information. The increasing size and dimensionality of the resulting data sets requires new modes of analysis. In this thesis we propose, analyze and validate several new methods for the analysis of biomedical data. We seek methods that are at once biologically relevant, computationally efficient, and statistically sound.

The thesis is composed of two parts. The first concerns the problem of reconstructing a low-rank signal matrix observed in the presence of noise. In Chapter 1 we consider the general reconstruction problem, with no restrictions on the low-rank signal. We establish a connection with the singular value decomposition. This connection and recent results in random matrix theory are used to develop a new denoising scheme that outperforms existing methods on a wide range of simulated matrices.

Chapter 2 is devoted to a data mining tool that searches for low-rank signals equal to a sum of raised submatrices. The method, called LAS, searches for large average submatrices, also called biclusters, using an iterative search procedure that seeks to maximize a statistically motivated score function. We perform extensive validation of LAS and other biclustering methods on real datasets and assess the biological relevance of their findings.

The second part of the thesis considers the joint analysis of two biological datasets. In Chapter 3 we address the problem of finding associations between single nucleotide polymorphisms (SNPs) and genes expression. The huge number of possible associations requires careful attention to issues of computational efficiency and multiple comparisons. We propose a new method, called FastMap, that exploits the discreteness of SNPs, and uses a permutation approach to account for multiple comparisons.

In Chapter 4 we describe a method for combining gene expression data produced from different measurement platforms. The method, called XPN, estimates and removes the systematic differences between datasets by fitting a simple block-linear model to the available data.

The method is validated on real gene expression data. The methods described in Chapters 2-4 have been implemented and are publicly available online.

Indexing (document details)
Advisor: Nobel, Andrew B.
Commitee: Budhiraja, Amarjit, Liu, Yufeng, Marron, J. S., Perou, Charles M., Rusyn, Ivan
School: The University of North Carolina at Chapel Hill
Department: Statistics
School Location: United States -- North Carolina
Source: DAI-B 71/09, Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Statistics, Bioinformatics
Keywords: Biclustering, Biological data, Correlation mining, Data sets, Low rank signals, Matrix reconstruction, Noise
Publication Number: 3418746
ISBN: 9781124173009
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest