My work in this thesis is concerning methods for robust inference and network analysis for non-Gaussian data, with a focus on the unique challenges presented by RNA-Sequencing and single-cell RNA-Sequencing applications. Two methods for constructing gene coexpression networks are presented—a robust method for RNASeq data, and a method for estimating directed networks from scRNA-Seq data—as well as a novel method for testing differences in gene expression.
The first proposed method provides a new way of estimating the correlation of non-Gaussian data, which in turn can be used to infer gene function. The most straightforward way of constructing a coexpression network is to connect gene pairs whose expressions are highly correlated under different experimental conditions. Usually, this correlation is measured by the Pearson's correlation coefficient, which, however, does not directly apply to data generated from RNA-Seq technique. RNA-Seq data are non-negative integers which cannot be properly modeled by a Gaussian distribution, and moreover, these counts have mean values that are proportional to the sequencing depths, and thus there are no identically distributed “replicates.” Directly normalizing counts by the corresponding sequencing depths and then using Pearson's correlation coefficient can be of low efficiency. The proposed method, iCC, is a generalization of the Pearson's correlation coefficient that can be directly applied to RNA-Seq data. On simulation data, it shows higher efficiency in distinguishing coexpressed gene pairs from unrelated gene pairs. In a real dataset, iCC generates a coexpression network that appears to more closely agree with experimentally validated networks than other methods. More generally, iCC can be used for calculating the correlation coefficient for any two series of random variables.
The second proposed method is for constructing gene co-expression networks based on single-cell RNA-Sequencing data. The algorithm is called LEAP, or Lagbased Expression Associations for Pseudotime-series data, and utilizes the estimated pseudotime of the cells to find gene co-expression that involves time delay, building off traditional time-series analysis techniques. Regular correlation-based GCNs only describe simultaneous gene co-expressions. By using the time information that is virtually freely available in scRNA-Seq data, LEAP is able to capture associations that were hidden by the time lags. The asymmetric associations detected by LEAP more likely reflect regulatory relationships as they describe which gene follows another gene in expression. Applied to a real data set, LEAP not only identifies more true relationships than a traditional correlation-based network, but also captures directed, and thereby regulatory, relationships.
Finally, the third method is a new way of detecting differentially expressed (DE) genes, which show different average expression levels in different sample groups, and thus can be important biological markers. While many methods have been proposed for detecting DE genes, and are generally very successful, these methods need to be further tailored and improved for cancerous data. Tumor samples often feature quite diverse expression—some even appear as huge outliers—and this diversity is much larger than that in the control group. The proposed method, DiPhiSeq, can detect not only genes that show different average expressions, but also genes that show different diversities of expressions in different groups. These "differentially dispersed" genes can be important clinical markers. DiPhiSeq uses a redescending penalty on the quasi-likelihood function, and thus has superior robustness against outliers and other noise. Simulations and real data analysis demonstrate that DiPhiSeq outperforms existing methods in the presence of outliers, and identifies unique sets of genes.
|Commitee:||Buechler, Steved, Li, Jun, Liu, Fang|
|School:||University of Notre Dame|
|Department:||Applied and Computational Mathematics and Statistics|
|School Location:||United States -- Indiana|
|Source:||DAI-B 80/06(E), Dissertation Abstracts International|
|Keywords:||Coexpression networks, Differential expression, Gene expression, Network analysis, RNA sequencing, Robust inference|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be