Dissertation/Thesis Abstract

The author has requested that access to this graduate work be delayed until 2019-08-07. After this date, this graduate work will be available on an open access basis.
Robust Inference and Network Analysis for Non-Gaussian Gene-Expression Data
by Specht, Alicia T., Ph.D., University of Notre Dame, 2017, 126; 13836379
Abstract (Summary)

My work in this thesis is concerning methods for robust inference and network analysis for non-Gaussian data, with a focus on the unique challenges presented by RNA-Sequencing and single-cell RNA-Sequencing applications. Two methods for constructing gene coexpression networks are presented—a robust method for RNASeq data, and a method for estimating directed networks from scRNA-Seq data—as well as a novel method for testing differences in gene expression.

The first proposed method provides a new way of estimating the correlation of non-Gaussian data, which in turn can be used to infer gene function. The most straightforward way of constructing a coexpression network is to connect gene pairs whose expressions are highly correlated under different experimental conditions. Usually, this correlation is measured by the Pearson's correlation coefficient, which, however, does not directly apply to data generated from RNA-Seq technique. RNA-Seq data are non-negative integers which cannot be properly modeled by a Gaussian distribution, and moreover, these counts have mean values that are proportional to the sequencing depths, and thus there are no identically distributed “replicates.” Directly normalizing counts by the corresponding sequencing depths and then using Pearson's correlation coefficient can be of low efficiency. The proposed method, iCC, is a generalization of the Pearson's correlation coefficient that can be directly applied to RNA-Seq data. On simulation data, it shows higher efficiency in distinguishing coexpressed gene pairs from unrelated gene pairs. In a real dataset, iCC generates a coexpression network that appears to more closely agree with experimentally validated networks than other methods. More generally, iCC can be used for calculating the correlation coefficient for any two series of random variables.

The second proposed method is for constructing gene co-expression networks based on single-cell RNA-Sequencing data. The algorithm is called LEAP, or Lagbased Expression Associations for Pseudotime-series data, and utilizes the estimated pseudotime of the cells to find gene co-expression that involves time delay, building off traditional time-series analysis techniques. Regular correlation-based GCNs only describe simultaneous gene co-expressions. By using the time information that is virtually freely available in scRNA-Seq data, LEAP is able to capture associations that were hidden by the time lags. The asymmetric associations detected by LEAP more likely reflect regulatory relationships as they describe which gene follows another gene in expression. Applied to a real data set, LEAP not only identifies more true relationships than a traditional correlation-based network, but also captures directed, and thereby regulatory, relationships.

Finally, the third method is a new way of detecting differentially expressed (DE) genes, which show different average expression levels in different sample groups, and thus can be important biological markers. While many methods have been proposed for detecting DE genes, and are generally very successful, these methods need to be further tailored and improved for cancerous data. Tumor samples often feature quite diverse expression—some even appear as huge outliers—and this diversity is much larger than that in the control group. The proposed method, DiPhiSeq, can detect not only genes that show different average expressions, but also genes that show different diversities of expressions in different groups. These "differentially dispersed" genes can be important clinical markers. DiPhiSeq uses a redescending penalty on the quasi-likelihood function, and thus has superior robustness against outliers and other noise. Simulations and real data analysis demonstrate that DiPhiSeq outperforms existing methods in the presence of outliers, and identifies unique sets of genes.

Indexing (document details)
Advisor: Li, Jun
Commitee: Buechler, Steved, Li, Jun, Liu, Fang
School: University of Notre Dame
Department: Applied and Computational Mathematics and Statistics
School Location: United States -- Indiana
Source: DAI-B 80/06(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Statistics
Keywords: Coexpression networks, Differential expression, Gene expression, Network analysis, RNA sequencing, Robust inference
Publication Number: 13836379
ISBN: 9780438835658
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest