Motivation: In the quantification of molecular components, a large variation can affect and even potentially mislead the biological conclusions. Meanwhile, the high-throughput experiments often involve a small number of samples due to the limitation of cost and time. In such cases, the stochastic information may dominate the outcome of an experiment because there may not be enough samples to present the true biological information. It is challenging to distinguish the changes in phenotype from the stochastic variation.
Methods: Since the biological molecules have been quantified with different technologies, different statistical methods are required. Focusing on three types of important high-throughput experiments, this thesis proposes novel solutions to reduce noise and increase the accuracy of molecular discovery.
i) In the large-scale perturbation screens, thousands of mutant strains on hundreds of plates are separately profiled in hundreds of days (or batches). For each mutant strain, only a small number of samples are profiled. The artificial noise mainly consists of additive and multiplicative effects due to plates and batches. We propose a linear mixed-effect modeling framework based on experimental designs with at least two control samples. These are involved in a normalization and variance estimation procedure for the purpose of reducing the noise from data and scoring the true biological phenotype.
ii) In the RNA-seq experiments, fragments of greater than thousands of genes in 4∼8 samples on a flow cell can be sequenced in one day. The additive and non-additive effects due to the large number of plates do not typically present in the data. The gene-wise variance between samples consists of both the expectation and dispersion of gene counts. Due to stochastic noise, some of gene wise dispersion are under or over estimated. This may lead to misinterpretation of the biological phenotype. We propose a shrinkage estimator of dispersion under Negative Binomial models to regularize the estimates towards a value calculated from common information across genes.
Lastly iii) in the MS/MS experiments with SWATH acquisition, more than 10 thousand spectra in a run can be sequentially obtained in about 120 minutes. The summed up intensity across all the signals within a tiny m/z bin is used to identify fragments of each peptide. As a result, the interference noise within the m/z bins leaves undetected and misleading ambiguity in protein quantification. The solutions previously proposed for perturbation screens and RNA-seq experiments can not be used for SAWTH acquisition because the property of the data is different. In order to remedy such defects, a new approach is proposed to quantify the homogeneity (opposite to interference) among the co-elusion traces of molecules within the m/z bins. Since correct signals of a fragment share a homogeneous peak shape, we propose to utilize the p-value of one-side test on the second order coefficient in a linear quadratic model. The coefficient accounts for the curved shape in a linear regression procedure. The p-value represents the strength of concave pattern across those peaks of a fragment.
Results: The evaluation results of different experiments with each of the three technologies illustrate that the proposed solutions outperform several existing methods.
|Commitee:||Craig, Bruce, Zhang, Hao, Zhu, Michael Yu|
|School Location:||United States -- Indiana|
|Source:||DAI-B 75/07(E), Dissertation Abstracts International|
|Subjects:||Biostatistics, Statistics, Bioinformatics|
|Keywords:||Linear mixed effect models, Linear quadratic regression model with random coefficients, Ms/ms with swath, Pertubation screen, RNA-seq, Shrinkage estimator of dispersion in negative binomial models|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be