Advances in sequencing and genomic technologies are providing new opportunities to understand the genetic basis of phenotypes such as diseases. Translating the large volumes of heterogeneous, often noisy, data into biological insights presents challenging problems of statistical inference. In this thesis, we focus on three important statistical problems that arise in our efforts to understand the genetic basis of phenotypic variation in humans.
At the molecular level, we focus on the problem of identifying the amino acid residues in a protein that are important for its function. Identifying functional residues is essential to understanding the effect of genetic variation on protein function as well as to understanding protein function itself. We propose computational methods that predict functional residues using evolutionary information as well as from a combination of evolutionary and structural information. We demonstrate that these methods can accurately predict catalytic residues in enzymes. Case studies on well-studied enzymes show that these methods can be useful in guiding future experiments.
At the population level, discovering the link between genetic and phenotypic variation requires an understanding of the genetic structure of human populations. A common form of population structure is that found in admixed groups formed by the intermixing of several ancestral populations, such as African-Americans and Latinos. We describe a Bayesian hidden Markov model of admixture and propose efficient algorithms to infer the fine-scale structure of admixed populations. We show that the fine-scale structure of these populations can be inferred even when the ancestral populations are unknown or extinct. Further, the inference algorithm can run efficiently on genome-scale datasets. This model is well-suited to estimate other parameters of biological interest such as the allele frequencies of ancestral populations which can be used, in turn, to reconstruct extinct populations.
Finally, we address the problem of sharing genomic data while preserving the privacy of individual participants. We analyze the problem of detecting an individual genotype from the summary statistics of single nucleotide polymorphisms (SNPs) released in a study. We derive upper bounds on the power of detection as a function of the study size, number of exposed SNPs and the false positive rate, thereby providing guidelines as to which set of SNPs can be safely exposed.
|Advisor:||Jordan, Michael I.|
|Commitee:||Karp, Richard, Sjolander, Kimmen|
|School:||University of California, Berkeley|
|School Location:||United States -- California|
|Source:||DAI-B 71/09, Dissertation Abstracts International|
|Subjects:||Statistics, Bioinformatics, Computer science|
|Keywords:||Amino acid residues, Functional residue prediction, Genomic privacy, Human genetic variation|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be