Mutations are alterations in the DNA. They can have critical and permanent functional and evolutionary consequences. Learning the fundamental underlying mutational processes and mechanisms is the cornerstone of genomics research. With modern high-throughput sequencing, researchers have access to unprecedentedly abundant DNA mutation data. This dissertation work provides computational methods and analysis results on large-scale sequencing reads to unveil mutational processes details in human DNA. Specifically, I focus on 1) single nucleotide variants in human cancer, 2) deletion breakpoints and 3) retroduplications in human germline. In cancer, I develop a LASSO based method to identify active mutational processes in tumor samples. It gives sparse, biologically interpretable solution and is able to leverage on prior knowledge learned from pan-cancer analysis. Furthermore, I propose a generative model to integrate mutational heterogeneity in both nucleotide contexts and genomic locations. By exploiting mutational processes fingerprints in both aspects, this framework is potentially capable to better identify mutational processes and help reveal underlying biology knowledge. Using papillary renal cell carcinoma (pRCC) and data from Pan-cancer Analysis of Whole Genomes (PCAWG) as case studies, I showcase the power of these methods in cancer genomics. In human germline, I jointly analyze the 1000 Genomes Project data with other genomic annotations. I demonstrate how strong selection and mutational mechanisms together shape deletion distribution in human genomes. In addition, I develop a method specifically targeting retroduplications in human genomes. Using this method, I obtain the largest human retroduplication variation set from 26 populations. These retroduplications reveal population structure and give hints on human recent evolution and divergence. Further insertion point analysis shows how selection and mutational processes drive the nonrandom distribution of retroduplication in the genome. Finally, to address biological data explosion, I optimize the algorithm for a Monte Carlo simulation method in protein surface sampling. The new algorithm lowers down the computational complexity to O(n 2) and thus essentially permits the sampling method to be applied on real world large proteins and complexes.
|Advisor:||Gerstein, Mark B.|
|School Location:||United States -- Connecticut|
|Source:||DAI-B 79/05(E), Dissertation Abstracts International|
|Subjects:||Genetics, Bioinformatics, Computer science|
|Keywords:||Cancer, Mutational Landscape, Mutational Processes, Mutational Signatures, Retroduplication, Topic Model|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be