Initial approaches to cancer treatment have involved classifying cancer by the site in which it is first formed, and treating it with drugs and other therapies that have very broad targeting. These therapies are often prone to damaging healthy cells in the process, which may lead to additional health complications. With the advent of high-throughput sequencing, and the development of computational tools and software to process the subsequent deluge of sequencing data, much progress has been made on functionally annotating the human genome. Many genomes have been cost-effectively sequenced, providing insight into genetic variation between various human populations. The methods used to study population variation may also be used to study the basis of genetic disease, including cancer. It has now been demonstrated that there are many molecular subtypes of cancer, where each subtype is differentiated based on which important cellular molecule or DNA sequence has been disrupted. Hence, understanding the genetic basis of cancer is paramount to the development of new, personalized molecular therapies to treat cancer.
Noncoding variants are known to be associated with disease, but they are not as commonly investigated as coding variants since assessing the functional impact of a mutation is difficult. For rare mutations, background mutation models have been set up for burden tests to discover highly mutated regions, which might be potential drivers of cancer. This has been developed for coding regions, leading to the successful use of burden tests to find highly mutated genes. However, this is challenging for noncoding regions because of mutation rate heterogeneity and potential correlations across regions, which give rise to huge overdispersion in the mutation count data. If not corrected, such overdispersions may suggest artefactual mutational hotspots. We address these issues with the development of a new computational framework called LARVA. LARVA intersects whole genome single nucleotide variant (SNV) calls with a comprehensive set of noncoding regulatory elements, and models these elements' mutation counts with a beta-binomial distribution to handle the overdispersion in a principled fashion. Furthermore, in estimating this distribution and determining the local mutation rate, LARVA incorporates regional genomic features like replication timing.
The LARVA framework can be extended in certain ways to facilitate the analysis of its results. By storing information on highly mutated annotations in a relational database, it is possible to quickly extract the most interesting results for further analysis. Furthermore, results from multiple LARVA runs can be combined for a meta-analysis that could involve, for example, finding highly mutated pathways in cancer and other types of genetic disease. Since LARVA's computation consists of many independent units of work, it can benefit from various forms of parallel computation. These forms of computation include distributed computing with a large number of commodity processors, as well as more esoteric types of parallelization, such as general purpose graphics processing unit (GPU) computation.
We make LARVA available as free software tool at larva.gersteinlab.org. We demonstrate the effectiveness of LARVA by showing how it identifies the well-known noncoding drivers, such as TERT promoter, on 760 cancer whole genomes. Furthermore, we show it is able to highlight several novel noncoding regulators that could be potential new noncoding drivers. We also make all of the highly mutated annotations available online.
We also describe the Aggregation and Correlation Toolbox (ACT), a collection of software tools that facilitates the analysis of genomic signal tracks. The aggregation component takes a signal track and a series of genome regions, and creates an aggregate profile of the signal over the given regions. This enables the discovery of consistent signal patterns over related sets of annotations, implying potential connections between the signal and the regions. The correlation component of ACT takes two or more signal tracks and computes all pairwise track correlations. Correlation analyses are useful for finding similarities between various experiments, such as the binding sites of transcription factors as determined by ChIP-seq. The final component of ACT is a saturation tool designed to determine the number of experiments necessary to cover genomic features to saturation. This type of analysis can be illustrated with a ChIP-seq experiment where the inclusion of additional cell lines will reveal more binding sites for a transcription factor of interest: with each new cell line, a smaller fraction of the sites will be newly discovered, and a larger fraction will overlap discovered sites from previously used cell lines. The objective of ACT's saturation tool is to find the point of diminishing returns in the discovery of new sites, which may result in more efficiently planned experiments.
|Advisor:||Gerstein, Mark B.|
|School Location:||United States -- Connecticut|
|Source:||DAI-B 77/06(E), Dissertation Abstracts International|
|Subjects:||Statistics, Bioinformatics, Oncology|
|Keywords:||Cancer, Driver Mutations, Mutation Burden Test|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be