Dissertation/Thesis Abstract

An Exploration of Cancer-Associated Non-Coding Variations in Whole Genome Sequencing Data
by Torcivia, John Paul, Ph.D., The George Washington University, 2020, 183; 28088994
Abstract (Summary)

Genomics has benefited from an explosion in affordable high-throughput technology for whole genome sequencing. The regulatory and functional aspects in non-coding regions may be an important contributor to oncogenesis. The majority of cancer-associated mutations in 154 whole genome sequences covering lung adenocarcinoma (LUAD), breast invasive carcinoma (BRCA), kidney renal papillary cell carcinoma (KIRP), uterine corpus endometrial carcinoma (UCEC), and colon adenocarcinoma (COAD) and two races are found outside of the coding region (4,432,885 in non-coding versus 1,412,731 in coding regions). A pan-cancer analysis found significantly mutated windows (292 to 3,881 in count) demonstrating that there are significant numbers of large mutated regions in the non-coding genome. Fifty-nine significantly mutated windows were found in all studied races and cancers, including many found in centromeric locations. The X chromosome had the largest set of universal windows which cluster almost exclusively in Xq11– an area linked to chromosomal instability and oncogenesis. The presence of 19 to 114 large consecutive clusters (super windows) provide further evidence that large mutated regions in the genome are influencing cancer development. We investigated the frequency of these single-nucleotide variations in 12 different tissue-independent DNA functional elements. We demonstrated that the overlap of cancer-linked variations with these DNA functional elements is not likely the result of random selection, and most functional elements had significantly more single-nucleotide variations than expected. We identified highly variant functional elements in 5 cancer types, primarily in long non-coding RNAs and transcription factor binding sites, suggesting that some functional elements might have wide-ranging effects on oncogenesis. Finally, we demonstrated that the ratios of SNVs within DNA functional elements show a level of distinction, suggesting that different cancer types can be fingerprinted via these ratios. A multinomial logistic regression algorithm was combined with one-hot encoding, a cross-entropy distance function for loss calculation, and we created a stochastic gradient descent function to build several prediction models based on the data generated. Three models were generated and trained off of a binary representation of variation in the 59 universal windows, variation counts in the 59 universal windows, and ratios of variations found in DNA functional elements. These models performed at 53.3%, 33.3%, and 40.0% accuracy on the test set, respectively. Counterintuitively, the model with the lowest performance (variation counts in the universal windows) showed the most promise for improvement through increased data.

Supplemental Files

Some files may require a special program or browser plug-in. More Information

Indexing (document details)
Advisor: Mazumder, Raja
Commitee: Hu, Valerie, Morizono, Hiroki, Dimri, Goberdhan
School: The George Washington University
Department: Genomics and Bioinformatics
School Location: United States -- District of Columbia
Source: DAI-B 82/3(E), Dissertation Abstracts International
Subjects: Bioinformatics, Genetics
Keywords: Cancer, TCGA, Whole genome sequencing
Publication Number: 28088994
ISBN: 9798664790399
Copyright © 2020 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy