Genomics has benefited from an explosion in affordable high-throughput technology for whole genome sequencing. The regulatory and functional aspects in non-coding regions may be an important contributor to oncogenesis. The majority of cancer-associated mutations in 154 whole genome sequences covering lung adenocarcinoma (LUAD), breast invasive carcinoma (BRCA), kidney renal papillary cell carcinoma (KIRP), uterine corpus endometrial carcinoma (UCEC), and colon adenocarcinoma (COAD) and two races are found outside of the coding region (4,432,885 in non-coding versus 1,412,731 in coding regions). A pan-cancer analysis found significantly mutated windows (292 to 3,881 in count) demonstrating that there are significant numbers of large mutated regions in the non-coding genome. Fifty-nine significantly mutated windows were found in all studied races and cancers, including many found in centromeric locations. The X chromosome had the largest set of universal windows which cluster almost exclusively in Xq11– an area linked to chromosomal instability and oncogenesis. The presence of 19 to 114 large consecutive clusters (super windows) provide further evidence that large mutated regions in the genome are influencing cancer development. We investigated the frequency of these single-nucleotide variations in 12 different tissue-independent DNA functional elements. We demonstrated that the overlap of cancer-linked variations with these DNA functional elements is not likely the result of random selection, and most functional elements had significantly more single-nucleotide variations than expected. We identified highly variant functional elements in 5 cancer types, primarily in long non-coding RNAs and transcription factor binding sites, suggesting that some functional elements might have wide-ranging effects on oncogenesis. Finally, we demonstrated that the ratios of SNVs within DNA functional elements show a level of distinction, suggesting that different cancer types can be fingerprinted via these ratios. A multinomial logistic regression algorithm was combined with one-hot encoding, a cross-entropy distance function for loss calculation, and we created a stochastic gradient descent function to build several prediction models based on the data generated. Three models were generated and trained off of a binary representation of variation in the 59 universal windows, variation counts in the 59 universal windows, and ratios of variations found in DNA functional elements. These models performed at 53.3%, 33.3%, and 40.0% accuracy on the test set, respectively. Counterintuitively, the model with the lowest performance (variation counts in the universal windows) showed the most promise for improvement through increased data.
Some files may require a special program or browser plug-in. More Information
|Commitee:||Hu, Valerie, Morizono, Hiroki, Dimri, Goberdhan|
|School:||The George Washington University|
|Department:||Genomics and Bioinformatics|
|School Location:||United States -- District of Columbia|
|Source:||DAI-B 82/3(E), Dissertation Abstracts International|
|Keywords:||Cancer, TCGA, Whole genome sequencing|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be