Genomics has recently entered the realm of Big Data, and the last decade has seen an explosion in genome sequencing and assembly. The age of Big Data has also become synonymous with deep learning, and various deep network architectures have been developed to tackle genome annotation problems. At the same time, new exciting techniques have emerged, which allow the sequencing of only the portions of the RNA being actively translated by the ribosomes (ribosome profiling), and sequencing the RNA from individual cells (scRNA-seq).
This thesis takes advantage of recent advances in genomics, describing new methods and algorithms to improve the understanding of translation and genetic encoding biases, as well algorithms to improve the annotation on genome and single cell levels. Our algorithm to determine the rates of translation of codons using ribosome profiling data from yeast generated the first measurement of the differential rate of translation of all 61 codons in vivo. We developed several analytic approaches to demonstrate that prokaryotic coding regions have little specific depletion of Shine-Dalgarno motifs. We used highly conserved regions of the 16S rRNAs to develop an algorithm to fix erroneous 16S rRNA 3' end annotations in over twelve thousand prokaryotic organisms in the NCBI Genebank. In our foray into gene annotation, we evaluated various DNA K-mer embeddings, and developed DeepAnnotator, a deep learning architecture for genome annotation which achieved an F-score of 94%. We then turned to automatic annotation of cell phase in scRNA-seq data, describing Pre-Phaser, which established a general computational approach for precise cell phase assignment using k nearest neighbors. Finally, to pursue the goal of novel transcript and protein detection, we developed a statistical framework to identify all likely frameshift positions in a genome, as well as a frameshift simulator for the ribosome profiling data to verify our algorithm.
|Commitee:||Balasubramanian, Niranjan, Patro, Rob, Futcher, Bruce|
|School:||State University of New York at Stony Brook|
|School Location:||United States -- New York|
|Source:||DAI-B 82/7(E), Dissertation Abstracts International|
|Subjects:||Computer science, Systematic biology|
|Keywords:||Bioinformatics, Computational biology, Computer science, Data science, Genomics, Statistics|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be