Biology as a field is being transformed by the increasing availability of data, especially genomic sequencing data. Computational methods that can adapt and take advantage of this data deluge are essential for exploring and providing insights for new hypotheses, helping to unveil the biological processes that were previously expensive or even impossible to study.
This dissertation introduces data structures and approaches for scaling data analysis to hundreds of thousands of DNA sequencing datasets using Scaled MinHash sketches, a reduced space representation of the original datasets that can lower computational requirements for similarity and containment estimation; MHBT and LCA indices, structures for indexing and searching large collections of Scaled MinHash sketches; gather, a new top-down approach for decomposing datasets into a collection of reference components that can be implemented efficiently with Scaled MinHash sketches and MHBT and LCA indices; wort, a distributed system for large scale sketch computation across heterogeneous systems, from laptops to academic clusters and cloud instances, including prototypes for containment searches across millions of datasets; as well as explorations on how to facilitate sharing and increase the resilience of sketches collections built from public genomic data.
|Advisor:||Brown, C. Titus|
|Commitee:||Rubio González, Cindy, Koslicki, David M., Díaz-Muñoz, Sam|
|School:||University of California, Davis|
|School Location:||United States -- California|
|Source:||DAI-B 82/9(E), Dissertation Abstracts International|
|Subjects:||Bioinformatics, Computer science, Genetics|
|Keywords:||Compositional analysis, Decentralizing, Distributed, Indexing, Searching, Sketches, Genomic data, MinHash Bloom Tree|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be