COMING SOON! PQDT Open is getting a new home!

ProQuest Open Access Dissertations & Theses will remain freely available as part of a new and enhanced search experience at www.proquest.com.

Questions? Please refer to this FAQ.

Dissertation/Thesis Abstract

Decentralizing Indices for Genomic Data
by Irber, Luiz Carlos, Jr., Ph.D., University of California, Davis, 2020, 132; 28094567
Abstract (Summary)

Biology as a field is being transformed by the increasing availability of data, especially genomic sequencing data. Computational methods that can adapt and take advantage of this data deluge are essential for exploring and providing insights for new hypotheses, helping to unveil the biological processes that were previously expensive or even impossible to study.

This dissertation introduces data structures and approaches for scaling data analysis to hundreds of thousands of DNA sequencing datasets using Scaled MinHash sketches, a reduced space representation of the original datasets that can lower computational requirements for similarity and containment estimation; MHBT and LCA indices, structures for indexing and searching large collections of Scaled MinHash sketches; gather, a new top-down approach for decomposing datasets into a collection of reference components that can be implemented efficiently with Scaled MinHash sketches and MHBT and LCA indices; wort, a distributed system for large scale sketch computation across heterogeneous systems, from laptops to academic clusters and cloud instances, including prototypes for containment searches across millions of datasets; as well as explorations on how to facilitate sharing and increase the resilience of sketches collections built from public genomic data.

Indexing (document details)
Advisor: Brown, C. Titus
Commitee: Rubio González, Cindy, Koslicki, David M., Díaz-Muñoz, Sam
School: University of California, Davis
Department: Computer Science
School Location: United States -- California
Source: DAI-B 82/9(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Bioinformatics, Computer science, Genetics
Keywords: Compositional analysis, Decentralizing, Distributed, Indexing, Searching, Sketches, Genomic data, MinHash Bloom Tree
Publication Number: 28094567
ISBN: 9798582545620
Copyright © 2021 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest