Genome assembly is a critical first step for biological discovery. All current sequencing technologies share the fundamental limitation that segments read from a genome are much shorter than even the smallest genomes. Traditionally, whole-genome shotgun (WGS) sequencing over-samples a single clonal (or inbred) target chromosome with segments from random positions. The amount of over-sampling is known as the coverage. Assembly software then reconstructs the target. So called next-generation (or second-generation) sequencing has reduced the cost and increased throughput exponentially over first-generation sequencing. Unfortunately, next-generation sequences present their own challenges to genome assembly: (1) they require amplification of source DNA prior to sequencing leading to artifacts and biased coverage of the genome; (2) they produce relatively short reads: 100bp–700bp; (3) the sizeable runtime of most second-generation instruments is prohibitive for applications requiring rapid analysis, with an Illumina HiSeq 2000 instrument requiring 11 days for the sequencing reaction.
Recently, successors to the second-generation instruments (third-generation) have become available. These instruments promise to alleviate many of the down-sides of second-generation sequencing and can generate multi-kilobase sequences. The long sequences have the potential to dramatically improve genome and transcriptome assembly. However, the high error rate of these reads is challenging and has limited their use. To address this limitation, we introduce a novel correction algorithm and assembly strategy that utilizes shorter, high-identity sequences to correct the error in single-molecule sequences. Our approach achieves over 99% read accuracy and produces substantially better assemblies than current sequencing strategies.
The availability of cheaper sequencing has made new sequencing targets, such as multiple displacement amplified (MDA) single-cells and metagenomes, popular. Current algorithms assume assembly of a single clonal target, an assumption that is violated in these sequencing projects. We developed Bambus 2, a new scaffolder that works for metagenomics and single cell datasets. It can accurately detect repeats without assumptions about the taxonomic composition of a dataset. It can also identify biological variations present in a sample. We have developed a novel end-to-end analysis pipeline leveraging Bambus 2. Due to its modular nature, it is applicable to clonal, metagenomic, and MDA single-cell targets and allows a user to rapidly go from sequences to assembly, annotation, genes, and taxonomic info. We have incorporated a novel viewer, allowing a user to interactively explore the variation present in a genomic project on a laptop.
Together, these developments make genome assembly applicable to novel targets while utilizing emerging sequencing technologies. As genome assembly is critical for all aspects of bioinformatics, these developments will enable novel biological discovery.
|Commitee:||Daume, Hal, III, El-Sayed, Najib M., Kingsford, Carl, Sussman, Alan|
|School:||University of Maryland, College Park|
|School Location:||United States -- Maryland|
|Source:||DAI-B 73/12(E), Dissertation Abstracts International|
|Subjects:||Bioinformatics, Computer science|
|Keywords:||Genome assembly, Harnessing emerging sequencing, Metagenomic, Single-cell, Single-molecule sequencing|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be