In this thesis we present algorithmic results for computational problems arising in two important areas of computational biology: de novo sequence assembly and de novo motif search.
De novo sequence assembly is generally the problem of reconstructing a genome from its overlapping fragments called reads. One popular approach for solving the problem is to use an overlap graph from the reads and then analyze it. One of the challenges in this approach is to construct and store the overlap graphs efficiently when the number of reads is very large. Constructing and storing the overlap graphs has been a computational bottleneck for many de novo assembly software tools that use the overlap graph approach because it will need Ω(n²) memory and time to store any overlap graphs if we employ any traditional data structure. Here n is the number of reads. For the next generation sequencing data, n is usually ranges from hundreds of millions to billions. As a result, many overlap graph-based software tools are unable to handle this huge amount of data. Fortunately, we propose a data structure for the overlap graphs that provably requires only O(n) space and time. Our experimental results show that our data structure is very efficient in practice. In addition, we develop a De novo assembly software tool, named Large-scale Efficient Assembly Program (LEAP), that utilizes our data structure. LEAP can process efficiently a billion of 454 reads on a desktop computer.
De novo motif search is another important research topic in computational biology. In general terms, de novo motif search is the problem of finding patterns in a set of biological sequences such as DNA or protein sequences. In the literature, motif search has been modeled as various combinatorial problems such as Simple Motif Search (SMS), Planted Motif Search (PMS) - also known as (ℓ, d)-motif search , and Edit-distance-based Motif Search (EMS). Among these, PMS is the most well-known and well-studied problem. There have been many algorithms proposed for solving PMS that usually fall into the two following categories: approximate algorithms and exact algorithms. The difference between the two kinds of algorithms is that exact algorithms will provably find all of the motifs, whereas approximate algorithms may not output all of the motifs. In this thesis, we focus on exact algorithms. All of the known exact algorithms for PMS take exponential time in ℓ and d. We propose novel exact algorithms that improve the best-known exact algorithm in terms of running time and memory. In particular, our algorithm can solve the challenging instances of PMS (21, 8) and (23, 9) that no prior exact algorithm could solve. On a regular desktop, it takes about 4.3 hours to solve the instance (21, 8) and 24 hours for the instance (23, 9). In addition, our algorithms have been incorporated into our online software tool for motif search at http://pms.engr.uconn.edu or at http://motifsearch.com.
|Commitee:||Ammar, Reda, Bi, Jinbo, Mandoiu, Ion, Rajasekaran, Sanguthevar, Wu, Yufeng|
|School:||University of Connecticut|
|Department:||Computer Science and Engineering|
|School Location:||United States -- Connecticut|
|Source:||DAI-B 74/04(E), Dissertation Abstracts International|
|Subjects:||Computer Engineering, Bioinformatics, Computer science|
|Keywords:||Algorithms, DNA sequence assembly, Motif discovery, Motif search|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be