Dissertation/Thesis Abstract

Taxonomic assignment of gene sequences using hidden Markov models
by Huang, Huanhua, M.S., Northern Arizona University, 2014, 54; 1563863
Abstract (Summary)

Our ability to study communities of microorganisms has been vastly improved by the development of high-throughput DNA sequences. These technologies however can only sequence short fragments of organism's genomes at a time, which introduces many challenges in translating sequences results to biological insight. The field of bioinformatics has arisen in part to address these problems.

One bioinformatics problem is assigning a genetic sequence to a source organism. It is now common to use high−throughput, short−read sequencing technologies, such as the Illumina MiSeq, to sequence the 16S rRNA gene from a community of microorganisms. Researchers use this information to generate a profile of the different microbial organisms (i.e., the taxonomic composition) present in an environmental sample. There are a number of approaches for assigning taxonomy to genetic sequences, but all suffer from problems with accuracy. The methods that have been most widely used are pairwise alignment methods, like BLAST, UCLUST, and RTAX, and probability-based methods, such as RDP and MOTHUR. These methods can classify microbial sequences with high accuracy when sequences are long (e.g., thousand bases), however accuracy decreases as sequences are shorter. Current high−throughout sequencing technologies generates sequences between about 150 and 500 bases in length.

In my thesis I have developed new software for assigning taxonomy to short DNA sequences using profile Hidden Markov Models (HMMs). HMMs have been applied in related areas, such as assigning biological functions to protein sequences, and I hypothesize that it might be useful for achieving high accuracy taxonomic assignments from 16S rRNA gene sequences. My method builds models of 16S rRNA sequences for different taxonomic groups (kingdom, phylum, class, order, family genus and species) using the Greengenes 16S rRNA database. Given a sequence with unknown taxonomic origin, my method searches each kingdom model to determine the most likely kingdom. It then searches all of the phyla within the highest scoring kingdom to determine the most likely phylum. This iterative process continues until the sequence cannot be assigned at a taxonomic level with a user-defined confidence level, or until a species-level assignment is made that meets the user-defined confidence level.

I next evaluated this method on both artificial and real microbial community data, with both qualitative and quantitative metrics of method performance. The evaluation results showed that in the qualitative analyses (specificity and sensitivity) my method is not as good as the previously existing methods. However, the accuracy in the quantitative analysis was better than some other pre-existing methods. This suggests that my current implementation is sensitive to false positives, but is better at classifying more sequences than the other methods.

I present my method, my evaluations, and suggestions for next steps that might improve the performance of my HMM-based taxonomic classifier.

Indexing (document details)
Advisor: Caporaso, James G.
Commitee: Otte, Dieter, Pearson, Talima
School: Northern Arizona University
Department: Engineering Programs
School Location: United States -- Arizona
Source: MAI 53/05M(E), Masters Abstracts International
Source Type: DISSERTATION
Subjects: Bioinformatics, Computer science
Keywords: Hidden Markov models, High accuracy, Short DNA sequences, rRNA
Publication Number: 1563863
ISBN: 9781321137620
Copyright © 2018 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest