Our ability to study communities of microorganisms has been vastly improved by the development of high-throughput DNA sequences. These technologies however can only sequence short fragments of organism's genomes at a time, which introduces many challenges in translating sequences results to biological insight. The field of bioinformatics has arisen in part to address these problems.
One bioinformatics problem is assigning a genetic sequence to a source organism. It is now common to use high−throughput, short−read sequencing technologies, such as the Illumina MiSeq, to sequence the 16S rRNA gene from a community of microorganisms. Researchers use this information to generate a profile of the different microbial organisms (i.e., the taxonomic composition) present in an environmental sample. There are a number of approaches for assigning taxonomy to genetic sequences, but all suffer from problems with accuracy. The methods that have been most widely used are pairwise alignment methods, like BLAST, UCLUST, and RTAX, and probability-based methods, such as RDP and MOTHUR. These methods can classify microbial sequences with high accuracy when sequences are long (e.g., thousand bases), however accuracy decreases as sequences are shorter. Current high−throughout sequencing technologies generates sequences between about 150 and 500 bases in length.
In my thesis I have developed new software for assigning taxonomy to short DNA sequences using profile Hidden Markov Models (HMMs). HMMs have been applied in related areas, such as assigning biological functions to protein sequences, and I hypothesize that it might be useful for achieving high accuracy taxonomic assignments from 16S rRNA gene sequences. My method builds models of 16S rRNA sequences for different taxonomic groups (kingdom, phylum, class, order, family genus and species) using the Greengenes 16S rRNA database. Given a sequence with unknown taxonomic origin, my method searches each kingdom model to determine the most likely kingdom. It then searches all of the phyla within the highest scoring kingdom to determine the most likely phylum. This iterative process continues until the sequence cannot be assigned at a taxonomic level with a user-defined confidence level, or until a species-level assignment is made that meets the user-defined confidence level.
I next evaluated this method on both artificial and real microbial community data, with both qualitative and quantitative metrics of method performance. The evaluation results showed that in the qualitative analyses (specificity and sensitivity) my method is not as good as the previously existing methods. However, the accuracy in the quantitative analysis was better than some other pre-existing methods. This suggests that my current implementation is sensitive to false positives, but is better at classifying more sequences than the other methods.
I present my method, my evaluations, and suggestions for next steps that might improve the performance of my HMM-based taxonomic classifier.
|Advisor:||Caporaso, James G.|
|Commitee:||Otte, Dieter, Pearson, Talima|
|School:||Northern Arizona University|
|School Location:||United States -- Arizona|
|Source:||MAI 53/05M(E), Masters Abstracts International|
|Subjects:||Bioinformatics, Computer science|
|Keywords:||Hidden Markov models, High accuracy, Short DNA sequences, rRNA|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be