Dissertation/Thesis Abstract

Computational Methods for Exploring the Tandem Repeat Protein Universe
by Newman, Aaron Matthew, Ph.D., University of California, Santa Barbara, 2010, 259; 3428004
Abstract (Summary)

Proteins with tandemly repeated sequence architectures constitute an important class of naturally-evolved molecules in Earth's biosphere, and range from the polymeric antigenic surface proteins of protozoan parasites to biomaterials like animal and bacterial collagens, insect and spider silks, and mechanical scaffolds in plant cell walls. In the post-genomic era, there are unprecedented opportunities to identify and classify all Tandem Repeat Proteins (TRPs) from the rapidly growing genomic and metagenomic sequence datasets. Unfortunately, current computational methods have major limitations for mining the highly repetitive sequence content of TRPs, relegating many of these important molecules to the 'hypothetical', 'unknown', or misannotated regions of the protein sequence universe. In this dissertation I present and validate novel computational methods for identifying and classifying the diverse TRP content of Earth's myriad genomes. For efficient identification and architecture modeling of TR motifs from multi-genomic datasets, I developed XSTREAM, which implements a fast seed-extension strategy to address limitations of previous tools, such as impractical running times on long input sequences and restricted TR pattern sizes. XSTREAM also incorporates several important post-processing algorithms to model TR architectures, remove repeat redundancy, and ultimately, identify "fundamental" TR patterns. Second, to classify large collections of TR motifs from multiple genomes without prior knowledge of cluster number or structure, I designed and implemented a general-purpose data clustering strategy called AutoSOME. As demonstrated using a variety of benchmark datasets, including microarray gene expression data, AutoSOME effectively identifies both discrete and fuzzy clusters from large, high-dimensional datasets, and benchmarks favorably against common clustering methods without the need for data filtration or prior knowledge of cluster number. The utility of these new computational methods for mining TRPs from multiple genomes is demonstrated by a large-scale phylogenetic analysis of proline-rich TRPs targeted to the plant secretory pathway. Thirty-one Pro-rich TRP families representing both known and novel proteins were identified, and analysis of these families yielded new insights into plant evolutionary biology. Taken together, the computational strategies described and validated in this dissertation represent an effective methodological framework for elucidating the naturally-evolved TRP sequence universe, and in future work, could be used for assembling Nature's "Parts List," a multi-genomic catalog of TRP families with applications for diverse fields, including evolutionary and experimental biology, biomedicine, and biomimetics.

Supplemental Files

Some files may require a special program or browser plug-in. More Information

Indexing (document details)
Advisor: Cooper, James B.
Commitee: Poole, Stephen, Singh, Ambuj, Waite, Herbert, Yan, Xifeng
School: University of California, Santa Barbara
Department: Biomolecular Science and Engineering
School Location: United States -- California
Source: DAI-B 72/01, Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Molecular biology, Bioinformatics, Computer science
Keywords: Clustering, Gene expression, Microarrays, Plant cell walls, Tandem repeat proteins
Publication Number: 3428004
ISBN: 978-1-124-33242-0
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest