With PQDT Open, you can read the full text of open access dissertations and theses free of charge.
About PQDT Open
Search
Proteins with tandemly repeated sequence architectures constitute an important class of naturally-evolved molecules in Earth's biosphere, and range from the polymeric antigenic surface proteins of protozoan parasites to biomaterials like animal and bacterial collagens, insect and spider silks, and mechanical scaffolds in plant cell walls. In the post-genomic era, there are unprecedented opportunities to identify and classify all Tandem Repeat Proteins (TRPs) from the rapidly growing genomic and metagenomic sequence datasets. Unfortunately, current computational methods have major limitations for mining the highly repetitive sequence content of TRPs, relegating many of these important molecules to the 'hypothetical', 'unknown', or misannotated regions of the protein sequence universe. In this dissertation I present and validate novel computational methods for identifying and classifying the diverse TRP content of Earth's myriad genomes. For efficient identification and architecture modeling of TR motifs from multi-genomic datasets, I developed XSTREAM, which implements a fast seed-extension strategy to address limitations of previous tools, such as impractical running times on long input sequences and restricted TR pattern sizes. XSTREAM also incorporates several important post-processing algorithms to model TR architectures, remove repeat redundancy, and ultimately, identify "fundamental" TR patterns. Second, to classify large collections of TR motifs from multiple genomes without prior knowledge of cluster number or structure, I designed and implemented a general-purpose data clustering strategy called AutoSOME. As demonstrated using a variety of benchmark datasets, including microarray gene expression data, AutoSOME effectively identifies both discrete and fuzzy clusters from large, high-dimensional datasets, and benchmarks favorably against common clustering methods without the need for data filtration or prior knowledge of cluster number. The utility of these new computational methods for mining TRPs from multiple genomes is demonstrated by a large-scale phylogenetic analysis of proline-rich TRPs targeted to the plant secretory pathway. Thirty-one Pro-rich TRP families representing both known and novel proteins were identified, and analysis of these families yielded new insights into plant evolutionary biology. Taken together, the computational strategies described and validated in this dissertation represent an effective methodological framework for elucidating the naturally-evolved TRP sequence universe, and in future work, could be used for assembling Nature's "Parts List," a multi-genomic catalog of TRP families with applications for diverse fields, including evolutionary and experimental biology, biomedicine, and biomimetics.
Some files may require a special program or browser plug-in. More Information
Advisor: | Cooper, James B. |
Commitee: | Poole, Stephen, Singh, Ambuj, Waite, Herbert, Yan, Xifeng |
School: | University of California, Santa Barbara |
Department: | Biomolecular Science and Engineering |
School Location: | United States -- California |
Source: | DAI-B 72/01, Dissertation Abstracts International |
Source Type: | DISSERTATION |
Subjects: | Molecular biology, Bioinformatics, Computer science |
Keywords: | Clustering, Gene expression, Microarrays, Plant cell walls, Tandem repeat proteins |
Publication Number: | 3428004 |
ISBN: | 978-1-124-33242-0 |