Proteins with tandemly repeated sequence architectures constitute an important class of naturally-evolved molecules in Earth's biosphere, and range from the polymeric antigenic surface proteins of protozoan parasites to biomaterials like animal and bacterial collagens, insect and spider silks, and mechanical scaffolds in plant cell walls. In the post-genomic era, there are unprecedented opportunities to identify and classify all Tandem Repeat Proteins (TRPs) from the rapidly growing genomic and metagenomic sequence datasets. Unfortunately, current computational methods have major limitations for mining the highly repetitive sequence content of TRPs, relegating many of these important molecules to the 'hypothetical', 'unknown', or misannotated regions of the protein sequence universe. In this dissertation I present and validate novel computational methods for identifying and classifying the diverse TRP content of Earth's myriad genomes. For efficient identification and architecture modeling of TR motifs from multi-genomic datasets, I developed XSTREAM, which implements a fast seed-extension strategy to address limitations of previous tools, such as impractical running times on long input sequences and restricted TR pattern sizes. XSTREAM also incorporates several important post-processing algorithms to model TR architectures, remove repeat redundancy, and ultimately, identify "fundamental" TR patterns. Second, to classify large collections of TR motifs from multiple genomes without prior knowledge of cluster number or structure, I designed and implemented a general-purpose data clustering strategy called AutoSOME. As demonstrated using a variety of benchmark datasets, including microarray gene expression data, AutoSOME effectively identifies both discrete and fuzzy clusters from large, high-dimensional datasets, and benchmarks favorably against common clustering methods without the need for data filtration or prior knowledge of cluster number. The utility of these new computational methods for mining TRPs from multiple genomes is demonstrated by a large-scale phylogenetic analysis of proline-rich TRPs targeted to the plant secretory pathway. Thirty-one Pro-rich TRP families representing both known and novel proteins were identified, and analysis of these families yielded new insights into plant evolutionary biology. Taken together, the computational strategies described and validated in this dissertation represent an effective methodological framework for elucidating the naturally-evolved TRP sequence universe, and in future work, could be used for assembling Nature's "Parts List," a multi-genomic catalog of TRP families with applications for diverse fields, including evolutionary and experimental biology, biomedicine, and biomimetics.
Some files may require a special program or browser plug-in. More Information
|Advisor:||Cooper, James B.|
|Commitee:||Poole, Stephen, Singh, Ambuj, Waite, Herbert, Yan, Xifeng|
|School:||University of California, Santa Barbara|
|Department:||Biomolecular Science and Engineering|
|School Location:||United States -- California|
|Source:||DAI-B 72/01, Dissertation Abstracts International|
|Subjects:||Molecular biology, Bioinformatics, Computer science|
|Keywords:||Clustering, Gene expression, Microarrays, Plant cell walls, Tandem repeat proteins|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be