The popularization of high-throughput biological techniques has produced a significant bottleneck between protein identification and functional annotation. To alleviate this problem, researchers often apply computational methods for protein function recognition; however, existing tools are not as effective when the proteins are structurally novel. Structural genomics projects in particular are generating many novel protein structures with little associated functional knowledge, and so new function characterization methods that do not rely on strict sequence or structural similarity are needed. Thanks to improvements in sequencing technologies, we are also now discovering new proteins at a faster rate. These proteins may contain novel biological functions, but existing approaches are ill-equipped to discover them.
In this dissertation, I present several methods for protein function characterization that can be combined into pipelines both for supervised modeling of known functional sites and for unsupervised discovery of potentially novel functional sites. Each pipeline takes advantage of an existing framework called FEATURE, which models functional sites in protein structures. The first method, SeqFEATURE, uses sequence motifs to seed 3D models which are more robust to reductions in sequence identity compared to other sequence-based methods. The models are also more sensitive than other structure-based methods when tested on proteins with low structural similarity to known proteins. Using SeqFEATURE, I created and validated a large library of 3D functional site models and scanned all structures in the Protein Data Bank with each model, including structures with unknown function from structural genomics projects. The data and models are publicly available.
To identify and characterize potentially novel biological functions, we combine a number of clustering techniques with knowledge-informed approaches. FEATURE generates descriptive vectors of protein microenvironments, which we cluster using k-means to identify environments that recur across different protein structures. Each cluster represents a potential biological site of interest, but is likely to be noisy and therefore difficult to interpret. To select candidate clusters for analysis, I used hierarchical clustering in conjunction with a scoring function that takes into account the functional and internal coherence of sub-clusters. To annotate resulting candidate clusters, I developed a set of methods for ranking important terms found in the literature and in database records associated with the proteins comprising the cluster. We applied these methods to a novel data set of cysteine-based protein microenvironments, rediscovering known functional sites and sub-classes of functional sites in addition to making several novel predictions.
This dissertation extends existing frameworks to be relevant in the context of structural genomics. I demonstrate and validate an approach for rapid creation of robust functional site models that can be applied in high-throughput, and define a pipeline by which novel biology can be discovered and characterized. The work presented demonstrates significant contributions towards the characterization of protein function—both known and novel—using computational methods.
|Advisor:||Altman, Russ B.|
|School Location:||United States -- California|
|Source:||DAI-B 70/07, Dissertation Abstracts International|
|Subjects:||Bioinformatics, Computer science|
|Keywords:||Clustering, Machine learning, Protein function, Protein function prediction|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be