Dissertation/Thesis Abstract

Characterization of protein function using automated computational methods
by Wu, Shirley, Ph.D., Stanford University, 2009, 179; 3364515
Abstract (Summary)

The popularization of high-throughput biological techniques has produced a significant bottleneck between protein identification and functional annotation. To alleviate this problem, researchers often apply computational methods for protein function recognition; however, existing tools are not as effective when the proteins are structurally novel. Structural genomics projects in particular are generating many novel protein structures with little associated functional knowledge, and so new function characterization methods that do not rely on strict sequence or structural similarity are needed. Thanks to improvements in sequencing technologies, we are also now discovering new proteins at a faster rate. These proteins may contain novel biological functions, but existing approaches are ill-equipped to discover them.

In this dissertation, I present several methods for protein function characterization that can be combined into pipelines both for supervised modeling of known functional sites and for unsupervised discovery of potentially novel functional sites. Each pipeline takes advantage of an existing framework called FEATURE, which models functional sites in protein structures. The first method, SeqFEATURE, uses sequence motifs to seed 3D models which are more robust to reductions in sequence identity compared to other sequence-based methods. The models are also more sensitive than other structure-based methods when tested on proteins with low structural similarity to known proteins. Using SeqFEATURE, I created and validated a large library of 3D functional site models and scanned all structures in the Protein Data Bank with each model, including structures with unknown function from structural genomics projects. The data and models are publicly available.

To identify and characterize potentially novel biological functions, we combine a number of clustering techniques with knowledge-informed approaches. FEATURE generates descriptive vectors of protein microenvironments, which we cluster using k-means to identify environments that recur across different protein structures. Each cluster represents a potential biological site of interest, but is likely to be noisy and therefore difficult to interpret. To select candidate clusters for analysis, I used hierarchical clustering in conjunction with a scoring function that takes into account the functional and internal coherence of sub-clusters. To annotate resulting candidate clusters, I developed a set of methods for ranking important terms found in the literature and in database records associated with the proteins comprising the cluster. We applied these methods to a novel data set of cysteine-based protein microenvironments, rediscovering known functional sites and sub-classes of functional sites in addition to making several novel predictions.

This dissertation extends existing frameworks to be relevant in the context of structural genomics. I demonstrate and validate an approach for rapid creation of robust functional site models that can be applied in high-throughput, and define a pipeline by which novel biology can be discovered and characterized. The work presented demonstrates significant contributions towards the characterization of protein function—both known and novel—using computational methods.

Indexing (document details)
Advisor: Altman, Russ B.
School: Stanford University
School Location: United States -- California
Source: DAI-B 70/07, Dissertation Abstracts International
Subjects: Bioinformatics, Computer science
Keywords: Clustering, Machine learning, Protein function, Protein function prediction
Publication Number: 3364515
ISBN: 978-1-109-24308-6
Copyright © 2020 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy