With the recent proliferation of large, unlabeled data sets, a particular subclass of semisupervised learning problems has become more prevalent. Known as positive-unlabeled learning (PU learning), this scenario provides only positive labeled examples, usually just a small fraction of the entire dataset, with the remaining examples unknown and thus potentially belonging to either the positive or negative class. Since the vast majority of traditional machine learning classifiers require both positive and negative examples in the training set, a new class of algorithms has been developed to deal with PU learning problems.
A canonical example of this scenario is topic labeling of a large corpus of documents. Once the size of a corpus reaches into the thousands, it becomes largely infeasible to have a curator read even a sizable fraction of the documents, and annotate them with topics. In addition, the entire set of topics may not be known, or may change over time, making it impossible for a curator to annotate which documents are NOT about certain topics. Thus a machine learning algorithm needs to be able to learn from a small set of positive examples, without knowledge of the negative class, and knowing that the unlabeled training examples may contain an arbitrary number of additional but as yet unknown positive examples.
Another example of a PU learning scenario recently garnering attention is the protein function prediction problem (PFP problem). While the number of organisms with fully sequenced genomes continues to grow, the progress of annotating those sequences with the biological functions that they perform lags far behind. Machine learning methods have already been successfully applied to this problem, but with many organisms having a small number of positive annotated training examples, and the lack of availability of almost any labeled negative examples, PU learning algorithms have the potential to make large gains in predictive performance.
The first part of this dissertation motivates the protein function prediction problem, explores previous work, and introduces novel methods that improve upon previously reported benchmarks for a particular type of learning algorithm, known as Gaussian Random Field Label Propagation (GRFLP). In addition, we present improvements to the computational efficiency of the GRFLP algorithm, and a modification to the traditional structure of the PFP learning problem that allows for simultaneous prediction across multiple species.
The second part of the dissertation focuses specifically on the positive-unlabeled aspects of the PFP problem. Two novel algorithms are presented, and rigorously compared to existing PU learning techniques in the context of protein function prediction. Additionally, we take a step back and examine some of the theoretical considerations of the PU scenario in general, and provide an additional novel algorithm applicable in any PU context. This algorithm is tailored for situations in which the labeled positive examples are a small fraction of the set of true positive examples, and where the labeling process may be subject to some type of bias rather than being a random selection of true positives (arguably some of the most difficult PU learning scenarios).
The third and fourth sections return to the PFP problem, examining the power of tertiary structure as a predictor of protein function, as well as presenting two case studies of function prediction performance on novel benchmarks. Lastly, we conclude with several promising avenues of future research into both PU learning in general, and the protein function prediction problem specifically.
|Advisor:||Shasha, Dennis E., Bonneau, Richard A.|
|Commitee:||Bonneau, Richard A., Geiger, Davi, Gunsalus, Kris C., Kedem, Zvi M., Shasha, Dennis E.|
|School:||New York University|
|School Location:||United States -- New York|
|Source:||DAI-B 76/04(E), Dissertation Abstracts International|
|Subjects:||Bioinformatics, Computer science|
|Keywords:||Computational biology, Machine learning, Positive-unlabeled learning, Protein function prediction|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be