Mass spectrometry is a powerful technique in analytical chemistry that was originally designed to determine the composition of small molecules in terms of their constituent elements. In the last several decades, it has begun to be used for much more complex tasks, including the detailed analysis of the amino acid sequence that makes up an unknown protein and even the identification of multiple proteins present in a complex mixture. The latter problem is largely unsolved and the principal subject of this dissertation.
The fundamental difficulty in the analysis of mass spectrometry data is that of ill-posedness. There are multiple solutions consistent with the experimental data and the data is subject to significant amounts of noise. In this work, we have developed application-specific machine learning algorithms that (partially) overcome this ill-posedness. We make use of labeled examples of a single class of peptide fragments and of the unlabeled fragments detected by the instrument. This places the approach within the broader framework of semi-supervised learning.
Recently, there has been considerable interest in classification problems of this type, where the learning algorithm only has access to labeled examples of a single class and unlabeled data. The motivation for such problems is that in many applications, examples of one of the two classes are easy and inexpensive to obtain, whereas the acquisition of examples of a second class is difficult and labor-intensive. For example, in document classification, positive examples are documents that address specific subject, while unlabeled documents are abundant. In movie rating, the positive data are the movies chosen by clients, while the unlabeled data are all remaining movies in a collection. In medical imaging, positive (labeled) data correspond to images of tissue affected by a disease, while the remaining available images of the same tissue comprise the unlabeled data. Protein identification using mass spectrometry is another variant of such a general problem.
In this work, we propose application-specific machine learning algorithms to address this problem. The reliable identification of proteins from mixtures using mass spectrometry would provide an important tool in both biomedical research and clinical practice.
|Commitee:||LeCun, Yann, Peskin, Charles, Shasha, Dennis, Weston, Jason|
|School:||New York University|
|School Location:||United States -- New York|
|Source:||DAI-B 72/01, Dissertation Abstracts International|
|Keywords:||Mass spectrometry, Protein identification, Ranking, Semi-supervised learning|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be