COMING SOON! PQDT Open is getting a new home!

ProQuest Open Access Dissertations & Theses will remain freely available as part of a new and enhanced search experience at

Questions? Please refer to this FAQ.

Dissertation/Thesis Abstract

Analysis of mass spectrometry data for protein identification in complex biological mixtures
by Spivak, Marina, Ph.D., New York University, 2010, 180; 3428047
Abstract (Summary)

Mass spectrometry is a powerful technique in analytical chemistry that was originally designed to determine the composition of small molecules in terms of their constituent elements. In the last several decades, it has begun to be used for much more complex tasks, including the detailed analysis of the amino acid sequence that makes up an unknown protein and even the identification of multiple proteins present in a complex mixture. The latter problem is largely unsolved and the principal subject of this dissertation.

The fundamental difficulty in the analysis of mass spectrometry data is that of ill-posedness. There are multiple solutions consistent with the experimental data and the data is subject to significant amounts of noise. In this work, we have developed application-specific machine learning algorithms that (partially) overcome this ill-posedness. We make use of labeled examples of a single class of peptide fragments and of the unlabeled fragments detected by the instrument. This places the approach within the broader framework of semi-supervised learning.

Recently, there has been considerable interest in classification problems of this type, where the learning algorithm only has access to labeled examples of a single class and unlabeled data. The motivation for such problems is that in many applications, examples of one of the two classes are easy and inexpensive to obtain, whereas the acquisition of examples of a second class is difficult and labor-intensive. For example, in document classification, positive examples are documents that address specific subject, while unlabeled documents are abundant. In movie rating, the positive data are the movies chosen by clients, while the unlabeled data are all remaining movies in a collection. In medical imaging, positive (labeled) data correspond to images of tissue affected by a disease, while the remaining available images of the same tissue comprise the unlabeled data. Protein identification using mass spectrometry is another variant of such a general problem.

In this work, we propose application-specific machine learning algorithms to address this problem. The reliable identification of proteins from mixtures using mass spectrometry would provide an important tool in both biomedical research and clinical practice.

Indexing (document details)
Advisor: Greengard, Leslie
Commitee: LeCun, Yann, Peskin, Charles, Shasha, Dennis, Weston, Jason
School: New York University
Department: Computer Science
School Location: United States -- New York
Source: DAI-B 72/01, Dissertation Abstracts International
Subjects: Bioinformatics
Keywords: Mass spectrometry, Protein identification, Ranking, Semi-supervised learning
Publication Number: 3428047
ISBN: 978-1-124-33287-1
Copyright © 2021 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy