In the era of massive data sets, it is difficult for domain scientists to interact directly with their own data. Because the analysis of single examples may yield insights in the research process, it is important to use automated methods to highlight potentially interesting phenomena when hand inspection is not possible. This dissertation examines a particular sub-case of this problem: the use of machine learning to direct an expert’s attention to potentially informative outliers. Outliers warrant study in both a positive and negative sense. In the positive sense, outliers might be indicative of new scientific phenomena, whose study may pave the road to scientific discovery. In the negative sense, outliers might be worthy of elimination or alteration, in order to move forward with the original research objectives.
This dissertation explores two novel manifestations of the anomaly detection problem that are motivated by domain scientists’ need to mine their data sets for single outlying examples. The first is the unsupervised detection of anomalies in large sets of unsynchronized time series data for the purpose of aiding scientific discovery. This work is applied to astrophysics time series data. The second is the detection of label noise in training data in order to improve the supervised learning process. This work is applied to problems in remote sensing, medical text mining, and volcanology.
This thesis makes four contributions. First, we introduce a method called PCAD, for the discovery of local and global outliers on large sets of unsynchronized time series data. Second, we perform a comprehensive review of methods for the detection of label noise in training data, and introduce a new method called PWEM. Third, we introduce an interactive framework, called ICCN, that cleans training sets of label noise with help from a domain expert. Finally, we introduce a semi-supervised learning method, called collaborative learning, that synthesizes ideas from our research on label noise detection and uses them to minimize label noise during training data generation.
|Advisor:||Brodley, Carla E.|
|Commitee:||Baise, Laurie, Blumer, Anselm, Khardon, Roni, Wagstaff, Kiri|
|School Location:||United States -- Massachusetts|
|Source:||DAI-B 71/10, Dissertation Abstracts International|
|Keywords:||Data mining, Label noise, Machine learning, Outlier detection|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be