Gene expression microarray datasets often consist of a limited number of samples relative to a large number of expression measurements, usually on the order of thousands of genes. These characteristics pose a challenge to any classification model as they might negatively impact its prediction accuracy. Therefore, dimensionality reduction is a core process prior to any classification task.
This dissertation introduces the iterative feature perturbation method (IFP), an embedded gene selector that iteratively discards non-relevant features. IFP considers relevant features as those which after perturbation with noise cause a change in the predictive accuracy of the classification model. Non-relevant features do not cause any change in the predictive accuracy in such a situation.
We apply IFP to 4 cancer microarray datasets: colon cancer (cancer vs. normal), leukemia (subtype classification), Moffitt colon cancer (prognosis predictor) and lung cancer (prognosis predictor). We compare results obtained by IFP to those of SVM-RFE and the t-test using a linear support vector machine as the classifier in all cases. We do so using the original entire set of features in the datasets, and using a preselected set of 200 features (based on p values) from each dataset. When using the entire set of features, the IFP approach results in comparable accuracy (and higher at some points) with respect to SVM-RFE on 3 of the 4 datasets. The simple t-test feature ranking typically produces classifiers with the highest accuracy across the 4 datasets. When using 200 features chosen by the t-test, the accuracy results show up to 3% performance improvement for both IFP and SVM-RFE across the 4 datasets. We corroborate these results with an AUC analysis and a statistical analysis using the Friedman/Holm test.
Similar to the application of the t-test, we used the methods information gain and reliefF as filters and compared all three. Results of the AUC analysis show that IFP and SVM-RFE obtain the highest AUC value when applied on the t-test-filtered datasets. This result is additionally corroborated with statistical analysis.
The percentage of overlap between the gene sets selected by any twomethods across the four datasets indicates that different sets of genes can and do result in similar accuracies.
We created ensembles of classifiers using the bagging technique with IFP, SVM-RFE and the t-test, and showed that their performance can be at least equivalent to those of the non-bagging cases, as well as better in some cases.
|Advisor:||Hall, Lawrence O.|
|School:||University of South Florida|
|School Location:||United States -- Florida|
|Source:||DAI-B 71/11, Dissertation Abstracts International|
|Subjects:||Bioinformatics, Artificial intelligence, Computer science|
|Keywords:||Data mining, Feature selection, Gene selection, Iterative feature perturbation, Microarray data, Svm-rfe|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
supplemental files is subject to the ProQuest Terms and Conditions of use.