A computational evolution system (CES) is a knowledge-discovery engine that constructs and evolves classifiers with a small number of features to identify subtle, synergistic relationships among features and to discriminate groups in high-dimensional data analysis. CESs have previously been designed to only analyze binary datasets. In this work, the CES method has been expanded to accommodate multi-class data.
The multi-class CES was compared to three common classification and feature selection methods: random forest, random k-nearest neighbor, and support vector machines. The four classifiers were evaluated on three real RNA sequencing datasets. Performance was evaluated via cross validation to assess classification accuracy, number of features selected, stability of the selected feature sets, and run-time.
The three common classification and feature selection methods were originally designed for microarray data, which is fundamentally different from RNA-Seq data. In order to preprocess RNA-Seq count data for classification, the data was normalized and transformed via a variance stabilizing transformation to remove the variance-mean relationship that is commonly observed in RNA-Seq count data.
Compared to the three competing methods, the multi-class CES selected far fewer features. The identified features are potential biomarkers that may be more relevant than the longer lists of features identified by the competing methods. The CES performed best on the dataset with the smallest sample size, indicating that it has a unique advantage in these situations since most classification algorithms suffer in terms of accuracy when the sample size is small.
The CES identified numerous potentially-important biomarkers in each of the three real datasets that are validated by previous research and worthy of additional investigation. CES was especially helpful at identifying important features in the rat blood RNA-Seq data set. Subsequent ontological analysis of these selected features revealed protein folding as an important process in that dataset. The other contribution of this research to science was to extend the applicability of CES to biomarker discovery in multi-class settings. New software algorithms based on CES have already been developed, and the multi-class modifications presented here are directly applicable and would also benefit the newer software.
|Advisor:||Bowyer, John F.|
|Commitee:||George, Nysia I., Jennings, Steven F., Kane, Cynthia JM, Moore, Jason H., Patterson, Tucker A.|
|School:||University of Arkansas at Little Rock|
|School Location:||United States -- Arkansas|
|Source:||DAI-B 79/01(E), Dissertation Abstracts International|
|Subjects:||Biology, Biostatistics, Bioinformatics|
|Keywords:||Artificial intelligence, Evolutionary algorithm, Machine learning, Multi-class, Rna sequencing, Toxicology|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be