There has been a notable rise in the availability of non-experimental observational data sets. Observational data sets include government survey data and “big data” from continuous monitoring. To date, most of this data has spurred correlational studies and there is interest in being able to draw causal inferences from this data. The goal of this study is to suggest and evaluate a method for optimal construction of synthetic treatment and control samples for the purpose of drawing causal inference. A method of balancing data sets to remove bias using machine learning as a two-sample test is proposed and validated. The study builds on the balance optimization subset selection (BOSS) problem, which is a new area of study in operations research. This problem formulation minimizes aggregate imbalance in covariate distributions to reduce bias in data. The cross-validated area under the receiver operating characteristic curve (AUC) is proposed as a measure of balance between treatment and control groups. The proposed approach provides direct and automatic balancing of covariate distributions. In addition, the AUC-based approach is able to detect subtler distributional differences than existing measures, such as simple empirical mean/variance and count-based metrics. Thus, optimizing AUC achieves a greater balance. Using 5 widely used real data sets and 7 synthetic data sets, it is shown that optimization of samples using existing methods (chi-square, mean variance differences, Kolmogorov-Smirnov, and Mahalanobis) results in samples containing imbalance that is detectable using machine learning algorithms. Minimizing covariate imbalance by minimizing the absolute value of the distance of the maximum cross-validated AUCs (from 0.50) using evolutionary optimization on M folds is found to be effective. Particle swarm optimization (PSO) outperforms modified cuckoo swarm (MCS) for this proposed gradient-free, non-linear noisy cost function. To compute AUCs, supervised binary classification approaches from the machine learning and credit scoring literature are used.
|Advisor:||Etemadi, Amir H., Malalla, Ebrahim|
|School:||The George Washington University|
|School Location:||United States -- District of Columbia|
|Source:||DAI-B 79/12(E), Dissertation Abstracts International|
|Subjects:||Applied Mathematics, Operations research, Artificial intelligence|
|Keywords:||Machine learning, Optimization, Particle swarm optimization|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be