The purpose of this investigation is to study and pursue a user-defined approach in preserving data privacy while maintaining an acceptable level of data utility using machine learning classification techniques as a gauge in the generation of synthetic data sets. This dissertation will deal with data privacy, data utility, machine learning classification, and the generation of synthetic data sets. Hence, data privacy and utility preservation using machine learning classification as a gauge is the central focus of this study. Many organizations that transact in large amounts of data have to comply with state, federal, and international laws to guarantee that the privacy of individuals and other sensitive data is not compromised. Yet at some point during the data privacy process, data loses its utility - a measure of how useful a privatized dataset is to the user of that dataset. Data privacy researchers have documented that attaining an optimal balance between data privacy and utility is an NP-hard challenge, thus an intractable problem. Therefore we propose the classification error gauge (x-CEG) approach, a data utility quantification concept that employs machine learning classification techniques to gauge data utility based on the classification error. In the initial phase of this proposed approach, a data privacy algorithm such as differential privacy, Gaussian noise addition, generalization, and or k-anonymity is applied on a dataset for confidentiality, generating a privatized synthetic data set. The privatized synthetic data set is then passed through a machine learning classifier, after which the classification error is measured. If the classification error is lower or equal to a set threshold, then better utility might be achieved, otherwise, adjustment to the data privacy parameters is made and then the refined synthetic data set is sent to the machine learning classifier; the process repeats until the error threshold is reached. Additionally, this study presents the Comparative x-CEG concept, in which a privatized synthetic data set is passed through a series of classifiers, each of which returns a classification error, and the classifier with the lowest classification error is chosen after parameter adjustments, an indication of better data utility. Preliminary results from this investigation show that fine-tuning parameters in data privacy procedures, for example in the case of differential privacy, and increasing weak learners in the ensemble classifier for instance, might lead to lower classification error, thus better utility. Furthermore, this study explores the application of this approach by employing signal processing techniques in the generation of privatized synthetic data sets and improving data utility. This dissertation presents theoretical and empirical work examining various data privacy and utility methodologies using machine learning classification as a gauge. Similarly this study presents a resourceful approach in the generation of privatized synthetic data sets, and an innovative conceptual framework for the data privacy engineering process.
|Commitee:||El-Sayed, Hoda, Ji, Soo-Yeon, Joseph, Anthony, Josyula, Darsana|
|School:||Bowie State University|
|School Location:||United States -- Maryland|
|Source:||DAI-B 75/08(E), Dissertation Abstracts International|
|Subjects:||Information Technology, Information science, Computer science|
|Keywords:||Data analytics and modeling, Data privacy engineering, Differential privacy, Information assurance, Machine learning, Statistical disclosure control|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be