In this thesis we explore a problem facing contemporary data management in the earth and environmental sciences: effective production of uniform and quality data products which keeps pace with the volume and velocity of continuous data collection. The process of creating a quality data product is non-trivial and this thesis explores in detail what knowledge is required to automate this process for emerging mid-scale efforts and what prior attempts have been made towards this goal.
Furthermore, we propose a model which can be used to develop a mid-scale data product pipeline in the earth and environmental sciences: Keystone. Specifically, by automating Quality Assurance, Quality Control and Data Repair processes, data products can be created at a rate that keeps pace with the production of data itself.
To prove the effectiveness of this model, three software applications that fulfilled each of the key roles suggested by the Keystone model were conceived, implemented and validated individually. These three application are the NRDC Quality Assurance Application, the Near Real Time Autonomous Quality Control Application (NRAQC), and the Improved Robust and Sparse Fuzzy K Means (iRSFKM) imputation algorithm. Respectively, they provide the functionalities of metadata management and binding through a multi-platform mobile application; automated data quality control with the help of a dynamic web application; and rapid data imputation for data repair. The latter leverages multi-gpu processing to add significant speed to a high accuracy algorithm.
The NRDC Quality Assurance application was validated with the aid of a directed user survey which was disseminated among environmental scientist members of the Earth Science Information Partners organization. An analysis of these surveys indicated that the NRDC Quality Assurance application addresses many significant gaps in the area of metadata binding and creation with many respondents still recording metadata on pen and paper and taking between hours to weeks to digitize their metadata.
The efficacy of the NRAQC application was demonstrated through a case study where nearly a million data points were batch tested according to various user configured metrics. The NRAQC system consistently flagged the same data points on the same streams over the course of five iterations, with an average testing time of 134.24 seconds per testing iteration. Specifically, in each iteration, the NRAQC system identified 1946 repeat values, 365 missing values, and 131 out-of-range values.
Finally, we demonstrated the effectiveness of our iRSFKM algorithm and implementation with multiple experiments, clustering real environmental sensor data. These experiments showed that the our improved multi-GPU implementation is significantly faster than sequential implementations with 185 times speedup over eight GPUs. Experiments also indicated greater than 300 times increase in throughput with eight GPUs and 95\% efficiency with two GPUs compared to one.
The overall model itself was validated through a discussion of how effectively these software solutions worked in tandem to produce a final data product in an primarily automated fashion.
|Advisor:||Dascalu, Sergiu M., Harris, Frederick C., Jr.|
|School:||University of Nevada, Reno|
|School Location:||United States -- Nevada|
|Source:||MAI 81/2(E), Masters Abstracts International|
|Subjects:||Computer science, Information science, Environmental science|
|Keywords:||Data imputation, Data management, GPU, Quality control, Qualtiy assurance, Software engineering|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be