Efficient handling of large volumes of data is a necessity for future extreme-scale scientific applications and database systems. To address the growing storage and throughput imbalance between the data production on such systems and their I/O subsystems, reduction of the handled data volume by compression is a reasonable approach. However, quite often many scientific data sets compress poorly, referred to as hard-to-compress datasets, due to the negative impact of highly entropic information represented within the data. Lossless compression efforts on such datasets typically do not yield more than a 20% reduction in size when exact reproduction of the original data is required. Moreover, modern applications of compression for hard-to-compress scientific datasets hinder end-to-end throughput performance due to overhead timing costs of data analysis, compression, and reorganization. When overhead costs of applying compression are greater than end-to-end performance gains obtained by the data reduction, utilization of a compressor has no practical benefit for scientific systems.
A difficult problem in lossless compression for improving scientific data reduction efficiency and throughput performance is to identify the hard-to-compress information and subsequently optimize the compression techniques. To address this challenge, we introduce the In Situ Orthogonal Byte Aggregate Reduction Compression (ISOBAR-compress) methodology as a preconditioner of lossless compression to identify and optimize the compression efficiency and throughput of hard-to-compress datasets. Out of 24 scientific datasets from both the public domain and peta-scale simulations, ISOBAR-compress accurately identified the 19 that were hard-to-compress. Additionally, ISOBAR-compress improved data reduction by an average of 19% and increased compression and decompression throughput by an average speedup of 24.1 and 33.6, respectively.
Additionally, dataset preconditioning for lossless compression is a promising approach for reducing disk and network I/O activity to address the problem of limited I/O bandwidth in current analytic frameworks. Hence, we also introduce a hybrid compression-I/O methodology for interleaving I/O activity with data compression to improve end-to-end throughput performance along with the reduced dataset size.We evaluate several interleaving strategies, present theoretical models, and evaluate the efficiency and scalability of the approach through comparative analysis. The hybrid method when applied to 19 hard-to-compress scientific datasets demonstrates a 12% to 46% increase in end-to-end throughput. At the reported peak bandwidth of 60 GB/s of uncompressed data for a current, leadership-class parallel I/O system, this translates into an effective gain of 7 to 28 GB/s in aggregate throughput.
Lastly, it is important that scientific applications further streamline their end-to-end throughput performance beyond only preconditioning datasets for compression. The concept of applying a preconditioner is generalizable for other techniques that allow optimizing performance by data analysis and reorganization. For example in present-day scientific simulations, there is a drive to optimize in situ processing performance by inspecting the layout structure of a generated dataset and then restructuring the content. Typically, these simulations interleave dataset variables in memory during their calculation phase to improve computational performance, but deinterleave the data for subsequent storage and analysis. As a result, an efficient preconditioner for data deinterleaving is critical since common deinterleaving methods provide inefficient throughput and energy performance. To address this problem, we present a deinterleaving method that is high performance, energy efficient, and generic to any data type. When evaluated against conventional deinterleaving methods on 105 STREAM standard micro-benchmarks, our method always improved throughput and throughput/watt. In the best case, our deinterleaving method improved throughput up to 26.2x and throughput/watt up to 7.8x.
|Advisor:||Samatova, Nagiza F.|
|School:||North Carolina State University|
|School Location:||United States -- North Carolina|
|Source:||DAI-B 76/05(E), Dissertation Abstracts International|
|Keywords:||Data reduction, Dataset preconditioning, High-performance computing, Losless compression|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be