Dissertation/Thesis Abstract

Preconditioner-based In Situ Data Reduction for End-to-End Throughput Optimization
by Schendel, Eric Richard, Ph.D., North Carolina State University, 2014, 99; 3647579
Abstract (Summary)

Efficient handling of large volumes of data is a necessity for future extreme-scale scientific applications and database systems. To address the growing storage and throughput imbalance between the data production on such systems and their I/O subsystems, reduction of the handled data volume by compression is a reasonable approach. However, quite often many scientific data sets compress poorly, referred to as hard-to-compress datasets, due to the negative impact of highly entropic information represented within the data. Lossless compression efforts on such datasets typically do not yield more than a 20% reduction in size when exact reproduction of the original data is required. Moreover, modern applications of compression for hard-to-compress scientific datasets hinder end-to-end throughput performance due to overhead timing costs of data analysis, compression, and reorganization. When overhead costs of applying compression are greater than end-to-end performance gains obtained by the data reduction, utilization of a compressor has no practical benefit for scientific systems.

A difficult problem in lossless compression for improving scientific data reduction efficiency and throughput performance is to identify the hard-to-compress information and subsequently optimize the compression techniques. To address this challenge, we introduce the In Situ Orthogonal Byte Aggregate Reduction Compression (ISOBAR-compress) methodology as a preconditioner of lossless compression to identify and optimize the compression efficiency and throughput of hard-to-compress datasets. Out of 24 scientific datasets from both the public domain and peta-scale simulations, ISOBAR-compress accurately identified the 19 that were hard-to-compress. Additionally, ISOBAR-compress improved data reduction by an average of 19% and increased compression and decompression throughput by an average speedup of 24.1 and 33.6, respectively.

Additionally, dataset preconditioning for lossless compression is a promising approach for reducing disk and network I/O activity to address the problem of limited I/O bandwidth in current analytic frameworks. Hence, we also introduce a hybrid compression-I/O methodology for interleaving I/O activity with data compression to improve end-to-end throughput performance along with the reduced dataset size.We evaluate several interleaving strategies, present theoretical models, and evaluate the efficiency and scalability of the approach through comparative analysis. The hybrid method when applied to 19 hard-to-compress scientific datasets demonstrates a 12% to 46% increase in end-to-end throughput. At the reported peak bandwidth of 60 GB/s of uncompressed data for a current, leadership-class parallel I/O system, this translates into an effective gain of 7 to 28 GB/s in aggregate throughput.

Lastly, it is important that scientific applications further streamline their end-to-end throughput performance beyond only preconditioning datasets for compression. The concept of applying a preconditioner is generalizable for other techniques that allow optimizing performance by data analysis and reorganization. For example in present-day scientific simulations, there is a drive to optimize in situ processing performance by inspecting the layout structure of a generated dataset and then restructuring the content. Typically, these simulations interleave dataset variables in memory during their calculation phase to improve computational performance, but deinterleave the data for subsequent storage and analysis. As a result, an efficient preconditioner for data deinterleaving is critical since common deinterleaving methods provide inefficient throughput and energy performance. To address this problem, we present a deinterleaving method that is high performance, energy efficient, and generic to any data type. When evaluated against conventional deinterleaving methods on 105 STREAM standard micro-benchmarks, our method always improved throughput and throughput/watt. In the best case, our deinterleaving method improved throughput up to 26.2x and throughput/watt up to 7.8x.

Indexing (document details)
Advisor: Samatova, Nagiza F.
School: North Carolina State University
School Location: United States -- North Carolina
Source: DAI-B 76/05(E), Dissertation Abstracts International
Subjects: Computer science
Keywords: Data reduction, Dataset preconditioning, High-performance computing, Losless compression
Publication Number: 3647579
ISBN: 978-1-321-40901-7
Copyright © 2021 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy