Technology trends suggest that future machines will rely on parallelism to meet increasing performance requirements. To aid in programmer productivity and application performance, many parallel programming models provide communication building blocks called collective communication. These operations, such as Broadcast, Scatter, Gather, and Reduce, abstract common global data movement patterns behind a simple library interface allowing the hardware and runtime system to optimize them for performance and scalability.
We consider the problem of optimizing collective communication in Partitioned Global Address Space (PGAS) languages. Rooted in traditional shared memory programming models, they deliver the benefits of sophisticated distributed data structures using language extensions and one-sided communication. One-sided communication allows one processor to directly read and write memory associated with another. Many popular PGAS language implementations share a common runtime system called GASNet for implementing such communication. To provide a highly scalable platform for our work, we present a new implementation of GASNet for the IBM BlueGene/P, allowing GASNet to scale to tens of thousands of processors.
We demonstrate that PGAS languages are highly scalable and that the one-sided communication within them is an efficient and convenient platform for collective communication. We show how to use one-sided communication to achieve 3× improvements in the latency and throughput of the collectives over standard message passing implementations. Using a 3D FFT as a representative communication bound benchmark, for example, we see a 17% increase in performance on 32,768 cores of the BlueGene/P and a 1.5× improvement on 1024 cores of the CrayXT4. We also show how the automatically tuned collectives can deliver more than an order of magnitude in performance over existing implementations on shared memory platforms.
There is no obvious best algorithm that serves all machines and usage patterns demonstrating the need for tuning and we thus build an automatic tuning system in GASNet that optimizes the collectives for a variety of large scale supercomputers and novel multicore architectures. To understand the large search space, we construct analytic performance models use them to minimize the overhead of autotuning. We demonstrate that autotuning is an effective approach to addressing performance optimizations on complex parallel systems.
|Advisor:||Yelick, Katherine A.|
|Commitee:||Demmel, James W., Papadopoulos, Panayiotis|
|School:||University of California, Berkeley|
|School Location:||United States -- California|
|Source:||DAI-B 71/05, Dissertation Abstracts International|
|Keywords:||Automatic tuning, Collective communication, Distributed computing, Parallel computing, Partitioned global address space, Runtime systems|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be