With PQDT Open, you can read the full text of open access dissertations and theses free of charge.
About PQDT Open
Search
While modern GPU graph analytics libraries provide usable programming models and good single-node performance, the memory size and the computation power of a single GPU is still too limited for analyzing large graphs. Scaling graph analytics is challenging, however, because of the characteristics of graph applications: irregular computation, their low computation to communication ratios, and limited communication bandwidth on multi-GPU platforms. Addressing these challenges while still maintaining programmability is yet another difficulty.
In this work, I target the scalability of graph analytics to multiple GPUs. I begin by targeting multiple GPUs within a single node. Compared to GPU clusters, single-node-multi-GPU platforms are easier to manage and program, but can still act as good development environments for multi-GPU graph processing. My work targets several aspects of multi-GPU graph analytics: the inputs that graph application programmers provide to the multi-GPU framework; how the graph should be distributed across GPUs; the interaction between local computation and remote communication; what and when to communicate; how to combine received and local data; and when the application should stop. I answer these questions by extending the Gunrock graph analytics framework for a single GPU to multiple GPUs, showing that most graph applications scale well in my system. I also show that direction-optimizing breadth-first search (DOBFS) is the most difficult scaling challenge because of its extremely low compute to communication ratio.
To address the DOBFS scaling challenge, I demonstrate a DOBFS implementation with efficient graph representation, local computation, and remote communication, based on the idea of separating high- and low-degree vertices. I particularly target communication costs, using global reduction with bit masks on high-degree vertices and point-to-point communication to low-degree vertices. This greatly reduces overall communication cost and results in good DOBFS scaling with log-scale graphs on more than a hundred GPUs in the Sierra early access system (the testing bed for the Sierra Supercomputer).
Next, I revisit the design choices I made for the single-node multi-GPU framework in view of recent hardware and software developments, such as better peer GPU access and unified virtual memory. I analyze 9 newly developed complex graph applications for the DARPA HIVE program, implemented in the Gunrock framework, and show a wide range of potential scalabilities. More importantly, the questions of when and how to do communication are more diverse than those in the single-node framework. With this analysis, I conclude that future multi-GPU frameworks, whether single- or multiple-node, need to be more flexible: instead of only communicating at iteration boundaries, they should support a more flexible, general communication model. I also propose other research directions for future heterogeneous graph processing, including asynchronous computation and communication, specialized graph representation, and heterogenous processing.
Advisor: | Owens, John D. |
Commitee: | Amenta, Nina, Akella, Venkatesh |
School: | University of California, Davis |
Department: | Electrical and Computer Engineering |
School Location: | United States -- California |
Source: | DAI-B 81/3(E), Dissertation Abstracts International |
Source Type: | DISSERTATION |
Subjects: | Computer Engineering, Computer science |
Keywords: | BFS, Breadth-first search, Distributed graph processing, GPU, Multi GPU, Parallel graph processing |
Publication Number: | 13809782 |
ISBN: | 9781085794770 |