This thesis targets the design of parallelizable algorithms and communication-efficient parallel schedules for numerical linear algebra as well as computations with higher-order tensors. Communication is a growing bottleneck in the execution of most algorithms on parallel computers, which manifests itself as data movement both through the network connecting different processors and through the memory hierarchy of each processor as well as synchronization between processors. We provide a rigorous theoretical model of communication and derive lower bounds as well as algorithms in this model. Our analysis concerns two broad areas of linear algebra and of tensor contractions. We demonstrate the practical quality of the new theoretically-improved algorithms by presenting results which show that our implementations outperform standard libraries and traditional algorithms.
We model the costs associated with local computation, interprocessor communication and synchronization, as well as memory to cache data transfers of a parallel schedule based on the most expensive execution path in the schedule. We introduce a new technique for deriving lower bounds on tradeoffs between these costs and apply them to algorithms in both dense and sparse linear algebra as well as graph algorithms. These lower bounds are attained by what we refer to as 2.5D algorithms, which we give for matrix multiplication, Gaussian elimination, QR factorization, the symmetric eigenvalue problem, and the Floyd-Warshall all-pairs shortest-paths algorithm. 2.5D algorithms achieve lower interprocessor bandwidth cost by exploiting auxiliary memory. Algorithms employing this technique are well known for matrix multiplication, and have been derived in the BSP model for LU and QR factorization, as well as the Floyd-Warshall algorithm. We introduce alternate versions of LU and QR algorithms which have measurable performance improvements over their BSP counterparts, and we give the first evaluations of their performance. We also explore network-topology-aware mapping on torus networks for matrix multiplication and LU, showing how 2.5D algorithms can efficiently exploit collective communication, as well as introducing an adaptation of Cannon's matrix multiplication algorithm that is better suited for torus networks with dimension larger than two. For the symmetric eigenvalue problem, we give the first 2.5D algorithms, additionally solving challenges with memory-bandwidth efficiency that arise for this problem. We also give a new memory-bandwidth efficient algorithm for Krylov subspace methods (repeated multiplication of a vector by a sparse-matrix), which is motivated by the application of our lower bound techniques to this problem.
The latter half of the thesis contains algorithms for higher-order tensors, in particular tensor contractions. The motivating application for this work is the family of coupled-cluster methods, which solve the many-body Schrödinger equation to provide a chemically-accurate model of the electronic structure of molecules and chemical reactions where electron correlation plays a significant role. The numerical computation of these methods is dominated in cost by contraction of antisymmetric tensors. We introduce Cyclops Tensor Framework, which provides an automated mechanism for network-topology-aware decomposition and redistribution of tensor data. It leverages 2.5D matrix multiplication to perform tensor contractions communication-efficiently. The framework is capable of exploiting symmetry and antisymmetry in tensors and utilizes a distributed packed-symmetric storage format. Finally, we consider a theoretically novel technique for exploiting tensor symmetry to lower the number of multiplications necessary to perform a contraction via computing some redundant terms that allow preservation of symmetry and then cancelling them out with low-order cost. We analyze the numerical stability and communication efficiency of this technique and give adaptations to antisymmetric and Hermitian matrices. This technique has promising potential for accelerating coupled-cluster methods both in terms of computation and communication cost, and additionally provides a potential improvement for BLAS routines on complex matrices.
|Commitee:||Head-Gordon, Martin, Yelick, Katherine|
|School:||University of California, Berkeley|
|School Location:||United States -- California|
|Source:||DAI-B 76/08(E), Dissertation Abstracts International|
|Keywords:||Numerical linear algebra, Parallel algorithms, Tensor computations|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be