Dissertation/Thesis Abstract

Hardware Acceleration to Address the Costs of Data Movement
by Valavi, Hossein, Ph.D., Princeton University, 2020, 254; 27548046
Abstract (Summary)

We are living in the era of Big Data, witnessing an exponential increase in the amount of generated data, primarily driven by the substantial increase in the use of connected devices and embedded sensors. Machine learning plays a critical role in extracting meaningful information from such collected data. Furthermore, there are growing application domains in which resource-constrained hardware platforms are required to run machine-learning applications locally. These resource constraints usually appear in the form of energy, power consumption, throughput, latency and area, making it necessary to create new design strategies. Since the end of the Moore's law era, hardware acceleration and parallelization have been established as key approaches for addressing the energy and performance of computational kernels. Although accelerators provide an opportunity to address resource constraints, they primarily improve computational operations and not the memory-accessing/data-movement operations. However, in data-centric workloads, data movement becomes the bottleneck. Thus the leverage of hardware acceleration and the range of kernels for which it is beneficial is limited.

In this dissertation, the main focus has been on three matters. First, on a technological level and for sparse linear algebraic computation kernels, the leverage that 3D IC technology offers is analyzed to enable increasing performance gains from hardware acceleration. This is demonstrated via introducing a three layer architecture consisting of the "DRAM layer", the "SRAM layer", and the "Computation layer", interfaced with vertical 3D interconnects between the adjacent layers. This architecture implements the sparse matrix-vector multiplication (SpMV) kernel, as an example of a kernel that suffers from irregularities in memory-accessing pattern. We identify the key architectural parameters of this accelerator and analyze their impact and inter-dependencies towards the overall throughput of the computations. We show that memory-bandwidth improvement along with the ability to provide data to parallelized (and thus spatially-distributed) computational units are the key aspects which enable SpMV to benefit from acceleration.

Secondly, for dense linear algebraic computation kernels, we demonstrate the first charge-domain in-memory-computing accelerator that integrates dense weight storage and multiplication in order to reduce the overall data movement. This is achieved via incorporating the computations inside the very compact memory bit cells, and using highly linear and stable interdigitated Metal-Oxide-Metal (MOM) capacitors that are laid out on top of the bit cells, occupying no additional area. The system demonstrates high computational accuracy primarily as a consequence of the very good matching, process, and temperature stability of the MOM capacitors (also shown to improve in advanced VLSI technology due to the more accurate lithographic precision). The silicon-CMOS prototype is fabricated in a 65nm CMOS and tested on several benchmarks (the MNIST handwritten digit recognition, the Cifar-10 object recognition task and the SVHN - street view house number recognition dataset), showing orders of magnitude better energy efficiency and throughput, compared with the current fully-digital state-of-the-art neural network accelerators.

Thirdly, we examine an important theoretical approach which can be utilized to mitigate the memory accessing bottleneck. In particular, we study the problem of matrix factorization (MF), which in its low-rank form, can drastically reduce the required memory footprint. Furthermore, it identifies the important structures in the data, thereby creating regularities in the memory-accessing patterns. We revisit the landscape of the MF problem and derive prior results in a modern format without vectorizing the relevant differentials. We then show that all critical points are either global minima or strict saddles. For the strict saddles we derive a negative upper bound on the minimum eigenvalue of the Hessian map. Our results are applicable to both low-rank and general-rank factorization. We then analyze how an invariance property of gradient flow impacts the strict saddles that can be encountered and how the manifold constraint impacts the negative upper bounds on the minimum eigenvalue of the Hessian map. We also examine two other invariance properties of gradient flow and show how these interact with the choice of initial condition to impact training performance. Finally, we illustrate our findings using an experimental study of an fMRI dataset.

Indexing (document details)
Advisor: Verma, Naveen, Ramadge, Peter J
Commitee: Verma, Naveen, Ramadge, Peter J, Wentzlaff, David, Chen, Minjie
School: Princeton University
Department: Electrical Engineering
School Location: United States -- New Jersey
Source: DAI-B 81/8(E), Dissertation Abstracts International
Subjects: Electrical engineering, Computer Engineering, Computer science
Keywords: Charge-domain computing, Deep learning, Hardware accelerators, In-memory computing, Machine learning, Neural networks
Publication Number: 27548046
ISBN: 9781392456163
Copyright © 2021 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy