We are living in the era of Big Data, witnessing an exponential increase in the amount of generated data, primarily driven by the substantial increase in the use of connected devices and embedded sensors. Machine learning plays a critical role in extracting meaningful information from such collected data. Furthermore, there are growing application domains in which resource-constrained hardware platforms are required to run machine-learning applications locally. These resource constraints usually appear in the form of energy, power consumption, throughput, latency and area, making it necessary to create new design strategies. Since the end of the Moore's law era, hardware acceleration and parallelization have been established as key approaches for addressing the energy and performance of computational kernels. Although accelerators provide an opportunity to address resource constraints, they primarily improve computational operations and not the memory-accessing/data-movement operations. However, in data-centric workloads, data movement becomes the bottleneck. Thus the leverage of hardware acceleration and the range of kernels for which it is beneficial is limited.
In this dissertation, the main focus has been on three matters. First, on a technological level and for sparse linear algebraic computation kernels, the leverage that 3D IC technology offers is analyzed to enable increasing performance gains from hardware acceleration. This is demonstrated via introducing a three layer architecture consisting of the "DRAM layer", the "SRAM layer", and the "Computation layer", interfaced with vertical 3D interconnects between the adjacent layers. This architecture implements the sparse matrix-vector multiplication (SpMV) kernel, as an example of a kernel that suffers from irregularities in memory-accessing pattern. We identify the key architectural parameters of this accelerator and analyze their impact and inter-dependencies towards the overall throughput of the computations. We show that memory-bandwidth improvement along with the ability to provide data to parallelized (and thus spatially-distributed) computational units are the key aspects which enable SpMV to benefit from acceleration.
Secondly, for dense linear algebraic computation kernels, we demonstrate the first charge-domain in-memory-computing accelerator that integrates dense weight storage and multiplication in order to reduce the overall data movement. This is achieved via incorporating the computations inside the very compact memory bit cells, and using highly linear and stable interdigitated Metal-Oxide-Metal (MOM) capacitors that are laid out on top of the bit cells, occupying no additional area. The system demonstrates high computational accuracy primarily as a consequence of the very good matching, process, and temperature stability of the MOM capacitors (also shown to improve in advanced VLSI technology due to the more accurate lithographic precision). The silicon-CMOS prototype is fabricated in a 65nm CMOS and tested on several benchmarks (the MNIST handwritten digit recognition, the Cifar-10 object recognition task and the SVHN - street view house number recognition dataset), showing orders of magnitude better energy efficiency and throughput, compared with the current fully-digital state-of-the-art neural network accelerators.
Thirdly, we examine an important theoretical approach which can be utilized to mitigate the memory accessing bottleneck. In particular, we study the problem of matrix factorization (MF), which in its low-rank form, can drastically reduce the required memory footprint. Furthermore, it identifies the important structures in the data, thereby creating regularities in the memory-accessing patterns. We revisit the landscape of the MF problem and derive prior results in a modern format without vectorizing the relevant differentials. We then show that all critical points are either global minima or strict saddles. For the strict saddles we derive a negative upper bound on the minimum eigenvalue of the Hessian map. Our results are applicable to both low-rank and general-rank factorization. We then analyze how an invariance property of gradient flow impacts the strict saddles that can be encountered and how the manifold constraint impacts the negative upper bounds on the minimum eigenvalue of the Hessian map. We also examine two other invariance properties of gradient flow and show how these interact with the choice of initial condition to impact training performance. Finally, we illustrate our findings using an experimental study of an fMRI dataset.
|Advisor:||Verma, Naveen, Ramadge, Peter J|
|Commitee:||Verma, Naveen, Ramadge, Peter J, Wentzlaff, David, Chen, Minjie|
|School Location:||United States -- New Jersey|
|Source:||DAI-B 81/8(E), Dissertation Abstracts International|
|Subjects:||Electrical engineering, Computer Engineering, Computer science|
|Keywords:||Charge-domain computing, Deep learning, Hardware accelerators, In-memory computing, Machine learning, Neural networks|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be