Exploitation of parallelism has for decades been central to the pursuit of computing performance. This is evident in many facets of processor design: in pipelined execution, superscalar dispatch, pipelined and banked memory subsystems, multithreading, and more recently, in the proliferation of cores within chip multiprocessors (CMPs). As designs have evolved, and the parallelism dividend of each technique have been exhausted, designers have turned to other techniques in search of ever more parallelism.
The recent shift to multi-core designs is a profound one, since available parallelism promises to scale farther than at prior levels, limited by interconnect degree and thermal constraints. This explosion in parallelism necessitates changes in how hardware and software interact. In this dissertation, I focus on hardware aspects of this interaction, providing support for efficient on-chip parallel execution in the face of increasing core counts.
First, I introduce a mechanism for coping with increasing memory latencies in multithreaded processors. While prior designs coped well with instruction latencies in the low tens of cycles, I show that long latencies associated with stalls for main memory access lead to pathological resource hoarding and performance degradation. I demonstrate a reactive solution which more than doubles throughput for two-thread workloads.
Next, I reconsider the design of coherence subsystems for CMPs. I show that implementation of a traditional directory protocol on a CMP fails to take advantage of the latency and bandwidth landscape typical of CMPs. Then, I propose a CMP-specific customization of directory-based coherence, and use it to demonstrate overall speedup, reduced miss latency, and decreased interconnect utilization.
I then focus on improving hardware support for multithreading itself, specifically for thread scheduling, creation, and migration. I approach this from two complementary directions. First, I augment a CMP with support for rapidly transferring register state between execution pipelines and off-core thread storage. I demonstrate performance improvement from accelerated inter-core threading, both by scheduling around long-latency stalls as they occur, and by running a conventional multi-thread scheduler at higher sample rates than would be possible with software alone. Second, I consider a key bottleneck for newly-forked and newlyrescheduled threads: the lack of useful cached working sets, and the inability of conventional hardware to quickly construct those sets. I propose a solution which uses small hardware tables that monitor the behavior of executing threads, prepares working-set summaries on demand, and then uses those summaries to rapidly prefetch working sets when threads are forked or migrated. These techniques as much as double the performance of newly-migrated threads.
|Commitee:||Calder, Brad, Esener, Sadik, Sherwood, Tim, Swanson, Steven, Tullsen, Dean|
|School:||University of California, San Diego|
|Department:||Computer Science and Engineering|
|School Location:||United States -- California|
|Source:||DAI-B 71/05, Dissertation Abstracts International|
|Subjects:||Computer Engineering, Computer science|
|Keywords:||CMPs, Chip multiprocessors, Coherence, Multicore, Multithreading, Parallelism, SMT|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be