Advanced bit manipulation operations are not efficiently supported by commodity word-oriented microprocessors. Programming tricks are typically devised to shorten the long sequence of instructions needed to emulate these complicated operations. As these bit manipulation operations are relevant to applications that are becoming increasingly important, we propose direct support for them in microprocessors. In particular, we propose fast bit gather (or parallel extract), bit scatter (or parallel deposit) and bit matrix multiply instructions, building on previous work which focused solely on instructions for accelerating general bit permutations.
We show that the bit gather and bit scatter instructions can be implemented efficiently using the fast butterfly and inverse butterfly network datapaths. We define static, dynamic and loop-invariant versions of the instructions, with static versions utilizing a much simpler functional unit than dynamic or loop-invariant versions. We show how a hardware decoder can be implemented for the dynamic and loop-invariant versions to generate, dynamically, the control signals for the butterfly and inverse butterfly datapaths. We propose a new advanced bit manipulation functional unit to support bit gather, bit scatter and bit permutation instructions and then show how this functional unit can be extended to subsume the functionality of the standard shifter unit. This new unit represents an evolution in the design of shifters.
We also consider the bit matrix multiply instruction. This instruction multiplies two n × n bit matrices and can be used to accelerate parity computation and is a powerful bit manipulation primitive. Bit matrix multiply is currently only supported by supercomputers and we investigate simpler
Additionally, we perform an analysis of a variety of different application kernels taken from domains including binary compression, image manipulation, communications, random number generation, bioinformatics, integer compression and cryptology. We show that usage of our proposed instructions yields significant speedups over a basic RISC architecture - parallel extract and parallel deposit speed up applications 2.4× on average, while applications that benefit from
|Advisor:||Lee, Ruby B.|
|School Location:||United States -- New Jersey|
|Source:||DAI-B 69/10, Dissertation Abstracts International|
|Keywords:||Bit manipulation, Bit matrix multiply, Instruction sets, Parallel extraction|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be