Tightly-coupled parallel computing is an important tool for problem solving. Structured peer-to-peer network overlays are failure-tolerant and have a low administrative burden. This work seeks to unite the two.
First, I present a completely decentralized algorithm for parallel job scheduling and load balancing in distributed peer-to-peer environments. This algorithm is useful for meta-scheduling across known clusters and scheduling on desktop grids. To accomplish this, I build on previous work to route jobs to appropriate resources then use the new algorithm to start parallel jobs and balance load across the grid. I also discuss what constitutes useful clusterings for this algorithm as well as inherent scaling limitations. Ultimately, I show that my algorithm performs comparably to one using centralized load balancing with global up-to-date information. The principal contribution of this work is that the parallel job scheduling is completely decentralized, which is not featured in previous work, and enables reliable ad hoc sharing of distributed resources to run parallel computations.
Second, I show how clusters of computers can be found dynamically by using an existing latency prediction technique coupled with a new refinement algorithm. Several latency prediction techniques are compared experimentally. One, based on a tree metric space embedding, is found to be superior to the others. Nevertheless, I show that it is not quite accurate enough. To solve this problem, I present a refinement algorithm for producing quality clusters while still maintaining bounds for the amount of information any given node must store about other nodes. I show that clusters derived this way have scheduler performance comparable to those chosen statically with global knowledge.
Lastly, I discuss previously undiscovered under-specifications in the Content Addressable Network (CAN) structured peer to peer system. In high-churn situations, the CAN allows stale information and changes to the overlay structure to create routing problems. I show solutions to these two problems, as well as discuss other issues that may also disrupt a CAN.
|Commitee:||Hollingsworth, Jeff, Keleher, Pete, Porter, Adam, Richardson, Derek|
|School:||University of Maryland, College Park|
|School Location:||United States -- Maryland|
|Source:||DAI-B 76/11(E), Dissertation Abstracts International|
|Keywords:||Cluster computing, Distributed systems, High-performance computing, Peer-to-peer|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
supplemental files is subject to the ProQuest Terms and Conditions of use.