Checkpointing has been widely adopted in support of fault-tolerance and job migration essential for large-scale networked multicore systems and cloud computing. This dissertation pursues an effective checkpointing mechanism to handle failures and unavailable events in such systems and thus to reduce the expected job turnaround time, the aggregated file size, and the monetary cost involved. To withstand unavailability/failures of local nodes in networked systems, multi-level checkpointing is indispensable, with checkpoint files kept not only locally but also at remote storage. As the number of nodes in such a system grows, I/O bandwidth to remote storage quickly becomes the bottleneck for multi-level checkpointing.
The first part of this work deals with an effective mechanism, dubbed adaptive incremental checkpointing (AIC), which reduces the checkpointing file size considerably to lower its involved overhead and thus to shorten the expected job turnaround time. Given production multicore systems are observed to often have unused cores available, we design AIC to make use of separate, otherwise unused, cores for carrying out delta compression at desirable points of time adaptively. AIC permits multi-level checkpointing effectively, with checkpoint files of execution nodes written to their partner nodes and to remote storage concurrently during job execution. AIC is observed in our implemented testbed to substantially lower the normalized expected turnaround time (by up to 41%) and the aggregated file size (by up to 1,000×) when compared to its static counterpart and a recent multi-level checkpointing scheme with fixed checkpoint intervals.
The second part presents design and implementation of our enhanced adaptive incremental checkpointing (EAIC) for multithreaded applications on the RaaS clouds under spot instance pricing. EAIC model takes into account spot instance revocation events, besides hardware failures, for fast and accurately predicting the desirable points of time to take checkpoints so as to markedly reduce the expected job turnaround time and the monetary cost. The experimental results from our established testbed under real spot instance price traces from Amazon EC2 show that EAIC lowers both the application turnaround time and the monetary cost markedly (by up to 58% and 59%, respectively) in comparison to its recent checkpointing counterpart.
|Commitee:||Bayoumi, Magdy, Perkins, Dmitri, Wu, Hongyi|
|School:||University of Louisiana at Lafayette|
|School Location:||United States -- Louisiana|
|Source:||DAI-B 75/07(E), Dissertation Abstracts International|
|Keywords:||Adaptive checkpointing, Cloud computing, Delta compression, Fault tolerance, Markov model, Networked multicore systems|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be