Abstract
Checkpointing and rollback recovery are widely used techniques to handle failures in distributed computing systems. If there is no coordination among processes during checkpointing, processes may take useless checkpoints. Useless checkpoints are checkpoints that cannot be part of any consistent global checkpoint. In this paper, we propose a Communication-Induced checkpointing algorithm that prevents useless checkpoints by directing processes to take forced checkpoints more efficiently whenever a communication pattern that may lead to a Z-Cycle (ZC) is observed. Existence of ZC among checkpoints is known to be necessary and sufficient for making a checkpoint useless. The basic idea behind our algorithm can be extended to existing model-based checkpointing algorithms to reduce the number of forced checkpoints. We also compare the performance of our algorithm with an existing well-known algorithm.
Original language | English |
---|---|
Pages (from-to) | 383-406 |
Number of pages | 24 |
Journal | International Journal of Parallel, Emergent and Distributed Systems |
Volume | 24 |
Issue number | 5 |
DOIs | |
State | Published - Oct 2009 |
Bibliographical note
Funding Information:A preliminary version of this paper [16] has been presented in the 25th International Conference on Parallel and Distributed Computing and Networking. This material is based in part upon work supported by the US National science Foundation under Grant No. IIS-0414791. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Keywords
- Checkpointing
- Fault-tolerance
- Rollback recovery
- Useless checkpoints
ASJC Scopus subject areas
- Software
- Computer Networks and Communications