Abstract
Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead.
Original language | English |
---|---|
Pages (from-to) | 3110-3117 |
Number of pages | 8 |
Journal | Information Sciences |
Volume | 178 |
Issue number | 15 |
DOIs | |
State | Published - Aug 1 2008 |
Bibliographical note
Funding Information:The authors thank the editors and reviewers for their valuable and constructive comments which helped greatly in improving the content and presentation of the paper. This material is based in part upon work supported by the US National science Foundation under Grant No. IIS-0414791 and the US Department of Treasury Award #T0505060. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the Department of Treasury.
Keywords
- Checkpoint staggering
- Communication-induced checkpointing
- Distributed checkpointing
- Failure-recovery
- Fault-tolerance
- Rollback recovery
- Staggered checkpointing
- Uncoordinated
ASJC Scopus subject areas
- Theoretical Computer Science
- Software
- Control and Systems Engineering
- Computer Science Applications
- Information Systems and Management
- Artificial Intelligence