A fully informed model-based checkpointing protocol for preventing useless checkpoints

Jiang Wu, D. Manivannan

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class.

Original languageEnglish
Pages (from-to)485-518
Number of pages34
JournalInternational Journal of Parallel, Emergent and Distributed Systems
Volume28
Issue number6
DOIs
StatePublished - Dec 1 2013

Keywords

  • checkpointing
  • communication-induced checkpointing
  • fault tolerance
  • rollback recovery
  • useless checkpoints

ASJC Scopus subject areas

  • Software
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'A fully informed model-based checkpointing protocol for preventing useless checkpoints'. Together they form a unique fingerprint.

Cite this