HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems

Research output: Contribution to journalArticlepeer-review

16 Scopus citations

Abstract

Future generation supercomputers will be message-passing distributed systems consisting of hundreds of thousands of processors. As the size of the system grows, failure rate increases. Hence for the success and deployability of such large scale systems, scalable checkpointing and recovery protocols need to be implemented. Existing checkpointing and rollback recovery protocols used for providing fault tolerance in distributed systems are not scalable to such large systems. In this paper, we address this important and timely issue and propose a scalable group-based Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging (HOPE) protocol. Performance evaluation indicates, our protocol takes a balanced approach to lower checkpointing and message logging overhead and enhances scalability.

Original languageEnglish
Pages (from-to)1217-1235
Number of pages19
JournalFuture Generation Computer Systems
Volume28
Issue number8
DOIs
StatePublished - Oct 2012

Bibliographical note

Funding Information:
Dr. Manivannan is a recipient of the Faculty CAREER Award from the US National Science Foundation. He is a senior member of the IEEE and a senior member of the ACM.

Funding Information:
This material is based in part upon work supported by the US National Science Foundation under Grant No. IIS-0414791 . Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. The authors thank the editor and the reviewers, for their valuable and constructive comments which helped greatly in improving the content and presentation of the paper.

Keywords

  • Checkpointing protocols
  • Consistent global checkpoint
  • Failure recovery in distributed systems
  • Fault tolerance
  • Large scale systems
  • Message logging protocols

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems'. Together they form a unique fingerprint.

Cite this