A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also present an efficient asynchronous recovery algorithm based on the checkpointing algorithm. The checkpointing algorithm allows the processes to take checkpoints asynchronously and also forces the processes to take additional checkpoints in order to make every checkpoint useful. The recovery algorithm can handle concurrent failure of multiple processes. The recovery algorithm has no domino effect and a failed process needs only to roll back to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. Messages are only selectively logged to cope with various types of message abnormalities that arise due to rollback and hence results in low message logging overhead. Unlike some existing algorithms, our algorithm does not use vector timestamps for tracking dependency between checkpoints and hence results in low message overhead during failure-free operation. Moreover, a process can asynchronously decide garbage checkpoints and delete them from the stable storage-garbage checkpoints are the checkpoints that are no longer required for the purpose of recovery.
|Number of pages||34|
|Journal||Journal of Parallel and Distributed Computing|
|State||Published - Dec 1 2002|
Bibliographical noteFunding Information:
D. Manivannan received a B.Sc. in mathematics with special distinction from the University of Madras, Madras, India. He received an MS in mathematics, and an MS in computer science from The Ohio State University, Columbus, Ohio, in 1992 and 1993, respectively. He received his Ph.D. in computer science from The Ohio State University in 1997. Manivannan is currently an assistant professor of computer science at University of Kentucky, Lexington. His research interests include distributed systems, operating systems, mobile computing systems, and interprocess communication in parallel architectures. He is a member of ACM, IEEE, and IEEE Computer Society. Manivannan is a recipient of the CAREER Award from the National Science Foundation.
This research was supported in part by the National Science Foundation, CAREER Award # CCR-9983584.
- Asynchronous recovery
- Communication-induced checkpointing
- Distributed checkpointing
- Multiple failures
- Quasi-synchronous checkpointing
- Vector timestamps
ASJC Scopus subject areas
- Theoretical Computer Science
- Hardware and Architecture
- Computer Networks and Communications
- Artificial Intelligence