CAREER: Design and Implementation of Fault-Tolerant Distributed Computing Systems

Grants and Contracts Details

Description

This project focuses on developing efficient techniques for implementing fault-tolerant distributed systems. Even though checkpointing and rollback recovery have been known techniques for achieving fault-tolerance in distributed systems, the intricacies involved in designing efficient checkpointing and recovery protocols has been thoroughly addressed and understood only recently. Based on this theoretical foundation, (i) it is proposed to develop efficient checkpointing and recovery techniques; (ii) implement a simulation testbed for evaluating the performance of the newly developed as well as existing checkpointing and recovery techniques; (iii) integrate the findings of our research as well as the existing research work on fault-tolerance in the graduate curriculum. The important expected outcomes from this work are: (i) an improved understanding of the issues involved in building reliable distributed systems; (ii) improved techniques for fault-tolerance in distributed systems based on the hard experimental data as well as strong theoretical foundation; (iii) integration of the results of our research in an advanced course in distributed systems which would facilitate the students not only understand the intricacies involved in building reliable distributed systems but also help them acquire the necessary tools and techniques for building such systems.
StatusFinished
Effective start/end date9/1/008/31/04

Funding

  • National Science Foundation: $219,999.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.