Abstract
Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. However, as long running DSM applications scale to hundreds or even thousands of machines, the probability of a node or network link failing increases. Fault tolerance is typically achieved via `checkpointing' techniques that allow applications to `roll back' to a recent checkpoint rather than restarting. High-performance DSM systems using relaxed memory consistency are significantly more difficult to checkpoint than uniprocessor or message passing architectures. This paper describes previous approaches to checkpointing message passing parallel programs along with extensions to DSM systems.
Original language | English |
---|---|
Pages | 525-540 |
Number of pages | 16 |
State | Published - 1997 |
Event | Proceedings of the 1997 IEEE Aerospace Conference. Part 4 (of 4) - Snowmass Village, CO, USA Duration: Feb 1 1997 → Feb 2 1997 |
Conference
Conference | Proceedings of the 1997 IEEE Aerospace Conference. Part 4 (of 4) |
---|---|
City | Snowmass Village, CO, USA |
Period | 2/1/97 → 2/2/97 |
ASJC Scopus subject areas
- General Engineering