Fault recovery for distributed shared memory systems

William R. Dieter, James E. Lumpp

Research output: Contribution to conferencePaperpeer-review

2 Scopus citations


Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. However, as long running DSM applications scale to hundreds or even thousands of machines, the probability of a node or network link failing increases. Fault tolerance is typically achieved via `checkpointing' techniques that allow applications to `roll back' to a recent checkpoint rather than restarting. High-performance DSM systems using relaxed memory consistency are significantly more difficult to checkpoint than uniprocessor or message passing architectures. This paper describes previous approaches to checkpointing message passing parallel programs along with extensions to DSM systems.

Original languageEnglish
Number of pages16
StatePublished - 1997
EventProceedings of the 1997 IEEE Aerospace Conference. Part 4 (of 4) - Snowmass Village, CO, USA
Duration: Feb 1 1997Feb 2 1997


ConferenceProceedings of the 1997 IEEE Aerospace Conference. Part 4 (of 4)
CitySnowmass Village, CO, USA

ASJC Scopus subject areas

  • Engineering (all)


Dive into the research topics of 'Fault recovery for distributed shared memory systems'. Together they form a unique fingerprint.

Cite this