Fault recovery for distributed shared memory systems

William R. Dieter, James E. Lumpp

Research output: Contribution to conferencePaperpeer-review

2 Scopus citations

Abstract

Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. However, as long running DSM applications scale to hundreds or even thousands of machines, the probability of a node or network link failing increases. Fault tolerance is typically achieved via `checkpointing' techniques that allow applications to `roll back' to a recent checkpoint rather than restarting. High-performance DSM systems using relaxed memory consistency are significantly more difficult to checkpoint than uniprocessor or message passing architectures. This paper describes previous approaches to checkpointing message passing parallel programs along with extensions to DSM systems.

Original languageEnglish
Pages525-540
Number of pages16
StatePublished - 1997
EventProceedings of the 1997 IEEE Aerospace Conference. Part 4 (of 4) - Snowmass Village, CO, USA
Duration: Feb 1 1997Feb 2 1997

Conference

ConferenceProceedings of the 1997 IEEE Aerospace Conference. Part 4 (of 4)
CitySnowmass Village, CO, USA
Period2/1/972/2/97

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'Fault recovery for distributed shared memory systems'. Together they form a unique fingerprint.

Cite this