New-sum: A novel online ABFT scheme for general iterative methods

Dingwen Tao, Shuaiwen Leon Song, Sriram Krishnamoorthy, Panruo Wu, Xin Liang, Eddy Z. Zhang, Darren Kerbyson, Zizhong Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

31 Citations (SciVal)

Abstract

Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0:4% and 2:2%) and preconditioned BiCGSTAB (1:0% and 4:0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the exibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods.

Original languageEnglish
Title of host publicationHPDC 2016 - Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
Pages43-55
Number of pages13
ISBN (Electronic)9781450343145
DOIs
StatePublished - May 31 2016
Event25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2016 - Kyoto, Japan
Duration: May 31 2016Jun 4 2016

Publication series

NameHPDC 2016 - Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Conference

Conference25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2016
Country/TerritoryJapan
CityKyoto
Period5/31/166/4/16

Bibliographical note

Funding Information:
The authors would like to thank the anonymous reviewers for their insightful comments and valuable suggestions. This work is partially supported by the NSF grants CCF-1305622, ACI-1305624, CCF-1513201, the SZSTI basic research program JCYJ20150630114942313, and the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase). This work was also supported in part by the U.S. Department of Energy's (DOE) Office of Science, Office of Advanced Scientific Computing Research, under awards 66905 and 59921. Pacific Northwest National Laboratory is operated by Battelle for DOE under Contract DE-AC05-76RL01830.

Publisher Copyright:
Copyright © 2016 by the Association for Computing Machinery, Inc. (ACM).

Keywords

  • Algorithm-based fault tolerance (ABFT)
  • Checkpoint
  • Checksum
  • Iterative methods
  • Online error detection
  • Resilience
  • Rollback recovery
  • Silent data corruption (SDC)

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'New-sum: A novel online ABFT scheme for general iterative methods'. Together they form a unique fingerprint.

Cite this