Silent Data Corruption Resilient Two-sided Matrix Factorizations

Panruo Wu, Nathan Debardeleben, Qiang Guan, Sean Blanchard, Jieyang Chen, Dingwen Tao, Xin Liang, Kaiming Ouyang, Zizhong Chen

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

This paper presents an algorithm based fault tolerance method to harden three two-sided matrix factorizations against soft errors: reduction to Hessenberg form, tridiagonal form, and bidiagonal form. These two sided factorizations are usually the prerequisites to computing eigenvalues/eigenvectors and singular value decomposition. Algorithm based fault tolerance has been shown to work on three main one-sided matrix factorizations: LU, Cholesky, and QR, but extending it to cover two sided factorizations is non-trivial because there are no obvious \textit{offline, problem} specific maintenance of checksums. We thus develop an \textit{online, algorithm} specific checksum scheme and show how to systematically adapt the two sided factorization algorithms used in LAPACK and ScaLAPACK packages to introduce the algorithm based fault tolerance. The resulting ABFT scheme can detect and correct arithmetic errors \textit{continuously} during the factorizations that allow timely error handling. Detailed analysis and experiments are conducted to show the cost and the gain in resilience. We demonstrate that our scheme covers a significant portion of the operations of the factorizations. Our checksum scheme achieves high error detection coverage and error correction coverage compared to the state of the art, with low overhead and high scalability.

Original languageEnglish
Pages (from-to)415-427
Number of pages13
JournalACM SIGPLAN Notices
Volume52
Issue number8
DOIs
StatePublished - Jan 26 2017

Bibliographical note

Publisher Copyright:
© 2017 ACM.

Funding

The authors would like to thank the anonymous reviewers for their insightful comments and valuable suggestions. This work is partially supported by the NSF ACI-1305624, CCF-1513201, the SZSTI basic research program JCYJ20150630114942313, and the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase).

FundersFunder number
NSFC-Guangdong Joint Fund
SZSTIJCYJ20150630114942313
National Science Foundation Arctic Social Science ProgramACI-1305624, CCF-1513201
National Science Foundation Arctic Social Science Program

    Keywords

    • abft
    • algorithm based fault tolerance
    • eigenvalue decomposition
    • singular value decomposition
    • svd

    ASJC Scopus subject areas

    • General Computer Science

    Fingerprint

    Dive into the research topics of 'Silent Data Corruption Resilient Two-sided Matrix Factorizations'. Together they form a unique fingerprint.

    Cite this