TY - JOUR
T1 - Silent Data Corruption Resilient Two-sided Matrix Factorizations
AU - Wu, Panruo
AU - Debardeleben, Nathan
AU - Guan, Qiang
AU - Blanchard, Sean
AU - Chen, Jieyang
AU - Tao, Dingwen
AU - Liang, Xin
AU - Ouyang, Kaiming
AU - Chen, Zizhong
N1 - Publisher Copyright:
© 2017 ACM.
PY - 2017/1/26
Y1 - 2017/1/26
N2 - This paper presents an algorithm based fault tolerance method to harden three two-sided matrix factorizations against soft errors: reduction to Hessenberg form, tridiagonal form, and bidiagonal form. These two sided factorizations are usually the prerequisites to computing eigenvalues/eigenvectors and singular value decomposition. Algorithm based fault tolerance has been shown to work on three main one-sided matrix factorizations: LU, Cholesky, and QR, but extending it to cover two sided factorizations is non-trivial because there are no obvious \textit{offline, problem} specific maintenance of checksums. We thus develop an \textit{online, algorithm} specific checksum scheme and show how to systematically adapt the two sided factorization algorithms used in LAPACK and ScaLAPACK packages to introduce the algorithm based fault tolerance. The resulting ABFT scheme can detect and correct arithmetic errors \textit{continuously} during the factorizations that allow timely error handling. Detailed analysis and experiments are conducted to show the cost and the gain in resilience. We demonstrate that our scheme covers a significant portion of the operations of the factorizations. Our checksum scheme achieves high error detection coverage and error correction coverage compared to the state of the art, with low overhead and high scalability.
AB - This paper presents an algorithm based fault tolerance method to harden three two-sided matrix factorizations against soft errors: reduction to Hessenberg form, tridiagonal form, and bidiagonal form. These two sided factorizations are usually the prerequisites to computing eigenvalues/eigenvectors and singular value decomposition. Algorithm based fault tolerance has been shown to work on three main one-sided matrix factorizations: LU, Cholesky, and QR, but extending it to cover two sided factorizations is non-trivial because there are no obvious \textit{offline, problem} specific maintenance of checksums. We thus develop an \textit{online, algorithm} specific checksum scheme and show how to systematically adapt the two sided factorization algorithms used in LAPACK and ScaLAPACK packages to introduce the algorithm based fault tolerance. The resulting ABFT scheme can detect and correct arithmetic errors \textit{continuously} during the factorizations that allow timely error handling. Detailed analysis and experiments are conducted to show the cost and the gain in resilience. We demonstrate that our scheme covers a significant portion of the operations of the factorizations. Our checksum scheme achieves high error detection coverage and error correction coverage compared to the state of the art, with low overhead and high scalability.
KW - abft
KW - algorithm based fault tolerance
KW - eigenvalue decomposition
KW - singular value decomposition
KW - svd
UR - http://www.scopus.com/inward/record.url?scp=85084181579&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85084181579&partnerID=8YFLogxK
U2 - 10.1145/3018743.3018750
DO - 10.1145/3018743.3018750
M3 - Article
AN - SCOPUS:85084181579
SN - 1523-2867
VL - 52
SP - 415
EP - 427
JO - ACM SIGPLAN Notices
JF - ACM SIGPLAN Notices
IS - 8
ER -