Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs

Jieyang Chen, Xin Liang, Zizhong Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

28 Scopus citations

Abstract

Extensive researches have been done on developing and optimizing algorithm-based fault tolerance (ABFT) schemes for systolic arrays and general purpose microprocessors. However, little has been done on developing and optimizing ABFT schemes for heterogeneous systems with GPU accelerators. While existing ABFT schemes can correct computing errors like 1+1=3, we find that many memory storage errors can not be corrected by existing ABFT schemes. In this paper, we first develop a new ABFT scheme for Cholesky decomposition that can correct both computing errors and storage errors at the same time, and then develop several optimization techniques to reduce the fault tolerance overhead of ABFT for heterogeneous systems with GPU accelerators. Experimental results demonstrate that our fault tolerant Cholesky decomposition is able to correct both computing errors and storage errors in the middle of the computation and can achieve better performance than the state-of-the-art vendor provided version Cholesky decomposition library routine in CULA R18.

Original languageEnglish
Title of host publicationProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
Pages993-1002
Number of pages10
ISBN (Electronic)9781509021406
DOIs
StatePublished - Jul 18 2016
Event30th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016 - Chicago, United States
Duration: May 23 2016May 27 2016

Publication series

NameProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Conference

Conference30th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016
Country/TerritoryUnited States
CityChicago
Period5/23/165/27/16

Bibliographical note

Publisher Copyright:
© 2016 IEEE.

Funding

work is partially supported by the NSF grants CCF-1305622, ACI-1305624, CCF-1513201, and the SZSTI basic research program JCYJ20150630114942313.

FundersFunder number
SZSTIJCYJ20150630114942313
National Science Foundation (NSF)ACI-1305624, CCF-1305622, CCF-1513201

    Keywords

    • CULA
    • Cholesky Decomposition
    • Fault Tolerance
    • GPUs
    • MAGMA
    • Offline ABFT
    • Online ABFT

    ASJC Scopus subject areas

    • Computer Networks and Communications

    Fingerprint

    Dive into the research topics of 'Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs'. Together they form a unique fingerprint.

    Cite this