Ir directamente a la navegación principal Ir directamente a la búsqueda Ir directamente al contenido principal

Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs

  • Jieyang Chen
  • , Hongbo Li
  • , Sihuan Li
  • , Xin Liang
  • , Panruo Wu
  • , Dingwen Tao
  • , Kaiming Ouyang
  • , Yuanlai Liu
  • , Kai Zhao
  • , Qiang Guan
  • , Zizhong Chen

Producción científica: Conference contributionrevisión exhaustiva

19 Citas (Scopus)

Resumen

Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix decomposition on heterogeneous systems with GPUs have following limitations: (1) they do not provide sufficient protection as most of them only maintain checksum in one dimension; (2) their checking scheme is not efficient due to redundant checksum verifications; (3) they fail to protect PCIe communication; and (4) the checksum calculation based on a special type of matrix multiplication is far from efficient. By overcoming the above limitations, we design an efficient ABFT approach providing stronger protection for one-sided matrix decomposition methods on heterogeneous systems. First, we provide full matrix protection by using checksums in two dimensions. Second, our checking scheme is more efficient by prioritizing the checksum verification according to the sensitivity of matrix operations to soft errors. Third, we protect PCIe communication by reordering checksum verifications and decomposition steps. Fourth, we accelerate the checksum calculation by 1.7x via better utilizing GPUs.

Idioma originalEnglish
Título de la publicación alojadaProceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
Páginas854-865
Número de páginas12
ISBN (versión digital)9781538683842
DOI
EstadoPublished - jul 2 2018
Evento2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 - Dallas, United States
Duración: nov 11 2018nov 16 2018

Serie de la publicación

NombreProceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018

Conference

Conference2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
País/TerritorioUnited States
CiudadDallas
Período11/11/1811/16/18

Nota bibliográfica

Publisher Copyright:
© 2018 IEEE.

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Networks and Communications
  • Hardware and Architecture
  • Theoretical Computer Science

Huella

Profundice en los temas de investigación de 'Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs'. En conjunto forman una huella única.

Citar esto