RAPIDS: Reconciling Availability, Accuracy, and Performance in Managing Geo-Distributed Scientific Data

Lipeng Wan, Jieyang Chen, Xin Liang, Ana Gainaru, Qian Gong, Qing Liu, Ben Whitney, Joy Arulraj, Zhengchun Liu, Ian Foster, Scott Klasky

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In modern science, big data plays an increasingly important role. Many scientific applications, such as running simulations on supercomputers or conducting experiments on advanced instruments, produce huge amount of data at unprecedented speed. Analyzing and understanding such big data is the key for scientists to make scientific breakthroughs. However, data might become unavailable for scientists to access when outages or maintenance of the storage system occur, which severely hinders scientific discovery. To improve the data availability, data duplication and erasure coding (EC) are often used. But as the scientific data gets larger, using these two methods can cause considerable storage and network overhead. In this paper, we propose RAPIDS, a hybrid approach that combines the multigrid-based error-bounded lossy compression with erasure coding, to significantly reduce the storage and network overhead required for maintaining high data availability. Our experiments show that RAPIDS reduces the storage overhead by up to 7.5x and network overhead by up to 3x to achieve the same level of availability compared to the regular EC method. We improve RAPIDS by building two models to optimize the fault tolerance configurations and data gathering strategy. We demonstrate that RAPIDS significantly improves performance when running on many CPU cores in parallel or on GPUs.

Original languageEnglish
Title of host publicationHPDC 2023 - Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing
Pages87-100
Number of pages14
ISBN (Electronic)9798400701559
DOIs
StatePublished - Aug 7 2023
Event32nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2023 - Orlando, United States
Duration: Jun 16 2023Jun 23 2023

Publication series

NameHPDC 2023 - Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

Conference

Conference32nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2023
Country/TerritoryUnited States
CityOrlando
Period6/16/236/23/23

Bibliographical note

Publisher Copyright:
© 2023 ACM.

Funding

This research was supported by the ECP CODAR, Sirius-2, and RAPIDS-2 projects through the Advanced Scientific Computing Research (ASCR) program of Department of Energy, and the LDRD project through the DRD program of Oak Ridge National Laboratory. This research used resources of the Oak Ridge Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.

FundersFunder number
Michigan State University-U.S. Department of Energy (MSU-DOE) Plant Research Laboratory
Office of Science ProgramsDE-AC05-00OR22725
Advanced Scientific Computing Research
Oak Ridge National Laboratory
Laboratory Directed Research and Development

    Keywords

    • data availability
    • scientific data management

    ASJC Scopus subject areas

    • Information Systems
    • Software
    • Safety, Risk, Reliability and Quality
    • Artificial Intelligence
    • Computer Networks and Communications
    • Computer Science Applications
    • Hardware and Architecture

    Fingerprint

    Dive into the research topics of 'RAPIDS: Reconciling Availability, Accuracy, and Performance in Managing Geo-Distributed Scientific Data'. Together they form a unique fingerprint.

    Cite this