Abstract
Introspective sorting is a ubiquitous sorting algorithm which underlies many large scale distributed systems. Hardware-mediated soft errors can result in comparison and memory errors, and thus cause introsort to generate incorrect output, which in turn disrupts systems built upon introsort; hence, it is critical to incorporate fault tolerance capability within introsort. This paper proposes the first theoretically-sound, practical fault tolerant introsort with negligible overhead: FT-iSort. To tolerate comparison errors, we use minimal TMR protection via exploiting the properties of the effects of soft errors on introsort. This algorithm-based selective protection incurs far less overhead than nave TMR protection designed to protect an entire application. To tolerate memory errors that escape DRAM error correcting code, we propose XOR-based re-execution. We incorporate our fault tolerance method into the well-known parallel sorting implementation HykSort, and we find that fault tolerant HykSort incurs negligible overhead and obtains nearly the same scalability as unprotected HykSort.
Original language | English |
---|---|
Title of host publication | Proceedings of SC 2019 |
Subtitle of host publication | The International Conference for High Performance Computing, Networking, Storage and Analysis |
ISBN (Electronic) | 9781450362290 |
DOIs | |
State | Published - Nov 17 2019 |
Event | 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019 - Denver, United States Duration: Nov 17 2019 → Nov 22 2019 |
Publication series
Name | International Conference for High Performance Computing, Networking, Storage and Analysis, SC |
---|---|
ISSN (Print) | 2167-4329 |
ISSN (Electronic) | 2167-4337 |
Conference
Conference | 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019 |
---|---|
Country/Territory | United States |
City | Denver |
Period | 11/17/19 → 11/22/19 |
Bibliographical note
Publisher Copyright:© 2019 ACM.
Keywords
- Algorithm based fault tolerance
- Comparison errors
- Fault tolerant sorting
- Introsort
- Soft errors
ASJC Scopus subject areas
- Computer Networks and Communications
- Computer Science Applications
- Hardware and Architecture
- Software