Abstract
Large-scale cloud datacenters often experience reduced performance and service outage. Due to the inherent complexity, heterogeneity, and multitenant architecture of these datacenters, applications (i.e., jobs and tasks) running on them are susceptible to various types of failures. In this paper, we first characterize the application failures in Google cluster trace and then propose a prediction model which can forecast the termination status of a task. Then, we introduce a task scheduler that dynamically reschedules tasks based on the predicted results. This proactive fault-tolerant scheduler improves system reliability and ensures timely execution of the applications. Simulation results show that our scheduler reduces makespan and failure rates of tasks substantially while balancing load at the same time. Moreover, early prediction along with quick scheduling adjustment improves overall resource utilization and reduces resource wastage.
Original language | English |
---|---|
Title of host publication | Proceedings - 6th IEEE International Conference on Cyber Security and Cloud Computing, CSCloud 2019 and 5th IEEE International Conference on Edge Computing and Scalable Cloud, EdgeCom 2019 |
Editors | Meikang Qiu |
Pages | 1-6 |
Number of pages | 6 |
ISBN (Electronic) | 9781728116600 |
DOIs | |
State | Published - Jun 2019 |
Event | 6th IEEE International Conference on Cyber Security and Cloud Computing and 5th IEEE International Conference on Edge Computing and Scalable Cloud, CSCloud/EdgeCom 2019 - Paris, France Duration: Jun 21 2019 → Jun 23 2019 |
Publication series
Name | Proceedings - 6th IEEE International Conference on Cyber Security and Cloud Computing, CSCloud 2019 and 5th IEEE International Conference on Edge Computing and Scalable Cloud, EdgeCom 2019 |
---|
Conference
Conference | 6th IEEE International Conference on Cyber Security and Cloud Computing and 5th IEEE International Conference on Edge Computing and Scalable Cloud, CSCloud/EdgeCom 2019 |
---|---|
Country/Territory | France |
City | Paris |
Period | 6/21/19 → 6/23/19 |
Bibliographical note
Publisher Copyright:© 2019 IEEE.
Keywords
- Failure Prediction
- Fault-Tolerance
- Job and Task Scheduler
- Long Short-Term Memory Network.
ASJC Scopus subject areas
- Computer Networks and Communications
- Hardware and Architecture
- Safety, Risk, Reliability and Quality