FaCS: Toward a Fault-Tolerant Cloud Scheduler Leveraging Long Short-Term Memory Network

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Large-scale cloud datacenters often experience reduced performance and service outage. Due to the inherent complexity, heterogeneity, and multitenant architecture of these datacenters, applications (i.e., jobs and tasks) running on them are susceptible to various types of failures. In this paper, we first characterize the application failures in Google cluster trace and then propose a prediction model which can forecast the termination status of a task. Then, we introduce a task scheduler that dynamically reschedules tasks based on the predicted results. This proactive fault-tolerant scheduler improves system reliability and ensures timely execution of the applications. Simulation results show that our scheduler reduces makespan and failure rates of tasks substantially while balancing load at the same time. Moreover, early prediction along with quick scheduling adjustment improves overall resource utilization and reduces resource wastage.

Original languageEnglish
Title of host publicationProceedings - 6th IEEE International Conference on Cyber Security and Cloud Computing, CSCloud 2019 and 5th IEEE International Conference on Edge Computing and Scalable Cloud, EdgeCom 2019
EditorsMeikang Qiu
Pages1-6
Number of pages6
ISBN (Electronic)9781728116600
DOIs
StatePublished - Jun 2019
Event6th IEEE International Conference on Cyber Security and Cloud Computing and 5th IEEE International Conference on Edge Computing and Scalable Cloud, CSCloud/EdgeCom 2019 - Paris, France
Duration: Jun 21 2019Jun 23 2019

Publication series

NameProceedings - 6th IEEE International Conference on Cyber Security and Cloud Computing, CSCloud 2019 and 5th IEEE International Conference on Edge Computing and Scalable Cloud, EdgeCom 2019

Conference

Conference6th IEEE International Conference on Cyber Security and Cloud Computing and 5th IEEE International Conference on Edge Computing and Scalable Cloud, CSCloud/EdgeCom 2019
Country/TerritoryFrance
CityParis
Period6/21/196/23/19

Bibliographical note

Publisher Copyright:
© 2019 IEEE.

Keywords

  • Failure Prediction
  • Fault-Tolerance
  • Job and Task Scheduler
  • Long Short-Term Memory Network.

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'FaCS: Toward a Fault-Tolerant Cloud Scheduler Leveraging Long Short-Term Memory Network'. Together they form a unique fingerprint.

Cite this