Predicting Application Failure in Cloud: A Machine Learning Approach

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

54 Scopus citations

Abstract

Despite employing the architectures designed for high service reliability and availability, cloud computing systems do experience service outages and performance slowdown. In addition to these, large-scale cloud systems experience failures in their hardware and software components which often result in node and application (e.g., jobs and tasks) failures. Therefore, to build a reliable cloud system, it is important to understand and characterize the observed failures. The goal of this work is to identify the key features that correlate to application failures in cloud and present a failure prediction model that can correctly predict the outcome of a task or job before it actually finishes, fails or gets killed. To accomplish this, we perform a failure characterization study of the Google cluster workload trace. Our analysis reveals that, there is a significant consumption of resources due to failed and killed jobs. We further explore the potential for failure prediction in cloud applications so that we can reduce the wastage of resources by better managing the jobs and tasks that ultimately fail or get killed. For this, we propose a prediction method based on a special type of Recurrent NeuralNetwork (RNN) named Long Short-Term Memory Network(LSTM) to identify application failures in cloud. It takes resource usage measurements or performance data for each job and task, and the goal is to predict the termination status (e.g., failed and finished etc.) of them. Our algorithm can predict task failures with 87%accuracy and achieves a true positive rate of 85% and false positive rate of 11%.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE 1st International Conference on Cognitive Computing, ICCC 2017
EditorsPaul P. Maglio, Wu Chou
Pages24-31
Number of pages8
ISBN (Electronic)9781538620083
DOIs
StatePublished - Sep 7 2017
Event1st IEEE International Conference on Cognitive Computing, ICCC 2017 - Honolulu, United States
Duration: Jun 25 2017Jun 30 2017

Publication series

NameProceedings - 2017 IEEE 1st International Conference on Cognitive Computing, ICCC 2017

Conference

Conference1st IEEE International Conference on Cognitive Computing, ICCC 2017
Country/TerritoryUnited States
CityHonolulu
Period6/25/176/30/17

Bibliographical note

Publisher Copyright:
© 2017 IEEE.

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Experimental and Cognitive Psychology

Fingerprint

Dive into the research topics of 'Predicting Application Failure in Cloud: A Machine Learning Approach'. Together they form a unique fingerprint.

Cite this