Abstract
Despite employing the architectures designed for high service reliability and availability, cloud computing systems do experience service outages and performance slowdown. In addition to these, large-scale cloud systems experience failures in their hardware and software components which often result in node and application (e.g., jobs and tasks) failures. Therefore, to build a reliable cloud system, it is important to understand and characterize the observed failures. The goal of this work is to identify the key features that correlate to application failures in cloud and present a failure prediction model that can correctly predict the outcome of a task or job before it actually finishes, fails or gets killed. To accomplish this, we perform a failure characterization study of the Google cluster workload trace. Our analysis reveals that, there is a significant consumption of resources due to failed and killed jobs. We further explore the potential for failure prediction in cloud applications so that we can reduce the wastage of resources by better managing the jobs and tasks that ultimately fail or get killed. For this, we propose a prediction method based on a special type of Recurrent NeuralNetwork (RNN) named Long Short-Term Memory Network(LSTM) to identify application failures in cloud. It takes resource usage measurements or performance data for each job and task, and the goal is to predict the termination status (e.g., failed and finished etc.) of them. Our algorithm can predict task failures with 87%accuracy and achieves a true positive rate of 85% and false positive rate of 11%.
Original language | English |
---|---|
Title of host publication | Proceedings - 2017 IEEE 1st International Conference on Cognitive Computing, ICCC 2017 |
Editors | Paul P. Maglio, Wu Chou |
Pages | 24-31 |
Number of pages | 8 |
ISBN (Electronic) | 9781538620083 |
DOIs | |
State | Published - Sep 7 2017 |
Event | 1st IEEE International Conference on Cognitive Computing, ICCC 2017 - Honolulu, United States Duration: Jun 25 2017 → Jun 30 2017 |
Publication series
Name | Proceedings - 2017 IEEE 1st International Conference on Cognitive Computing, ICCC 2017 |
---|
Conference
Conference | 1st IEEE International Conference on Cognitive Computing, ICCC 2017 |
---|---|
Country/Territory | United States |
City | Honolulu |
Period | 6/25/17 → 6/30/17 |
Bibliographical note
Publisher Copyright:© 2017 IEEE.
ASJC Scopus subject areas
- Computer Networks and Communications
- Computer Science Applications
- Experimental and Cognitive Psychology