Bicocca Open Archive

Predicting failures in production environments allows service providers to activate countermeasures that prevent harming the users of the applications. The most successful approaches predict failures from error states that the current approaches identify from anomalies in time series of fixed sets of KPI values collected at runtime. They cannot handle time series of KPI sets with size that varies over time. Thus these approaches work with applications that run on statically configured sets of components and computational nodes, and do not scale up to the many popular cloud applications that exploit autoscaling. This paper proposes Preface, a novel approach to predict failures in cloud applications that exploit autoscaling. Preface originally augments the neural-network-based failure predictors successfully exploited to predict failures in statically configured applications, with a Rectifier layer that handles KPI sets of highly variable size as the ones collected in cloud autoscaling applications, and reduces those KPIs to a set of rectified- KPIs of fixed size that can be fed to the neural-network predictor. The Preface Rectifier computes the rectified-KPIs as descriptive statistics of the original KPIs, for each logical component of the target application. The descriptive statistics shrink the highly variable sets of KPIs collected at diﬀerent timestamps to a fixed set of values compatible with the input nodes of the neural-network failure predictor. The neural network can then reveal anomalies that correspond to error states, before they propagate to failures that harm the users of the applications. The experiments on both a commercial application and a widely used academic exemplar confirm that Preface can indeed predict many harmful failures early enough to activate proper countermeasures.

Denaro, G., El Moussa, N., Heydarov, R., Lomio, F., Pezzè, M., Qiu, K. (2024). Predicting Failures of Autoscaling Distributed Applications. PROCEEDINGS OF THE ACM ON SOFTWARE ENGINEERING, 1, 1960-1981 [10.1145/3660794].

Predicting Failures of Autoscaling Distributed Applications

Denaro, G;El Moussa, N;Heydarov, R;Lomio, F;Pezzè, M;Qiu, K

2024

Abstract

Predicting failures in production environments allows service providers to activate countermeasures that prevent harming the users of the applications. The most successful approaches predict failures from error states that the current approaches identify from anomalies in time series of fixed sets of KPI values collected at runtime. They cannot handle time series of KPI sets with size that varies over time. Thus these approaches work with applications that run on statically configured sets of components and computational nodes, and do not scale up to the many popular cloud applications that exploit autoscaling. This paper proposes Preface, a novel approach to predict failures in cloud applications that exploit autoscaling. Preface originally augments the neural-network-based failure predictors successfully exploited to predict failures in statically configured applications, with a Rectifier layer that handles KPI sets of highly variable size as the ones collected in cloud autoscaling applications, and reduces those KPIs to a set of rectified- KPIs of fixed size that can be fed to the neural-network predictor. The Preface Rectifier computes the rectified-KPIs as descriptive statistics of the original KPIs, for each logical component of the target application. The descriptive statistics shrink the highly variable sets of KPIs collected at diﬀerent timestamps to a fixed set of values compatible with the input nodes of the neural-network failure predictor. The neural network can then reveal anomalies that correspond to error states, before they propagate to failures that harm the users of the applications. The experiments on both a commercial application and a widely used academic exemplar confirm that Preface can indeed predict many harmful failures early enough to activate proper countermeasures.

Scheda breve

Scheda completa

Scheda completa (DC)

	Sottotipologia
	
				Articolo in rivista - Articolo scientifico
			
	Parole chiave
	
				Failure Prediction, Fault Localization, Kubernetes
			
	Lingua del contenuto
	
				English
			
	Data ahead of print o Data prima pubblicazione Online
	
				24-dic-2024
			
	Data di pubblicazione
	
				2024
			
	Rivista
	
				PROCEEDINGS OF THE ACM ON SOFTWARE ENGINEERING
			
	Numero del volume
	
				1
			
	Pagina iniziale
	
				1960
			
	Pagina finale
	
				1981
			
	Article number
	
				87
			
	DOI dell'articolo
	
				https://dx.doi.org/10.1145/3660794
			
	Fulltext
	
				open
			
	Citazione
	
				Denaro, G., El Moussa, N., Heydarov, R., Lomio, F., Pezzè, M., Qiu, K. (2024). Predicting Failures of Autoscaling Distributed Applications. PROCEEDINGS OF THE ACM ON SOFTWARE ENGINEERING, 1, 1960-1981 [10.1145/3660794].
			
	Appare nelle tipologie:
	
				01 - Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Denaro-2024-Proc. ACM Softw. Eng-VoR.pdf accesso aperto Descrizione: This work is licensed under a Creative Commons Attribution 4.0 International License Tipologia di allegato: Publisher’s Version (Version of Record, VoR) Licenza: Creative Commons Dimensione 1.33 MB Formato Adobe PDF Visualizza/Apri	1.33 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/550343

Citazioni

ND

ND

Social impact