Bicocca Open Archive

Deep Learning applications are pervasive today, and efficient strategies are designed to reduce the computational time and resource demand of the training process. The Distributed Deep Learning (DDL) paradigm yields a significant speed-up by partitioning the training into multiple, parallel tasks. The Ray framework supports DDL applications exploiting data parallelism by enhancing the scalability with minimal user effort. This work aims at evaluating the performance of DDL training applications, by profiling their execution on a Ray cluster and developing Machine Learning-based models to predict the training time when changing the dataset size, the number of parallel workers and the amount of computational resources. Such performance-prediction models are crucial to forecast computational resources usage and costs in Cloud environments. Experimental results prove that our models achieve average prediction errors between 3 and 15% for both interpolation and extrapolation, thus demonstrating their applicability to unforeseen scenarios.

Filippini, F., Lublinsky, B., De Bayser, M., Ardagna, D. (2023). Performance Models for Distributed Deep Learning Training Jobs on Ray. In 2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp.30-35). Institute of Electrical and Electronics Engineers Inc. [10.1109/SEAA60479.2023.00014].

Performance Models for Distributed Deep Learning Training Jobs on Ray

Filippini F.;Lublinsky B.;De Bayser M.;Ardagna D.

2023

Abstract

Deep Learning applications are pervasive today, and efficient strategies are designed to reduce the computational time and resource demand of the training process. The Distributed Deep Learning (DDL) paradigm yields a significant speed-up by partitioning the training into multiple, parallel tasks. The Ray framework supports DDL applications exploiting data parallelism by enhancing the scalability with minimal user effort. This work aims at evaluating the performance of DDL training applications, by profiling their execution on a Ray cluster and developing Machine Learning-based models to predict the training time when changing the dataset size, the number of parallel workers and the amount of computational resources. Such performance-prediction models are crucial to forecast computational resources usage and costs in Cloud environments. Experimental results prove that our models achieve average prediction errors between 3 and 15% for both interpolation and extrapolation, thus demonstrating their applicability to unforeseen scenarios.

Scheda breve

Scheda completa

Scheda completa (DC)

	Tipo di intervento
	
				paper
			
	Parole chiave
	
				Distributed training; Performance models; Ray;
			
	Lingua del contenuto
	
				English
			
	Nome del convegno
	
				49th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2023 - 06-08 September 2023
			
	Anno del convegno
	
				2023
			
	Titolo degli atti
	
				2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)
			
	ISBN del volume degli atti
	
				9798350342352
			
	Collana o serie
	
				PROCEEDINGS EUROMICRO CONFERENCE ON SOFTWARE ENGINEERING AND ADVANCED APPLICATIONS
			
	Data di pubblicazione
	
				2023
			
	Pagina iniziale
	
				30
			
	Pagina finale
	
				35
			
	DOI dell'intervento
	
				https://dx.doi.org/10.1109/SEAA60479.2023.00014
			
	Fulltext
	
				open
			
	Citazione
	
				Filippini, F., Lublinsky, B., De Bayser, M., Ardagna, D. (2023). Performance Models for Distributed Deep Learning Training Jobs on Ray. In 2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp.30-35). Institute of Electrical and Electronics Engineers Inc. [10.1109/SEAA60479.2023.00014].
			
	Appare nelle tipologie:
	
				02 - Intervento a convegno

File in questo prodotto:

File	Dimensione	Formato
Filippini et al-2023-SEAA-AAM.pdf accesso aperto Tipologia di allegato: Author’s Accepted Manuscript, AAM (Post-print) Licenza: Licenza open access specifica dell’editore Dimensione 6.11 MB Formato Adobe PDF Visualizza/Apri	6.11 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/601085

Citazioni

2

ND

Social impact