Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently applied to a wide range of products and solutions. DL training jobs are highly resource demanding and they experience great benefits when exploiting AI accelerators (e.g., GPUs). However, the effective management of GPU-powered clusters comes with great challenges. Among these, efficient scheduling and resource allocation solutions are crucial to maximize performance and minimize Data Centers operational costs. In this paper we propose ANDREAS, an advanced scheduling solution that tackles these problems jointly, aiming at optimizing DL training runtime workloads and their energy consumption in accelerated clusters. Experiments based on simulation demostrate that we can achieve a cost reduction between 30 and 62% on average with respect to first-principle methods while the validation on a real cluster shows a worst case deviation below 13% between actual and predicted costs, proving the effectiveness of ANDREAS solution in practical scenarios.

Filippini, F., Ardagna, D., Lattuada, M., Amaldi, E., Riedl, M., Materka, K., et al. (2021). ANDREAS: Artificial intelligence traiNing scheDuler for accElerAted resource clusterS. In Proceedings - 2021 International Conference on Future Internet of Things and Cloud, FiCloud 2021 (pp.388-393). Institute of Electrical and Electronics Engineers Inc. [10.1109/FiCloud49777.2021.00063].

ANDREAS: Artificial intelligence traiNing scheDuler for accElerAted resource clusterS

Ciavotta M.;
2021

Abstract

Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently applied to a wide range of products and solutions. DL training jobs are highly resource demanding and they experience great benefits when exploiting AI accelerators (e.g., GPUs). However, the effective management of GPU-powered clusters comes with great challenges. Among these, efficient scheduling and resource allocation solutions are crucial to maximize performance and minimize Data Centers operational costs. In this paper we propose ANDREAS, an advanced scheduling solution that tackles these problems jointly, aiming at optimizing DL training runtime workloads and their energy consumption in accelerated clusters. Experiments based on simulation demostrate that we can achieve a cost reduction between 30 and 62% on average with respect to first-principle methods while the validation on a real cluster shows a worst case deviation below 13% between actual and predicted costs, proving the effectiveness of ANDREAS solution in practical scenarios.
Si
paper
Deep Learning; Energy-aware hardware platforms; Scheduling;
English
8th International Conference on Future Internet of Things and Cloud, FiCloud 2021 - 23 August 2021through 25 August 2021
978-1-6654-2574-2
Filippini, F., Ardagna, D., Lattuada, M., Amaldi, E., Riedl, M., Materka, K., et al. (2021). ANDREAS: Artificial intelligence traiNing scheDuler for accElerAted resource clusterS. In Proceedings - 2021 International Conference on Future Internet of Things and Cloud, FiCloud 2021 (pp.388-393). Institute of Electrical and Electronics Engineers Inc. [10.1109/FiCloud49777.2021.00063].
Filippini, F; Ardagna, D; Lattuada, M; Amaldi, E; Riedl, M; Materka, K; Skrzypek, P; Ciavotta, M; Magugliani, F; Cicala, M
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/395676
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
Social impact