The error rate of current High Performance Computing (HPC) systems is already in the order of one per dozens of hours. Understanding the reliability behavior of HPC applications will be required for the next generation of supercomputers. Using the reliability behavior one can select efficient mitigation techniques for the application and fine-tune parameters such as checkpoint frequency. In this paper, we investigate the application of a machine learning model to predict the reliability behavior of HPC applications. We inject faults in more than 30 HPC applications executing in the Intel Xeon Phi Knights Landing (KNL) and use profiling information to build a predictive model with Support Vector Machines (SVM). We show that the model can predict the Program Vulnerability Factor (PVF) with an average relative error of 7 % for certain classes of algorithm, such as linear algebra and sorting. The average relative error for all algorithm classes is 22 %. Such a fast and straightforward prediction model can be effective as a filter to select the most unreliable applications to perform an in-depth analysis.

Predicting the Reliability Behavior of HPC Applications / Oliveira, D.; Moreira, F. B.; Rech, P.; Navaux, P.. - (2019), pp. 124-131. (Intervento presentato al convegno 30th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2018 tenutosi a Ecole Normale Superieure of Lyon, fra nel 2018) [10.1109/CAHPC.2018.8645856].

Predicting the Reliability Behavior of HPC Applications

Rech P.;
2019-01-01

Abstract

The error rate of current High Performance Computing (HPC) systems is already in the order of one per dozens of hours. Understanding the reliability behavior of HPC applications will be required for the next generation of supercomputers. Using the reliability behavior one can select efficient mitigation techniques for the application and fine-tune parameters such as checkpoint frequency. In this paper, we investigate the application of a machine learning model to predict the reliability behavior of HPC applications. We inject faults in more than 30 HPC applications executing in the Intel Xeon Phi Knights Landing (KNL) and use profiling information to build a predictive model with Support Vector Machines (SVM). We show that the model can predict the Program Vulnerability Factor (PVF) with an average relative error of 7 % for certain classes of algorithm, such as linear algebra and sorting. The average relative error for all algorithm classes is 22 %. Such a fast and straightforward prediction model can be effective as a filter to select the most unreliable applications to perform an in-depth analysis.
2019
Proceedings - 2018 30th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2018
Stati Uniti
Institute of Electrical and Electronics Engineers Inc.
978-1-5386-7769-8
Oliveira, D.; Moreira, F. B.; Rech, P.; Navaux, P.
Predicting the Reliability Behavior of HPC Applications / Oliveira, D.; Moreira, F. B.; Rech, P.; Navaux, P.. - (2019), pp. 124-131. (Intervento presentato al convegno 30th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2018 tenutosi a Ecole Normale Superieure of Lyon, fra nel 2018) [10.1109/CAHPC.2018.8645856].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/403746
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? ND
social impact