Predicting the Reliability Behavior of HPC Applications

IRIS

The error rate of current High Performance Computing (HPC) systems is already in the order of one per dozens of hours. Understanding the reliability behavior of HPC applications will be required for the next generation of supercomputers. Using the reliability behavior one can select efficient mitigation techniques for the application and fine-tune parameters such as checkpoint frequency. In this paper, we investigate the application of a machine learning model to predict the reliability behavior of HPC applications. We inject faults in more than 30 HPC applications executing in the Intel Xeon Phi Knights Landing (KNL) and use profiling information to build a predictive model with Support Vector Machines (SVM). We show that the model can predict the Program Vulnerability Factor (PVF) with an average relative error of 7 % for certain classes of algorithm, such as linear algebra and sorting. The average relative error for all algorithm classes is 22 %. Such a fast and straightforward prediction model can be effective as a filter to select the most unreliable applications to perform an in-depth analysis.

Predicting the Reliability Behavior of HPC Applications / Oliveira, D.; Moreira, F. B.; Rech, P.; Navaux, P.. - (2019), pp. 124-131. (Intervento presentato al convegno 30th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2018 tenutosi a Ecole Normale Superieure of Lyon, fra nel 2018) [10.1109/CAHPC.2018.8645856].

Predicting the Reliability Behavior of HPC Applications

Oliveira D.;Moreira F. B.;Rech P.;Navaux P.

2019-01-01

Abstract

The error rate of current High Performance Computing (HPC) systems is already in the order of one per dozens of hours. Understanding the reliability behavior of HPC applications will be required for the next generation of supercomputers. Using the reliability behavior one can select efficient mitigation techniques for the application and fine-tune parameters such as checkpoint frequency. In this paper, we investigate the application of a machine learning model to predict the reliability behavior of HPC applications. We inject faults in more than 30 HPC applications executing in the Intel Xeon Phi Knights Landing (KNL) and use profiling information to build a predictive model with Support Vector Machines (SVM). We show that the model can predict the Program Vulnerability Factor (PVF) with an average relative error of 7 % for certain classes of algorithm, such as linear algebra and sorting. The average relative error for all algorithm classes is 22 %. Such a fast and straightforward prediction model can be effective as a filter to select the most unreliable applications to perform an in-depth analysis.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2019
			
	Titolo del volume (Proceedings title)
	
				Proceedings - 2018 30th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2018
			
	Luogo di edizione (Place of publication)
	
				Stati Uniti
			
	Casa editrice (Publisher)
	
				Institute of Electrical and Electronics Engineers Inc.
			
	ISBN
	
				978-1-5386-7769-8
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85063158531
			
	Tutti gli autori
	
						Oliveira, D.; Moreira, F. B.; Rech, P.; Navaux, P.
					
	Citazione
	
				Predicting the Reliability Behavior of HPC Applications / Oliveira, D.; Moreira, F. B.; Rech, P.; Navaux, P.. - (2019), pp. 124-131. (Intervento presentato al  convegno 30th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2018 tenutosi a Ecole Normale Superieure of Lyon, fra nel 2018) [10.1109/CAHPC.2018.8645856].

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/403746

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

6

ND

ND

social impact