Experimental and Analytical Study of Xeon Phi Reliability

Oliveira, D.; Pilla, L.; De Bardeleben, N.; Blanchard, S.; Quinn, H.; Koren, I.; Navaux, P.; Rech, P.

doi:10.1145/3126908.3126960

We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection. Besides measuring the realistic error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) by correlating the distribution of corrupted elements in the output to the application's characteristics. We evaluate the benefits of imprecise computing for reducing the programs' error rate. For example, for HotSpot a 0.5% tolerance in the output value reduces the error rate by 85%. We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. For example, faults occurring in the middle of LUD execution, or in the Sort and Tree portions of CLAMR, are more critical than the remaining portions. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions.

Experimental and Analytical Study of Xeon Phi Reliability / Oliveira, D.; Pilla, L.; De Bardeleben, N.; Blanchard, S.; Quinn, H.; Koren, I.; Navaux, P.; Rech, P.. - (2017), pp. 1-12. ( International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 Denver, CO, USA 12-17 November 2017) [10.1145/3126908.3126960].

Experimental and Analytical Study of Xeon Phi Reliability

Oliveira D.;Pilla L.;De Bardeleben N.;Blanchard S.;Quinn H.;Koren I.;Navaux P.;Rech P.

2017-01-01

Abstract

We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection. Besides measuring the realistic error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) by correlating the distribution of corrupted elements in the output to the application's characteristics. We evaluate the benefits of imprecise computing for reducing the programs' error rate. For example, for HotSpot a 0.5% tolerance in the output value reduces the error rate by 85%. We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. For example, faults occurring in the middle of LUD execution, or in the Sort and Tree portions of CLAMR, are more critical than the remaining portions. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2017
			
	Titolo del volume (Proceedings title)
	
				SC17 International Conference for High Performance Computing, Networking, Storage and Analysis
			
	Luogo di edizione (Place of publication)
	
				New York, NY, USA
			
	Casa editrice (Publisher)
	
				ACM Association for Computing Machinery
			
	ISBN
	
				9781450351140
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85040175401
			
	Codice WOS (WOS identifier)
	
				WOS:000458161700028
			
	Tutti gli autori
	
						Oliveira, D.; Pilla, L.; De Bardeleben, N.; Blanchard, S.; Quinn, H.; Koren, I.; Navaux, P.; Rech, P.
					
	Citazione
	
				Experimental and Analytical Study of Xeon Phi Reliability / Oliveira, D.; Pilla, L.; De Bardeleben, N.; Blanchard, S.; Quinn, H.; Koren, I.; Navaux, P.; Rech, P.. - (2017), pp. 1-12. ( International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 Denver, CO, USA 12-17 November 2017) [10.1145/3126908.3126960].
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
sc17_3126908.3126960.pdf accesso aperto Descrizione: Research article Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 673.13 kB Formato Adobe PDF Visualizza/Apri	673.13 kB	Adobe PDF	Visualizza/Apri