We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection. Besides measuring the realistic error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) by correlating the distribution of corrupted elements in the output to the application's characteristics. We evaluate the benefits of imprecise computing for reducing the programs' error rate. For example, for HotSpot a 0.5% tolerance in the output value reduces the error rate by 85%. We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. For example, faults occurring in the middle of LUD execution, or in the Sort and Tree portions of CLAMR, are more critical than the remaining portions. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions.

Experimental and analytical study of Xeon Phi reliability / Oliveira, D.; Pilla, L.; De Bardeleben, N.; Blanchard, S.; Quinn, H.; Koren, I.; Navaux, P.; Rech, P.. - (2017), pp. 1-12. (Intervento presentato al convegno International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 tenutosi a usa nel 2017) [10.1145/3126908.3126960].

Experimental and analytical study of Xeon Phi reliability

Rech P.
2017-01-01

Abstract

We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection. Besides measuring the realistic error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) by correlating the distribution of corrupted elements in the output to the application's characteristics. We evaluate the benefits of imprecise computing for reducing the programs' error rate. For example, for HotSpot a 0.5% tolerance in the output value reduces the error rate by 85%. We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. For example, faults occurring in the middle of LUD execution, or in the Sort and Tree portions of CLAMR, are more critical than the remaining portions. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions.
2017
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017
Stati Uniti
Association for Computing Machinery, Inc
9781450351140
Oliveira, D.; Pilla, L.; De Bardeleben, N.; Blanchard, S.; Quinn, H.; Koren, I.; Navaux, P.; Rech, P.
Experimental and analytical study of Xeon Phi reliability / Oliveira, D.; Pilla, L.; De Bardeleben, N.; Blanchard, S.; Quinn, H.; Koren, I.; Navaux, P.; Rech, P.. - (2017), pp. 1-12. (Intervento presentato al convegno International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 tenutosi a usa nel 2017) [10.1145/3126908.3126960].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/403751
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 38
  • ???jsp.display-item.citation.isi??? 46
  • OpenAlex ND
social impact