In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing~(HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications' output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators / Oliveira, D. A. G. D.; Pilla, L. L.; Hanzich, M.; Fratin, V.; Fernandes, F.; Lunardi, C.; Cela, J. M.; Navaux, P. O. A.; Carro, L.; Rech, P.. - (2017), pp. 577-588. (Intervento presentato al convegno 23rd IEEE Symposium on High Performance Computer Architecture, HPCA 2017 tenutosi a usa nel 2017) [10.1109/HPCA.2017.41].

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

Rech P.
2017-01-01

Abstract

In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing~(HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications' output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.
2017
Proceedings - International Symposium on High-Performance Computer Architecture
Stati Uniti
IEEE Computer Society
978-1-5090-4985-1
Oliveira, D. A. G. D.; Pilla, L. L.; Hanzich, M.; Fratin, V.; Fernandes, F.; Lunardi, C.; Cela, J. M.; Navaux, P. O. A.; Carro, L.; Rech, P.
Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators / Oliveira, D. A. G. D.; Pilla, L. L.; Hanzich, M.; Fratin, V.; Fernandes, F.; Lunardi, C.; Cela, J. M.; Navaux, P. O. A.; Carro, L.; Rech, P.. - (2017), pp. 577-588. (Intervento presentato al convegno 23rd IEEE Symposium on High Performance Computer Architecture, HPCA 2017 tenutosi a usa nel 2017) [10.1109/HPCA.2017.41].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/403748
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 32
  • ???jsp.display-item.citation.isi??? 22
social impact