In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing~(HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications' output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators / Oliveira, D.A.G.D., Pilla, L.L., Hanzich, M., Fratin, V., Fernandes, F., Lunardi, C., Cela, J.M., Navaux, P.O.A., Carro, L., Rech, P.. - (2017), pp. 577-588. (23rd IEEE Symposium on High Performance Computer Architecture, HPCA 2017 Austin, TX, USA 04-08 February 2017) [10.1109/HPCA.2017.41].

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

Rech P.
2017-01-01

Abstract

In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing~(HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications' output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.
2017
2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)
New York, USA
IEEE Institute of Electrical and Electronics Engineers Inc.
978-1-5090-4985-1
Oliveira, D. A. G. D.; Pilla, L. L.; Hanzich, M.; Fratin, V.; Fernandes, F.; Lunardi, C.; Cela, J. M.; Navaux, P. O. A.; Carro, L.; Rech, P.
Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators / Oliveira, D.A.G.D., Pilla, L.L., Hanzich, M., Fratin, V., Fernandes, F., Lunardi, C., Cela, J.M., Navaux, P.O.A., Carro, L., Rech, P.. - (2017), pp. 577-588. (23rd IEEE Symposium on High Performance Computer Architecture, HPCA 2017 Austin, TX, USA 04-08 February 2017) [10.1109/HPCA.2017.41].
File in questo prodotto:
File Dimensione Formato  
HPCA_Radiation-Induced_Error_Criticality_in_Modern_HPC_Parallel_Accelerators.pdf

accesso aperto

Descrizione: HPCA 2017 - conference paper
Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 746.58 kB
Formato Adobe PDF
746.58 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/403748
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 36
  • ???jsp.display-item.citation.isi??? 25
  • OpenAlex 39
social impact