Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU's computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than 5 ×) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.

Demystifying GPU reliability: Comparing and combining beam experiments, fault simulation, and profiling / Santos, F. F. D.; Hari, S. K. S.; Basso, P. M.; Carro, L.; Rech, P.. - (2021), pp. 289-298. ((Intervento presentato al convegno 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021 tenutosi a USA nel 2021 [10.1109/IPDPS49936.2021.00037].

Demystifying GPU reliability: Comparing and combining beam experiments, fault simulation, and profiling

Rech P.
2021-01-01

Abstract

Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU's computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than 5 ×) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.
Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021
USA
Institute of Electrical and Electronics Engineers Inc.
978-1-6654-4066-0
Santos, F. F. D.; Hari, S. K. S.; Basso, P. M.; Carro, L.; Rech, P.
Demystifying GPU reliability: Comparing and combining beam experiments, fault simulation, and profiling / Santos, F. F. D.; Hari, S. K. S.; Basso, P. M.; Carro, L.; Rech, P.. - (2021), pp. 289-298. ((Intervento presentato al convegno 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021 tenutosi a USA nel 2021 [10.1109/IPDPS49936.2021.00037].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/346713
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? ND
social impact