Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU's computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than 5 ×) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.

Demystifying GPU reliability: Comparing and combining beam experiments, fault simulation, and profiling / dos Santos, Fernando Fernandes; Hari, Siva Kumar Sastry; Basso, Pedro Martins; Carro, Luigi; Rech, Paolo. - (2021), pp. 289-298. (Intervento presentato al convegno 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021 tenutosi a Virtual Event nel 17th–21st May 2021) [10.1109/IPDPS49936.2021.00037].

Demystifying GPU reliability: Comparing and combining beam experiments, fault simulation, and profiling

Rech, Paolo
Ultimo
2021-01-01

Abstract

Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU's computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than 5 ×) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.
2021
2021 IEEE 35th International Parallel and Distributed Processing Symposium Proceedings
Piscataway, NJ
Institute of Electrical and Electronics Engineers Inc.
978-1-6654-4066-0
dos Santos, Fernando Fernandes; Hari, Siva Kumar Sastry; Basso, Pedro Martins; Carro, Luigi; Rech, Paolo
Demystifying GPU reliability: Comparing and combining beam experiments, fault simulation, and profiling / dos Santos, Fernando Fernandes; Hari, Siva Kumar Sastry; Basso, Pedro Martins; Carro, Luigi; Rech, Paolo. - (2021), pp. 289-298. (Intervento presentato al convegno 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021 tenutosi a Virtual Event nel 17th–21st May 2021) [10.1109/IPDPS49936.2021.00037].
File in questo prodotto:
File Dimensione Formato  
IPDPS-NVIDIA-Demystifying_GPU_Reliability_Comparing_and_Combining_Beam_Experiments_Fault_Simulation_and_Profiling.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 329.77 kB
Formato Adobe PDF
329.77 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/346713
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 15
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex ND
social impact