Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU's computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than 5 ×) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.
Demystifying GPU reliability: Comparing and combining beam experiments, fault simulation, and profiling / Santos, F. F. D.; Hari, S. K. S.; Basso, P. M.; Carro, L.; Rech, P.. - (2021), pp. 289-298. ((Intervento presentato al convegno 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021 tenutosi a USA nel 2021 [10.1109/IPDPS49936.2021.00037].
Scheda prodotto non validato
I dati visualizzati non sono stati ancora sottoposti a validazione formale da parte dello Staff di IRIS, ma sono stati ugualmente trasmessi al Sito Docente Cineca (Loginmiur).
Titolo: | Demystifying GPU reliability: Comparing and combining beam experiments, fault simulation, and profiling | |
Autori: | Santos, F. F. D.; Hari, S. K. S.; Basso, P. M.; Carro, L.; Rech, P. | |
Autori Unitn: | ||
Titolo del volume contenente il saggio: | Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021 | |
Luogo di edizione: | USA | |
Casa editrice: | Institute of Electrical and Electronics Engineers Inc. | |
Anno di pubblicazione: | 2021 | |
Codice identificativo Scopus: | 2-s2.0-85113562135 | |
ISBN: | 978-1-6654-4066-0 | |
Handle: | http://hdl.handle.net/11572/346713 | |
Citazione: | Demystifying GPU reliability: Comparing and combining beam experiments, fault simulation, and profiling / Santos, F. F. D.; Hari, S. K. S.; Basso, P. M.; Carro, L.; Rech, P.. - (2021), pp. 289-298. ((Intervento presentato al convegno 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021 tenutosi a USA nel 2021 [10.1109/IPDPS49936.2021.00037]. | |
Appare nelle tipologie: | 04.1 Saggio in atti di convegno (Paper in Proceedings) |