Demystifying GPU reliability: Comparing and combining beam experiments, fault simulation, and profiling

IRIS

Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU's computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than 5 ×) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.

Demystifying GPU reliability: Comparing and combining beam experiments, fault simulation, and profiling / dos Santos, Fernando Fernandes; Hari, Siva Kumar Sastry; Basso, Pedro Martins; Carro, Luigi; Rech, Paolo. - (2021), pp. 289-298. (Intervento presentato al convegno 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021 tenutosi a Virtual Event nel 17th–21st May 2021) [10.1109/IPDPS49936.2021.00037].

Demystifying GPU reliability: Comparing and combining beam experiments, fault simulation, and profiling

dos Santos, Fernando Fernandes^Primo;Hari, Siva Kumar Sastry^Secondo;Basso, Pedro Martins;Carro, Luigi^Penultimo;Rech, Paolo^Ultimo

2021-01-01

Abstract

Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU's computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than 5 ×) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2021
			
	Titolo del volume (Proceedings title)
	
				2021 IEEE 35th International Parallel and Distributed Processing Symposium Proceedings
			
	Luogo di edizione (Place of publication)
	
				Piscataway, NJ
			
	Casa editrice (Publisher)
	
				Institute of Electrical and Electronics Engineers Inc.
			
	ISBN
	
				978-1-6654-4066-0
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85113562135
			
	Codice WOS (WOS identifier)
	
				WOS:000695273000029
			
	Tutti gli autori
	
						dos Santos, Fernando Fernandes; Hari, Siva Kumar Sastry; Basso, Pedro Martins; Carro, Luigi; Rech, Paolo
					
	Citazione
	
				Demystifying GPU reliability: Comparing and combining beam experiments, fault simulation, and profiling / dos Santos, Fernando Fernandes; Hari, Siva Kumar Sastry; Basso, Pedro Martins; Carro, Luigi; Rech, Paolo. - (2021), pp. 289-298. (Intervento presentato al  convegno 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021 tenutosi a Virtual Event nel 17th–21st May 2021) [10.1109/IPDPS49936.2021.00037].
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
IPDPS-NVIDIA-Demystifying_GPU_Reliability_Comparing_and_Combining_Beam_Experiments_Fault_Simulation_and_Profiling.pdf Solo gestori archivio Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 329.77 kB Formato Adobe PDF Visualizza/Apri	329.77 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/346713

Citazioni

ND

15

0

ND

social impact