Selective Fault Tolerance for Register Files of Graphics Processing Units

IRIS

The high computing efficiency of graphics processing units (GPUs) makes them attractive for both high-performance computing and safety-critical applications, such as the automotive and aerospace ones. For both application domains, reliability is a major concern. This paper aims at providing guidelines to improve the reliability of GPUs register file without jeopardizing the device's computing efficiency. We advance the knowledge of GPUs' reliability by investigating register file criticality, which is the probability for a fault in a register to propagate and affect computation. Then, we propose and validate selective fault-tolerance techniques for GPUs register file that can be applied at hardware or software level. Results show that both implementations are well suited to detect faults affecting computation. However, although hardware-implemented techniques are able to detect faults that are triggering a crash, software-implemented techniques may not be sufficient to guarantee sufficient coverage for crashes.

Selective Fault Tolerance for Register Files of Graphics Processing Units / Goncalves, M.; Fernandes, F.; Lamb, I.; Rech, P.; Azambuja, J. R.. - In: IEEE TRANSACTIONS ON NUCLEAR SCIENCE. - ISSN 0018-9499. - 66:7(2019), pp. 1449-1456. [10.1109/TNS.2019.2903027]

Selective Fault Tolerance for Register Files of Graphics Processing Units

Goncalves M.;Fernandes F.;Lamb I.;Rech P.;Azambuja J. R.

2019-01-01

Abstract

The high computing efficiency of graphics processing units (GPUs) makes them attractive for both high-performance computing and safety-critical applications, such as the automotive and aerospace ones. For both application domains, reliability is a major concern. This paper aims at providing guidelines to improve the reliability of GPUs register file without jeopardizing the device's computing efficiency. We advance the knowledge of GPUs' reliability by investigating register file criticality, which is the probability for a fault in a register to propagate and affect computation. Then, we propose and validate selective fault-tolerance techniques for GPUs register file that can be applied at hardware or software level. Results show that both implementations are well suited to detect faults affecting computation. However, although hardware-implemented techniques are able to detect faults that are triggering a crash, software-implemented techniques may not be sufficient to guarantee sufficient coverage for crashes.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2019
			
	Titolo del periodico (Journal title)
	
				IEEE TRANSACTIONS ON NUCLEAR SCIENCE
			
	Numero e parte del fascicolo (Issue number and part)
	
				7
			
	DOI
	
				https://dx.doi.org/10.1109/TNS.2019.2903027
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-85069465931
			
	Codice WOS (WOS identifier)
	
				WOS:000476782600014
			
	Tutti gli autori
	
						Goncalves, M.; Fernandes, F.; Lamb, I.; Rech, P.; Azambuja, J. R.
					
	Citazione
	
				Selective Fault Tolerance for Register Files of Graphics Processing Units / Goncalves, M.; Fernandes, F.; Lamb, I.; Rech, P.; Azambuja, J. R.. - In: IEEE TRANSACTIONS ON NUCLEAR SCIENCE. - ISSN 0018-9499. - 66:7(2019), pp. 1449-1456. [10.1109/TNS.2019.2903027]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/346729

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

11

9

12

social impact