The high computing efficiency of graphics processing units (GPUs) makes them attractive for both high-performance computing and safety-critical applications, such as the automotive and aerospace ones. For both application domains, reliability is a major concern. This paper aims at providing guidelines to improve the reliability of GPUs register file without jeopardizing the device's computing efficiency. We advance the knowledge of GPUs' reliability by investigating register file criticality, which is the probability for a fault in a register to propagate and affect computation. Then, we propose and validate selective fault-tolerance techniques for GPUs register file that can be applied at hardware or software level. Results show that both implementations are well suited to detect faults affecting computation. However, although hardware-implemented techniques are able to detect faults that are triggering a crash, software-implemented techniques may not be sufficient to guarantee sufficient coverage for crashes.

Selective Fault Tolerance for Register Files of Graphics Processing Units / Goncalves, M.; Fernandes, F.; Lamb, I.; Rech, P.; Azambuja, J. R.. - In: IEEE TRANSACTIONS ON NUCLEAR SCIENCE. - ISSN 0018-9499. - 66:7(2019), pp. 1449-1456. [10.1109/TNS.2019.2903027]

Selective Fault Tolerance for Register Files of Graphics Processing Units

Rech P.;
2019-01-01

Abstract

The high computing efficiency of graphics processing units (GPUs) makes them attractive for both high-performance computing and safety-critical applications, such as the automotive and aerospace ones. For both application domains, reliability is a major concern. This paper aims at providing guidelines to improve the reliability of GPUs register file without jeopardizing the device's computing efficiency. We advance the knowledge of GPUs' reliability by investigating register file criticality, which is the probability for a fault in a register to propagate and affect computation. Then, we propose and validate selective fault-tolerance techniques for GPUs register file that can be applied at hardware or software level. Results show that both implementations are well suited to detect faults affecting computation. However, although hardware-implemented techniques are able to detect faults that are triggering a crash, software-implemented techniques may not be sufficient to guarantee sufficient coverage for crashes.
2019
7
Goncalves, M.; Fernandes, F.; Lamb, I.; Rech, P.; Azambuja, J. R.
Selective Fault Tolerance for Register Files of Graphics Processing Units / Goncalves, M.; Fernandes, F.; Lamb, I.; Rech, P.; Azambuja, J. R.. - In: IEEE TRANSACTIONS ON NUCLEAR SCIENCE. - ISSN 0018-9499. - 66:7(2019), pp. 1449-1456. [10.1109/TNS.2019.2903027]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/346729
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 9
  • ???jsp.display-item.citation.isi??? ND
social impact