The complexity of both hardware and software makes GPUs reliability evaluation extremely challenging. A low level fault injection on a GPU model, despite being accurate, would take a prohibitively long time (months to years), while software fault injection, despite being quick, cannot access critical resources for GPUs and typically uses synthetic fault models (e.g., single bit-flips) that could result in unrealistic evaluations. This paper proposes to combine the accuracy of Register-Transfer Level (RTL) fault injection with the efficiency of software fault injection. First, on an RTL GPU model (FlexGripPlus), we inject over 1.5 million faults in low-level resources that are unprotected and hidden to the programmer, and characterize their effects on the output of common instructions. We create a pool of possible fault effects on the operation output based on the instruction opcode and input characteristics. We then inject these fault effects, at the application level, using an updated version of a software framework (NVBitFI). Our strategy reduces the fault injection time from the tens of years an RTL evaluation would need to tens of hours, thus allowing, for the first time on GPUs, to track the fault propagation from the hardware to the output of complex applications. Additionally, we provide a more realistic fault model and show that single bit-flip injection would underestimate the error rate of six HPC applications and two convolutional neural networks by up to 48parcent (18parcent on average). The RTL fault models and the injection framework we developed are made available in a public repository to enable third-party evaluations and ease results reproducibility.

Revealing GPUs Vulnerabilities by Combining Register-Transfer and Software-Level Fault Injection / Santos, F. F. D.; Condia, J. E. R.; Carro, L.; Reorda, M. S.; Rech, P.. - (2021), pp. 292-304. ((Intervento presentato al convegno 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021 tenutosi a twn nel 2021 [10.1109/DSN48987.2021.00042].

Revealing GPUs Vulnerabilities by Combining Register-Transfer and Software-Level Fault Injection

Rech P.
2021-01-01

Abstract

The complexity of both hardware and software makes GPUs reliability evaluation extremely challenging. A low level fault injection on a GPU model, despite being accurate, would take a prohibitively long time (months to years), while software fault injection, despite being quick, cannot access critical resources for GPUs and typically uses synthetic fault models (e.g., single bit-flips) that could result in unrealistic evaluations. This paper proposes to combine the accuracy of Register-Transfer Level (RTL) fault injection with the efficiency of software fault injection. First, on an RTL GPU model (FlexGripPlus), we inject over 1.5 million faults in low-level resources that are unprotected and hidden to the programmer, and characterize their effects on the output of common instructions. We create a pool of possible fault effects on the operation output based on the instruction opcode and input characteristics. We then inject these fault effects, at the application level, using an updated version of a software framework (NVBitFI). Our strategy reduces the fault injection time from the tens of years an RTL evaluation would need to tens of hours, thus allowing, for the first time on GPUs, to track the fault propagation from the hardware to the output of complex applications. Additionally, we provide a more realistic fault model and show that single bit-flip injection would underestimate the error rate of six HPC applications and two convolutional neural networks by up to 48parcent (18parcent on average). The RTL fault models and the injection framework we developed are made available in a public repository to enable third-party evaluations and ease results reproducibility.
Proceedings - 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021
usa
Institute of Electrical and Electronics Engineers Inc.
978-1-6654-3572-7
Santos, F. F. D.; Condia, J. E. R.; Carro, L.; Reorda, M. S.; Rech, P.
Revealing GPUs Vulnerabilities by Combining Register-Transfer and Software-Level Fault Injection / Santos, F. F. D.; Condia, J. E. R.; Carro, L.; Reorda, M. S.; Rech, P.. - (2021), pp. 292-304. ((Intervento presentato al convegno 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021 tenutosi a twn nel 2021 [10.1109/DSN48987.2021.00042].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/346633
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? ND
social impact