The complexity of both hardware and software makes GPUs reliability evaluation extremely challenging. A low level fault injection on a GPU model, despite being accurate, would take a prohibitively long time (months to years), while software fault injection, despite being quick, cannot access critical resources for GPUs and typically uses synthetic fault models (e.g., single bit-flips) that could result in unrealistic evaluations. This paper proposes to combine the accuracy of Register- Transfer Level (RTL) fault injection with the efficiency of software fault injection. First, on an RTL GPU model (FlexGripPlus), we inject over 1.5 million faults in low-level resources that are unprotected and hidden to the programmer, and characterize their effects on the output of common instructions. We create a pool of possible fault effects on the operation output based on the instruction opcode and input characteristics. We then inject these fault effects, at the application level, using an updated version of a software framework (NVBitFI). Our strategy reduces the fault injection time from the tens of years an RTL evaluation would need to tens of hours, thus allowing, for the first time on GPUs, to track the fault propagation from the hardware to the output of complex applications. Additionally, we provide a more realistic fault model and show that single bit-flip injection would underestimate the error rate of six HPC applications and two convolutional neural networks by up to 48parcent (18parcent on average). The RTL fault models and the injection framework we developed are made available in a public repository to enable third-party evaluations and ease results reproducibility.

Revealing GPUs Vulnerabilities by Combining Register-Transfer and Software-Level Fault Injection / dos Santos, Fernando F.; Condia, Josie E. Rodriguez; Carro, Luigi; Reorda, Matteo Sonza; Rech, Paolo. - (2021), pp. 292-304. (Intervento presentato al convegno DSN 2021 tenutosi a Virtual Event nel 21st-24th June 2021) [10.1109/DSN48987.2021.00042].

Revealing GPUs Vulnerabilities by Combining Register-Transfer and Software-Level Fault Injection

Rech, Paolo
Ultimo
2021-01-01

Abstract

The complexity of both hardware and software makes GPUs reliability evaluation extremely challenging. A low level fault injection on a GPU model, despite being accurate, would take a prohibitively long time (months to years), while software fault injection, despite being quick, cannot access critical resources for GPUs and typically uses synthetic fault models (e.g., single bit-flips) that could result in unrealistic evaluations. This paper proposes to combine the accuracy of Register- Transfer Level (RTL) fault injection with the efficiency of software fault injection. First, on an RTL GPU model (FlexGripPlus), we inject over 1.5 million faults in low-level resources that are unprotected and hidden to the programmer, and characterize their effects on the output of common instructions. We create a pool of possible fault effects on the operation output based on the instruction opcode and input characteristics. We then inject these fault effects, at the application level, using an updated version of a software framework (NVBitFI). Our strategy reduces the fault injection time from the tens of years an RTL evaluation would need to tens of hours, thus allowing, for the first time on GPUs, to track the fault propagation from the hardware to the output of complex applications. Additionally, we provide a more realistic fault model and show that single bit-flip injection would underestimate the error rate of six HPC applications and two convolutional neural networks by up to 48parcent (18parcent on average). The RTL fault models and the injection framework we developed are made available in a public repository to enable third-party evaluations and ease results reproducibility.
2021
51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks Proceedings
Piscataway, NJ, USA
IEEE
978-1-6654-3572-7
dos Santos, Fernando F.; Condia, Josie E. Rodriguez; Carro, Luigi; Reorda, Matteo Sonza; Rech, Paolo
Revealing GPUs Vulnerabilities by Combining Register-Transfer and Software-Level Fault Injection / dos Santos, Fernando F.; Condia, Josie E. Rodriguez; Carro, Luigi; Reorda, Matteo Sonza; Rech, Paolo. - (2021), pp. 292-304. (Intervento presentato al convegno DSN 2021 tenutosi a Virtual Event nel 21st-24th June 2021) [10.1109/DSN48987.2021.00042].
File in questo prodotto:
File Dimensione Formato  
DSN_Revealing_GPUs_Vulnerabilities_by_Combining_Register-Transfer_and_Software-Level_Fault_Injection.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.34 MB
Formato Adobe PDF
1.34 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/360388
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 23
  • ???jsp.display-item.citation.isi??? 20
  • OpenAlex ND
social impact