Graphics Processing Units (GPUs) are increasingly adopted in several domains where reliability is fundamental, such as self-driving cars and autonomous systems. Unfortunately, GPU devices have been shown to have a high error rate, while the constraints imposed by real-time safety-critical applications make traditional (and costly) replication-based hardening solutions inadequate. This work proposes an effective methodology to identify the architectural vulnerable sites in GPUs modules, i.e., the locations that, if corrupted, most affect the correct instructions execution. We first identify, through an innovative method based on Register-Transfer Level (RTL) fault injection experiments, the architectural vulnerabilities of a GPU model. Then, we mitigate the fault impact via selective hardening applied to the flip-flops that have been identified as critical. We evaluate three hardening strategies: Triple Modular Redundancy (TMR), Triple Modular Redundancy against SETs ( Δ TMR), and Dual Interlocked Storage Cells (Dice flip-flops). The results gathered on a publicly available GPU Model (FlexGripPlus) considering functional units, pipeline registers, and warp scheduler controller show that our method can tolerate from 85% to 99% of faults in the pipeline registers, from 50% to 100% of faults in the functional units and up to 10% of faults in the warp scheduler, with a reduced hardware overhead (in the range of 58% to 94% when compared with traditional TMR). Finally, we adapt the methodology to perform a complementary evaluation targeting permanent faults and identify critical sites prone to propagate fault effects across the GPU. We found that a considerable percentage (65% to 98%) of flip-flops that are critical for transient faults are also critical for permanent faults.

An Effective Method to Identify Microarchitectural Vulnerabilities in GPUs / Rodriguez Condia, Esteban; Rech, Paolo; Fernandes dos Santos, Fernando; Carro, Luigi; Sonza Reorda, Matteo. - In: IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY. - ISSN 1530-4388. - 22:2(2022), pp. 129-141. [10.1109/TDMR.2022.3166260]

An Effective Method to Identify Microarchitectural Vulnerabilities in GPUs

Rech, Paolo;
2022-01-01

Abstract

Graphics Processing Units (GPUs) are increasingly adopted in several domains where reliability is fundamental, such as self-driving cars and autonomous systems. Unfortunately, GPU devices have been shown to have a high error rate, while the constraints imposed by real-time safety-critical applications make traditional (and costly) replication-based hardening solutions inadequate. This work proposes an effective methodology to identify the architectural vulnerable sites in GPUs modules, i.e., the locations that, if corrupted, most affect the correct instructions execution. We first identify, through an innovative method based on Register-Transfer Level (RTL) fault injection experiments, the architectural vulnerabilities of a GPU model. Then, we mitigate the fault impact via selective hardening applied to the flip-flops that have been identified as critical. We evaluate three hardening strategies: Triple Modular Redundancy (TMR), Triple Modular Redundancy against SETs ( Δ TMR), and Dual Interlocked Storage Cells (Dice flip-flops). The results gathered on a publicly available GPU Model (FlexGripPlus) considering functional units, pipeline registers, and warp scheduler controller show that our method can tolerate from 85% to 99% of faults in the pipeline registers, from 50% to 100% of faults in the functional units and up to 10% of faults in the warp scheduler, with a reduced hardware overhead (in the range of 58% to 94% when compared with traditional TMR). Finally, we adapt the methodology to perform a complementary evaluation targeting permanent faults and identify critical sites prone to propagate fault effects across the GPU. We found that a considerable percentage (65% to 98%) of flip-flops that are critical for transient faults are also critical for permanent faults.
2022
2
Rodriguez Condia, Esteban; Rech, Paolo; Fernandes dos Santos, Fernando; Carro, Luigi; Sonza Reorda, Matteo
An Effective Method to Identify Microarchitectural Vulnerabilities in GPUs / Rodriguez Condia, Esteban; Rech, Paolo; Fernandes dos Santos, Fernando; Carro, Luigi; Sonza Reorda, Matteo. - In: IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY. - ISSN 1530-4388. - 22:2(2022), pp. 129-141. [10.1109/TDMR.2022.3166260]
File in questo prodotto:
File Dimensione Formato  
An_Effective_Method_to_Identify_Microarchitectural_Vulnerabilities_in_GPUs.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.6 MB
Formato Adobe PDF
1.6 MB Adobe PDF   Visualizza/Apri
Transactions_on_device_and_materials_reliability.pdf

accesso aperto

Tipologia: Post-print referato (Refereed author’s manuscript)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.14 MB
Formato Adobe PDF
1.14 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/338432
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? 5
  • OpenAlex ND
social impact