Modern Graphics Processing Units (GPUs) demand life expectancy extended to many years, exposing the hardware to aging (i.e., permanent faults arising after the end-of-manufacturing test). Hence, techniques to assess permanent fault impacts in GPUs are strongly required, especially in safety-critical domains. This paper presents a method to evaluate permanent faults in the GPU's scheduler and control units, together with the first figures to quantify these effects. We inject 5.83x105 permanent faults in the gate-level units of a GPU model. Then, we map the observed error categories as software errors by instrumenting 13 applications and two convolutional neural networks, injecting more than 1.65x105 permanent errors (1,000 errors per application), reducing evaluation times from several years to hundreds of hours. Our results highlight that faults in GPU parallelism management units impact software execution parameters. Moreover, errors in resource management or instructions codes hang the code, while 45% of errors induce silent data corruption.

Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units / Guerrero Balaguera, Juan David; Rodriguez Condia, Josie Esteban; Fernandes Dos Santos, Fernando; Sonza Reorda, Matteo; Rech, Paolo. - (2023). (Intervento presentato al convegno SC '23 tenutosi a Denver nel 12th-17th November 2023) [10.1145/3581784.3607086].

Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units

Rech, Paolo
Ultimo
2023-01-01

Abstract

Modern Graphics Processing Units (GPUs) demand life expectancy extended to many years, exposing the hardware to aging (i.e., permanent faults arising after the end-of-manufacturing test). Hence, techniques to assess permanent fault impacts in GPUs are strongly required, especially in safety-critical domains. This paper presents a method to evaluate permanent faults in the GPU's scheduler and control units, together with the first figures to quantify these effects. We inject 5.83x105 permanent faults in the gate-level units of a GPU model. Then, we map the observed error categories as software errors by instrumenting 13 applications and two convolutional neural networks, injecting more than 1.65x105 permanent errors (1,000 errors per application), reducing evaluation times from several years to hundreds of hours. Our results highlight that faults in GPU parallelism management units impact software execution parameters. Moreover, errors in resource management or instructions codes hang the code, while 45% of errors induce silent data corruption.
2023
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
New York City
ACM
979-8-4007-0109-2
Guerrero Balaguera, Juan David; Rodriguez Condia, Josie Esteban; Fernandes Dos Santos, Fernando; Sonza Reorda, Matteo; Rech, Paolo
Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units / Guerrero Balaguera, Juan David; Rodriguez Condia, Josie Esteban; Fernandes Dos Santos, Fernando; Sonza Reorda, Matteo; Rech, Paolo. - (2023). (Intervento presentato al convegno SC '23 tenutosi a Denver nel 12th-17th November 2023) [10.1145/3581784.3607086].
File in questo prodotto:
File Dimensione Formato  
3581784.3607086.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 1.27 MB
Formato Adobe PDF
1.27 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/403694
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact