Graphics Processing Units (GPUs) are being employed in High Performance Computing (HPC) and safety-critical applications, such as autonomous vehicles. This market shift led to significant improvements in the programming frameworks and performance evaluation tools and concerns about their reliability. GPU reliability evaluation is extremely challenging due to the parallel nature and high complexity of GPU architectures. We conducted the first cross-layer GPU reliability evaluation to unveil (and mitigate) GPU vulnerabilities. The proposed evaluation is achieved by comparing and combining extensive high-energy neutron beam experiments, massive fault simulation campaigns at both Register-Transfer Level (RTL) and software levels, and application profiling. Based on this extensive and detailed analysis, a novel accurate methodology to accurately estimate GPUs application FIT rate is proposed. Moreover, by employing the knowledge obtained from the cross-layer reliability evaluation, two novel hardening solutions for HPC and safety-critical applications are proposed: (1) Reduced Precision Duplication With Comparison (RP-DWC), which executes a redundant copy in a reduced precision. RP-DWC delivers excellent fault coverage, up to 86%, with minimal execution time and energy consumption overheads (13% and 24%, respectively). (2) Dedicated software solutions for hardening Convolutional Neural Networks (CNNs) that can correct up to 98% of the CNN errors.

Understanding and Improving GPUs' Reliability Combining Beam Experiments with Fault Simulation / Dos Santos, Fernando Fernandes; Carro, Luigi; Rech, Paolo. - 2023-:(2023). (Intervento presentato al convegno 28th IEEE European Test Symposium: ETS 2023 tenutosi a Venezia nel 22nd-26th May 2023) [10.1109/ets56758.2023.10174206].

Understanding and Improving GPUs' Reliability Combining Beam Experiments with Fault Simulation

Rech, Paolo
Ultimo
2023-01-01

Abstract

Graphics Processing Units (GPUs) are being employed in High Performance Computing (HPC) and safety-critical applications, such as autonomous vehicles. This market shift led to significant improvements in the programming frameworks and performance evaluation tools and concerns about their reliability. GPU reliability evaluation is extremely challenging due to the parallel nature and high complexity of GPU architectures. We conducted the first cross-layer GPU reliability evaluation to unveil (and mitigate) GPU vulnerabilities. The proposed evaluation is achieved by comparing and combining extensive high-energy neutron beam experiments, massive fault simulation campaigns at both Register-Transfer Level (RTL) and software levels, and application profiling. Based on this extensive and detailed analysis, a novel accurate methodology to accurately estimate GPUs application FIT rate is proposed. Moreover, by employing the knowledge obtained from the cross-layer reliability evaluation, two novel hardening solutions for HPC and safety-critical applications are proposed: (1) Reduced Precision Duplication With Comparison (RP-DWC), which executes a redundant copy in a reduced precision. RP-DWC delivers excellent fault coverage, up to 86%, with minimal execution time and energy consumption overheads (13% and 24%, respectively). (2) Dedicated software solutions for hardening Convolutional Neural Networks (CNNs) that can correct up to 98% of the CNN errors.
2023
2023 IEEE European Test Symposium Proceedings
Piscataway, NJ
IEEE
979-8-3503-3634-4
Dos Santos, Fernando Fernandes; Carro, Luigi; Rech, Paolo
Understanding and Improving GPUs' Reliability Combining Beam Experiments with Fault Simulation / Dos Santos, Fernando Fernandes; Carro, Luigi; Rech, Paolo. - 2023-:(2023). (Intervento presentato al convegno 28th IEEE European Test Symposium: ETS 2023 tenutosi a Venezia nel 22nd-26th May 2023) [10.1109/ets56758.2023.10174206].
File in questo prodotto:
File Dimensione Formato  
Understanding_and_Improving_GPUs_Reliability_Combining_Beam_Experiments_with_Fault_Simulation.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.9 MB
Formato Adobe PDF
1.9 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/403699
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex ND
social impact