Graphics Processing Units (GPUs) are being employed in High Performance Computing (HPC) and safety-critical applications, such as autonomous vehicles. This market shift led to significant improvements in the programming frameworks and performance evaluation tools and concerns about their reliability. GPU reliability evaluation is extremely challenging due to the parallel nature and high complexity of GPU architectures. We conducted the first cross-layer GPU reliability evaluation to unveil (and mitigate) GPU vulnerabilities. The proposed evaluation is achieved by comparing and combining extensive high-energy neutron beam experiments, massive fault simulation campaigns at both Register-Transfer Level (RTL) and software levels, and application profiling. Based on this extensive and detailed analysis, a novel accurate methodology to accurately estimate GPUs application FIT rate is proposed. Moreover, by employing the knowledge obtained from the cross-layer reliability evaluation, two novel hardening solutions for HPC and safety-critical applications are proposed: (1) Reduced Precision Duplication With Comparison (RP-DWC), which executes a redundant copy in a reduced precision. RP-DWC delivers excellent fault coverage, up to 86%, with minimal execution time and energy consumption overheads (13% and 24%, respectively). (2) Dedicated software solutions for hardening Convolutional Neural Networks (CNNs) that can correct up to 98% of the CNN errors.
Understanding and Improving GPUs' Reliability Combining Beam Experiments with Fault Simulation / Dos Santos, Fernando Fernandes; Carro, Luigi; Rech, Paolo. - 2023-:(2023). (Intervento presentato al convegno 28th IEEE European Test Symposium: ETS 2023 tenutosi a Venezia nel 22nd-26th May 2023) [10.1109/ets56758.2023.10174206].
Understanding and Improving GPUs' Reliability Combining Beam Experiments with Fault Simulation
Rech, PaoloUltimo
2023-01-01
Abstract
Graphics Processing Units (GPUs) are being employed in High Performance Computing (HPC) and safety-critical applications, such as autonomous vehicles. This market shift led to significant improvements in the programming frameworks and performance evaluation tools and concerns about their reliability. GPU reliability evaluation is extremely challenging due to the parallel nature and high complexity of GPU architectures. We conducted the first cross-layer GPU reliability evaluation to unveil (and mitigate) GPU vulnerabilities. The proposed evaluation is achieved by comparing and combining extensive high-energy neutron beam experiments, massive fault simulation campaigns at both Register-Transfer Level (RTL) and software levels, and application profiling. Based on this extensive and detailed analysis, a novel accurate methodology to accurately estimate GPUs application FIT rate is proposed. Moreover, by employing the knowledge obtained from the cross-layer reliability evaluation, two novel hardening solutions for HPC and safety-critical applications are proposed: (1) Reduced Precision Duplication With Comparison (RP-DWC), which executes a redundant copy in a reduced precision. RP-DWC delivers excellent fault coverage, up to 86%, with minimal execution time and energy consumption overheads (13% and 24%, respectively). (2) Dedicated software solutions for hardening Convolutional Neural Networks (CNNs) that can correct up to 98% of the CNN errors.File | Dimensione | Formato | |
---|---|---|---|
Understanding_and_Improving_GPUs_Reliability_Combining_Beam_Experiments_with_Fault_Simulation.pdf
Solo gestori archivio
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
1.9 MB
Formato
Adobe PDF
|
1.9 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione