Graphics processing units (GPUs) are increasingly attractive for both safety-critical and High-Performance Computing applications. GPU reliability is a primary concern for both the automotive and aerospace markets and is becoming an issue also for supercomputers. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this paper, we aim at giving novel insights on GPU reliability by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences. Additionally, a wide set of parallel codes are exposed to controlled neutron beams to measure GPUs operative error rates. From experimental data and algorithm analysis we derive general insights on parallel algorithms and programming approaches reliability. Finally, error-correcting code, algorithm-based fault tolerance, and duplication with comparison hardening strategies are presented and evaluated on GPUs through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions.

Evaluation and Mitigation of Radiation-induced Soft Errors in Graphics Processing Units / De Oliveira, D. A. G.; Pilla, L. L.; Santini, T.; Rech, P.. - In: IEEE TRANSACTIONS ON COMPUTERS. - ISSN 0018-9340. - 2016, 65:3(2016), pp. 791-804. [10.1109/TC.2015.2444855]

Evaluation and Mitigation of Radiation-induced Soft Errors in Graphics Processing Units

Rech P.
2016-01-01

Abstract

Graphics processing units (GPUs) are increasingly attractive for both safety-critical and High-Performance Computing applications. GPU reliability is a primary concern for both the automotive and aerospace markets and is becoming an issue also for supercomputers. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this paper, we aim at giving novel insights on GPU reliability by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences. Additionally, a wide set of parallel codes are exposed to controlled neutron beams to measure GPUs operative error rates. From experimental data and algorithm analysis we derive general insights on parallel algorithms and programming approaches reliability. Finally, error-correcting code, algorithm-based fault tolerance, and duplication with comparison hardening strategies are presented and evaluated on GPUs through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions.
2016
3
De Oliveira, D. A. G.; Pilla, L. L.; Santini, T.; Rech, P.
Evaluation and Mitigation of Radiation-induced Soft Errors in Graphics Processing Units / De Oliveira, D. A. G.; Pilla, L. L.; Santini, T.; Rech, P.. - In: IEEE TRANSACTIONS ON COMPUTERS. - ISSN 0018-9340. - 2016, 65:3(2016), pp. 791-804. [10.1109/TC.2015.2444855]
File in questo prodotto:
File Dimensione Formato  
TC_Evaluation_and_Mitigation_of_Radiation-Induced_Soft_Errors_in_Graphics_Processing_Units.pdf

Solo gestori archivio

Descrizione: IEEE Transactions on Computers - article
Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.06 MB
Formato Adobe PDF
1.06 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/403760
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 95
  • ???jsp.display-item.citation.isi??? 83
  • OpenAlex 99
social impact