While GPUs are being aggressively deployed in a growing number of computing domains, their resilience to transient faults remains a subject of concern. To gain a better understanding of the inherent vulnerability of GPU applications to transient faults, researchers perform extensive fault injection experiments. However, the conclusions reached based on the results of these fault injection experiments tend to be dependent on the specific input used during the experiments. The dependence of program resilience on changes in program input has not been thoroughly studied for GPU workloads. This paper addresses this issue, presenting extensive analysis on the effects of changes in program input and the resulting GPU reliability. Our work extends and challenges previous studies which reported that input data values do not affect reliability. Our analysis demonstrates that input sizes, as well as biased input values (input with a small set of dominant values) can have a significant impact on application reliability. For applications studied, we can expect a change of as much as 30% in the probability for a fault to cause a failure. Furthermore, we provide guidance on how to predict changes in resilience without repeating exhaustive fault injection experiments,

A comprehensive evaluation of the effects of input data on the resilience of GPU applications / Previlon, F. G.; Kalra, C.; Kaeli, D. R.; Rech, P.. - (2019), pp. 1-6. (Intervento presentato al convegno 32nd IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2019 tenutosi a nld nel 2019) [10.1109/DFT.2019.8875269].

A comprehensive evaluation of the effects of input data on the resilience of GPU applications

Rech P.
2019-01-01

Abstract

While GPUs are being aggressively deployed in a growing number of computing domains, their resilience to transient faults remains a subject of concern. To gain a better understanding of the inherent vulnerability of GPU applications to transient faults, researchers perform extensive fault injection experiments. However, the conclusions reached based on the results of these fault injection experiments tend to be dependent on the specific input used during the experiments. The dependence of program resilience on changes in program input has not been thoroughly studied for GPU workloads. This paper addresses this issue, presenting extensive analysis on the effects of changes in program input and the resulting GPU reliability. Our work extends and challenges previous studies which reported that input data values do not affect reliability. Our analysis demonstrates that input sizes, as well as biased input values (input with a small set of dominant values) can have a significant impact on application reliability. For applications studied, we can expect a change of as much as 30% in the probability for a fault to cause a failure. Furthermore, we provide guidance on how to predict changes in resilience without repeating exhaustive fault injection experiments,
2019
2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2019
Stati Uniti
Institute of Electrical and Electronics Engineers Inc.
978-1-7281-2260-1
Previlon, F. G.; Kalra, C.; Kaeli, D. R.; Rech, P.
A comprehensive evaluation of the effects of input data on the resilience of GPU applications / Previlon, F. G.; Kalra, C.; Kaeli, D. R.; Rech, P.. - (2019), pp. 1-6. (Intervento presentato al convegno 32nd IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2019 tenutosi a nld nel 2019) [10.1109/DFT.2019.8875269].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/403739
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? 8
  • OpenAlex ND
social impact