In this paper we compare the soft-error sensitivity of parallel applications on modern Graphics Processing Units (GPUs) obtained through architectural-level fault injections and high-energy particle beam radiation experiments. Fault-injection and beam experiments provide different information and uses different transient-fault sensitivity metrics, which are hard to combine. In this paper we show how correlating beam and fault-injection data can provide a deeper understanding of the behavior of GPUs in the occurrence of transient faults. In particular, we demonstrate that commonly used architecture-level fault models (and fast injection tools) can be used to identify critical kernels and to associate some experimentally observed output errors with their causes. Additionally, we show how register file and instruction-level injections can be used to evaluate ECC efficiency in reducing the radiation-induced error rate.

Analyzing the criticality of transient faults-induced SDCs on GPU applications / Dos Santos, F. F.; Rech, P.. - (2017), pp. 1-7. ((Intervento presentato al convegno 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2017 - Held in conjunction with the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 tenutosi a usa nel 2017 [10.1145/3148226.3148228].

Analyzing the criticality of transient faults-induced SDCs on GPU applications

Rech P.
2017-01-01

Abstract

In this paper we compare the soft-error sensitivity of parallel applications on modern Graphics Processing Units (GPUs) obtained through architectural-level fault injections and high-energy particle beam radiation experiments. Fault-injection and beam experiments provide different information and uses different transient-fault sensitivity metrics, which are hard to combine. In this paper we show how correlating beam and fault-injection data can provide a deeper understanding of the behavior of GPUs in the occurrence of transient faults. In particular, we demonstrate that commonly used architecture-level fault models (and fast injection tools) can be used to identify critical kernels and to associate some experimentally observed output errors with their causes. Additionally, we show how register file and instruction-level injections can be used to evaluate ECC efficiency in reducing the radiation-induced error rate.
Proceedings of ScalA 2017: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis
usa
Association for Computing Machinery, Inc
9781450351256
Dos Santos, F. F.; Rech, P.
Analyzing the criticality of transient faults-induced SDCs on GPU applications / Dos Santos, F. F.; Rech, P.. - (2017), pp. 1-7. ((Intervento presentato al convegno 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2017 - Held in conjunction with the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017 tenutosi a usa nel 2017 [10.1145/3148226.3148228].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/346657
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 8
  • ???jsp.display-item.citation.isi??? ND
social impact