Transient faults continue to be a critical concern in a range of computing domains including: High-Performance Computing (HPC), scientific computing, and the automotive industry. While radiation-induced faults have been well studied and understood in microprocessors, their impact on computations on Graphic Processing Units (GPU) has received less attention. GPUs are now being used in a large number of HPC and automotive markets. Mitigating the effects of transient faults requires a thorough understanding of the interaction between applications, system software, and the underlying hardware. Developing this understanding is quite challenging mainly due to our limited ability to capture and study cross-layer reliability interactions. In this paper, we consider the combination of neutron beam testing experiments with architectural fault injection experiments to gain a deeper understanding of the relationship between the vulnerability of GPUs and the underlying workload characteristics of applications targeted for GPU devices.
Combining architectural fault-injection and neutron beam testing approaches toward better understanding of GPU soft-error resilience / Previlon, F. G.; Egbantan, B.; Tiwari, D.; Rech, P.; Kaeli, D. R.. - 2017-:(2017), pp. 898-901. (Intervento presentato al convegno 60th IEEE International Midwest Symposium on Circuits and Systems, MWSCAS 2017 tenutosi a usa nel 2017) [10.1109/MWSCAS.2017.8053069].
Combining architectural fault-injection and neutron beam testing approaches toward better understanding of GPU soft-error resilience
Rech P.;
2017-01-01
Abstract
Transient faults continue to be a critical concern in a range of computing domains including: High-Performance Computing (HPC), scientific computing, and the automotive industry. While radiation-induced faults have been well studied and understood in microprocessors, their impact on computations on Graphic Processing Units (GPU) has received less attention. GPUs are now being used in a large number of HPC and automotive markets. Mitigating the effects of transient faults requires a thorough understanding of the interaction between applications, system software, and the underlying hardware. Developing this understanding is quite challenging mainly due to our limited ability to capture and study cross-layer reliability interactions. In this paper, we consider the combination of neutron beam testing experiments with architectural fault injection experiments to gain a deeper understanding of the relationship between the vulnerability of GPUs and the underlying workload characteristics of applications targeted for GPU devices.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione