Transient faults are a major problem for large scale HPC systems, and the mitigation of adverse fault effects need to be highly efficient as we approach exascale. We developed a fault injection tool (CAROL-FI) to identify the potential sources of adverse fault effects. With a deeper understanding of such effects, we provide useful insights to design efficient mitigation techniques, like selective hardening of critical portions of the code. We performed a fault injection campaign injecting more than 67, 000 faults into an Intel Xeon Phi executing six repre-sentative HPC programs. We show that selective hardening can be successfully applied to DGEMM and Hotspot while LavaMD and NW may require a complete code hardening.
CAROL-FI: An efficient fault-injection tool for vulnerability evaluation of modern HPC parallel accelerators / Oliveira, D.; Frattin, V.; Navaux, P.; Koren, I.; Rech, P.. - (2017), pp. 295-298. (Intervento presentato al convegno 14th ACM International Conference on Computing Frontiers, CF 2017 tenutosi a University of Siena, ita nel 2017) [10.1145/3075564.3075598].
CAROL-FI: An efficient fault-injection tool for vulnerability evaluation of modern HPC parallel accelerators
Rech P.
2017-01-01
Abstract
Transient faults are a major problem for large scale HPC systems, and the mitigation of adverse fault effects need to be highly efficient as we approach exascale. We developed a fault injection tool (CAROL-FI) to identify the potential sources of adverse fault effects. With a deeper understanding of such effects, we provide useful insights to design efficient mitigation techniques, like selective hardening of critical portions of the code. We performed a fault injection campaign injecting more than 67, 000 faults into an Intel Xeon Phi executing six repre-sentative HPC programs. We show that selective hardening can be successfully applied to DGEMM and Hotspot while LavaMD and NW may require a complete code hardening.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione