Transient faults are a major problem for large scale HPC systems, and the mitigation of adverse fault effects need to be highly efficient as we approach exascale. We developed a fault injection tool (CAROL-FI) to identify the potential sources of adverse fault effects. With a deeper understanding of such effects, we provide useful insights to design efficient mitigation techniques, like selective hardening of critical portions of the code. We performed a fault injection campaign injecting more than 67, 000 faults into an Intel Xeon Phi executing six repre-sentative HPC programs. We show that selective hardening can be successfully applied to DGEMM and Hotspot while LavaMD and NW may require a complete code hardening.

CAROL-FI: An efficient fault-injection tool for vulnerability evaluation of modern HPC parallel accelerators / Oliveira, D.; Frattin, V.; Navaux, P.; Koren, I.; Rech, P.. - (2017), pp. 295-298. ((Intervento presentato al convegno 14th ACM International Conference on Computing Frontiers, CF 2017 tenutosi a University of Siena, ita nel 2017 [10.1145/3075564.3075598].

CAROL-FI: An efficient fault-injection tool for vulnerability evaluation of modern HPC parallel accelerators

Rech P.
2017-01-01

Abstract

Transient faults are a major problem for large scale HPC systems, and the mitigation of adverse fault effects need to be highly efficient as we approach exascale. We developed a fault injection tool (CAROL-FI) to identify the potential sources of adverse fault effects. With a deeper understanding of such effects, we provide useful insights to design efficient mitigation techniques, like selective hardening of critical portions of the code. We performed a fault injection campaign injecting more than 67, 000 faults into an Intel Xeon Phi executing six repre-sentative HPC programs. We show that selective hardening can be successfully applied to DGEMM and Hotspot while LavaMD and NW may require a complete code hardening.
ACM International Conference on Computing Frontiers 2017, CF 2017
usa
Association for Computing Machinery, Inc
9781450344876
Oliveira, D.; Frattin, V.; Navaux, P.; Koren, I.; Rech, P.
CAROL-FI: An efficient fault-injection tool for vulnerability evaluation of modern HPC parallel accelerators / Oliveira, D.; Frattin, V.; Navaux, P.; Koren, I.; Rech, P.. - (2017), pp. 295-298. ((Intervento presentato al convegno 14th ACM International Conference on Computing Frontiers, CF 2017 tenutosi a University of Siena, ita nel 2017 [10.1145/3075564.3075598].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/346639
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 14
  • ???jsp.display-item.citation.isi??? ND
social impact