Duplication with Comparison (DWC) is an effective software-level solution to improve the reliability of computing devices. However, it introduces performance and energy consumption overheads that could be unsuitable for high-performance computing or real-time safety-critical applications. In this article, we present Reduced-Precision Duplication with Comparison (RP-DWC) as a means to lower the overhead of DWC by executing the redundant copy in reduced precision. RP-DWC is particularly suitable for modern mixed-precision architectures, such as NVIDIA GPUs, that feature dedicated functional units for computing with programmable accuracy. We discuss the benefits and challenges associated with RP-DWC and show that the intrinsic difference between the mixed-precision copies allows for detecting most, but not all, errors. However, as the undetected faults are the ones that fall into the difference between precisions, they are the ones that produce a much smaller impact on the application output and, thus, might be tolerated. We investigate RP-DWC impact into fault detection, performance, and energy consumption on Volta GPUs. Through fault injection and beam experiment, using three microbenchmarks and four real applications, we show that RP-DWC achieves an excellent coverage (up to 86 percent) with minimal overheads (as low as 0.1 percent time and 24 percent energy consumption overhead).

Reduced Precision DWC: An Efficient Hardening Strategy for Mixed-Precision Architectures / Dos Santos, F. F.; Brandalero, M.; Sullivan, M. B.; Basso, P. M.; Hubner, M.; Carro, L.; Rech, P.. - In: IEEE TRANSACTIONS ON COMPUTERS. - ISSN 0018-9340. - 71:3(2022), pp. 573-586. [10.1109/TC.2021.3058872]

Reduced Precision DWC: An Efficient Hardening Strategy for Mixed-Precision Architectures

Rech P.
2022-01-01

Abstract

Duplication with Comparison (DWC) is an effective software-level solution to improve the reliability of computing devices. However, it introduces performance and energy consumption overheads that could be unsuitable for high-performance computing or real-time safety-critical applications. In this article, we present Reduced-Precision Duplication with Comparison (RP-DWC) as a means to lower the overhead of DWC by executing the redundant copy in reduced precision. RP-DWC is particularly suitable for modern mixed-precision architectures, such as NVIDIA GPUs, that feature dedicated functional units for computing with programmable accuracy. We discuss the benefits and challenges associated with RP-DWC and show that the intrinsic difference between the mixed-precision copies allows for detecting most, but not all, errors. However, as the undetected faults are the ones that fall into the difference between precisions, they are the ones that produce a much smaller impact on the application output and, thus, might be tolerated. We investigate RP-DWC impact into fault detection, performance, and energy consumption on Volta GPUs. Through fault injection and beam experiment, using three microbenchmarks and four real applications, we show that RP-DWC achieves an excellent coverage (up to 86 percent) with minimal overheads (as low as 0.1 percent time and 24 percent energy consumption overhead).
2022
3
Dos Santos, F. F.; Brandalero, M.; Sullivan, M. B.; Basso, P. M.; Hubner, M.; Carro, L.; Rech, P.
Reduced Precision DWC: An Efficient Hardening Strategy for Mixed-Precision Architectures / Dos Santos, F. F.; Brandalero, M.; Sullivan, M. B.; Basso, P. M.; Hubner, M.; Carro, L.; Rech, P.. - In: IEEE TRANSACTIONS ON COMPUTERS. - ISSN 0018-9340. - 71:3(2022), pp. 573-586. [10.1109/TC.2021.3058872]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/346699
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 9
  • ???jsp.display-item.citation.isi??? ND
social impact