Deep Neural Networks (DNNs) have revolutionized several fields, including safety- and mission-critical applications, such as autonomous driving and space exploration. However, recent studies have highlighted that transient hardware faults can corrupt the model's output, leading to high misprediction probabilities. Since traditional reliability strategies, based on modular hardware, software replications, or matrix multiplication checksum impose a high overhead, there is a pressing need for efficient and effective hardening solutions tailored for DNNs. In this paper we present several network design choices and a training procedure that increase the robustness of standard deep models and thoroughly evaluate these strategies with experimental analyses on vision classification tasks. We name DieHardNet the specialized DNN obtained by applying all our hardening techniques that combine knowledge from experimental hardware faults characterization and machine learning studies. We conduct extensive ablation studies to quantify the reliability gain of each hardening component in DieHardNet. We perform over 10,000 instruction-level fault injections to validate our approach and expose DieHardNet executed on GPUs to an accelerated neutron beam equivalent to more than 570,000 years of natural radiation. Our evaluation demonstrates that DieHardNet can reduce the critical error rate (i.e., errors that modify the inference) up to 100 times compared to the unprotected baseline model, without causing any increase in inference time.

Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training / Fernandes Dos Santos, F.; Cavagnero, N.; Ciccone, M.; Averta, G.; Kritikakou, A.; Sentieys, O.; Rech, P.; Tommasi, T.. - In: IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING. - ISSN 2168-6750. - 2025, 13:3(2025), pp. 829-840. [10.1109/TETC.2024.3520672]

Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training

Rech P.;
2025-01-01

Abstract

Deep Neural Networks (DNNs) have revolutionized several fields, including safety- and mission-critical applications, such as autonomous driving and space exploration. However, recent studies have highlighted that transient hardware faults can corrupt the model's output, leading to high misprediction probabilities. Since traditional reliability strategies, based on modular hardware, software replications, or matrix multiplication checksum impose a high overhead, there is a pressing need for efficient and effective hardening solutions tailored for DNNs. In this paper we present several network design choices and a training procedure that increase the robustness of standard deep models and thoroughly evaluate these strategies with experimental analyses on vision classification tasks. We name DieHardNet the specialized DNN obtained by applying all our hardening techniques that combine knowledge from experimental hardware faults characterization and machine learning studies. We conduct extensive ablation studies to quantify the reliability gain of each hardening component in DieHardNet. We perform over 10,000 instruction-level fault injections to validate our approach and expose DieHardNet executed on GPUs to an accelerated neutron beam equivalent to more than 570,000 years of natural radiation. Our evaluation demonstrates that DieHardNet can reduce the critical error rate (i.e., errors that modify the inference) up to 100 times compared to the unprotected baseline model, without causing any increase in inference time.
2025
3
Fernandes Dos Santos, F.; Cavagnero, N.; Ciccone, M.; Averta, G.; Kritikakou, A.; Sentieys, O.; Rech, P.; Tommasi, T.
Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training / Fernandes Dos Santos, F.; Cavagnero, N.; Ciccone, M.; Averta, G.; Kritikakou, A.; Sentieys, O.; Rech, P.; Tommasi, T.. - In: IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING. - ISSN 2168-6750. - 2025, 13:3(2025), pp. 829-840. [10.1109/TETC.2024.3520672]
File in questo prodotto:
File Dimensione Formato  
Improving_Deep_Neural_Network_Reliability_via_Transient-Fault-Aware_Design_and_Training.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.97 MB
Formato Adobe PDF
1.97 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/458770
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
  • OpenAlex ND
social impact