Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training

IRIS

Deep Neural Networks (DNNs) have revolutionized several fields, including safety- and mission-critical applications, such as autonomous driving and space exploration. However, recent studies have highlighted that transient hardware faults can corrupt the model's output, leading to high misprediction probabilities. Since traditional reliability strategies, based on modular hardware, software replications, or matrix multiplication checksum impose a high overhead, there is a pressing need for efficient and effective hardening solutions tailored for DNNs. In this paper we present several network design choices and a training procedure that increase the robustness of standard deep models and thoroughly evaluate these strategies with experimental analyses on vision classification tasks. We name DieHardNet the specialized DNN obtained by applying all our hardening techniques that combine knowledge from experimental hardware faults characterization and machine learning studies. We conduct extensive ablation studies to quantify the reliability gain of each hardening component in DieHardNet. We perform over 10,000 instruction-level fault injections to validate our approach and expose DieHardNet executed on GPUs to an accelerated neutron beam equivalent to more than 570,000 years of natural radiation. Our evaluation demonstrates that DieHardNet can reduce the critical error rate (i.e., errors that modify the inference) up to 100 times compared to the unprotected baseline model, without causing any increase in inference time.

Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training / Fernandes Dos Santos, F.; Cavagnero, N.; Ciccone, M.; Averta, G.; Kritikakou, A.; Sentieys, O.; Rech, P.; Tommasi, T.. - In: IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING. - ISSN 2168-6750. - 2025, 13:3(2025), pp. 829-840. [10.1109/TETC.2024.3520672]

Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training

Fernandes Dos Santos F.;Cavagnero N.;Ciccone M.;Averta G.;Kritikakou A.;Sentieys O.;Rech P.;Tommasi T.

2025-01-01

Abstract

Deep Neural Networks (DNNs) have revolutionized several fields, including safety- and mission-critical applications, such as autonomous driving and space exploration. However, recent studies have highlighted that transient hardware faults can corrupt the model's output, leading to high misprediction probabilities. Since traditional reliability strategies, based on modular hardware, software replications, or matrix multiplication checksum impose a high overhead, there is a pressing need for efficient and effective hardening solutions tailored for DNNs. In this paper we present several network design choices and a training procedure that increase the robustness of standard deep models and thoroughly evaluate these strategies with experimental analyses on vision classification tasks. We name DieHardNet the specialized DNN obtained by applying all our hardening techniques that combine knowledge from experimental hardware faults characterization and machine learning studies. We conduct extensive ablation studies to quantify the reliability gain of each hardening component in DieHardNet. We perform over 10,000 instruction-level fault injections to validate our approach and expose DieHardNet executed on GPUs to an accelerated neutron beam equivalent to more than 570,000 years of natural radiation. Our evaluation demonstrates that DieHardNet can reduce the critical error rate (i.e., errors that modify the inference) up to 100 times compared to the unprotected baseline model, without causing any increase in inference time.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2025
			
	Titolo del periodico (Journal title)
	
				IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING
			
	Numero e parte del fascicolo (Issue number and part)
	
				3
			
	DOI
	
				https://dx.doi.org/10.1109/TETC.2024.3520672
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-85215304476
			
	Codice WOS (WOS identifier)
	
				WOS:001571491400026
			
	Tutti gli autori
	
						Fernandes Dos Santos, F.; Cavagnero, N.; Ciccone, M.; Averta, G.; Kritikakou, A.; Sentieys, O.; Rech, P.; Tommasi, T.
					
	Citazione
	
				Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training / Fernandes Dos Santos, F.; Cavagnero, N.; Ciccone, M.; Averta, G.; Kritikakou, A.; Sentieys, O.; Rech, P.; Tommasi, T.. - In: IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING. - ISSN 2168-6750. - 2025, 13:3(2025), pp. 829-840. [10.1109/TETC.2024.3520672]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
Improving_Deep_Neural_Network_Reliability_via_Transient-Fault-Aware_Design_and_Training.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.97 MB Formato Adobe PDF Visualizza/Apri	1.97 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/458770

Citazioni

ND

1

1

ND

social impact