Evaluating and Mitigating Neutrons Effects on COTS EdgeAI Accelerators

Blower, Sebastian; Rech, Paolo; Cazzaniga, Carlo; Kastriotou, Maria; Frost, Christopher D.

doi:10.1109/TNS.2021.3086686

EdgeAI is an emerging artificial intelligence (AI) accelerator technology, which is capable of delivering improved AI performance at both a lower cost and a lower power level. With the aim of implementation in large quantities and in safety-critical environments, it is imperative to understand how single-event effects (SEEs) affect the reliability of this new family of devices and to propose efficient hardening solutions. Through neutron beam experiments and fault-injection analysis of a commercial-off-the-shelf (COTS) EdgeAI device, we are able to identify the device's SEE failure-modes, separate the error rate contributions of the device's different resources, and characterize the device's SEE reliability. During this analysis, we discovered that the vast majority of single-bit flips have no appreciable effect on the output. After this analysis, we propose a hardening solution that implements triple-modular redundancy (TMR) in the device without changing its physical architecture. We...

EdgeAI is an emerging artificial intelligence (AI) accelerator technology, which is capable of delivering improved AI performance at both a lower cost and a lower power level. With the aim of implementation in large quantities and in safety-critical environments, it is imperative to understand how single-event effects (SEEs) affect the reliability of this new family of devices and to propose efficient hardening solutions. Through neutron beam experiments and fault-injection analysis of a commercial-off-the-shelf (COTS) EdgeAI device, we are able to identify the device's SEE failure-modes, separate the error rate contributions of the device's different resources, and characterize the device's SEE reliability. During this analysis, we discovered that the vast majority of single-bit flips have no appreciable effect on the output. After this analysis, we propose a hardening solution that implements triple-modular redundancy (TMR) in the device without changing its physical architecture. We experimentally validate this solution and show that we are able to correct 96% of the misclassifications (critical errors) with nearly zero overhead.

Evaluating and Mitigating Neutrons Effects on COTS EdgeAI Accelerators / Blower, Sebastian; Rech, Paolo; Cazzaniga, Carlo; Kastriotou, Maria; Frost, Christopher D.. - In: IEEE TRANSACTIONS ON NUCLEAR SCIENCE. - ISSN 0018-9499. - 68:8(2021), pp. 1719-1726. [10.1109/TNS.2021.3086686]