The high computing power of graphics processing units (GPUs) makes them attractive for safety-critical applications, where reliability is a major concern. This article uses an approximate computing perspective to relax application accuracy in order to improve the selective fault tolerance techniques. Our approach first assesses the vulnerability of a Kepler GPU to the transient effects through a neutron beam experiment. Then, it performs a fault injection campaign to identify the most critical registers and relax the result accuracy. Finally, it uses the acquired data to improve the selective fault tolerance techniques in terms of occupation and performance. The results show that it was possible to improve the GPU register file's reliability on average by 71.6% by relaxing the application accuracy and, when compared with the selective hardening techniques, it was able to reduce the replicated registers by an average of 41.4%, while maintaining 100% fault coverage.
Improving Selective Fault Tolerance in GPU Register Files by Relaxing Application Accuracy / Goncalves, M. M.; Lamb, I. P.; Rech, P.; Brum, R. M.; Azambuja, J. R.. - In: IEEE TRANSACTIONS ON NUCLEAR SCIENCE. - ISSN 0018-9499. - 67:7(2020), pp. 1573-1580. [10.1109/TNS.2020.2982162]
Improving Selective Fault Tolerance in GPU Register Files by Relaxing Application Accuracy
Rech P.;
2020-01-01
Abstract
The high computing power of graphics processing units (GPUs) makes them attractive for safety-critical applications, where reliability is a major concern. This article uses an approximate computing perspective to relax application accuracy in order to improve the selective fault tolerance techniques. Our approach first assesses the vulnerability of a Kepler GPU to the transient effects through a neutron beam experiment. Then, it performs a fault injection campaign to identify the most critical registers and relax the result accuracy. Finally, it uses the acquired data to improve the selective fault tolerance techniques in terms of occupation and performance. The results show that it was possible to improve the GPU register file's reliability on average by 71.6% by relaxing the application accuracy and, when compared with the selective hardening techniques, it was able to reduce the replicated registers by an average of 41.4%, while maintaining 100% fault coverage.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione