The increased need for computing capabilities and higher efficiency have stimulated industries to make available in the market novel architectures with increased complexity. The variety of codes that need to be executed combined with the complexity of novel architectures introduces challenges in the reliability evaluation of computing systems and applications. This paper compares the reliability behaviors of six different architectures (an Intel co-processor, three NVIDIA GPUs, an AMD APU, an embedded ARM) executing eight different codes. To support our evaluation, we present and discuss experimental beam data that covers a total of more than 352,000 years of natural exposure and fault-injection analysis based on a total of more than 120,000 injections. We first quantify both the Silent Data Corruptions and the Detected Unrecoverable Errors rates. Then, we qualify observed errors considering the difference between the corrupted and expected values as well as the portion of the output that has been corrupted. From these analyses, we identify the reliability characteristics which are related to the underlying hardware and the intrinsic behaviors of the executed code. Finally, we discuss the implications of the device- and code-dependent reliability behaviors for approximate computing. We analyze the benefits, in term of reduced error rate, of a relaxed output correctness.
Code-Dependent and Architecture-Dependent reliability behaviors / Fratin, V.; Oliveira, D.; Lunardi, C.; Santos, F.; Rodrigues, G.; Rech, P.. - (2018), pp. 13-26. (Intervento presentato al convegno 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018 tenutosi a lux nel 2018) [10.1109/DSN.2018.00015].
Code-Dependent and Architecture-Dependent reliability behaviors
Rech P.
2018-01-01
Abstract
The increased need for computing capabilities and higher efficiency have stimulated industries to make available in the market novel architectures with increased complexity. The variety of codes that need to be executed combined with the complexity of novel architectures introduces challenges in the reliability evaluation of computing systems and applications. This paper compares the reliability behaviors of six different architectures (an Intel co-processor, three NVIDIA GPUs, an AMD APU, an embedded ARM) executing eight different codes. To support our evaluation, we present and discuss experimental beam data that covers a total of more than 352,000 years of natural exposure and fault-injection analysis based on a total of more than 120,000 injections. We first quantify both the Silent Data Corruptions and the Detected Unrecoverable Errors rates. Then, we qualify observed errors considering the difference between the corrupted and expected values as well as the portion of the output that has been corrupted. From these analyses, we identify the reliability characteristics which are related to the underlying hardware and the intrinsic behaviors of the executed code. Finally, we discuss the implications of the device- and code-dependent reliability behaviors for approximate computing. We analyze the benefits, in term of reduced error rate, of a relaxed output correctness.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione