In this paper, we investigate neutron-induced errors in three implementations of sort algorithms (QuickSort, MergeSort, and RadixSort) executed on modern graphics processing units designed for high-performance computing and large server applications. We measure the radiation-induced error rate of sort algorithms taking advantage of the neutron beam available at the Los Alamos Neutron Science Center facility. We also analyze output error criticality by identifying specific output error patterns. We found that radiation can cause wrong elements to appear in the sorted array, misalign values as well as application crashes or system hangs. This paper presents results showing that the criticality of the radiation-induced output error pattern depends on the application. Additionally, an extensive fault-injection campaign has been performed. This campaign allows for better understanding of the observed phenomena. We take advantage of SASS-assembly Intrumentator Fault Injector developed by NVIDIA, which can inject faults into all the user-accessible architectural state. Comparing fault-injection results with radiation experiments data provides an understanding that not all the output errors observed under radiation can be replicated in fault injection. However, fault injection is useful in identifying possible root causes of the output errors observed in radiation testing. Finally, we take advantage of our experimental and analytical study to design efficient experimentally tuned hardening strategies. We detect the error patterns that are critical to the final application and find the more efficient way to detect them. With an overhead as low as 16% of the execution time, we are able to reduce the output error rate of sort of about one order of magnitude.

Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications / Lunardi, C.; Quinn, H.; Monroe, L.; Oliveira, D.; Navaux, P.; Rech, P.. - In: IEEE TRANSACTIONS ON NUCLEAR SCIENCE. - ISSN 0018-9499. - 64:8(2017), pp. 2169-2178. [10.1109/TNS.2017.2727499]

Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications

Rech P.
2017-01-01

Abstract

In this paper, we investigate neutron-induced errors in three implementations of sort algorithms (QuickSort, MergeSort, and RadixSort) executed on modern graphics processing units designed for high-performance computing and large server applications. We measure the radiation-induced error rate of sort algorithms taking advantage of the neutron beam available at the Los Alamos Neutron Science Center facility. We also analyze output error criticality by identifying specific output error patterns. We found that radiation can cause wrong elements to appear in the sorted array, misalign values as well as application crashes or system hangs. This paper presents results showing that the criticality of the radiation-induced output error pattern depends on the application. Additionally, an extensive fault-injection campaign has been performed. This campaign allows for better understanding of the observed phenomena. We take advantage of SASS-assembly Intrumentator Fault Injector developed by NVIDIA, which can inject faults into all the user-accessible architectural state. Comparing fault-injection results with radiation experiments data provides an understanding that not all the output errors observed under radiation can be replicated in fault injection. However, fault injection is useful in identifying possible root causes of the output errors observed in radiation testing. Finally, we take advantage of our experimental and analytical study to design efficient experimentally tuned hardening strategies. We detect the error patterns that are critical to the final application and find the more efficient way to detect them. With an overhead as low as 16% of the execution time, we are able to reduce the output error rate of sort of about one order of magnitude.
2017
8
Lunardi, C.; Quinn, H.; Monroe, L.; Oliveira, D.; Navaux, P.; Rech, P.
Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications / Lunardi, C.; Quinn, H.; Monroe, L.; Oliveira, D.; Navaux, P.; Rech, P.. - In: IEEE TRANSACTIONS ON NUCLEAR SCIENCE. - ISSN 0018-9499. - 64:8(2017), pp. 2169-2178. [10.1109/TNS.2017.2727499]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/403758
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? ND
social impact