Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the world's second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation / Tiwari, D.; Gupta, S.; Rogers, J.; Maxwell, D.; Rech, P.; Vazhkudai, S.; Oliveira, D.; Londo, D.; Debardeleben, N.; Navaux, P.; Carro, L.; Bland, A.. - (2015), pp. 331-342. (Intervento presentato al convegno 2015 21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015 tenutosi a usa nel 2015) [10.1109/HPCA.2015.7056044].

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Gupta S.;Rogers J.;Rech P.;
2015-01-01

Abstract

Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the world's second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.
2015
2015 IEEE 21st International Symposium on High Performance Computer Architecture, HPCA 2015
usa
Institute of Electrical and Electronics Engineers Inc.
978-1-4799-8930-0
Tiwari, D.; Gupta, S.; Rogers, J.; Maxwell, D.; Rech, P.; Vazhkudai, S.; Oliveira, D.; Londo, D.; Debardeleben, N.; Navaux, P.; Carro, L.; Bland, A.
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation / Tiwari, D.; Gupta, S.; Rogers, J.; Maxwell, D.; Rech, P.; Vazhkudai, S.; Oliveira, D.; Londo, D.; Debardeleben, N.; Navaux, P.; Carro, L.; Bland, A.. - (2015), pp. 331-342. (Intervento presentato al convegno 2015 21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015 tenutosi a usa nel 2015) [10.1109/HPCA.2015.7056044].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/346629
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 127
  • ???jsp.display-item.citation.isi??? ND
social impact