Static code analysis conducted by means of learning-based methods is an essential part of Security Testing. Effective learning algorithms are crucial for training reliable models that can accurately detect weaknesses and vulnerabilities. During models’ training, however, it is also of paramount importance to use adequate datasets of vulnerable and non-vulnerable source code. Most existing learning-based methods have been evaluated by applying them to public datasets of code fragments labeled as vulnerable and non-vulnerable. However, it is recognized that such datasets contain spurious entries, and are often imbalanced, i.e., contain a large portion of non-vulnerable code. While the first issue is often fixed with a pre-processing of data cleaning operations, the second one is almost ignored. This paper reports a preliminary study that investigates the effect of adopting imbalanced datasets and imbalance techniques on the performance of learning-based vulnerability detection methods. Our results show that (i) resampling approaches, in particular, a combination of over and under sampling, can generate reliable models and corroborate the results; and (ii) imbalance loss functions can improve the performance in case of very imbalanced and variegated datasets.

On the Use of Imbalanced Datasets for Learning-Based Vulnerability Detection / Foulefack, Rosmael; Marchetto, Alessandro. - 16107:(2026), pp. 307-324. ( 37th IFIP WG 6.1 International Conference on Testing Software and Systems, ICTSS 2025 Cyprus September 17–19, 2025) [10.1007/978-3-032-05188-2_20].

On the Use of Imbalanced Datasets for Learning-Based Vulnerability Detection

Foulefack, Rosmael
;
Marchetto, Alessandro
2026-01-01

Abstract

Static code analysis conducted by means of learning-based methods is an essential part of Security Testing. Effective learning algorithms are crucial for training reliable models that can accurately detect weaknesses and vulnerabilities. During models’ training, however, it is also of paramount importance to use adequate datasets of vulnerable and non-vulnerable source code. Most existing learning-based methods have been evaluated by applying them to public datasets of code fragments labeled as vulnerable and non-vulnerable. However, it is recognized that such datasets contain spurious entries, and are often imbalanced, i.e., contain a large portion of non-vulnerable code. While the first issue is often fixed with a pre-processing of data cleaning operations, the second one is almost ignored. This paper reports a preliminary study that investigates the effect of adopting imbalanced datasets and imbalance techniques on the performance of learning-based vulnerability detection methods. Our results show that (i) resampling approaches, in particular, a combination of over and under sampling, can generate reliable models and corroborate the results; and (ii) imbalance loss functions can improve the performance in case of very imbalanced and variegated datasets.
2026
Testing Software and Systems
Cham (SW)
Springer Science and Business Media Deutschland GmbH
978-3-032-05187-5
Settore IINF-05/A - Sistemi di elaborazione delle informazioni
Foulefack, Rosmael; Marchetto, Alessandro
On the Use of Imbalanced Datasets for Learning-Based Vulnerability Detection / Foulefack, Rosmael; Marchetto, Alessandro. - 16107:(2026), pp. 307-324. ( 37th IFIP WG 6.1 International Conference on Testing Software and Systems, ICTSS 2025 Cyprus September 17–19, 2025) [10.1007/978-3-032-05188-2_20].
File in questo prodotto:
File Dimensione Formato  
978-3-032-05188-2_20.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 535.67 kB
Formato Adobe PDF
535.67 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/468514
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex 0
social impact