Machine Learning Models and Data-Balancing Techniques for Credit Scoring: What Is the Best Combination?

IRIS

Forecasting the creditworthiness of customers is a central issue of banking activity. This task requires the analysis of large datasets with many variables, for which machine learning algorithms and feature selection techniques are a crucial tool. Moreover, the percentages of “good” and “bad” customers are typically imbalanced such that over- and undersampling techniques should be employed. In the literature, most investigations tackle these three issues individually. Since there is little evidence about their joint performance, in this paper, we try to fill this gap. We use five machine learning classifiers, and each of them is combined with different feature selection techniques and various data-balancing approaches. According to the empirical analysis of a retail credit bank dataset, we find that the best combination is given by random forests, random forest recursive feature elimination and random oversampling.

Machine Learning Models and Data-Balancing Techniques for Credit Scoring: What Is the Best Combination? / Hussin Adam Khatir, Ahmed Almustfa; Bee, Marco. - In: RISKS. - ISSN 2227-9091. - 10:9(2022), p. 169. [10.3390/risks10090169]

Machine Learning Models and Data-Balancing Techniques for Credit Scoring: What Is the Best Combination?

Hussin Adam Khatir, Ahmed Almustfa^Primo;Bee, Marco^Ultimo

2022-01-01

Abstract

Forecasting the creditworthiness of customers is a central issue of banking activity. This task requires the analysis of large datasets with many variables, for which machine learning algorithms and feature selection techniques are a crucial tool. Moreover, the percentages of “good” and “bad” customers are typically imbalanced such that over- and undersampling techniques should be employed. In the literature, most investigations tackle these three issues individually. Since there is little evidence about their joint performance, in this paper, we try to fill this gap. We use five machine learning classifiers, and each of them is combined with different feature selection techniques and various data-balancing approaches. According to the empirical analysis of a retail credit bank dataset, we find that the best combination is given by random forests, random forest recursive feature elimination and random oversampling.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2022
			
	Titolo del periodico (Journal title)
	
				RISKS
			
	Numero e parte del fascicolo (Issue number and part)
	
				9
			
	DOI
	
				https://dx.doi.org/10.3390/risks10090169
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-85138607476
			
	Codice WOS (WOS identifier)
	
				WOS:000856916500001
			
	Tutti gli autori
	
						Hussin Adam Khatir, Ahmed Almustfa; Bee, Marco
					
	Citazione
	
				Machine Learning Models and Data-Balancing Techniques for Credit Scoring: What Is the Best Combination? / Hussin Adam Khatir, Ahmed Almustfa; Bee, Marco. - In: RISKS. - ISSN 2227-9091. - 10:9(2022), p. 169. [10.3390/risks10090169]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
HussinBee2022.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Creative commons Dimensione 738.54 kB Formato Adobe PDF Visualizza/Apri	738.54 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/352420

Citazioni

ND

34

16

ND

social impact