Classifying imbalanced data sets using similarity based hierarchical decomposition

IRIS

Classification of data is difficult if the data is imbalanced and classes are overlapping. In recent years, more research has started to focus on classification of imbalanced data since real world data is often skewed. Traditional methods are more successful with classifying the class that has the most samples (majority class) compared to the other classes (minority classes). For the classification of imbalanced data sets, different methods are available, although each has some advantages and shortcomings. In this study, we propose a new hierarchical decomposition method for imbalanced data sets which is different from previously proposed solutions to the class imbalance problem. Additionally, it does not require any data pre-processing step as many other solutions need. The new method is based on clustering and outlier detection. The hierarchy is constructed using the similarity of labeled data subsets at each level of the hierarchy with different levels being built by different data and feature subsets. Clustering is used to partition the data while outlier detection is utilized to detect minority class samples. The comparison of the proposed method with state of art the methods using 20 public imbalanced data sets and 181 synthetic data sets showed that the proposed method's classification performance is better than the state of art methods. It is especially successful if the minority class is sparser than the majority class. It has accurate performance even when classes have sub-varieties and minority and majority classes are overlapping. Moreover, its performance is also good when the class imbalance ratio is low, i.e. classes are more imbalanced.

Classifying imbalanced data sets using similarity based hierarchical decomposition / Beyan, C.; Fisher, R.. - In: PATTERN RECOGNITION. - ISSN 0031-3203. - 48:5(2015), pp. 1653-1672. [10.1016/j.patcog.2014.10.032]

Classifying imbalanced data sets using similarity based hierarchical decomposition

Beyan C.;Fisher R.

2015-01-01

Abstract

Classification of data is difficult if the data is imbalanced and classes are overlapping. In recent years, more research has started to focus on classification of imbalanced data since real world data is often skewed. Traditional methods are more successful with classifying the class that has the most samples (majority class) compared to the other classes (minority classes). For the classification of imbalanced data sets, different methods are available, although each has some advantages and shortcomings. In this study, we propose a new hierarchical decomposition method for imbalanced data sets which is different from previously proposed solutions to the class imbalance problem. Additionally, it does not require any data pre-processing step as many other solutions need. The new method is based on clustering and outlier detection. The hierarchy is constructed using the similarity of labeled data subsets at each level of the hierarchy with different levels being built by different data and feature subsets. Clustering is used to partition the data while outlier detection is utilized to detect minority class samples. The comparison of the proposed method with state of art the methods using 20 public imbalanced data sets and 181 synthetic data sets showed that the proposed method's classification performance is better than the state of art methods. It is especially successful if the minority class is sparser than the majority class. It has accurate performance even when classes have sub-varieties and minority and majority classes are overlapping. Moreover, its performance is also good when the class imbalance ratio is low, i.e. classes are more imbalanced.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
			2015
		
	Titolo del periodico (Journal title)
	
			PATTERN RECOGNITION
		
	Numero e parte del fascicolo (Issue number and part)
	
			5
		
	DOI
	
			https://dx.doi.org/10.1016/j.patcog.2014.10.032
		
	Codice Scopus (Scopus identifier)
	
			2-s2.0-84921689324
		
	Codice WOS (WOS identifier)
	
			WOS:000349504700006
		
	Tutti gli autori
	
			Beyan, C.; Fisher, R.
		
	Citazione
	
			Classifying imbalanced data sets using similarity based hierarchical decomposition / Beyan, C.; Fisher, R.. - In: PATTERN RECOGNITION. - ISSN 0031-3203. - 48:5(2015), pp. 1653-1672. [10.1016/j.patcog.2014.10.032]
		
	Appare nelle tipologie:
	
			03.1 Articolo su rivista (Journal article)

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/304315

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

127

86

social impact