Reliable Brain Tumor Classification Without Metadata: A Step-by-Step Guideline with Duplicate Removal

Saifullah, Mohiuddin; Baldauf, Daniel; Sakumura, Yuichi

doi:10.1109/ictai66417.2025.00180

Accurate brain tumor classification with deep learning can significantly support diagnosis and treatment planning. While publicly available datasets have greatly advanced research, they often lack crucial clinical and demographic metadata, making it difficult to implement proper data splitting and safeguard against information leakage. Access to hospital-sourced datasets remains extremely limited, so the community continues to rely on publicly available imaging datasets. In this work, we propose a practical, step-by-step guideline for using such datasets responsibly. We apply perceptual hashing (pHash) to detect and remove duplicate or near-duplicate images, which can otherwise inflate performance. Using EfficientNetB0 and InceptionV3, we compare model performance before and after de-duplication with stratified splits and nested cross-validation. Despite a slight drop in test accuracy, nested validation reveals a clearer picture of generalizability. We also report detailed metrics including F2-score, MCC, CSI, PR AUC, Cohen's Kappa, and Log Loss. Finally, we evaluated the model using k-fold and nested crossvalidation, and interpreted with LIME. Our study highlights the importance of addressing data leakage and encourages future researchers to adopt more rigorous validation practices. We also urge data providers to consider leakage risks when sharing medical datasets, especially in sensitive domains like brain tumor analysis.

Reliable Brain Tumor Classification Without Metadata: A Step-by-Step Guideline with Duplicate Removal / Saifullah, M., Baldauf, D., Sakumura, Y.. - (2025), pp. 1237-1244. (37th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2025 Athens, Greece 03-05 November 2025) [10.1109/ictai66417.2025.00180].

Reliable Brain Tumor Classification Without Metadata: A Step-by-Step Guideline with Duplicate Removal

Saifullah, Mohiuddin;Baldauf, Daniel;Sakumura, Yuichi

2025-01-01

Abstract

Accurate brain tumor classification with deep learning can significantly support diagnosis and treatment planning. While publicly available datasets have greatly advanced research, they often lack crucial clinical and demographic metadata, making it difficult to implement proper data splitting and safeguard against information leakage. Access to hospital-sourced datasets remains extremely limited, so the community continues to rely on publicly available imaging datasets. In this work, we propose a practical, step-by-step guideline for using such datasets responsibly. We apply perceptual hashing (pHash) to detect and remove duplicate or near-duplicate images, which can otherwise inflate performance. Using EfficientNetB0 and InceptionV3, we compare model performance before and after de-duplication with stratified splits and nested cross-validation. Despite a slight drop in test accuracy, nested validation reveals a clearer picture of generalizability. We also report detailed metrics including F2-score, MCC, CSI, PR AUC, Cohen's Kappa, and Log Loss. Finally, we evaluated the model using k-fold and nested crossvalidation, and interpreted with LIME. Our study highlights the importance of addressing data leakage and encourages future researchers to adopt more rigorous validation practices. We also urge data providers to consider leakage risks when sharing medical datasets, especially in sensitive domains like brain tumor analysis.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2025
			
	Titolo del volume (Proceedings title)
	
				2025 IEEE 37th International Conference on Tools with Artificial Intelligence (ICTAI)
			
	Luogo di edizione (Place of publication)
	
				Athens, Greece
			
	Casa editrice (Publisher)
	
				IEEE Computer Society
			
	ISBN
	
				9798331549190
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-105031892121
			
	Tutti gli autori
	
						Saifullah, Mohiuddin; Baldauf, Daniel; Sakumura, Yuichi
					
	Citazione
	
				Reliable Brain Tumor Classification Without Metadata: A Step-by-Step Guideline with Duplicate Removal / Saifullah, M., Baldauf, D., Sakumura, Y.. - (2025), pp. 1237-1244. (37th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2025 Athens, Greece 03-05 November 2025) [10.1109/ictai66417.2025.00180].
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
Reliable_Brain_Tumor_Classification_Without_Metadata_A_Step-by-Step_Guideline_with_Duplicate_Removal.pdf Solo gestori archivio Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.27 MB Formato Adobe PDF Visualizza/Apri	1.27 MB	Adobe PDF	Visualizza/Apri