Accurate brain tumor classification with deep learning can significantly support diagnosis and treatment planning. While publicly available datasets have greatly advanced research, they often lack crucial clinical and demographic metadata, making it difficult to implement proper data splitting and safeguard against information leakage. Access to hospital-sourced datasets remains extremely limited, so the community continues to rely on publicly available imaging datasets. In this work, we propose a practical, step-by-step guideline for using such datasets responsibly. We apply perceptual hashing (pHash) to detect and remove duplicate or near-duplicate images, which can otherwise inflate performance. Using EfficientNetB0 and InceptionV3, we compare model performance before and after de-duplication with stratified splits and nested cross-validation. Despite a slight drop in test accuracy, nested validation reveals a clearer picture of generalizability. We also report detailed metrics including F2-score, MCC, CSI, PR AUC, Cohen's Kappa, and Log Loss. Finally, we evaluated the model using k-fold and nested crossvalidation, and interpreted with LIME. Our study highlights the importance of addressing data leakage and encourages future researchers to adopt more rigorous validation practices. We also urge data providers to consider leakage risks when sharing medical datasets, especially in sensitive domains like brain tumor analysis.

Reliable Brain Tumor Classification Without Metadata: A Step-by-Step Guideline with Duplicate Removal / Saifullah, M., Baldauf, D., Sakumura, Y.. - (2025), pp. 1237-1244. (37th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2025 Athens, Greece 03-05 November 2025) [10.1109/ictai66417.2025.00180].

Reliable Brain Tumor Classification Without Metadata: A Step-by-Step Guideline with Duplicate Removal

Baldauf, Daniel;
2025-01-01

Abstract

Accurate brain tumor classification with deep learning can significantly support diagnosis and treatment planning. While publicly available datasets have greatly advanced research, they often lack crucial clinical and demographic metadata, making it difficult to implement proper data splitting and safeguard against information leakage. Access to hospital-sourced datasets remains extremely limited, so the community continues to rely on publicly available imaging datasets. In this work, we propose a practical, step-by-step guideline for using such datasets responsibly. We apply perceptual hashing (pHash) to detect and remove duplicate or near-duplicate images, which can otherwise inflate performance. Using EfficientNetB0 and InceptionV3, we compare model performance before and after de-duplication with stratified splits and nested cross-validation. Despite a slight drop in test accuracy, nested validation reveals a clearer picture of generalizability. We also report detailed metrics including F2-score, MCC, CSI, PR AUC, Cohen's Kappa, and Log Loss. Finally, we evaluated the model using k-fold and nested crossvalidation, and interpreted with LIME. Our study highlights the importance of addressing data leakage and encourages future researchers to adopt more rigorous validation practices. We also urge data providers to consider leakage risks when sharing medical datasets, especially in sensitive domains like brain tumor analysis.
2025
2025 IEEE 37th International Conference on Tools with Artificial Intelligence (ICTAI)
Athens, Greece
IEEE Computer Society
9798331549190
Saifullah, Mohiuddin; Baldauf, Daniel; Sakumura, Yuichi
Reliable Brain Tumor Classification Without Metadata: A Step-by-Step Guideline with Duplicate Removal / Saifullah, M., Baldauf, D., Sakumura, Y.. - (2025), pp. 1237-1244. (37th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2025 Athens, Greece 03-05 November 2025) [10.1109/ictai66417.2025.00180].
File in questo prodotto:
File Dimensione Formato  
Reliable_Brain_Tumor_Classification_Without_Metadata_A_Step-by-Step_Guideline_with_Duplicate_Removal.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.27 MB
Formato Adobe PDF
1.27 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/488191
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact