Accurate brain tumor classification with deep learning can significantly support diagnosis and treatment planning. While publicly available datasets have greatly advanced research, they often lack crucial clinical and demographic metadata, making it difficult to implement proper data splitting and safeguard against information leakage. Access to hospital-sourced datasets remains extremely limited, so the community continues to rely on publicly available imaging datasets. In this work, we propose a practical, step-by-step guideline for using such datasets responsibly. We apply perceptual hashing (pHash) to detect and remove duplicate or near-duplicate images, which can otherwise inflate performance. Using EfficientNetB0 and InceptionV3, we compare model performance before and after de-duplication with stratified splits and nested cross-validation. Despite a slight drop in test accuracy, nested validation reveals a clearer picture of generalizability. We also report detailed metrics including F2-score, MCC, CSI, PR AUC, Cohen's Kappa, and Log Loss. Finally, we evaluated the model using k-fold and nested crossvalidation, and interpreted with LIME. Our study highlights the importance of addressing data leakage and encourages future researchers to adopt more rigorous validation practices. We also urge data providers to consider leakage risks when sharing medical datasets, especially in sensitive domains like brain tumor analysis.
Reliable Brain Tumor Classification Without Metadata: A Step-by-Step Guideline with Duplicate Removal / Saifullah, M., Baldauf, D., Sakumura, Y.. - (2025), pp. 1237-1244. (37th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2025 Athens, Greece 03-05 November 2025) [10.1109/ictai66417.2025.00180].
Reliable Brain Tumor Classification Without Metadata: A Step-by-Step Guideline with Duplicate Removal
Baldauf, Daniel;
2025-01-01
Abstract
Accurate brain tumor classification with deep learning can significantly support diagnosis and treatment planning. While publicly available datasets have greatly advanced research, they often lack crucial clinical and demographic metadata, making it difficult to implement proper data splitting and safeguard against information leakage. Access to hospital-sourced datasets remains extremely limited, so the community continues to rely on publicly available imaging datasets. In this work, we propose a practical, step-by-step guideline for using such datasets responsibly. We apply perceptual hashing (pHash) to detect and remove duplicate or near-duplicate images, which can otherwise inflate performance. Using EfficientNetB0 and InceptionV3, we compare model performance before and after de-duplication with stratified splits and nested cross-validation. Despite a slight drop in test accuracy, nested validation reveals a clearer picture of generalizability. We also report detailed metrics including F2-score, MCC, CSI, PR AUC, Cohen's Kappa, and Log Loss. Finally, we evaluated the model using k-fold and nested crossvalidation, and interpreted with LIME. Our study highlights the importance of addressing data leakage and encourages future researchers to adopt more rigorous validation practices. We also urge data providers to consider leakage risks when sharing medical datasets, especially in sensitive domains like brain tumor analysis.| File | Dimensione | Formato | |
|---|---|---|---|
|
Reliable_Brain_Tumor_Classification_Without_Metadata_A_Step-by-Step_Guideline_with_Duplicate_Removal.pdf
Solo gestori archivio
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
1.27 MB
Formato
Adobe PDF
|
1.27 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



