This thesis investigates how the reliance on supervision can be reduced across the entire deep learning pipeline. In the training phase, we explore unsupervised fine-tuning, focusing on Source-Free Unsupervised Domain Adaptation scenarios in visual tasks such as Facial Expression Recognition and video-based Action Recognition, primarily leveraging self-supervision and self-training. At inference, we address the challenge of removing fixed output vocabularies from Vision Language Models by formalizing the tasks of Vocabulary-free Image Classification and Vocabulary-free Semantic Segmentation and by introducing a family of efficient methods that adapt CLIP to the tasks. We also evaluate Large Multimodal Models under a similar constrained scenario, analyzing their predictions, categorizing their mistakes, and proposing tailored solutions to optimize their performance. Finally, we investigate unsupervised evaluation by proposing a framework that uses a Large Language Model and modular tools to automatically generate, execute, and interpret evaluation experiments for Large Multimodal Models without ground-truth labels. By reducing the need for human supervision at every stage of the deep learning pipeline, this thesis contributes toward a more flexible and efficient paradigm for developing and deploying deep neural networks in real-world, data-scarce, and open-ended settings.
Learning without Labels - Reducing Supervision in Training, Inference, and Evaluation of Deep Neural Networks / Conti, Alessandro. - (2025 Jul 17), pp. 1-195.
Learning without Labels - Reducing Supervision in Training, Inference, and Evaluation of Deep Neural Networks
Conti, Alessandro
2025-07-17
Abstract
This thesis investigates how the reliance on supervision can be reduced across the entire deep learning pipeline. In the training phase, we explore unsupervised fine-tuning, focusing on Source-Free Unsupervised Domain Adaptation scenarios in visual tasks such as Facial Expression Recognition and video-based Action Recognition, primarily leveraging self-supervision and self-training. At inference, we address the challenge of removing fixed output vocabularies from Vision Language Models by formalizing the tasks of Vocabulary-free Image Classification and Vocabulary-free Semantic Segmentation and by introducing a family of efficient methods that adapt CLIP to the tasks. We also evaluate Large Multimodal Models under a similar constrained scenario, analyzing their predictions, categorizing their mistakes, and proposing tailored solutions to optimize their performance. Finally, we investigate unsupervised evaluation by proposing a framework that uses a Large Language Model and modular tools to automatically generate, execute, and interpret evaluation experiments for Large Multimodal Models without ground-truth labels. By reducing the need for human supervision at every stage of the deep learning pipeline, this thesis contributes toward a more flexible and efficient paradigm for developing and deploying deep neural networks in real-world, data-scarce, and open-ended settings.| File | Dimensione | Formato | |
|---|---|---|---|
|
output.pdf
accesso aperto
Tipologia:
Tesi di dottorato (Doctoral Thesis)
Licenza:
Creative commons
Dimensione
8.06 MB
Formato
Adobe PDF
|
8.06 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



