Large-scale vision-language models (VLMs) like CLIP successfully find correspondences between images and text. Through the standard deterministic mapping process, an image or a text sample is mapped to a single vector in the embedding space. This is problematic: as multiple samples (images or text) can abstract the same concept in the physical world, deterministic embeddings do not reflect the inherent ambiguity in the embedding space. We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained VLMs via inter/intra-modal alignment in a post-hoc manner without needing large-scale datasets or computing. On four challenging datasets, i.e., COCO, Flickr, CUB, and Oxford-flowers, we estimate the multi-modal embedding uncertainties for two VLMs, i.e., CLIP and BLIP, quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. Furthermore, we propose active learning and model selection as two real-world downstream tasks for VLMs and show that the estimated uncertainty aids both tasks. Lastly, we present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model. Code is available at https://github.com/ExplainableML/ProbVLM

ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models / Upadhyay, Uddeshya; Karthik, Shyamgopal; Mancini, Massimiliano; Akata, Zeynep. - (2023), pp. 1899-1910. (Intervento presentato al convegno ICCV tenutosi a Parigi, Francia nel 1 - 6 October 2023) [10.1109/ICCV51070.2023.00182].

ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models

Mancini, Massimiliano;
2023-01-01

Abstract

Large-scale vision-language models (VLMs) like CLIP successfully find correspondences between images and text. Through the standard deterministic mapping process, an image or a text sample is mapped to a single vector in the embedding space. This is problematic: as multiple samples (images or text) can abstract the same concept in the physical world, deterministic embeddings do not reflect the inherent ambiguity in the embedding space. We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained VLMs via inter/intra-modal alignment in a post-hoc manner without needing large-scale datasets or computing. On four challenging datasets, i.e., COCO, Flickr, CUB, and Oxford-flowers, we estimate the multi-modal embedding uncertainties for two VLMs, i.e., CLIP and BLIP, quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. Furthermore, we propose active learning and model selection as two real-world downstream tasks for VLMs and show that the estimated uncertainty aids both tasks. Lastly, we present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model. Code is available at https://github.com/ExplainableML/ProbVLM
2023
2023 IEEE/CVF International Conference on Computer Vision (ICCV)
Piscataway, NJ USA
IEEE Computer Society
979-8-3503-0718-4
979-8-3503-0719-1
Upadhyay, Uddeshya; Karthik, Shyamgopal; Mancini, Massimiliano; Akata, Zeynep
ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models / Upadhyay, Uddeshya; Karthik, Shyamgopal; Mancini, Massimiliano; Akata, Zeynep. - (2023), pp. 1899-1910. (Intervento presentato al convegno ICCV tenutosi a Parigi, Francia nel 1 - 6 October 2023) [10.1109/ICCV51070.2023.00182].
File in questo prodotto:
File Dimensione Formato  
Upadhyay_ProbVLM_Probabilistic_Adapter_for_Frozen_Vison-Language_Models_ICCV_2023_paper.pdf

accesso aperto

Descrizione: ICCV paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the accepted version;
Tipologia: Post-print referato (Refereed author’s manuscript)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 9.73 MB
Formato Adobe PDF
9.73 MB Adobe PDF Visualizza/Apri
ProbVLM_Probabilistic_Adapter_for_Frozen_Vison-Language_Models.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 10.24 MB
Formato Adobe PDF
10.24 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/400793
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact