Given an image and a target modification (e.g an image of the Eiffel tower and the text “without people and at night-time”), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the capt...

Vision-by-Language for Training-Free Compositional Image Retrieval / Karthik, Shyamgopal; Roth, Karsten; Mancini, Massimiliano; Akata, Zeynep. - (2024). ( 12th International Conference on Learning Representations, ICLR 2024 Vienna, Austria 7th May - 11th May 2024).

Vision-by-Language for Training-Free Compositional Image Retrieval

Massimiliano Mancini;
2024-01-01

Abstract

Given an image and a target modification (e.g an image of the Eiffel tower and the text “without people and at night-time”), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the capt...
2024
The Twelfth International Conference on Learning Representations
Amherst, Massachusetts, USA
International Conference on Learning Representations, ICLR
Karthik, Shyamgopal; Roth, Karsten; Mancini, Massimiliano; Akata, Zeynep
Vision-by-Language for Training-Free Compositional Image Retrieval / Karthik, Shyamgopal; Roth, Karsten; Mancini, Massimiliano; Akata, Zeynep. - (2024). ( 12th International Conference on Learning Representations, ICLR 2024 Vienna, Austria 7th May - 11th May 2024).
File in questo prodotto:
File Dimensione Formato  
5738_Vision_by_Language_for_Tr.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 835.7 kB
Formato Adobe PDF
835.7 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/437741
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 37
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact