Vision-by-Language for Training-Free Compositional Image Retrieval

IRIS

Given an image and a target modification (e.g an image of the Eiffel tower and the text “without people and at night-time”), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the capt...

Vision-by-Language for Training-Free Compositional Image Retrieval / Karthik, S., Roth, K., Mancini, M., Akata, Z.. - (2024). (12th International Conference on Learning Representations, ICLR 2024 Vienna, Austria 7th May - 11th May 2024).

Vision-by-Language for Training-Free Compositional Image Retrieval

Shyamgopal Karthik^Co-primo;Karsten Roth^Co-primo;Massimiliano Mancini;Zeynep Akata

2024-01-01

Abstract

Given an image and a target modification (e.g an image of the Eiffel tower and the text “without people and at night-time”), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the capt...

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2024
			
	Titolo del volume (Proceedings title)
	
				The Twelfth International Conference on Learning Representations
			
	Luogo di edizione (Place of publication)
	
				Amherst, Massachusetts, USA
			
	Casa editrice (Publisher)
	
				International Conference on Learning Representations, ICLR
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85198940449
			
	Tutti gli autori
	
						Karthik, Shyamgopal; Roth, Karsten; Mancini, Massimiliano; Akata, Zeynep
					
	Citazione
	
				Vision-by-Language for Training-Free Compositional Image Retrieval / Karthik, S., Roth, K., Mancini, M., Akata, Z.. - (2024). (12th International Conference on Learning Representations, ICLR 2024 Vienna, Austria 7th May - 11th May 2024).
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
5738_Vision_by_Language_for_Tr.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 835.7 kB Formato Adobe PDF Visualizza/Apri	835.7 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/437741

Citazioni

ND

44

ND

ND

social impact