Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision

IRIS

This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models combine complementary information from the two modalities. Recently, impressive progress has been made to develop universal multimodal encoders suitable for virtually any language and vision tasks. However, current approaches often require them to combine redundant information provided by language and vision. Inspired by real-life communicative contexts, we propose a novel task where either modality is necessary but not sufficient to make a correct prediction. To do so, we first build a dataset of images and corresponding sentences provided by human participants. Second, we evaluate state-of-the-art models and compare their performance against human speakers. We show that, while the task is relatively easy for humans, best-performing models struggle to achieve similar results.

Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision / Pezzelle, Sandro; Greco, Claudio; Gandolfi, Greta; Gualdoni, Eleonora; Bernardi, Raffaella. - (2020), pp. 2751-2767. (Intervento presentato al convegno EMNLP 2020 tenutosi a Online nel 16th – 20th November 2020) [10.18653/v1/2020.findings-emnlp.248].

Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision

Pezzelle, Sandro;Greco, Claudio;Gandolfi, Greta;Gualdoni, Eleonora;Bernardi, Raffaella

2020-01-01

Abstract

This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models combine complementary information from the two modalities. Recently, impressive progress has been made to develop universal multimodal encoders suitable for virtually any language and vision tasks. However, current approaches often require them to combine redundant information provided by language and vision. Inspired by real-life communicative contexts, we propose a novel task where either modality is necessary but not sufficient to make a correct prediction. To do so, we first build a dataset of images and corresponding sentences provided by human participants. Second, we evaluate state-of-the-art models and compare their performance against human speakers. We show that, while the task is relatively easy for humans, best-performing models struggle to achieve similar results.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
			2020
		
	Titolo del volume (Proceedings title)
	
			Findings of the Association for Computational Linguistics: EMNLP 2020
		
	Luogo di edizione (Place of publication)
	
			Aachen, Germany
		
	Casa editrice (Publisher)
	
			Association for Computational Linguistics
		
	Codice Scopus (Scopus Identifier)
	
			2-s2.0-85109368081
		
	Tutti gli autori
	
			Pezzelle, Sandro; Greco, Claudio; Gandolfi, Greta; Gualdoni, Eleonora; Bernardi, Raffaella
		
	Citazione
	
			Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision / Pezzelle, Sandro; Greco, Claudio; Gandolfi, Greta; Gualdoni, Eleonora; Bernardi, Raffaella. - (2020), pp. 2751-2767. (Intervento presentato al  convegno EMNLP 2020 tenutosi a Online nel 16th – 20th November 2020) [10.18653/v1/2020.findings-emnlp.248].
		
	Appare nelle tipologie:
	
			04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
2020.findings-emnlp.248.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Creative commons Dimensione 3.97 MB Formato Adobe PDF Visualizza/Apri	3.97 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/286795

Citazioni

ND

11

ND

social impact