FOIL it! Find One mismatch between Image and Language caption.

Shekhar, Ravi; Pezzelle, Sandro; Klimovich, Yauhen; Herbelot, Aurelie; Nabi, Moin; Sangineto, Enver; Bernardi, Raffaella

doi:10.18653/v1/P17-1024

In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MS-COCO dataset, FOIL-COCO, which associates images with both correct and ‘foil’ captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake (‘foil word’). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.

FOIL it! Find One mismatch between Image and Language caption / Shekhar, R., Pezzelle, S., Klimovich, Y., Herbelot, A., Nabi, M., Sangineto, E., Bernardi, R.. - ELETTRONICO. - (2017), pp. 255-265. (ACL Vancouver July 30th - August 4th, 2017) [10.18653/v1/P17-1024 ].

FOIL it! Find One mismatch between Image and Language caption.

Shekhar, Ravi;Pezzelle, Sandro;Klimovich, Yauhen;Herbelot, Aurelie;Nabi, Moin;Sangineto, Enver;Bernardi, Raffaella

2017-01-01

Abstract

In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MS-COCO dataset, FOIL-COCO, which associates images with both correct and ‘foil’ captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake (‘foil word’). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2017
			
	Titolo del volume (Proceedings title)
	
				ACL 2017 The 55th Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference, Vol. 1 (Long Papers)
			
	Luogo di edizione (Place of publication)
	
				Stroudsburg PA
			
	Casa editrice (Publisher)
	
				Association for Computational Linguistics
			
	ISBN
	
				978-194562675-3
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85040908564
			
	Codice WOS (WOS identifier)
	
				WOS:000493984800024
			
	Tutti gli autori
	
						Shekhar, Ravi; Pezzelle, Sandro; Klimovich, Yauhen; Herbelot, Aurelie; Nabi, Moin; Sangineto, Enver; Bernardi, Raffaella
					
	Citazione
	
				FOIL it! Find One mismatch between Image and Language caption / Shekhar, R., Pezzelle, S., Klimovich, Y., Herbelot, A., Nabi, M., Sangineto, E., Bernardi, R.. - ELETTRONICO. - (2017), pp. 255-265. (ACL Vancouver July 30th - August 4th, 2017) [10.18653/v1/P17-1024 ].
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
foil_acl17.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 2.99 MB Formato Adobe PDF Visualizza/Apri	2.99 MB	Adobe PDF	Visualizza/Apri