The last years have seen an explosion of work on the integration of vision and language data. New tasks like Image Captioning and Visual Questions Answering have been proposed and impressive results have been achieved. There is now a shared desire to gain an in-depth understanding of the strengths and weaknesses of those models. To this end, several datasets have been proposed to try and challenge the state-of-the-art. Those datasets, however, mostly focus on the interpretation of objects (as denoted by nouns in the corresponding captions). In this paper, we reuse a previously proposed methodology to evaluate the ability of current systems to move beyond objects and deal with attributes (as denoted by adjectives), actions (verbs), manner (adverbs) and spatial relations (prepositions). We show that the coarse representations given by current approaches are not informative enough to interpret attributes or actions, whilst spatial relations somewhat fare better, but only in attention models.

Vision and Language Integration: Moving beyond Objects / Ravi, Shekhar; Pezzelle, Sandro; Herbelot, Aurelie Georgette Geraldine; Nabi, Moin; Sangineto, Enver; Bernardi, Raffaella. - ELETTRONICO. - (2017), pp. 1-6. (Intervento presentato al convegno IWCS 2017 tenutosi a Montpellier, France nel 19th-22nd September 2017).

Vision and Language Integration: Moving beyond Objects

Ravi Shekhar;Sandro Pezzelle;Aurelie Herbelot;Moin Nabi;Enver Sangineto;Raffaella Bernardi
2017-01-01

Abstract

The last years have seen an explosion of work on the integration of vision and language data. New tasks like Image Captioning and Visual Questions Answering have been proposed and impressive results have been achieved. There is now a shared desire to gain an in-depth understanding of the strengths and weaknesses of those models. To this end, several datasets have been proposed to try and challenge the state-of-the-art. Those datasets, however, mostly focus on the interpretation of objects (as denoted by nouns in the corresponding captions). In this paper, we reuse a previously proposed methodology to evaluate the ability of current systems to move beyond objects and deal with attributes (as denoted by adjectives), actions (verbs), manner (adverbs) and spatial relations (prepositions). We show that the coarse representations given by current approaches are not informative enough to interpret attributes or actions, whilst spatial relations somewhat fare better, but only in attention models.
2017
IWCS 2017 12th International Conference on Computational Semantics: Short papers
Stroudsburg, USA
ACL
Ravi, Shekhar; Pezzelle, Sandro; Herbelot, Aurelie Georgette Geraldine; Nabi, Moin; Sangineto, Enver; Bernardi, Raffaella
Vision and Language Integration: Moving beyond Objects / Ravi, Shekhar; Pezzelle, Sandro; Herbelot, Aurelie Georgette Geraldine; Nabi, Moin; Sangineto, Enver; Bernardi, Raffaella. - ELETTRONICO. - (2017), pp. 1-6. (Intervento presentato al convegno IWCS 2017 tenutosi a Montpellier, France nel 19th-22nd September 2017).
File in questo prodotto:
File Dimensione Formato  
W17-6938-vision.pdf

accesso aperto

Descrizione: articolo principale
Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 591.98 kB
Formato Adobe PDF
591.98 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/192745
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 12
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact