The last years have seen an explosion of work on the integration of vision and language data. New tasks like Image Captioning and Visual Questions Answering have been proposed and impressive results have been achieved. There is now a shared desire to gain an in-depth understanding of the strengths and weaknesses of those models. To this end, several datasets have been proposed to try and challenge the state-of-the-art. Those datasets, however, mostly focus on the interpretation of objects (as denoted by nouns in the corresponding captions). In this paper, we reuse a previously proposed methodology to evaluate the ability of current systems to move beyond objects and deal with attributes (as denoted by adjectives), actions (verbs), manner (adverbs) and spatial relations (prepositions). We show that the coarse representations given by current approaches are not informative enough to interpret attributes or actions, whilst spatial relations somewhat fare better, but only in attention models.
Vision and Language Integration: Moving beyond Objects / Ravi, Shekhar; Pezzelle, Sandro; Herbelot, Aurelie Georgette Geraldine; Nabi, Moin; Sangineto, Enver; Bernardi, Raffaella. - ELETTRONICO. - (2017), pp. 1-6. (Intervento presentato al convegno IWCS 2017 tenutosi a Montpellier, France nel 19th-22nd September 2017).
Vision and Language Integration: Moving beyond Objects
Ravi Shekhar;Sandro Pezzelle;Aurelie Herbelot;Moin Nabi;Enver Sangineto;Raffaella Bernardi
2017-01-01
Abstract
The last years have seen an explosion of work on the integration of vision and language data. New tasks like Image Captioning and Visual Questions Answering have been proposed and impressive results have been achieved. There is now a shared desire to gain an in-depth understanding of the strengths and weaknesses of those models. To this end, several datasets have been proposed to try and challenge the state-of-the-art. Those datasets, however, mostly focus on the interpretation of objects (as denoted by nouns in the corresponding captions). In this paper, we reuse a previously proposed methodology to evaluate the ability of current systems to move beyond objects and deal with attributes (as denoted by adjectives), actions (verbs), manner (adverbs) and spatial relations (prepositions). We show that the coarse representations given by current approaches are not informative enough to interpret attributes or actions, whilst spatial relations somewhat fare better, but only in attention models.File | Dimensione | Formato | |
---|---|---|---|
W17-6938-vision.pdf
accesso aperto
Descrizione: articolo principale
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
591.98 kB
Formato
Adobe PDF
|
591.98 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione