Linguistic issues behind visual question answering

Bernardi, Raffaella; Pezzelle, Sandro

doi:10.1111/lnc3.12417

Answering a question that is grounded in an image is a crucial ability that requires understanding the question, the visual context, and their interaction at many linguistic levels: among others, semantics, syntax and pragmatics. As such, visually-grounded questions have long been of interest to theoretical linguists and cognitive scientists. Moreover, they have inspired the first attempts to computationally model natural language understanding, where pioneering systems were faced with the highly challenging task—still unsolved—of jointly dealing with syntax, semantics and inference whilst understanding a visual context. Boosted by impressive advancements in machine learning, the task of answering visually-grounded questions has experienced a renewed interest in recent years, to the point of becoming a research sub-field at the intersection of computational linguistics and computer vision. In this paper, we review current approaches to the problem which encompass the development of datasets, models and frameworks. We conduct our investigation from the perspective of the theoretical linguists; we extract from pioneering computational linguistic work a list of desiderata that we use to review current computational achievements. We acknowledge that impressive progress has been made to reconcile the engineering with the theoretical view. At the same time, we claim that further research is needed to get to a unified approach which jointly encompasses all the underlying linguistic problems. We conclude the paper by sharing our own desiderata for the future.

Linguistic issues behind visual question answering / Bernardi, R., Pezzelle, S.. - In: LANGUAGE AND LINGUISTICS COMPASS. - ISSN 1749-818X. - ELETTRONICO. - 15:6(2021), pp. 1241701-1241725. [10.1111/lnc3.12417]

Linguistic issues behind visual question answering

Bernardi, Raffaella;Pezzelle, Sandro

2021-01-01

Abstract

Answering a question that is grounded in an image is a crucial ability that requires understanding the question, the visual context, and their interaction at many linguistic levels: among others, semantics, syntax and pragmatics. As such, visually-grounded questions have long been of interest to theoretical linguists and cognitive scientists. Moreover, they have inspired the first attempts to computationally model natural language understanding, where pioneering systems were faced with the highly challenging task—still unsolved—of jointly dealing with syntax, semantics and inference whilst understanding a visual context. Boosted by impressive advancements in machine learning, the task of answering visually-grounded questions has experienced a renewed interest in recent years, to the point of becoming a research sub-field at the intersection of computational linguistics and computer vision. In this paper, we review current approaches to the problem which encompass the development of datasets, models and frameworks. We conduct our investigation from the perspective of the theoretical linguists; we extract from pioneering computational linguistic work a list of desiderata that we use to review current computational achievements. We acknowledge that impressive progress has been made to reconcile the engineering with the theoretical view. At the same time, we claim that further research is needed to get to a unified approach which jointly encompasses all the underlying linguistic problems. We conclude the paper by sharing our own desiderata for the future.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2021
			
	Titolo del periodico (Journal title)
	
				LANGUAGE AND LINGUISTICS COMPASS
			
	Numero e parte del fascicolo (Issue number and part)
	
				6
			
	DOI
	
				https://dx.doi.org/10.1111/lnc3.12417
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-85109366151
			
	Codice WOS (WOS identifier)
	
				WOS:000670612600002
			
	Tutti gli autori
	
						Bernardi, Raffaella; Pezzelle, Sandro
					
	Citazione
	
				Linguistic issues behind visual question answering / Bernardi, R., Pezzelle, S.. - In: LANGUAGE AND LINGUISTICS COMPASS. - ISSN 1749-818X. - ELETTRONICO. - 15:6(2021), pp. 1241701-1241725. [10.1111/lnc3.12417]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
Language and Linguist Compass - 2021 - Bernardi - Linguistic issues behind visual question answering.pdf accesso aperto Descrizione: articolo principale Tipologia: Versione editoriale (Publisher’s layout) Licenza: Creative commons Dimensione 1.88 MB Formato Adobe PDF Visualizza/Apri	1.88 MB	Adobe PDF	Visualizza/Apri