Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery

Bazi, Y.; Rahhal, M. M. A.; Mekhalfi, M. L.; Zuair, M. A. A.; Melgani, F.

doi:10.1109/TGRS.2022.3192460

Recently, vision-language models based on transformers are gaining popularity for joint modeling of visual and textual modalities. In particular, they show impressive results when transferred to several downstream tasks such as zero and few-shot classification. In this article, we propose a visual question answering (VQA) approach for remote sensing images based on these models. The VQA task attempts to provide answers to image-related questions. In contrast, VQA has gained popularity in computer vision, in remote sensing, it is not widespread. First, we use the contrastive language image pretraining (CLIP) network for embedding the image patches and question words into a sequence of visual and textual representations. Then, we learn attention mechanisms to capture the intradependencies and interdependencies within and between these representations. Afterward, we generate the final answer by averaging the predictions of two classifiers mounted on the top of the resulting contextual representations. In the experiments, we study the performance of the proposed approach on two datasets acquired with Sentinel-2 and aerial sensors. In particular, we demonstrate that our approach can achieve better results with reduced training size compared with the recent state-of-the-art.

Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery / Bazi, Y.; Rahhal, M. M. A.; Mekhalfi, M. L.; Zuair, M. A. A.; Melgani, F.. - In: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. - ISSN 0196-2892. - ELETTRONICO. - 60:(2022), pp. 470801101-470801111. [10.1109/TGRS.2022.3192460]

Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery

Bazi Y.;Rahhal M. M. A.;Mekhalfi M. L.;Zuair M. A. A.;Melgani F.

2022-01-01

Abstract

Recently, vision-language models based on transformers are gaining popularity for joint modeling of visual and textual modalities. In particular, they show impressive results when transferred to several downstream tasks such as zero and few-shot classification. In this article, we propose a visual question answering (VQA) approach for remote sensing images based on these models. The VQA task attempts to provide answers to image-related questions. In contrast, VQA has gained popularity in computer vision, in remote sensing, it is not widespread. First, we use the contrastive language image pretraining (CLIP) network for embedding the image patches and question words into a sequence of visual and textual representations. Then, we learn attention mechanisms to capture the intradependencies and interdependencies within and between these representations. Afterward, we generate the final answer by averaging the predictions of two classifiers mounted on the top of the resulting contextual representations. In the experiments, we study the performance of the proposed approach on two datasets acquired with Sentinel-2 and aerial sensors. In particular, we demonstrate that our approach can achieve better results with reduced training size compared with the recent state-of-the-art.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2022
			
	Titolo del periodico (Journal title)
	
				IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
			
	DOI
	
				https://dx.doi.org/10.1109/TGRS.2022.3192460
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-85135211976
			
	Codice WOS (WOS identifier)
	
				WOS:000837291200015
			
	Tutti gli autori
	
						Bazi, Y.; Rahhal, M. M. A.; Mekhalfi, M. L.; Zuair, M. A. A.; Melgani, F.
					
	Citazione
	
				Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery / Bazi, Y.; Rahhal, M. M. A.; Mekhalfi, M. L.; Zuair, M. A. A.; Melgani, F.. - In: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. - ISSN 0196-2892. - ELETTRONICO. - 60:(2022), pp. 470801101-470801111. [10.1109/TGRS.2022.3192460]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
2022_TGRS-VQA-Yakoub.pdf Solo gestori archivio Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 4.62 MB Formato Adobe PDF Visualizza/Apri	4.62 MB	Adobe PDF	Visualizza/Apri