Recently, vision-language models based on transformers are gaining popularity for joint modeling of visual and textual modalities. In particular, they show impressive results when transferred to several downstream tasks such as zero and few-shot classification. In this article, we propose a visual question answering (VQA) approach for remote sensing images based on these models. The VQA task attempts to provide answers to image-related questions. In contrast, VQA has gained popularity in computer vision, in remote sensing, it is not widespread. First, we use the contrastive language image pretraining (CLIP) network for embedding the image patches and question words into a sequence of visual and textual representations. Then, we learn attention mechanisms to capture the intradependencies and interdependencies within and between these representations. Afterward, we generate the final answer by averaging the predictions of two classifiers mounted on the top of the resulting contextual representations. In the experiments, we study the performance of the proposed approach on two datasets acquired with Sentinel-2 and aerial sensors. In particular, we demonstrate that our approach can achieve better results with reduced training size compared with the recent state-of-the-art.

Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery / Bazi, Y.; Rahhal, M. M. A.; Mekhalfi, M. L.; Zuair, M. A. A.; Melgani, F.. - In: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. - ISSN 0196-2892. - ELETTRONICO. - 60:(2022), pp. 470801101-470801111. [10.1109/TGRS.2022.3192460]

Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery

Bazi Y.;Mekhalfi M. L.;Melgani F.
2022-01-01

Abstract

Recently, vision-language models based on transformers are gaining popularity for joint modeling of visual and textual modalities. In particular, they show impressive results when transferred to several downstream tasks such as zero and few-shot classification. In this article, we propose a visual question answering (VQA) approach for remote sensing images based on these models. The VQA task attempts to provide answers to image-related questions. In contrast, VQA has gained popularity in computer vision, in remote sensing, it is not widespread. First, we use the contrastive language image pretraining (CLIP) network for embedding the image patches and question words into a sequence of visual and textual representations. Then, we learn attention mechanisms to capture the intradependencies and interdependencies within and between these representations. Afterward, we generate the final answer by averaging the predictions of two classifiers mounted on the top of the resulting contextual representations. In the experiments, we study the performance of the proposed approach on two datasets acquired with Sentinel-2 and aerial sensors. In particular, we demonstrate that our approach can achieve better results with reduced training size compared with the recent state-of-the-art.
2022
Bazi, Y.; Rahhal, M. M. A.; Mekhalfi, M. L.; Zuair, M. A. A.; Melgani, F.
Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery / Bazi, Y.; Rahhal, M. M. A.; Mekhalfi, M. L.; Zuair, M. A. A.; Melgani, F.. - In: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. - ISSN 0196-2892. - ELETTRONICO. - 60:(2022), pp. 470801101-470801111. [10.1109/TGRS.2022.3192460]
File in questo prodotto:
File Dimensione Formato  
2022_TGRS-VQA-Yakoub.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 4.62 MB
Formato Adobe PDF
4.62 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/373008
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 15
  • ???jsp.display-item.citation.isi??? 11
social impact