In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and visual question answering (VQA). In particular, we introduce an improved version of the Large Language and Vision Assistant Model (LLaVA), specifically adapted for RS imagery through a low-rank adaptation approach. To evaluate the model performance, we create the RS-instructions dataset, a comprehensive benchmark dataset that integrates four diverse single-task datasets related to captioning and VQA. The experimental results confirm the model’s effectiveness, marking a step forward toward the development of efficient multi-task models for RS image analysis.

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery / Bazi, Yakoub; Bashmal, Laila; Al Rahhal, Mohamad Mahmoud; Ricci, Riccardo; Melgani, Farid. - In: REMOTE SENSING. - ISSN 2072-4292. - 2024, 16:9(2024), pp. 147701-147718. [10.3390/rs16091477]

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

Bazi, Yakoub;Ricci, Riccardo;Melgani, Farid
2024-01-01

Abstract

In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and visual question answering (VQA). In particular, we introduce an improved version of the Large Language and Vision Assistant Model (LLaVA), specifically adapted for RS imagery through a low-rank adaptation approach. To evaluate the model performance, we create the RS-instructions dataset, a comprehensive benchmark dataset that integrates four diverse single-task datasets related to captioning and VQA. The experimental results confirm the model’s effectiveness, marking a step forward toward the development of efficient multi-task models for RS image analysis.
2024
9
Bazi, Yakoub; Bashmal, Laila; Al Rahhal, Mohamad Mahmoud; Ricci, Riccardo; Melgani, Farid
RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery / Bazi, Yakoub; Bashmal, Laila; Al Rahhal, Mohamad Mahmoud; Ricci, Riccardo; Melgani, Farid. - In: REMOTE SENSING. - ISSN 2072-4292. - 2024, 16:9(2024), pp. 147701-147718. [10.3390/rs16091477]
File in questo prodotto:
File Dimensione Formato  
2024_Remote Sensing-RSLLAVA.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 3.63 MB
Formato Adobe PDF
3.63 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/437938
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 3
  • OpenAlex ND
social impact