RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

IRIS

In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and visual question answering (VQA). In particular, we introduce an improved version of the Large Language and Vision Assistant Model (LLaVA), specifically adapted for RS imagery through a low-rank adaptation approach. To evaluate the model performance, we create the RS-instructions dataset, a comprehensive benchmark dataset that integrates four diverse single-task datasets related to captioning and VQA. The experimental results confirm the model’s effectiveness, marking a step forward toward the development of efficient multi-task models for RS image analysis.

In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and visual question answering (VQA). In particular, we introduce an improved version of the Large Language and Vision Assistant Model (LLaVA), specifically adapted for RS imagery through a low-rank adaptation approach. To evaluate the model performance, we create the RS-instructions dataset, a comprehensive benchmark dataset that integrates four diverse single-task datasets related to captioning and VQA. The experimental results confirm the model’s effectiveness, marking a step forward toward the development of efficient multi-task models for RS image analysis.

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery / Bazi, Y., Bashmal, L., Al Rahhal, M.M., Ricci, R., Melgani, F.. - In: REMOTE SENSING. - ISSN 2072-4292. - 16:9(2024), pp. 147701-147718. [10.3390/rs16091477]

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

Bazi, Yakoub;Bashmal, Laila;Al Rahhal, Mohamad Mahmoud;Ricci, Riccardo;Melgani, Farid

2024-01-01

Abstract

In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and visual question answering (VQA). In particular, we introduce an improved version of the Large Language and Vision Assistant Model (LLaVA), specifically adapted for RS imagery through a low-rank adaptation approach. To evaluate the model performance, we create the RS-instructions dataset, a comprehensive benchmark dataset that integrates four diverse single-task datasets related to captioning and VQA. The experimental results confirm the model’s effectiveness, marking a step forward toward the development of efficient multi-task models for RS image analysis.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2024
			
	Titolo del periodico (Journal title)
	
				REMOTE SENSING
			
	Numero e parte del fascicolo (Issue number and part)
	
				9
			
	DOI
	
				https://dx.doi.org/10.3390/rs16091477
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-85192998027
			
	Codice WOS (WOS identifier)
	
				WOS:001219825600001
			
	Tutti gli autori
	
						Bazi, Yakoub; Bashmal, Laila; Al Rahhal, Mohamad Mahmoud; Ricci, Riccardo; Melgani, Farid
					
	Citazione
	
				RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery / Bazi, Y., Bashmal, L., Al Rahhal, M.M., Ricci, R., Melgani, F.. - In: REMOTE SENSING. - ISSN 2072-4292. - 16:9(2024), pp. 147701-147718. [10.3390/rs16091477]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
2024_Remote Sensing-RSLLAVA.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Creative commons Dimensione 3.63 MB Formato Adobe PDF Visualizza/Apri	3.63 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/437938

Citazioni

ND

104

86

74

social impact