Exploring Synthetic Captions for Remote Sensing Vision-Text Foundational Models

IRIS

Large multi-modal models, empowered with both textual and visual inputs, have shown tremendous capabilities in a wide range of vision and language tasks. This kind of vision-text models, extensively studied in natural settings, are receiving less attention in the remote sensing (RS) field. Oftentimes, RS research relies on models for natural scenarios, glossing over the potential improvements of systems tailored to the remote sensing scenario. In this paper, we push toward bridging this gap. First, we design a procedure to generate a large-scale RS image-text dataset with synthetic captions. We use it to fine-tune a targeted CLIP model and we analyze the effect of using only synthetic captions on the model capabilities. Lastly, we build a benchmark for remote sensing image-text models, and evaluate our model, along with other recently proposed in the literature. We release the code of our benchmark system and our dataset.

Exploring Synthetic Captions for Remote Sensing Vision-Text Foundational Models / Ricci, R.; Frizzera, A.; Goncalves, W. N.; Marcato Junior, J.; Melgani, F.. - (2024), pp. 73-77. (Intervento presentato al convegno 2024 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium, M2GARSS 2024 tenutosi a dza nel 2024) [10.1109/M2GARSS57310.2024.10537354].

Exploring Synthetic Captions for Remote Sensing Vision-Text Foundational Models

Ricci R.;Frizzera A.;Goncalves W. N.;Marcato Junior J.;Melgani F.

2024-01-01

Abstract

Large multi-modal models, empowered with both textual and visual inputs, have shown tremendous capabilities in a wide range of vision and language tasks. This kind of vision-text models, extensively studied in natural settings, are receiving less attention in the remote sensing (RS) field. Oftentimes, RS research relies on models for natural scenarios, glossing over the potential improvements of systems tailored to the remote sensing scenario. In this paper, we push toward bridging this gap. First, we design a procedure to generate a large-scale RS image-text dataset with synthetic captions. We use it to fine-tune a targeted CLIP model and we analyze the effect of using only synthetic captions on the model capabilities. Lastly, we build a benchmark for remote sensing image-text models, and evaluate our model, along with other recently proposed in the literature. We release the code of our benchmark system and our dataset.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2024
			
	Titolo del volume (Proceedings title)
	
				2024 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium, M2GARSS 2024 - Proceedings
			
	Luogo di edizione (Place of publication)
	
				New York, USA
			
	Casa editrice (Publisher)
	
				Institute of Electrical and Electronics Engineers Inc.
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85195424485
			
	Tutti gli autori
	
						Ricci, R.; Frizzera, A.; Goncalves, W. N.; Marcato Junior, J.; Melgani, F.
					
	Citazione
	
				Exploring Synthetic Captions for Remote Sensing Vision-Text Foundational Models / Ricci, R.; Frizzera, A.; Goncalves, W. N.; Marcato Junior, J.; Melgani, F.. - (2024), pp. 73-77. (Intervento presentato al  convegno 2024 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium, M2GARSS 2024 tenutosi a dza nel 2024) [10.1109/M2GARSS57310.2024.10537354].

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/437982

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

ND

ND

social impact