Large multi-modal models, empowered with both textual and visual inputs, have shown tremendous capabilities in a wide range of vision and language tasks. This kind of vision-text models, extensively studied in natural settings, are receiving less attention in the remote sensing (RS) field. Oftentimes, RS research relies on models for natural scenarios, glossing over the potential improvements of systems tailored to the remote sensing scenario. In this paper, we push toward bridging this gap. First, we design a procedure to generate a large-scale RS image-text dataset with synthetic captions. We use it to fine-tune a targeted CLIP model and we analyze the effect of using only synthetic captions on the model capabilities. Lastly, we build a benchmark for remote sensing image-text models, and evaluate our model, along with other recently proposed in the literature. We release the code of our benchmark system and our dataset.

Exploring Synthetic Captions for Remote Sensing Vision-Text Foundational Models / Ricci, Riccardo; Frizzera, Alberto; Nunes Gonçalves, Wesley; Marcato Junior, José; Melgani, Farid. - (2024), pp. 73-77. ( 2024 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium, M2GARSS 2024 Oran, Algeria 2024) [10.1109/M2GARSS57310.2024.10537354].

Exploring Synthetic Captions for Remote Sensing Vision-Text Foundational Models

Riccardo Ricci;Farid Melgani
2024-01-01

Abstract

Large multi-modal models, empowered with both textual and visual inputs, have shown tremendous capabilities in a wide range of vision and language tasks. This kind of vision-text models, extensively studied in natural settings, are receiving less attention in the remote sensing (RS) field. Oftentimes, RS research relies on models for natural scenarios, glossing over the potential improvements of systems tailored to the remote sensing scenario. In this paper, we push toward bridging this gap. First, we design a procedure to generate a large-scale RS image-text dataset with synthetic captions. We use it to fine-tune a targeted CLIP model and we analyze the effect of using only synthetic captions on the model capabilities. Lastly, we build a benchmark for remote sensing image-text models, and evaluate our model, along with other recently proposed in the literature. We release the code of our benchmark system and our dataset.
2024
2024 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS)
New York, USA
Institute of Electrical and Electronics Engineers Inc.
9798350358582
979-8-3503-5859-9
Ricci, Riccardo; Frizzera, Alberto; Nunes Gonçalves, Wesley; Marcato Junior, José; Melgani, Farid
Exploring Synthetic Captions for Remote Sensing Vision-Text Foundational Models / Ricci, Riccardo; Frizzera, Alberto; Nunes Gonçalves, Wesley; Marcato Junior, José; Melgani, Farid. - (2024), pp. 73-77. ( 2024 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium, M2GARSS 2024 Oran, Algeria 2024) [10.1109/M2GARSS57310.2024.10537354].
File in questo prodotto:
File Dimensione Formato  
2024-M2GARSS-Riccardo.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 998.57 kB
Formato Adobe PDF
998.57 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/437982
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact