Large multi-modal models, empowered with both textual and visual inputs, have shown tremendous capabilities in a wide range of vision and language tasks. This kind of vision-text models, extensively studied in natural settings, are receiving less attention in the remote sensing (RS) field. Oftentimes, RS research relies on models for natural scenarios, glossing over the potential improvements of systems tailored to the remote sensing scenario. In this paper, we push toward bridging this gap. First, we design a procedure to generate a large-scale RS image-text dataset with synthetic captions. We use it to fine-tune a targeted CLIP model and we analyze the effect of using only synthetic captions on the model capabilities. Lastly, we build a benchmark for remote sensing image-text models, and evaluate our model, along with other recently proposed in the literature. We release the code of our benchmark system and our dataset.

Exploring Synthetic Captions for Remote Sensing Vision-Text Foundational Models / Ricci, R.; Frizzera, A.; Goncalves, W. N.; Marcato Junior, J.; Melgani, F.. - (2024), pp. 73-77. (Intervento presentato al convegno 2024 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium, M2GARSS 2024 tenutosi a dza nel 2024) [10.1109/M2GARSS57310.2024.10537354].

Exploring Synthetic Captions for Remote Sensing Vision-Text Foundational Models

Ricci R.;Frizzera A.;Melgani F.
2024-01-01

Abstract

Large multi-modal models, empowered with both textual and visual inputs, have shown tremendous capabilities in a wide range of vision and language tasks. This kind of vision-text models, extensively studied in natural settings, are receiving less attention in the remote sensing (RS) field. Oftentimes, RS research relies on models for natural scenarios, glossing over the potential improvements of systems tailored to the remote sensing scenario. In this paper, we push toward bridging this gap. First, we design a procedure to generate a large-scale RS image-text dataset with synthetic captions. We use it to fine-tune a targeted CLIP model and we analyze the effect of using only synthetic captions on the model capabilities. Lastly, we build a benchmark for remote sensing image-text models, and evaluate our model, along with other recently proposed in the literature. We release the code of our benchmark system and our dataset.
2024
2024 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium, M2GARSS 2024 - Proceedings
New York, USA
Institute of Electrical and Electronics Engineers Inc.
Ricci, R.; Frizzera, A.; Goncalves, W. N.; Marcato Junior, J.; Melgani, F.
Exploring Synthetic Captions for Remote Sensing Vision-Text Foundational Models / Ricci, R.; Frizzera, A.; Goncalves, W. N.; Marcato Junior, J.; Melgani, F.. - (2024), pp. 73-77. (Intervento presentato al convegno 2024 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium, M2GARSS 2024 tenutosi a dza nel 2024) [10.1109/M2GARSS57310.2024.10537354].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/437982
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact