Large multi-modal models, empowered with both textual and visual inputs, have shown tremendous capabilities in a wide range of vision and language tasks. This kind of vision-text models, extensively studied in natural settings, are receiving less attention in the remote sensing (RS) field. Oftentimes, RS research relies on models for natural scenarios, glossing over the potential improvements of systems tailored to the remote sensing scenario. In this paper, we push toward bridging this gap. First, we design a procedure to generate a large-scale RS image-text dataset with synthetic captions. We use it to fine-tune a targeted CLIP model and we analyze the effect of using only synthetic captions on the model capabilities. Lastly, we build a benchmark for remote sensing image-text models, and evaluate our model, along with other recently proposed in the literature. We release the code of our benchmark system and our dataset.
Exploring Synthetic Captions for Remote Sensing Vision-Text Foundational Models / Ricci, R.; Frizzera, A.; Goncalves, W. N.; Marcato Junior, J.; Melgani, F.. - (2024), pp. 73-77. (Intervento presentato al convegno 2024 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium, M2GARSS 2024 tenutosi a dza nel 2024) [10.1109/M2GARSS57310.2024.10537354].
Exploring Synthetic Captions for Remote Sensing Vision-Text Foundational Models
Ricci R.;Frizzera A.;Melgani F.
2024-01-01
Abstract
Large multi-modal models, empowered with both textual and visual inputs, have shown tremendous capabilities in a wide range of vision and language tasks. This kind of vision-text models, extensively studied in natural settings, are receiving less attention in the remote sensing (RS) field. Oftentimes, RS research relies on models for natural scenarios, glossing over the potential improvements of systems tailored to the remote sensing scenario. In this paper, we push toward bridging this gap. First, we design a procedure to generate a large-scale RS image-text dataset with synthetic captions. We use it to fine-tune a targeted CLIP model and we analyze the effect of using only synthetic captions on the model capabilities. Lastly, we build a benchmark for remote sensing image-text models, and evaluate our model, along with other recently proposed in the literature. We release the code of our benchmark system and our dataset.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione