Image captioning is a technique that enables the automatic extraction of natural language descriptions about the contents of an image. On the one hand, information in the form of natural language can enhance accessibility by reducing the expertise required to process, analyze, and exploit remote sensing images, while on the other, it provides a direct and general form of communication. However, image captioning is usually restricted to a single sentence, which barely describes the rich semantic information that typically characterizes remote sensing (RS) images. In this paper, we aim to move one step forward by proposing a captioning system that, mimicking human behavior, adopts dialogue as a tool to explore and dig for information, leading to more detailed and comprehensive descriptions of RS scenes. The system relies on a questions–answers scheme fed by a query image and summarizes the dialogue content with ChatGPT. Experiments carried out on two benchmark remote sensing datasets confirm the potential of such an approach in the context of semantic information mining. Strengths and weaknesses are highlighted and discussed, as well as some possible future developments.

Machine-to-Machine Visual Dialoguing with ChatGPT for Enriched Textual Image Description / Ricci, Riccardo; Bazi, Yakoub; Melgani, Farid. - In: REMOTE SENSING. - ISSN 2072-4292. - 16:3(2024), pp. 44101-44118. [10.3390/rs16030441]

Machine-to-Machine Visual Dialoguing with ChatGPT for Enriched Textual Image Description

Ricci, Riccardo;Bazi, Yakoub;Melgani, Farid
2024-01-01

Abstract

Image captioning is a technique that enables the automatic extraction of natural language descriptions about the contents of an image. On the one hand, information in the form of natural language can enhance accessibility by reducing the expertise required to process, analyze, and exploit remote sensing images, while on the other, it provides a direct and general form of communication. However, image captioning is usually restricted to a single sentence, which barely describes the rich semantic information that typically characterizes remote sensing (RS) images. In this paper, we aim to move one step forward by proposing a captioning system that, mimicking human behavior, adopts dialogue as a tool to explore and dig for information, leading to more detailed and comprehensive descriptions of RS scenes. The system relies on a questions–answers scheme fed by a query image and summarizes the dialogue content with ChatGPT. Experiments carried out on two benchmark remote sensing datasets confirm the potential of such an approach in the context of semantic information mining. Strengths and weaknesses are highlighted and discussed, as well as some possible future developments.
2024
3
Ricci, Riccardo; Bazi, Yakoub; Melgani, Farid
Machine-to-Machine Visual Dialoguing with ChatGPT for Enriched Textual Image Description / Ricci, Riccardo; Bazi, Yakoub; Melgani, Farid. - In: REMOTE SENSING. - ISSN 2072-4292. - 16:3(2024), pp. 44101-44118. [10.3390/rs16030441]
File in questo prodotto:
File Dimensione Formato  
2024_Remote Sensing-ChatGPT.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 5.7 MB
Formato Adobe PDF
5.7 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/437939
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 4
  • OpenAlex ND
social impact