Someone once said that ”an image is worth a thousand words.” This captures well the amount of semantic information hidden inside these matrices of pixels. Looking at an image instinctively induces us to form hypotheses about which objects are inside, their state, dislocation, etc. We thus create, in our head, a high-level semantic representation of the image, understanding its contents. This ability, although instinctive and almost effortless for us, is very difficult to reproduce inside machines. Researchers have worked to replicate human image perception for decades, initially focusing on categorizing images or detecting specific objects, resulting in semantic representations tied to fixed concepts. Image captioning (IC) involves generating natural language descriptions for images, enabling machines to communicate their perception through language. This approach provides a flexible framework to convey diverse semantics. Despite significant advancements, IC systems face challenges in flexibility and reliability, particularly in specialized domains like remote sensing (RS). The contributions presented in this thesis are organized into chapters, each addressing a limitation in remote-sensing image captioning (RSIC). Chapter 2 introduces the fundamentals of generative image captioning, while Chapter 3 explores two distinct approaches to enhance robustness and accuracy. First, we propose an ensemble method that leverages collective knowledge from multiple participants to improve the reliability of image captioning outputs. Second, we fine-tune a pre-trained large vision-language model using an instruction-based multi-task dataset, showing how pre-training on a large visual-language dataset results in a better adaptation to downstream tasks with limited labeled data. We further evaluate how integrating multiple tasks into the same framework influences single-task performance. Chapter 4 focuses on enhancing the richness and detail of image descriptions, addressing the limitation of current RS image captioning datasets, where complex scenes are often reduced to a single, simple sentence. We propose to simulate a visual dialogue between two pre-trained instruction following models to iteratively dig for more detailed information. Based on several metrics, we show that our paradigm can generate descriptions that can discriminate better between different scenes. Chapter 5 focuses on Visual Question Generation (VQG) for remote-sensing images, which aims at generating natural language questions for a given input image. We introduce a new dataset to overcome the lack of question diversity in existing RS-VQG datasets. We also train a vision-language model to generate questions directly from remote-sensing images. VQG can address a serious limitation of our visual dialogue paradigm, which generates questions from an initial image description. Directly generating the questions from the image reduces the risk associated with an incoherent first description of the image. Chapter 6 focuses on describing changes between pairs of remote-sensing images. We use entirely pre-trained models, eliminating the need for custom model training. We explore how the instructions provided to these models can steer the description toward particular aspects of interest for the user. Chapter 7 introduces our initial exploration of incorporating supplementary geographic information into a remote-sensing image captioning pipeline. We aim to generate more detailed captions tailored to the specific scene and its geographic features. We believe integrating GIS data into vision-language models can enhance their groundedness when solving different tasks. In light of recent advancements in large vision-language models, we think that our endeavors in image captioning reflect a gradual shift from task-specific models trained to perform single tasks to multi-task large vision-language models that can accomplish several tasks at the same time, framing each as an answer to a different user instruction. We think adapting at inference time to multiple tasks and requests, as we show in Chapters 3 and 6, represents a fundamental ability that future vision-language models should possess and can greatly benefit remote-sensing applications. Finally, we think that RS vision-language research should move toward the inclusion of additional data sources (i.e., geographic databases) to help vision-language models be more precise and grounded in answering specific image-related queries.
Advanced Methods for Remote Sensing Image Captioning / Ricci, Riccardo. - (2025 Apr 11), pp. 1-124.
Advanced Methods for Remote Sensing Image Captioning
Ricci, Riccardo
2025-04-11
Abstract
Someone once said that ”an image is worth a thousand words.” This captures well the amount of semantic information hidden inside these matrices of pixels. Looking at an image instinctively induces us to form hypotheses about which objects are inside, their state, dislocation, etc. We thus create, in our head, a high-level semantic representation of the image, understanding its contents. This ability, although instinctive and almost effortless for us, is very difficult to reproduce inside machines. Researchers have worked to replicate human image perception for decades, initially focusing on categorizing images or detecting specific objects, resulting in semantic representations tied to fixed concepts. Image captioning (IC) involves generating natural language descriptions for images, enabling machines to communicate their perception through language. This approach provides a flexible framework to convey diverse semantics. Despite significant advancements, IC systems face challenges in flexibility and reliability, particularly in specialized domains like remote sensing (RS). The contributions presented in this thesis are organized into chapters, each addressing a limitation in remote-sensing image captioning (RSIC). Chapter 2 introduces the fundamentals of generative image captioning, while Chapter 3 explores two distinct approaches to enhance robustness and accuracy. First, we propose an ensemble method that leverages collective knowledge from multiple participants to improve the reliability of image captioning outputs. Second, we fine-tune a pre-trained large vision-language model using an instruction-based multi-task dataset, showing how pre-training on a large visual-language dataset results in a better adaptation to downstream tasks with limited labeled data. We further evaluate how integrating multiple tasks into the same framework influences single-task performance. Chapter 4 focuses on enhancing the richness and detail of image descriptions, addressing the limitation of current RS image captioning datasets, where complex scenes are often reduced to a single, simple sentence. We propose to simulate a visual dialogue between two pre-trained instruction following models to iteratively dig for more detailed information. Based on several metrics, we show that our paradigm can generate descriptions that can discriminate better between different scenes. Chapter 5 focuses on Visual Question Generation (VQG) for remote-sensing images, which aims at generating natural language questions for a given input image. We introduce a new dataset to overcome the lack of question diversity in existing RS-VQG datasets. We also train a vision-language model to generate questions directly from remote-sensing images. VQG can address a serious limitation of our visual dialogue paradigm, which generates questions from an initial image description. Directly generating the questions from the image reduces the risk associated with an incoherent first description of the image. Chapter 6 focuses on describing changes between pairs of remote-sensing images. We use entirely pre-trained models, eliminating the need for custom model training. We explore how the instructions provided to these models can steer the description toward particular aspects of interest for the user. Chapter 7 introduces our initial exploration of incorporating supplementary geographic information into a remote-sensing image captioning pipeline. We aim to generate more detailed captions tailored to the specific scene and its geographic features. We believe integrating GIS data into vision-language models can enhance their groundedness when solving different tasks. In light of recent advancements in large vision-language models, we think that our endeavors in image captioning reflect a gradual shift from task-specific models trained to perform single tasks to multi-task large vision-language models that can accomplish several tasks at the same time, framing each as an answer to a different user instruction. We think adapting at inference time to multiple tasks and requests, as we show in Chapters 3 and 6, represents a fundamental ability that future vision-language models should possess and can greatly benefit remote-sensing applications. Finally, we think that RS vision-language research should move toward the inclusion of additional data sources (i.e., geographic databases) to help vision-language models be more precise and grounded in answering specific image-related queries.File | Dimensione | Formato | |
---|---|---|---|
phd_unitn_Ricci_Riccardo.pdf
accesso aperto
Descrizione: Tesi di dottorato
Tipologia:
Tesi di dottorato (Doctoral Thesis)
Licenza:
Creative commons
Dimensione
2.33 MB
Formato
Adobe PDF
|
2.33 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione