CapERA: Captioning Events in Aerial Videos

IRIS

In this paper, we introduce the CapERA dataset, which upgrades the Event Recognition in Aerial Videos (ERA) dataset to aerial video captioning. The newly proposed dataset aims to advance visual–language-understanding tasks for UAV videos by providing each video with diverse textual descriptions. To build the dataset, 2864 aerial videos are manually annotated with a caption that includes information such as the main event, object, place, action, numbers, and time. More captions are automatically generated from the manual annotation to take into account as much as possible the variation in describing the same video. Furthermore, we propose a captioning model for the CapERA dataset to provide benchmark results for UAV video captioning. The proposed model is based on the encoder–decoder paradigm with two configurations to encode the video. The first configuration encodes the video frames independently by an image encoder. Then, a temporal attention module is added on the top to consider the temporal dynamics between features derived from the video frames. In the second configuration, we directly encode the input video using a video encoder that employs factorized space–time attention to capture the dependencies within and between the frames. For generating captions, a language decoder is utilized to autoregressively produce the captions from the visual tokens. The experimental results under different evaluation criteria show the challenges of generating captions from aerial videos. We expect that the introduction of CapERA will open interesting new research avenues for integrating natural language processing (NLP) with UAV video understandings.

CapERA: Captioning Events in Aerial Videos / Bashmal, L; Bazi, Y; Al Rahhal, M. M; Zuair, M; Melgani, F. - In: REMOTE SENSING. - ISSN 2072-4292. - ELETTRONICO. - 15:8(2023), pp. 213901-213916. [10.3390/rs15082139]

CapERA: Captioning Events in Aerial Videos

Bashmal, L;Bazi, Y;Al Rahhal, M. M;Zuair, M;Melgani, F

2023-01-01

Abstract

In this paper, we introduce the CapERA dataset, which upgrades the Event Recognition in Aerial Videos (ERA) dataset to aerial video captioning. The newly proposed dataset aims to advance visual–language-understanding tasks for UAV videos by providing each video with diverse textual descriptions. To build the dataset, 2864 aerial videos are manually annotated with a caption that includes information such as the main event, object, place, action, numbers, and time. More captions are automatically generated from the manual annotation to take into account as much as possible the variation in describing the same video. Furthermore, we propose a captioning model for the CapERA dataset to provide benchmark results for UAV video captioning. The proposed model is based on the encoder–decoder paradigm with two configurations to encode the video. The first configuration encodes the video frames independently by an image encoder. Then, a temporal attention module is added on the top to consider the temporal dynamics between features derived from the video frames. In the second configuration, we directly encode the input video using a video encoder that employs factorized space–time attention to capture the dependencies within and between the frames. For generating captions, a language decoder is utilized to autoregressively produce the captions from the visual tokens. The experimental results under different evaluation criteria show the challenges of generating captions from aerial videos. We expect that the introduction of CapERA will open interesting new research avenues for integrating natural language processing (NLP) with UAV video understandings.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2023
			
	Titolo del periodico (Journal title)
	
				REMOTE SENSING
			
	Numero e parte del fascicolo (Issue number and part)
	
				8
			
	DOI
	
				https://dx.doi.org/10.3390/rs15082139
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-85156136670
			
	Codice WOS (WOS identifier)
	
				WOS:000977272700001
			
	Tutti gli autori
	
						Bashmal, L; Bazi, Y; Al Rahhal, M. M; Zuair, M; Melgani, F
					
	Citazione
	
				CapERA: Captioning Events in Aerial Videos / Bashmal, L; Bazi, Y; Al Rahhal, M. M; Zuair, M; Melgani, F. - In: REMOTE SENSING. - ISSN 2072-4292. - ELETTRONICO. - 15:8(2023), pp. 213901-213916. [10.3390/rs15082139]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
2023_Remote Sensing-CapEra-Yakoub.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Creative commons Dimensione 2.57 MB Formato Adobe PDF Visualizza/Apri	2.57 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/400700

Citazioni

ND

4

3

ND

social impact