Test-Time Zero-Shot Temporal Action Localization

Liberatori, Benedetta; Conti, Alessandro; Rota, Paolo; Wang, Yiming; Ricci, Elisa

doi:10.1109/cvpr52733.2024.01771

Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.

Test-Time Zero-Shot Temporal Action Localization / Liberatori, Benedetta; Conti, Alessandro; Rota, Paolo; Wang, Yiming; Ricci, Elisa. - (2024), pp. 18720-18729. ( 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 Seattle 17th June 2024) [10.1109/cvpr52733.2024.01771].

Test-Time Zero-Shot Temporal Action Localization

Liberatori, Benedetta;Conti, Alessandro;Rota, Paolo;Wang, Yiming;Ricci, Elisa

2024-01-01

Abstract

Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2024
			
	Titolo del volume (Proceedings title)
	
				Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
			
	Luogo di edizione (Place of publication)
	
				New York
			
	Casa editrice (Publisher)
	
				IEEE
			
	ISBN
	
				9798350353006
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85203971034
			
	Codice WOS (WOS identifier)
	
				WOS:001342515502006
			
	Tutti gli autori
	
						Liberatori, Benedetta; Conti, Alessandro; Rota, Paolo; Wang, Yiming; Ricci, Elisa
					
	Citazione
	
				Test-Time Zero-Shot Temporal Action Localization / Liberatori, Benedetta; Conti, Alessandro; Rota, Paolo; Wang, Yiming; Ricci, Elisa. - (2024), pp. 18720-18729. ( 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 Seattle 17th June 2024) [10.1109/cvpr52733.2024.01771].
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
Liberatori_Test-Time_Zero-Shot_Temporal_Action_Localization_CVPR_2024_paper.pdf accesso aperto Tipologia: Post-print referato (Refereed author’s manuscript) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.61 MB Formato Adobe PDF Visualizza/Apri	1.61 MB	Adobe PDF	Visualizza/Apri
Test-Time_Zero-Shot_Temporal_Action_Localization.pdf Solo gestori archivio Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.49 MB Formato Adobe PDF Visualizza/Apri	1.49 MB	Adobe PDF	Visualizza/Apri