Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zeroshot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging due to the dynamic and temporal nature of actions. Existing methods for ZS-VAR typically require extensive training on specific datasets, which can be resource-intensive and may introduce domain biases. In this work, we propose Text-Enhanced Action Recognition (TEAR), a simple approach to ZS-VAR that is training-free and does not require the availability of training data or extensive computational resources. Drawing inspiration from recent findings in vision and language literature, we utilize action descriptors for decomposition and contextual information to enhance zero-shot action recognition. Through experiments on UCF101, HMDB51, and Kinetics600 datasets, we showcase the effectiveness and applicability of our proposed approach in addressing the challenges of ZS-VAR. (The code will be released later at https://github.com/MaXDL4Phys/tear).

Text-Enhanced Zero-Shot Action Recognition: A Training-Free Approach / Bosetti, Massimo; Zhang, Shibingfeng; Liberatori, Bendetta; Zara, Giacomo; Ricci, Elisa; Rota, Paolo. - 15315:(2024), pp. 327-342. ( ICPR Kolkata 1-5/12/2024) [10.1007/978-3-031-78354-8_21].

Text-Enhanced Zero-Shot Action Recognition: A Training-Free Approach

Bosetti, Massimo;Zara, Giacomo;Ricci, Elisa;Rota, Paolo
2024-01-01

Abstract

Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zeroshot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging due to the dynamic and temporal nature of actions. Existing methods for ZS-VAR typically require extensive training on specific datasets, which can be resource-intensive and may introduce domain biases. In this work, we propose Text-Enhanced Action Recognition (TEAR), a simple approach to ZS-VAR that is training-free and does not require the availability of training data or extensive computational resources. Drawing inspiration from recent findings in vision and language literature, we utilize action descriptors for decomposition and contextual information to enhance zero-shot action recognition. Through experiments on UCF101, HMDB51, and Kinetics600 datasets, we showcase the effectiveness and applicability of our proposed approach in addressing the challenges of ZS-VAR. (The code will be released later at https://github.com/MaXDL4Phys/tear).
2024
Lecture Notes in Computer Science ((LNCS,volume 15315))
GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND
SPRINGER INTERNATIONAL PUBLISHING AG
9783031783531
9783031783548
Bosetti, Massimo; Zhang, Shibingfeng; Liberatori, Bendetta; Zara, Giacomo; Ricci, Elisa; Rota, Paolo
Text-Enhanced Zero-Shot Action Recognition: A Training-Free Approach / Bosetti, Massimo; Zhang, Shibingfeng; Liberatori, Bendetta; Zara, Giacomo; Ricci, Elisa; Rota, Paolo. - 15315:(2024), pp. 327-342. ( ICPR Kolkata 1-5/12/2024) [10.1007/978-3-031-78354-8_21].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/470970
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex 1
social impact