Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD- Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.

Harnessing Large Language Models for Training-Free Video Anomaly Detection / Zanella, Luca; Menapace, Willi; Mancini, Massimiliano; Wang, Yiming; Ricci, Elisa. - (2024), pp. 18527-18536. ( 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 Seattle, WA, USA 17th June - 21st June 2024) [10.1109/cvpr52733.2024.01753].

Harnessing Large Language Models for Training-Free Video Anomaly Detection

Zanella, Luca;Menapace, Willi;Mancini, Massimiliano;Wang, Yiming;Ricci, Elisa
2024-01-01

Abstract

Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD- Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.
2024
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Los Alamitos, CA, USA
IEEE Computer Society
9798350353006
Zanella, Luca; Menapace, Willi; Mancini, Massimiliano; Wang, Yiming; Ricci, Elisa
Harnessing Large Language Models for Training-Free Video Anomaly Detection / Zanella, Luca; Menapace, Willi; Mancini, Massimiliano; Wang, Yiming; Ricci, Elisa. - (2024), pp. 18527-18536. ( 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 Seattle, WA, USA 17th June - 21st June 2024) [10.1109/cvpr52733.2024.01753].
File in questo prodotto:
File Dimensione Formato  
Zanella_Harnessing_Large_Language_Models_for_Training-free_Video_Anomaly_Detection_CVPR_2024_paper.pdf

Solo gestori archivio

Tipologia: Altro materiale allegato (Other attachments)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.47 MB
Formato Adobe PDF
1.47 MB Adobe PDF   Visualizza/Apri
Harnessing_Large_Language_Models_for_Training-Free_Video_Anomaly_Detection.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.58 MB
Formato Adobe PDF
1.58 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/437738
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 38
  • ???jsp.display-item.citation.isi??? 23
  • OpenAlex 34
social impact