Video understanding systems have achieved strong performance on controlled benchmarks, yet their deployment in real-world scenarios remains limited by assumptions about supervision, training data availability, and offline access to complete video sequences. These constraints are particularly restrictive in settings such as surveillance and procedural assistance, where data is scarce, privacy-sensitive, and decisions must be made online. Recent foundation models provide a new opportunity to rethink video understanding as a language-guided inference problem. Leveraging this shift, this thesis investigates how Vision-Language Models (VLMs) and Large Language Models (LLMs) can be used to relax key deployment constraints. The proposed methods build on frozen, language-aligned representations and increasingly shift task objectives and decision logic to inference time through natural language, rather than encoding them through task-specific training. The first contribution shows that pre-trained VLM representations can be adapted under weak supervision for video anomaly detection and recognition by exploiting the geometric structure of vision-language embeddings. The second contribution eliminates task-specific training by reformulating anomaly detection as an inference-time reasoning problem solved using LLMs. The third contribution extends this paradigm to causal, online settings by introducing a framework for video step grounding that combines Large Multimodal Models with Bayesian filtering. Finally, the thesis addresses the reliability of language model estimates over video and explores whether synthetic videos generated by text-to-video models can improve their temporal understanding without human annotation. By reducing reliance on task-specific data and offline access to complete videos, and by separately addressing the reliability of language model estimates, the proposed methods make video understanding systems more adaptable across tasks and environments and better suited to real-world deployment constraints.
Language-Guided Video Understanding with Foundation Models / Zanella, Luca. - (2026 Apr 17).
Language-Guided Video Understanding with Foundation Models
Zanella, Luca
2026-04-17
Abstract
Video understanding systems have achieved strong performance on controlled benchmarks, yet their deployment in real-world scenarios remains limited by assumptions about supervision, training data availability, and offline access to complete video sequences. These constraints are particularly restrictive in settings such as surveillance and procedural assistance, where data is scarce, privacy-sensitive, and decisions must be made online. Recent foundation models provide a new opportunity to rethink video understanding as a language-guided inference problem. Leveraging this shift, this thesis investigates how Vision-Language Models (VLMs) and Large Language Models (LLMs) can be used to relax key deployment constraints. The proposed methods build on frozen, language-aligned representations and increasingly shift task objectives and decision logic to inference time through natural language, rather than encoding them through task-specific training. The first contribution shows that pre-trained VLM representations can be adapted under weak supervision for video anomaly detection and recognition by exploiting the geometric structure of vision-language embeddings. The second contribution eliminates task-specific training by reformulating anomaly detection as an inference-time reasoning problem solved using LLMs. The third contribution extends this paradigm to causal, online settings by introducing a framework for video step grounding that combines Large Multimodal Models with Bayesian filtering. Finally, the thesis addresses the reliability of language model estimates over video and explores whether synthetic videos generated by text-to-video models can improve their temporal understanding without human annotation. By reducing reliance on task-specific data and offline access to complete videos, and by separately addressing the reliability of language model estimates, the proposed methods make video understanding systems more adaptable across tasks and environments and better suited to real-world deployment constraints.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



