Language-Guided Video Understanding with Foundation Models

Zanella, Luca

Video understanding systems have achieved strong performance on controlled benchmarks, yet their deployment in real-world scenarios remains limited by assumptions about supervision, training data availability, and offline access to complete video sequences. These constraints are particularly restrictive in settings such as surveillance and procedural assistance, where data is scarce, privacy-sensitive, and decisions must be made online. Recent foundation models provide a new opportunity to rethink video understanding as a language-guided inference problem. Leveraging this shift, this thesis investigates how Vision-Language Models (VLMs) and Large Language Models (LLMs) can be used to relax key deployment constraints. The proposed methods build on frozen, language-aligned representations and increasingly shift task objectives and decision logic to inference time through natural language, rather than encoding them through task-specific training. The first contribution shows that pre-trained VLM representations can be adapted under weak supervision for video anomaly detection and recognition by exploiting the geometric structure of vision-language embeddings. The second contribution eliminates task-specific training by reformulating anomaly detection as an inference-time reasoning problem solved using LLMs. The third contribution extends this paradigm to causal, online settings by introducing a framework for video step grounding that combines Large Multimodal Models with Bayesian filtering. Finally, the thesis addresses the reliability of language model estimates over video and explores whether synthetic videos generated by text-to-video models can improve their temporal understanding without human annotation. By reducing reliance on task-specific data and offline access to complete videos, and by separately addressing the reliability of language model estimates, the proposed methods make video understanding systems more adaptable across tasks and environments and better suited to real-world deployment constraints.

Language-Guided Video Understanding with Foundation Models / Zanella, Luca. - (2026 Apr 17), pp. 1-109.

Language-Guided Video Understanding with Foundation Models

Zanella, Luca

2026-04-17

Abstract

Video understanding systems have achieved strong performance on controlled benchmarks, yet their deployment in real-world scenarios remains limited by assumptions about supervision, training data availability, and offline access to complete video sequences. These constraints are particularly restrictive in settings such as surveillance and procedural assistance, where data is scarce, privacy-sensitive, and decisions must be made online. Recent foundation models provide a new opportunity to rethink video understanding as a language-guided inference problem. Leveraging this shift, this thesis investigates how Vision-Language Models (VLMs) and Large Language Models (LLMs) can be used to relax key deployment constraints. The proposed methods build on frozen, language-aligned representations and increasingly shift task objectives and decision logic to inference time through natural language, rather than encoding them through task-specific training. The first contribution shows that pre-trained VLM representations can be adapted under weak supervision for video anomaly detection and recognition by exploiting the geometric structure of vision-language embeddings. The second contribution eliminates task-specific training by reformulating anomaly detection as an inference-time reasoning problem solved using LLMs. The third contribution extends this paradigm to causal, online settings by introducing a framework for video step grounding that combines Large Multimodal Models with Bayesian filtering. Finally, the thesis addresses the reliability of language model estimates over video and explores whether synthetic videos generated by text-to-video models can improve their temporal understanding without human annotation. By reducing reliance on task-specific data and offline access to complete videos, and by separately addressing the reliability of language model estimates, the proposed methods make video understanding systems more adaptable across tasks and environments and better suited to real-world deployment constraints.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di esame finale/Defended on
	
				17-apr-2026
			
	Ciclo
	
				XXXVIII
			
	Anno Accademico
	
				2024-2025
			
	Dipartimento
	
				Ingegneria e scienza dell'Informaz (29/10/12-)
			
	Corso di dottorato
	
				Informatica e telecomunicazioni (fino a.a. 2020-21, 36° ciclo)
			
	Supervisore/Relatore di tesi Unitn (Unitn internal supervisor)
	
				Ricci, Elisa
			
	Supervisore aggiunto/Correlatore Unitn (Unitn Co-Supervisor)
	
				Mancini, Massimiliano
			
	Tesi in cotutela (Bi-nationally supervised Doctoral Thesis)
	
				no
			
	Lingua (Language)
	
				Inglese
			
	Appare nelle tipologie:
	
				08.1 Tesi di dottorato (Doctoral Thesis)

File in questo prodotto:

File	Dimensione	Formato
phd_thesis_iecs_luca_zanella.pdf accesso aperto Tipologia: Tesi di dottorato (Doctoral Thesis) Licenza: Creative commons Dimensione 17.74 MB Formato Adobe PDF Visualizza/Apri	17.74 MB	Adobe PDF	Visualizza/Apri