Mistake detection in procedural videos is the task of identifying errors in activities such as cooking, assembly, or repair. The domain represents a critical yet underexplored challenge. This thesis focuses on Post-Completion Mistake Detection (PCMD), where a model must verify a full procedure execution and localize deviations from the intended protocol. PCMD is under-researched and still held back by fragmented error taxonomies, staged and scarce datasets, and complex, computationally demanding, often domain-specific vision-first models. This thesis develops a unified, language-centered PCMD framework. First, it establishes the limitations of end-to-end Vision-Language Models (VLMs) for procedural verification. Through gaps in temporal reasoning of ongoing and completed actions, failures in understanding of cause-effect relations in procedural structures, and model tendencies towards ``blind guessing'', the thesis demonstrates that VLMs struggle with fine-grained temporal logic. The diagnostics prove that reliable mistake detection requires structured and interpretable mechanisms over black-box VLM reasoning alone. Second, to address the data bottleneck, the thesis introduces PIE-V, a semi-synthetic pipeline for generating mistake-aware datasets. Using psychology-informed error planning, PIE-V injects semantic mistakes into clean procedures. It delivers controllable, error-rich variants that approximate real-world error scenarios, in contrast to the staged mistakes of the current mistake-aware video datasets, and outperforms freeform LLM-based generation in coherence and perceived realism. Third, the thesis presents a lightweight, language-grounded PCMD framework, \texttt{ChronoFix}. The method grounds video executions into step sequences, compares raw step descriptions, semantic role representations, and action--object abstractions, and verifies the resulting traces with a Hidden Markov Model. Across CaptainCook4D, EgoPER, EgoOops, and auxiliary Assembly101 experiments, the results show that semantic-role normalization improves robustness to noisy VLM grounding and that explicit sequence modeling supports interpretable cross-dataset mistake detection. This work advances the state of the art by (1) providing diagnostic evidence of VLM failures in temporal logic, (2) introducing a scalable pipeline for generating realistic mistakes, and (3) presenting an efficient, structure-first baseline for post-completion mistake detection.
Language-Grounded Post-Completion Mistake Detection in Procedural Videos / Loginova, O.. - (2026 Apr 30), pp. 1-184.
Language-Grounded Post-Completion Mistake Detection in Procedural Videos
Loginova, Olga
2026-04-30
Abstract
Mistake detection in procedural videos is the task of identifying errors in activities such as cooking, assembly, or repair. The domain represents a critical yet underexplored challenge. This thesis focuses on Post-Completion Mistake Detection (PCMD), where a model must verify a full procedure execution and localize deviations from the intended protocol. PCMD is under-researched and still held back by fragmented error taxonomies, staged and scarce datasets, and complex, computationally demanding, often domain-specific vision-first models. This thesis develops a unified, language-centered PCMD framework. First, it establishes the limitations of end-to-end Vision-Language Models (VLMs) for procedural verification. Through gaps in temporal reasoning of ongoing and completed actions, failures in understanding of cause-effect relations in procedural structures, and model tendencies towards ``blind guessing'', the thesis demonstrates that VLMs struggle with fine-grained temporal logic. The diagnostics prove that reliable mistake detection requires structured and interpretable mechanisms over black-box VLM reasoning alone. Second, to address the data bottleneck, the thesis introduces PIE-V, a semi-synthetic pipeline for generating mistake-aware datasets. Using psychology-informed error planning, PIE-V injects semantic mistakes into clean procedures. It delivers controllable, error-rich variants that approximate real-world error scenarios, in contrast to the staged mistakes of the current mistake-aware video datasets, and outperforms freeform LLM-based generation in coherence and perceived realism. Third, the thesis presents a lightweight, language-grounded PCMD framework, \texttt{ChronoFix}. The method grounds video executions into step sequences, compares raw step descriptions, semantic role representations, and action--object abstractions, and verifies the resulting traces with a Hidden Markov Model. Across CaptainCook4D, EgoPER, EgoOops, and auxiliary Assembly101 experiments, the results show that semantic-role normalization improves robustness to noisy VLM grounding and that explicit sequence modeling supports interpretable cross-dataset mistake detection. This work advances the state of the art by (1) providing diagnostic evidence of VLM failures in temporal logic, (2) introducing a scalable pipeline for generating realistic mistakes, and (3) presenting an efficient, structure-first baseline for post-completion mistake detection.| File | Dimensione | Formato | |
|---|---|---|---|
|
Thesis_Olga_Loginova_38th_cycle_FinalExam.pdf
accesso aperto
Tipologia:
Tesi di dottorato (Doctoral Thesis)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
33.61 MB
Formato
Adobe PDF
|
33.61 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



