Can Text-to-Video Generation help Video-Language Alignment?

Zanella, Luca; Mancini, Massimiliano; Menapace, Willi; Tulyakov, Sergey; Wang, Yiming; Ricci, Elisa

doi:10.1109/cvpr52734.2025.02244

Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others. We hypothesize this issue is linked to noise (semantic and visual) in the generated videos and develop a method, SynViTa, that accounts for those. SynViTa dynamically weights the contribution of each synthetic video based on how similar its target caption is w.r.t. the real counterpart. Moreover, a semantic consistency loss makes the model focus on fine-grained differences across captions, rather than differences in video appearance. Experiments show that, on average, SynViTa improves over existing methods on VideoCon test sets and SSv2-Temporal, SSv2-Events, and ATP-Hard benchmarks, being a first promising step for using synthetic videos when learning video-language models.

Can Text-to-Video Generation help Video-Language Alignment? / Zanella, L., Mancini, M., Menapace, W., Tulyakov, S., Wang, Y., Ricci, E.. - (2025), pp. 24097-24107. (2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 USA 2025) [10.1109/cvpr52734.2025.02244].

Can Text-to-Video Generation help Video-Language Alignment?

Zanella, Luca;Mancini, Massimiliano;Menapace, Willi;Tulyakov, Sergey;Wang, Yiming;Ricci, Elisa

2025-01-01

Abstract

Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others. We hypothesize this issue is linked to noise (semantic and visual) in the generated videos and develop a method, SynViTa, that accounts for those. SynViTa dynamically weights the contribution of each synthetic video based on how similar its target caption is w.r.t. the real counterpart. Moreover, a semantic consistency loss makes the model focus on fine-grained differences across captions, rather than differences in video appearance. Experiments show that, on average, SynViTa improves over existing methods on VideoCon test sets and SSv2-Temporal, SSv2-Events, and ATP-Hard benchmarks, being a first promising step for using synthetic videos when learning video-language models.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2025
			
	Titolo del volume (Proceedings title)
	
				2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
			
	Luogo di edizione (Place of publication)
	
				Los Alamitos, CA, USA
			
	Casa editrice (Publisher)
	
				IEEE Computer Society
			
	ISBN
	
				979-8-3315-4364-8
			
	Codice WOS (WOS identifier)
	
				WOS:001601181100009
			
	Tutti gli autori
	
						Zanella, Luca; Mancini, Massimiliano; Menapace, Willi; Tulyakov, Sergey; Wang, Yiming; Ricci, Elisa
					
	Citazione
	
				Can Text-to-Video Generation help Video-Language Alignment? / Zanella, L., Mancini, M., Menapace, W., Tulyakov, S., Wang, Y., Ricci, E.. - (2025), pp. 24097-24107. (2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 USA 2025) [10.1109/cvpr52734.2025.02244].
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
Zanella_Can_Text-to-Video_Generation_help_Video-Language_Alignment_CVPR_2025_paper.pdf accesso aperto Tipologia: Post-print referato (Refereed author’s manuscript) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 4.41 MB Formato Adobe PDF Visualizza/Apri	4.41 MB	Adobe PDF	Visualizza/Apri
Can_Text-to-Video_Generation_help_Video-Language_Alignment.pdf Solo gestori archivio Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 4.23 MB Formato Adobe PDF Visualizza/Apri	4.23 MB	Adobe PDF	Visualizza/Apri