When Video Compression Meets Multimodal Large Language Models: A Unified Paradigm for Cross-Modality Video Compression

IRIS

Traditional video compression methods perform well at high bitrates but struggle to preserve fine-grained semantic information at low bitrates. Recently, with the blossoming of Multimodal Large Language Models (MLLMs), Cross-modal compression techniques offer prospective solutions for improving video compression under low-bitrate conditions. In this paper, we propose a unified Cross-Modality Video Compression (CMVC) framework that integrates multimodal representations and video generative models. The encoder disentangles video into spatial and temporal components, which are mapped to compact cross modal representations using MLLMs. During decoding, different encoding-decoding modes are employed to acquire various video reconstruction qualities, including Text-Text-to-Video (TT2V) for semantic preservation and Image-Text-to-Video (IT2V) for perceptual consistency. Additionally, we elaborate on an efficient frame interpolation model using Low-Rank Adaptation (LoRA) to improve the perceptual quality. Experimental results demon strate that TT2V achieves effective semantic reconstruction, while IT2V ensures competitive perceptual consistency. These findings suggest the potential of leveraging multimodal priors to improve video compression, offering promising future research directions.

Traditional video compression methods perform well at high bitrates but struggle to preserve fine-grained semantic information at low bitrates. Recently, with the blossoming of Multimodal Large Language Models (MLLMs), Cross-modal compression techniques offer prospective solutions for improving video compression under low-bitrate conditions. In this paper, we propose a unified Cross-Modality Video Compression (CMVC) framework that integrates multimodal representations and video generative models. The encoder disentangles video into spatial and temporal components, which are mapped to compact cross modal representations using MLLMs. During decoding, different encoding-decoding modes are employed to acquire various video reconstruction qualities, including Text-Text-to-Video (TT2V) for semantic preservation and Image-Text-to-Video (IT2V) for perceptual consistency. Additionally, we elaborate on an efficient frame interpolation model using Low-Rank Adaptation (LoRA) to improve the perceptual quality. Experimental results demon strate that TT2V achieves effective semantic reconstruction, while IT2V ensures competitive perceptual consistency. These findings suggest the potential of leveraging multimodal priors to improve video compression, offering promising future research directions.

When Video Compression Meets Multimodal Large Language Models: A Unified Paradigm for Cross-Modality Video Compression / Zhang, P., Li, J., Chen, K., Wang, M., Xu, L., Li, H., Sebe, N., Kwong, S., Wang, S.. - In: IEEE SIGNAL PROCESSING LETTERS. - ISSN 1070-9908. - 33:(2026), pp. 1716-1720. [10.1109/LSP.2026.3673193]

When Video Compression Meets Multimodal Large Language Models: A Unified Paradigm for Cross-Modality Video Compression

Zhang P.;Li J.;Chen K.;Wang M.;Xu L.;Li H.;Sebe N.;Kwong S.;Wang S.

2026-01-01

Abstract

Traditional video compression methods perform well at high bitrates but struggle to preserve fine-grained semantic information at low bitrates. Recently, with the blossoming of Multimodal Large Language Models (MLLMs), Cross-modal compression techniques offer prospective solutions for improving video compression under low-bitrate conditions. In this paper, we propose a unified Cross-Modality Video Compression (CMVC) framework that integrates multimodal representations and video generative models. The encoder disentangles video into spatial and temporal components, which are mapped to compact cross modal representations using MLLMs. During decoding, different encoding-decoding modes are employed to acquire various video reconstruction qualities, including Text-Text-to-Video (TT2V) for semantic preservation and Image-Text-to-Video (IT2V) for perceptual consistency. Additionally, we elaborate on an efficient frame interpolation model using Low-Rank Adaptation (LoRA) to improve the perceptual quality. Experimental results demon strate that TT2V achieves effective semantic reconstruction, while IT2V ensures competitive perceptual consistency. These findings suggest the potential of leveraging multimodal priors to improve video compression, offering promising future research directions.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2026
			
	Titolo del periodico (Journal title)
	
				IEEE SIGNAL PROCESSING LETTERS
			
	DOI
	
				https://dx.doi.org/10.1109/LSP.2026.3673193
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-105032761891
			
	Tutti gli autori
	
						Zhang, P.; Li, J.; Chen, K.; Wang, M.; Xu, L.; Li, H.; Sebe, N.; Kwong, S.; Wang, S.
					
	Citazione
	
				When Video Compression Meets Multimodal Large Language Models: A Unified Paradigm for Cross-Modality Video Compression / Zhang, P., Li, J., Chen, K., Wang, M., Xu, L., Li, H., Sebe, N., Kwong, S., Wang, S.. - In: IEEE SIGNAL PROCESSING LETTERS. - ISSN 1070-9908. - 33:(2026), pp. 1716-1720. [10.1109/LSP.2026.3673193]

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/486910

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

ND

ND

social impact