TY - JOUR
T1 - When Video Compression Meets Multimodal Large Language Models
T2 - A Unified Paradigm for Cross-Modality Video Compression
AU - Zhang, Pingping
AU - Li, Jinlong
AU - Chen, Kecheng
AU - Wang, Meng
AU - Xu, Long
AU - Li, Haoliang
AU - Sebe, Nicu
AU - Kwong, Sam
AU - Wang, Shiqi
N1 - Publisher Copyright:
© 1994-2012 IEEE.
PY - 2026
Y1 - 2026
N2 - Traditional video compression methods perform well at high bitrates but struggle to preserve fine-grained semantic information at low bitrates. Recently, with the blossoming of Multimodal Large Language Models (MLLMs), Cross-modal compression techniques offer prospective solutions for improving video compression under low-bitrate conditions. In this paper, we propose a unified Cross-Modality Video Compression (CMVC) framework that integrates multimodal representations and video generative models. The encoder disentangles video into spatial and temporal components, which are mapped to compact cross modal representations using MLLMs. During decoding, different encoding-decoding modes are employed to acquire various video reconstruction qualities, including Text-Text-to-Video (TT2V) for semantic preservation and Image-Text-to-Video (IT2V) for perceptual consistency. Additionally, we elaborate on an efficient frame interpolation model using Low-Rank Adaptation (LoRA) to improve the perceptual quality. Experimental results demon strate that TT2V achieves effective semantic reconstruction, while IT2V ensures competitive perceptual consistency. These findings suggest the potential of leveraging multimodal priors to improve video compression, offering promising future research directions.
AB - Traditional video compression methods perform well at high bitrates but struggle to preserve fine-grained semantic information at low bitrates. Recently, with the blossoming of Multimodal Large Language Models (MLLMs), Cross-modal compression techniques offer prospective solutions for improving video compression under low-bitrate conditions. In this paper, we propose a unified Cross-Modality Video Compression (CMVC) framework that integrates multimodal representations and video generative models. The encoder disentangles video into spatial and temporal components, which are mapped to compact cross modal representations using MLLMs. During decoding, different encoding-decoding modes are employed to acquire various video reconstruction qualities, including Text-Text-to-Video (TT2V) for semantic preservation and Image-Text-to-Video (IT2V) for perceptual consistency. Additionally, we elaborate on an efficient frame interpolation model using Low-Rank Adaptation (LoRA) to improve the perceptual quality. Experimental results demon strate that TT2V achieves effective semantic reconstruction, while IT2V ensures competitive perceptual consistency. These findings suggest the potential of leveraging multimodal priors to improve video compression, offering promising future research directions.
KW - Video
KW - multimodal large language Models
KW - multimodal representations
KW - semantic reconstruction
UR - https://www.scopus.com/pages/publications/105032761891
U2 - 10.1109/LSP.2026.3673193
DO - 10.1109/LSP.2026.3673193
M3 - 文章
AN - SCOPUS:105032761891
SN - 1070-9908
JO - IEEE Signal Processing Letters
JF - IEEE Signal Processing Letters
ER -