跳到主要导航 跳到搜索 跳到主要内容

When Video Compression Meets Multimodal Large Language Models: A Unified Paradigm for Cross-Modality Video Compression

  • Pingping Zhang
  • , Jinlong Li
  • , Kecheng Chen
  • , Meng Wang*
  • , Long Xu
  • , Haoliang Li
  • , Nicu Sebe
  • , Sam Kwong
  • , Shiqi Wang
  • *此作品的通讯作者
  • City University of Hong Kong
  • University of Trento
  • Lingnan University
  • Ningbo University

科研成果: 期刊稿件文章同行评审

摘要

Traditional video compression methods perform well at high bitrates but struggle to preserve fine-grained semantic information at low bitrates. Recently, with the blossoming of Multimodal Large Language Models (MLLMs), Cross-modal compression techniques offer prospective solutions for improving video compression under low-bitrate conditions. In this paper, we propose a unified Cross-Modality Video Compression (CMVC) framework that integrates multimodal representations and video generative models. The encoder disentangles video into spatial and temporal components, which are mapped to compact cross modal representations using MLLMs. During decoding, different encoding-decoding modes are employed to acquire various video reconstruction qualities, including Text-Text-to-Video (TT2V) for semantic preservation and Image-Text-to-Video (IT2V) for perceptual consistency. Additionally, we elaborate on an efficient frame interpolation model using Low-Rank Adaptation (LoRA) to improve the perceptual quality. Experimental results demon strate that TT2V achieves effective semantic reconstruction, while IT2V ensures competitive perceptual consistency. These findings suggest the potential of leveraging multimodal priors to improve video compression, offering promising future research directions.

源语言英语
期刊IEEE Signal Processing Letters
DOI
出版状态已接受/待刊 - 2026
已对外发布

指纹

探究 'When Video Compression Meets Multimodal Large Language Models: A Unified Paradigm for Cross-Modality Video Compression' 的科研主题。它们共同构成独一无二的指纹。

引用此