Video Diffusion Models are Training-free Motion Interpreter and Controller (doi:10.21979/N9/HQM313)

View:

Part 1: Document Description
Part 2: Study Description
Entire Codebook

(external link) (external link)

Document Description
Citation
Title:	Video Diffusion Models are Training-free Motion Interpreter and Controller
Identification Number:	doi:10.21979/N9/HQM313
Distributor:	DR-NTU (Data)
Date of Distribution:	2024-11-07
Version:	1
Bibliographic Citation:	Xiao, Zeqi; Zhou, Yifan; Yang, Shuai; Pan, Xingang, 2024, "Video Diffusion Models are Training-free Motion Interpreter and Controller", https://doi.org/10.21979/N9/HQM313, DR-NTU (Data), V1
Study Description
Citation
Title:	Video Diffusion Models are Training-free Motion Interpreter and Controller
Identification Number:	doi:10.21979/N9/HQM313
Authoring Entity:	Xiao, Zeqi (Nanyang Technological University)
	Zhou, Yifan (Nanyang Technological University)
	Yang, Shuai (Peking University)
	Pan, Xingang (Nanyang Technological University)
Software used in Production:	Nil
Distributor:	DR-NTU (Data)
Access Authority:	Xiao, Zeqi
Depositor:	Xiao Zeqi
Date of Deposit:	2024-10-11
Holdings Information:	https://doi.org/10.21979/N9/HQM313
Study Scope
Keywords:	Computer and Information Science, Video diffusion model
Abstract:	Video generation primarily aims to model authentic and customized motion across frames, making understanding and controlling the motion a crucial topic. Most diffusion-based studies on video motion focus on motion customization with training-based paradigms, which, however, demands substantial training resources and necessitates retraining for diverse models. Crucially, these approaches do not explore how video diffusion models encode cross-frame motion information in their features, lacking interpretability and transparency in their effectiveness. To answer this question, this paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. Through analysis using Principal Component Analysis (PCA), our work discloses that robust motion-aware feature already exists in video diffusion models. We present a new MOtion FeaTure (MOFT) by eliminating content correlation information and filtering motion channels. MOFT provides a distinct set of benefits, including the ability to encode comprehensive motion information with clear interpretability, extraction without the need for training, and generalizability across diverse architectures. Leveraging MOFT, we propose a novel training-free video motion control framework. Our method demonstrates competitive performance in generating natural and faithful motion, providing architecture-agnostic insights and applicability in a variety of downstream tasks.
Kind of Data:	Codes
Methodology and Processing
Sources Statement
Data Access
Other Study Description Materials
Related Studies
	Paper: <a href="https://xizaoqu.github.io/moft/">Link</a>
	Code: <a href="https://github.com/xizaoqu/TrajectoryAttention">Link</a>
Related Publications
Citation
Title:	Xiao, Z., Zhou, Y., Yang, S., & Pan, X. (2024, December). Video diffusion models are training-free motion interpreter and controller. In Proceedings of the 38th International Conference on Neural Information Processing Systems (pp. 76115-76138).
Identification Number:	10.5555/3737916.3740339
Bibliographic Citation:	Xiao, Z., Zhou, Y., Yang, S., & Pan, X. (2024, December). Video diffusion models are training-free motion interpreter and controller. In Proceedings of the 38th International Conference on Neural Information Processing Systems (pp. 76115-76138).
Citation
Title:	Xiao, Z., Zhou, Y., Yang, S., & Pan, X. (2024). Video diffusion models are training-free motion interpreter and controller. Advances in Neural Information Processing Systems, 37, 76115-76138.
Identification Number:	10356/201828
Bibliographic Citation:	Xiao, Z., Zhou, Y., Yang, S., & Pan, X. (2024). Video diffusion models are training-free motion interpreter and controller. Advances in Neural Information Processing Systems, 37, 76115-76138.