A preview of this full-text is provided by Springer Nature.
Content available from International Journal of Computer Vision
This content is subject to copyright. Terms and conditions apply.
International Journal of Computer Vision
https://doi.org/10.1007/s11263-025-02346-1
MoonShot: Towards Controllable Video Generation and Editing with
Motion-Aware Multimodal Conditions
David Junhao Zhang1·Dongxu Li2·Hung Le2·Mike Zheng Shou1·Caiming Xiong2·Doyen Sahoo2
Received: 1 April 2024 / Accepted: 6 January 2025
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025
Abstract
Current video diffusion models (VDMs) mostly rely on text conditions, limiting control over video appearance and geometry.
This study introduces a new model, MoonShot, conditioning on both image and text for enhanced control. It features the
Multimodal Video Block (MVB), integrating the motion-aware dual cross-attention layer for precise appearance and motion
alignment with provided prompts, and the spatiotemporal attention layer for large motion dynamics. It can also incorporate
pre-trained Image ControlNet modules for geometry conditioning without extra video training. Experiments show our model
significantly improves visual quality and motion fidelity, and its versatility allows for applications in personalized video
generation, animation, and editing, making it a foundational tool for controllable video creation. More video results can be
found here.
Keywords Video generation ·Video diffusion model ·Video customization
1 Introduction
Recently, text-to-video diffusion models (VDMs) [4,12,
15,21,24,59,69,93] have developed significantly, allow-
ing creation of high-quality visual appealing videos(Fig.
1). However, most existing VDMs are limited to mere text
conditional control, which is not always sufficient to pre-
cisely describe visual content. Specifically, these methods
are usually lacking in control over the visual appearance
and geometry structure of the generated videos, rendering
video generation largely reliant on chance or randomness.
Regarding the apperance control, it is well acknowledged
that text prompts are not sufficient to describe precisely the
appearance of generations [38,86]. To address this issue, in
the context of text-to-image generation, efforts are made to
achieve personalized generation [9,35,38,55,86] by fine-
tuning diffusion models on input images. Similarly for video
generation, AnimateDiff relies on customized model weights
Communicated by Shengfeng He.
BMike Zheng Shou
mike.zheng.shou@gmail.com
1Show Lab, National University of Singapore, Singapore,
Singapore
2Salesforce Research, California, USA
to inject conditional visual content, either via LoRA [27]
or DreamBooth tuning [55]. Nonetheless, such an approach
incurs repetitive and tedious fine-tuning for each individual
visual conditional input, hindering it from efficiently scal-
ing to wider applications. The IP-Adapter [86] in the image
domain offers a solution for conditioning on both image and
text through the use of dual cross-attention layers.
However, directly incorporating these layers into a video
diffusion model can lead to a scenario where each text con-
dition is repeatedly applied across the temporal dimension,
resulting in every frame being subject to the same text con-
dition. As shown in Fig. 2a, c, this replication can make it
difficult for the generated video to capture motion informa-
tion from the prompt, preventing the video from accurately
reflecting the motion described in the prompt. To address this
challenge, we introduce a motion-aware dual cross-attention
layer, in which a motion-aware module is designed to assign
learnable temporal weights to the repeated conditions, allow-
ing each frame to have a unique condition. As shown in
Fig. 2d, this approach enables the video to effectively extract
valuable motion information from the prompts, ensuring the
video accurately follows the described motion from text and
appearance control from image.
In terms of geometric structure control, despite methods
such as ControlNet [89] and T2I-Adapter [45] are developed
to leverage depth, edge maps as visual conditions for image
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.