ArticlePublisher preview available

MoonShot: Towards Controllable Video Generation and Editing with Motion-Aware Multimodal Conditions

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Current video diffusion models (VDMs) mostly rely on text conditions, limiting control over video appearance and geometry. This study introduces a new model, MoonShot, conditioning on both image and text for enhanced control. It features the Multimodal Video Block (MVB), integrating the motion-aware dual cross-attention layer for precise appearance and motion alignment with provided prompts, and the spatiotemporal attention layer for large motion dynamics. It can also incorporate pre-trained Image ControlNet modules for geometry conditioning without extra video training. Experiments show our model significantly improves visual quality and motion fidelity, and its versatility allows for applications in personalized video generation, animation, and editing, making it a foundational tool for controllable video creation. More video results can be found here.
This content is subject to copyright. Terms and conditions apply.
International Journal of Computer Vision
https://doi.org/10.1007/s11263-025-02346-1
MoonShot: Towards Controllable Video Generation and Editing with
Motion-Aware Multimodal Conditions
David Junhao Zhang1·Dongxu Li2·Hung Le2·Mike Zheng Shou1·Caiming Xiong2·Doyen Sahoo2
Received: 1 April 2024 / Accepted: 6 January 2025
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025
Abstract
Current video diffusion models (VDMs) mostly rely on text conditions, limiting control over video appearance and geometry.
This study introduces a new model, MoonShot, conditioning on both image and text for enhanced control. It features the
Multimodal Video Block (MVB), integrating the motion-aware dual cross-attention layer for precise appearance and motion
alignment with provided prompts, and the spatiotemporal attention layer for large motion dynamics. It can also incorporate
pre-trained Image ControlNet modules for geometry conditioning without extra video training. Experiments show our model
significantly improves visual quality and motion fidelity, and its versatility allows for applications in personalized video
generation, animation, and editing, making it a foundational tool for controllable video creation. More video results can be
found here.
Keywords Video generation ·Video diffusion model ·Video customization
1 Introduction
Recently, text-to-video diffusion models (VDMs) [4,12,
15,21,24,59,69,93] have developed significantly, allow-
ing creation of high-quality visual appealing videos(Fig.
1). However, most existing VDMs are limited to mere text
conditional control, which is not always sufficient to pre-
cisely describe visual content. Specifically, these methods
are usually lacking in control over the visual appearance
and geometry structure of the generated videos, rendering
video generation largely reliant on chance or randomness.
Regarding the apperance control, it is well acknowledged
that text prompts are not sufficient to describe precisely the
appearance of generations [38,86]. To address this issue, in
the context of text-to-image generation, efforts are made to
achieve personalized generation [9,35,38,55,86] by fine-
tuning diffusion models on input images. Similarly for video
generation, AnimateDiff relies on customized model weights
Communicated by Shengfeng He.
BMike Zheng Shou
mike.zheng.shou@gmail.com
1Show Lab, National University of Singapore, Singapore,
Singapore
2Salesforce Research, California, USA
to inject conditional visual content, either via LoRA [27]
or DreamBooth tuning [55]. Nonetheless, such an approach
incurs repetitive and tedious fine-tuning for each individual
visual conditional input, hindering it from efficiently scal-
ing to wider applications. The IP-Adapter [86] in the image
domain offers a solution for conditioning on both image and
text through the use of dual cross-attention layers.
However, directly incorporating these layers into a video
diffusion model can lead to a scenario where each text con-
dition is repeatedly applied across the temporal dimension,
resulting in every frame being subject to the same text con-
dition. As shown in Fig. 2a, c, this replication can make it
difficult for the generated video to capture motion informa-
tion from the prompt, preventing the video from accurately
reflecting the motion described in the prompt. To address this
challenge, we introduce a motion-aware dual cross-attention
layer, in which a motion-aware module is designed to assign
learnable temporal weights to the repeated conditions, allow-
ing each frame to have a unique condition. As shown in
Fig. 2d, this approach enables the video to effectively extract
valuable motion information from the prompts, ensuring the
video accurately follows the described motion from text and
appearance control from image.
In terms of geometric structure control, despite methods
such as ControlNet [89] and T2I-Adapter [45] are developed
to leverage depth, edge maps as visual conditions for image
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Recently, image-to-video (I2V) generation models [4,6,9,12,21,29,30,37,54,60,66,68] develop rapidly. These models bring images to life, making the visual content more dynamic and vivid. ...
... ated result. Other I2V models [9,29,30,37,54,66,68] support simple control through text input. As shown in Figure 1(a), they adopt a text encoder to inject control information into visual features. ...
... Generally, most I2V generation models [9,21,29,66] support text and image input simultaneously. The input image guides the visual content of the video, while the text indicates the potential motion. ...
Preprint
We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields, providing flexible and precise motion guidance. Specifically, the agent extracts the object movement and camera motion described in the text and converts them into object trajectories and camera extrinsics, respectively. An analytical optical flow composition module integrates these motion representations in 3D space and projects them into a unified optical flow. An optical flow adapter takes the flow to control the base image-to-video diffusion model for generating fine-grained controlled videos. The significant improvement in the Video-Text Camera Motion metrics on VBench indicates that our method achieves precise control over camera motion. We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
... The commonly used open-source text-video dataset for video generation [29,30,31,32,33,34,35,36,37,38,39] is WebVid-10M [10]. However, it contains a prominent watermark on videos, requiring additional fine-tuning on image datasets (e.g., Laion [40]) or internal high-quality video datasets to remove the watermark. ...
... To generate long videos in the absence of corresponding dataset, Make-A-Video [29] and NUWA-XL [53] explore coarse-to-fine video generation but suffer from maintaining temporal continuity and producing strong motion magnitude. Apart from these explorations of convolution-based architecture [29,30,31,25,23,27,24,32,42,37,34,35,33,38,39], transformer-based methods (e.g., WALT [26], Latte [54], and Snap Video [3]) become more prevalent recently, offering a better trade-off between computational complexity and performance, as well as improved scalability. ...
Preprint
Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention. However, existing publicly available datasets are inadequate for generating Sora-like videos, as they mainly contain short videos with low motion intensity and brief captions. To address these issues, we propose MiraData, a high-quality video dataset that surpasses previous ones in video duration, caption detail, motion strength, and visual quality. We curate MiraData from diverse, manually selected sources and meticulously process the data to obtain semantically consistent clips. GPT-4V is employed to annotate structured captions, providing detailed descriptions from four different perspectives along with a summarized dense caption. To better assess temporal consistency and motion intensity in video generation, we introduce MiraBench, which enhances existing benchmarks by adding 3D consistency and tracking-based motion strength metrics. MiraBench includes 150 evaluation prompts and 17 metrics covering temporal consistency, motion strength, 3D consistency, visual quality, text-video alignment, and distribution similarity. To demonstrate the utility and effectiveness of MiraData, we conduct experiments using our DiT-based video generation model, MiraDiT. The experimental results on MiraBench demonstrate the superiority of MiraData, especially in motion strength.
... Compared to pure T2V synthesis, these methods enabled controllable and precise synthesis driven by multiple conditions. For facilitating the global controls regarding visual appearance, AnimateDiff [4] and MoonShot [28] conditioned the synthesis on both image and text inputs simultaneously. However, these methods overlooked fine-grained controls, limiting their applicability in the medical field. ...
... To model the temporal dynamics, we insert a temporal self-attention layer following the cross-attention layer in each block. This design ensures the feature distribution of the spatial layers will not be altered signifi-cantly [28]. Thus, HeartBeat can reuse the rich visual concepts regarding ECHO video patterns preserved in LDM and focus on temporal features integration. ...
Preprint
Full-text available
Echocardiography (ECHO) video is widely used for cardiac examination. In clinical, this procedure heavily relies on operator experience, which needs years of training and maybe the assistance of deep learning-based systems for enhanced accuracy and efficiency. However, it is challenging since acquiring sufficient customized data (e.g., abnormal cases) for novice training and deep model development is clinically unrealistic. Hence, controllable ECHO video synthesis is highly desirable. In this paper, we propose a novel diffusion-based framework named HeartBeat towards controllable and high-fidelity ECHO video synthesis. Our highlight is three-fold. First, HeartBeat serves as a unified framework that enables perceiving multimodal conditions simultaneously to guide controllable generation. Second, we factorize the multimodal conditions into local and global ones, with two insertion strategies separately provided fine- and coarse-grained controls in a composable and flexible manner. In this way, users can synthesize ECHO videos that conform to their mental imagery by combining multimodal control signals. Third, we propose to decouple the visual concepts and temporal dynamics learning using a two-stage training scheme for simplifying the model training. One more interesting thing is that HeartBeat can easily generalize to mask-guided cardiac MRI synthesis in a few shots, showcasing its scalability to broader applications. Extensive experiments on two public datasets show the efficacy of the proposed HeartBeat.
... With the advancement of video generation models [2,7,8,16,31,35,38,55,60,62,69] and video understanding models [1,12,30,32,37,70], character-generated technologies have made significant progress. Some works [17,56,61,64,67] can generate videos with consistent identity based on a reference face or body image, and some [14,15,20,41,42,57,71,72] extend this by incorporating motion control. However, these methods still focus on creating a new video rather than altering an existing one. ...
Preprint
Video body-swapping aims to replace the body in an existing video with a new body from arbitrary sources, which has garnered more attention in recent years. Existing methods treat video body-swapping as a composite of multiple tasks instead of an independent task and typically rely on various models to achieve video body-swapping sequentially. However, these methods fail to achieve end-to-end optimization for the video body-swapping which causes issues such as variations in luminance among frames, disorganized occlusion relationships, and the noticeable separation between bodies and background. In this work, we define video body-swapping as an independent task and propose three critical consistencies: identity consistency, motion consistency, and environment consistency. We introduce an end-to-end model named SwapAnyone, treating video body-swapping as a video inpainting task with reference fidelity and motion control. To improve the ability to maintain environmental harmony, particularly luminance harmony in the resulting video, we introduce a novel EnvHarmony strategy for training our model progressively. Additionally, we provide a dataset named HumanAction-32K covering various videos about human actions. Extensive experiments demonstrate that our method achieves State-Of-The-Art (SOTA) performance among open-source methods while approaching or surpassing closed-source models across multiple dimensions. All code, model weights, and the HumanAction-32K dataset will be open-sourced at https://github.com/PKU-YuanGroup/SwapAnyone.
... For example, AnimateDiff [16] introduced a temporal attention module to improve temporal consistency across frames. Subsequent video generation models [4, 6,7,47,62,63] adopted an alternating approach with 2D spatial and 1D temporal attention, including works like ModelScope, VideoCrafter, Moonshot, and Show-1. ...
Preprint
Full-text available
Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.
Preprint
Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling independent control over motion and identity. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a Temporal Transition Module to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.
Article
Full-text available
Diffusion generative models have recently become a powerful technique for creating and modifying high-quality, coherent video content. This survey provides a comprehensive overview of the critical components of diffusion models for video generation, including their applications, architectural design, and temporal dynamics modeling. The paper begins by discussing the core principles and mathematical formulations, then explores various architectural choices and methods for maintaining temporal consistency. A taxon-omy of applications is presented, categorizing models based on input modalities such as text prompts, images, videos, and audio signals. Advancements in text-to-video generation are discussed to illustrate the state-of-the-art capabilities and limitations of current approaches. Additionally, the survey summarizes recent developments in training and evaluation practices, including the use of diverse video and image datasets and the adoption of various evaluation metrics to assess model performance. The survey concludes with an examination of ongoing challenges, such as generating longer videos and managing computational costs, and offers insights into potential future directions for the field. By consolidating the latest research and developments, this survey aims to serve as a valuable resource for researchers and practitioners working with video diffusion models. Website: https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models
Preprint
Full-text available
Recently, video generation techniques have advanced rapidly. Given the popularity of video content on social media platforms, these models intensify concerns about the spread of fake information. Therefore, there is a growing demand for detectors capable of distinguishing between fake AI-generated videos and mitigating the potential harm caused by fake information. However, the lack of large-scale datasets from the most advanced video generators poses a barrier to the development of such detectors. To address this gap, we introduce the first AI-generated video detection dataset, GenVideo. It features the following characteristics: (1) a large volume of videos, including over one million AI-generated and real videos collected; (2) a rich diversity of generated content and methodologies, covering a broad spectrum of video categories and generation techniques. We conducted extensive studies of the dataset and proposed two evaluation methods tailored for real-world-like scenarios to assess the detectors' performance: the cross-generator video classification task assesses the generalizability of trained detectors on generators; the degraded video classification task evaluates the robustness of detectors to handle videos that have degraded in quality during dissemination. Moreover, we introduced a plug-and-play module, named Detail Mamba (DeMamba), designed to enhance the detectors by identifying AI-generated videos through the analysis of inconsistencies in temporal and spatial dimensions. Our extensive experiments demonstrate DeMamba's superior generalizability and robustness on GenVideo compared to existing detectors. We believe that the GenVideo dataset and the DeMamba module will significantly advance the field of AI-generated video detection. Our code and dataset will be aviliable at \url{https://github.com/chenhaoxing/DeMamba}.
Article
Full-text available
Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution, which can also remove potential artifacts and corruptions from low-resolution videos. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15 G vs. 72 G). Furthermore, our Show-1 model can be readily adapted for motion customization and video stylization applications through simple temporal attention layer finetuning. Our model achieves state-of-the-art performance on standard video generation benchmarks. Code of Show-1 is publicly available and more videos can be found here.
Conference Paper
Full-text available
This paper presents Video-P2P, a novel framework for real-world video editing with cross-attention control. While attention control has proven effective for image editing with pre-trained image generation models, there are currently no large-scale video generation models publicly available. Video-P2P addresses this limitation by adapting an image generation diffusion model to complete various video editing tasks. Specifically, we propose to first tune a Text-to-Set (T2S) model to complete an approximate inversion and then optimize a shared unconditional embedding to achieve accurate video inversion with a small memory cost. For attention control, we introduce a novel decoupled-guidance strategy, which uses different guidance strategies for the source and target prompts. The optimized unconditional embedding for the source prompt improves reconstruction ability, while an initialized unconditional embedding for the target prompt enhances editability. Incorporating the attention maps of these two branches enables detailed editing. These technical designs enable various text-driven editing applications, including word swap, prompt refinement , and attention re-weighting. Video-P2P works well on real-world videos for generating new characters while optimally preserving their original poses and scenes. It significantly outperforms previous approaches.