ArticlePublisher preview available

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution, which can also remove potential artifacts and corruptions from low-resolution videos. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15 G vs. 72 G). Furthermore, our Show-1 model can be readily adapted for motion customization and video stylization applications through simple temporal attention layer finetuning. Our model achieves state-of-the-art performance on standard video generation benchmarks. Code of Show-1 is publicly available and more videos can be found here.
The comparison a evaluates the CLIP-Text Similarity Score, highlighting how well the text aligns with video content and the fidelity of motion across various pixel and latent model pairings at different resolutions and compression ratios during the keyframe stage. These keyframe models all utilize identical latent VDM for the final super-resolution phases. The point’s radius signifies the peak memory usage during the whole inference process. For consistency, all models in this study employ the same T5 text encoder and start with pre-trained weights from LAION, followed by additional training on WebVid using uniform steps to maintain fairness. f=0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f=0$$\end{document} indicates the model operating in pixel space, while f=2,4,8\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f=2,4,8$$\end{document} correspond to different latent compression ratios. The findings reveal that employing a pixel VDM to create low-resolution videos (64×40\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$64\times 40$$\end{document}) at the keyframe stage yields superior outcomes compared to latent VDM across various resolutions and compression ratios. b Presents the visual outcomes of the keyframes
… 
This content is subject to copyright. Terms and conditions apply.
International Journal of Computer Vision
https://doi.org/10.1007/s11263-024-02271-9
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video
Generation
David Junhao Zhang1·Jay Zhangjie Wu1·Jia-Wei Liu1·Rui Zhao1·Lingmin Ran1·Yuchao Gu1·Difei Gao1·
Mike Zheng Shou1
Received: 29 March 2024 / Accepted: 3 October 2024
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024
Abstract
Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs).
However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-
based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model,
dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses
pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert
translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution,
which can also remove potential artifacts and corruptions from low-resolution videos. Compared to latent VDMs, Show-1
can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient
(GPU memory usage during inference is 15G vs. 72G). Furthermore, our Show-1 model can be readily adapted for motion
customization and video stylization applications through simple temporal attention layer finetuning. Our model achieves
state-of-the-art performance on standard video generation benchmarks. Code of Show-1 is publicly available and more videos
can be found here.
Keywords Diffusion model ·Video generation ·Video customization
1 Introduction
Remarkable progress has been made in developing large-
scale pre-trained Text-to-Video Diffusion Models (VDMs),
including closed-source ones (e.g., Make-A-Video (Singer
et al., 2022), Imagen Video (Ho et al., 2022a), Video
LDM (Blattmann et al., 2023a), Gen-2 (Esser et al., 2023))
and open-sourced ones (e.g., VideoCrafter (He et al., 2022),
ModelScopeT2V (Wang et al., 2023a). These VDMs can be
classified into two types: (1) Pixel-based VDMs that directly
denoise pixel values, including Make-A-Video (Singer et
al., 2022), Imagen Video (Ho et al., 2022a), PYoCo (Ge
Communicated by Yubo Li.
David Junhao Zhang, Jay Zhangjie Wu, and Jia-Wei Liu contributed
equally to this work.
BMike Zheng Shou
mike.zheng.shou@gmail.com
1Show Lab, National University of Singapore, Singapore,
Singapore
et al., 2023), and (2) Latent-based VDMs that manipulate
the compacted latent space within a variational autoencoder
(VAE), like Video LDM (Blattmann et al., 2023a) and Mag-
icVideo (Zhou et al., 2022) (Fig. 1).
However, both of them have pros and cons. As indi-
cated by (Singer et al., 2022; Ho et al., 2022a), pixel-based
VDMs can generate motion accurately aligned with the tex-
tual prompt because they start generating video from a very
low resolution e.g., 64 ×40 (also demonstrated by Fig. 2).
But they typically demand expensive computational costs in
terms of time and GPU memory, especially when upscal-
ing the video to the high-resolution. Latent-based VDMs
are more resource-efficient because they work in a reduced-
dimension latent space. But it is challenging for such small
latent space (e.g.,8×5 for 64 ×40 videos) to cover rich
yet necessary visual semantic details as described by the tex-
tual prompt. Therefore, as shown in Fig. 2, the generated
videos often are not well-aligned with the textual prompts.
On the other hand, when directly generating relatively high
resolution videos (e.g., 256 ×160) using latent methods,
the alignment between text and video could also be rela-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... To alleviate this problem, researchers propose a series * Corresponding Author. of approaches to increase training and inference efficiency. Show-1 [41] and Lavie [32] adopts cascaded framework to model temporal relations at low resolution and apply superresolution to improve the final video resolution. However, the cascaded structure leads to error accumulation and significantly increases the inference time. ...
... Based on this insight, we propose temporal pyramid: 1) In our method, the frame rate progressively increases as the diffusion process proceeds as shown in Fig. 1. Unlike previous works [32,41] require an additional temporal interpolation network, we adopt a single model to handle different frame rates. To achieve this, we divide the diffusion process into multiple stages, with each stage operating at different frame rate. ...
Preprint
Full-text available
The development of video diffusion models unveils a significant challenge: the substantial computational demands. To mitigate this challenge, we note that the reverse process of diffusion exhibits an inherent entropy-reducing nature. Given the inter-frame redundancy in video modality, maintaining full frame rates in high-entropy stages is unnecessary. Based on this insight, we propose TPDiff, a unified framework to enhance training and inference efficiency. By dividing diffusion into several stages, our framework progressively increases frame rate along the diffusion process with only the last stage operating on full frame rate, thereby optimizing computational efficiency. To train the multi-stage diffusion model, we introduce a dedicated training framework: stage-wise diffusion. By solving the partitioned probability flow ordinary differential equations (ODE) of diffusion under aligned data and noise, our training strategy is applicable to various diffusion forms and further enhances training efficiency. Comprehensive experimental evaluations validate the generality of our method, demonstrating 50% reduction in training cost and 1.5x improvement in inference efficiency.
... Early approaches [13,45,51,56] relied on GANs but were limited to single-domain datasets. The field later shifted to diffusion-based methods [11,12,16,24,37,57,69,70,75], mostly leveraging pre-trained image diffusion models. Among them, representative works such as Align-Your-Latents [6], AnimateDiff [18], Stable Video Diffusion [5], Lumiere [3], and Emu Video [17] extend 2D diffusion UNet with temporal layers to model dynamic motion priors. ...
Preprint
Full-text available
Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See https://guoyww.github.io/projects/long-context-video/ for more details.
... Diffusion-based video generators [5,7,9,39,44,58,74,85,92] are rapidly advancing, enabling the generation of visually rich and dynamic videos from text or visual inputs. Recent progress in video generative models highlights the growing need for user control over object appearance [12,Input video Static view transport results Input video Dynamic camera control results Figure 2. Qualitative results on static view transport (left) & dynamic camera control (right). ...
Preprint
Full-text available
We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Project page: https://hyeonho99.github.io/reangle-a-video/
... Diffusion models are probabilistic generative models that have achieved remarkable success in learning complex data distributions across various domains, including images [6,27,31,33], videos [3,5,14,28,43,50,51,54], and 3D objects [19,29,39,45,49]. These models operate through a two-step process: a forward process, which incrementally adds noise to a clean image, and a backward process, which progressively removes the noise to reconstruct the original image. ...
Preprint
Full-text available
Personalized image generation via text prompts has great potential to improve daily life and professional work by facilitating the creation of customized visual content. The aim of image personalization is to create images based on a user-provided subject while maintaining both consistency of the subject and flexibility to accommodate various textual descriptions of that subject. However, current methods face challenges in ensuring fidelity to the text prompt while not overfitting to the training data. In this work, we introduce a novel training pipeline that incorporates an attractor to filter out distractions in training images, allowing the model to focus on learning an effective representation of the personalized subject. Moreover, current evaluation methods struggle due to the lack of a dedicated test set. The evaluation set-up typically relies on the training data of the personalization task to compute text-image and image-image similarity scores, which, while useful, tend to overestimate performance. Although human evaluations are commonly used as an alternative, they often suffer from bias and inconsistency. To address these issues, we curate a diverse and high-quality test set with well-designed prompts. With this new benchmark, automatic evaluation metrics can reliably assess model performance
... Storytelling through text-based video synthesis [12, 14, 22-24, 39, 45, 46, 48] presents a challenge in content creation. Recent advances [1, 3, 4, 6, 7, 9-11, 13, 15, 17-* Corresponding author 19,27,29,30,32,37,38,[40][41][42][43][44] in diffusion-based models have significantly improved the quality of short video generation. However, these models often struggle [1, 3, 4, 6, 7, 9-11, 17-19, 29, 32, 37, 38, 40, 41, 43, 44] to generate long-form coherent video sequences. ...
Preprint
Generating coherent long-form video sequences from discrete input using only text prompts is a critical task in content creation. While diffusion-based models excel at short video synthesis, long-form storytelling from text remains largely unexplored and a challenge due to challenges pertaining to temporal coherency, preserving semantic meaning and action continuity across the video. We introduce a novel storytelling approach to enable seamless video generation with natural action transitions and structured narratives. We present a bidirectional time-weighted latent blending strategy to ensure temporal consistency between segments of the long-form video being generated. Further, our method extends the Black-Scholes algorithm from prompt mixing for image generation to video generation, enabling controlled motion evolution through structured text conditioning. To further enhance motion continuity, we propose a semantic action representation framework to encode high-level action semantics into the blending process, dynamically adjusting transitions based on action similarity, ensuring smooth yet adaptable motion changes. Latent space blending maintains spatial coherence between objects in a scene, while time-weighted blending enforces bidirectional constraints for temporal consistency. This integrative approach prevents abrupt transitions while ensuring fluid storytelling. Extensive experiments demonstrate significant improvements over baselines, achieving temporally consistent and visually compelling video narratives without any additional training. Our approach bridges the gap between short clips and extended video to establish a new paradigm in GenAI-driven video synthesis from text.
... 10) AnimateDiff-V2 . 11) Show-1 (Zhang et al., 2023a). 12) Pika (Pik, 2024). ...
Preprint
Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a \sim14,000×\times compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Keling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.
... We condition our model on a set of clean reference views I ref , which in practice, we select as the closest training view. Inspired by video [1,3,9,10,13,17,53,65,66,71,84,87] and multi-view diffusion models [24,26,29,30,42,50,51,74], we adapt the self-attention layers into a reference mixing layer to capture cross-view dependencies. We start from concatenating novel viewĨ and reference views I ref on an additional view dimension and frame-wise encoded into latent space ...
Preprint
Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation. Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2×\times improvement in FID score over baselines while maintaining 3D consistency.
Article
Quality assessment, which evaluates the visual quality level of multimedia experiences, has garnered significant attention from researchers and has evolved substantially through dedicated efforts. Before the advent of large models, quality assessment typically relied on small expert models tailored for specific tasks. While these smaller models are effective at handling their designated tasks and predicting quality levels, they often lack explainability and robustness. With the advancement of large models, which align more closely with human cognitive and perceptual processes, many researchers are now leveraging the prior knowledge embedded in these large models for quality assessment tasks. This emergence of quality assessment within the context of large models motivates us to provide a comprehensive review focusing on two key aspects: 1) the assessment of large models, and 2) the role of large models in assessment tasks. We begin by reflecting on the historical development of quality assessment. Subsequently, we move to detailed discussions of related works concerning quality assessment in the era of large models. Finally, we offer insights into the future progression and potential pathways for quality assessment in this new era. We hope this survey will enable a rapid understanding of the development of quality assessment in the era of large models and inspire further advancements in the field.
Preprint
Relational video customization refers to the creation of personalized videos that depict user-specified relations between two subjects, a crucial task for comprehending real-world visual content. While existing methods can personalize subject appearances and motions, they still struggle with complex relational video customization, where precise relational modeling and high generalization across subject categories are essential. The primary challenge arises from the intricate spatial arrangements, layout variations, and nuanced temporal dynamics inherent in relations; consequently, current models tend to overemphasize irrelevant visual details rather than capturing meaningful interactions. To address these challenges, we propose DreamRelation, a novel approach that personalizes relations through a small set of exemplar videos, leveraging two key components: Relational Decoupling Learning and Relational Dynamics Enhancement. First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships. Furthermore, we determine the optimal design of relation LoRA triplet by analyzing the distinct roles of the query, key, and value features within MM-DiT's attention mechanism, making DreamRelation the first relational video generation framework with explainable components. Second, in Relational Dynamics Enhancement, we introduce space-time relational contrastive loss, which prioritizes relational dynamics while minimizing the reliance on detailed subject appearances. Extensive experiments demonstrate that DreamRelation outperforms state-of-the-art methods in relational video customization. Code and models will be made publicly available.
Preprint
Recent great advances in video generation models have demonstrated their potential to produce high-quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack high-level semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curate GRADEO-Instruct, a multi-dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduce GRADEO, one of the first specifically designed video evaluation models, which grades AI-generated videos for explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios. The models, datasets, and codes will be released soon.
Article
Full-text available
Denoising diffusion probabilistic models are a promising new class of generative models that mark a milestone in high-quality image generation. This paper showcases their ability to sequentially generate video, surpassing prior methods in perceptual and probabilistic forecasting metrics. We propose an autoregressive, end-to-end optimized video diffusion model inspired by recent advances in neural video compression. The model successively generates future frames by correcting a deterministic next-frame prediction using a stochastic residual generated by an inverse diffusion process. We compare this approach against six baselines on four datasets involving natural and simulation-based videos. We find significant improvements in terms of perceptual quality and probabilistic frame forecasting ability for all datasets.