A preview of this full-text is provided by Springer Nature.
Content available from International Journal of Computer Vision
This content is subject to copyright. Terms and conditions apply.
International Journal of Computer Vision
https://doi.org/10.1007/s11263-024-02271-9
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video
Generation
David Junhao Zhang1·Jay Zhangjie Wu1·Jia-Wei Liu1·Rui Zhao1·Lingmin Ran1·Yuchao Gu1·Difei Gao1·
Mike Zheng Shou1
Received: 29 March 2024 / Accepted: 3 October 2024
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024
Abstract
Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs).
However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-
based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model,
dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses
pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert
translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution,
which can also remove potential artifacts and corruptions from low-resolution videos. Compared to latent VDMs, Show-1
can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient
(GPU memory usage during inference is 15G vs. 72G). Furthermore, our Show-1 model can be readily adapted for motion
customization and video stylization applications through simple temporal attention layer finetuning. Our model achieves
state-of-the-art performance on standard video generation benchmarks. Code of Show-1 is publicly available and more videos
can be found here.
Keywords Diffusion model ·Video generation ·Video customization
1 Introduction
Remarkable progress has been made in developing large-
scale pre-trained Text-to-Video Diffusion Models (VDMs),
including closed-source ones (e.g., Make-A-Video (Singer
et al., 2022), Imagen Video (Ho et al., 2022a), Video
LDM (Blattmann et al., 2023a), Gen-2 (Esser et al., 2023))
and open-sourced ones (e.g., VideoCrafter (He et al., 2022),
ModelScopeT2V (Wang et al., 2023a). These VDMs can be
classified into two types: (1) Pixel-based VDMs that directly
denoise pixel values, including Make-A-Video (Singer et
al., 2022), Imagen Video (Ho et al., 2022a), PYoCo (Ge
Communicated by Yubo Li.
David Junhao Zhang, Jay Zhangjie Wu, and Jia-Wei Liu contributed
equally to this work.
BMike Zheng Shou
mike.zheng.shou@gmail.com
1Show Lab, National University of Singapore, Singapore,
Singapore
et al., 2023), and (2) Latent-based VDMs that manipulate
the compacted latent space within a variational autoencoder
(VAE), like Video LDM (Blattmann et al., 2023a) and Mag-
icVideo (Zhou et al., 2022) (Fig. 1).
However, both of them have pros and cons. As indi-
cated by (Singer et al., 2022; Ho et al., 2022a), pixel-based
VDMs can generate motion accurately aligned with the tex-
tual prompt because they start generating video from a very
low resolution e.g., 64 ×40 (also demonstrated by Fig. 2).
But they typically demand expensive computational costs in
terms of time and GPU memory, especially when upscal-
ing the video to the high-resolution. Latent-based VDMs
are more resource-efficient because they work in a reduced-
dimension latent space. But it is challenging for such small
latent space (e.g.,8×5 for 64 ×40 videos) to cover rich
yet necessary visual semantic details as described by the tex-
tual prompt. Therefore, as shown in Fig. 2, the generated
videos often are not well-aligned with the textual prompts.
On the other hand, when directly generating relatively high
resolution videos (e.g., 256 ×160) using latent methods,
the alignment between text and video could also be rela-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.