Jay Zhangjie Wu’s research while affiliated with National University of Singapore and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (14)


Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images
  • Chapter

November 2024

·

3 Reads

David Junhao Zhang

·

·

Jay Zhangjie Wu

·

[...]

·

Mike Zheng Shou


Given text descriptions, our approach generates highly faithful and photorealistic videos. Click the image to play the video clips. Best viewed with Adobe Acrobat Reader
The comparison a evaluates the CLIP-Text Similarity Score, highlighting how well the text aligns with video content and the fidelity of motion across various pixel and latent model pairings at different resolutions and compression ratios during the keyframe stage. These keyframe models all utilize identical latent VDM for the final super-resolution phases. The point’s radius signifies the peak memory usage during the whole inference process. For consistency, all models in this study employ the same T5 text encoder and start with pre-trained weights from LAION, followed by additional training on WebVid using uniform steps to maintain fairness. f=0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f=0$$\end{document} indicates the model operating in pixel space, while f=2,4,8\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f=2,4,8$$\end{document} correspond to different latent compression ratios. The findings reveal that employing a pixel VDM to create low-resolution videos (64×40\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$64\times 40$$\end{document}) at the keyframe stage yields superior outcomes compared to latent VDM across various resolutions and compression ratios. b Presents the visual outcomes of the keyframes
Final super-resolution comparisons. We contrast our expert translation against typical SDx4 upsampling that includes temporal layers and visualize the X-T slice of the final outcomes. The findings suggest that our approach is capable of managing the possible corruptions found in low-resolution videos, resulting in improved temporal consistency and quality (notably smoother and with reduced noise in the X-T slice) compared to SDx4 with temporal layers
Overview of Show-1. Pixel-based VDMs produce videos of lower resolution with better text-video alignment, while latent-based VDMs upscale these low-resolution videos from pixel-based VDMs to then create high-resolution videos with low computation cost
UNet block of Show-1. We modify the 2D UNet by inserting temporal convolution and attention layers inside each block. During training, we update the additional temporal layers while keeping spatial layers fixed

+7

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
  • Article
  • Publisher preview available

October 2024

·

32 Reads

·

123 Citations

International Journal of Computer Vision

Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution, which can also remove potential artifacts and corruptions from low-resolution videos. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15 G vs. 72 G). Furthermore, our Show-1 model can be readily adapted for motion customization and video stylization applications through simple temporal attention layer finetuning. Our model achieves state-of-the-art performance on standard video generation benchmarks. Code of Show-1 is publicly available and more videos can be found here.

View access options



Figure 3: Timeline of the established video-language understanding methods (TVR: Text-video retrieval, VC: video captioning, VQA: video question answering, TF: Transformer, LLM: large language model). From left to right, our legend table follows the order: pre-Transformer (Pre-TF), task-specific Transformer, multi-task Transformer, and LLM-augmented architectures.
Figure 4: Illustration of video-language understanding Transformer-based architectures.
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

June 2024

·

16 Reads

Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with temporal dynamics. In this survey, we review the key tasks of these systems and highlight the associated challenges. Based on the challenges, we summarize their methods from model architecture, model training, and data perspectives. We also conduct performance comparison among the methods, and discuss promising directions for future research.





Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task

June 2023

·

4 Reads

·

16 Citations

Proceedings of the AAAI Conference on Artificial Intelligence

VQA is an ambitious task aiming to answer any image-related question. However, in reality, it is hard to build such a system once for all since the needs of users are continuously updated, and the system has to implement new functions. Thus, Continual Learning (CL) ability is a must in developing advanced VQA systems. Recently, a pioneer work split a VQA dataset into disjoint answer sets to study this topic. However, CL on VQA involves not only the expansion of label sets (new Answer sets). It is crucial to study how to answer questions when deploying VQA systems to new environments (new Visual scenes) and how to answer questions requiring new functions (new Question types). Thus, we propose CLOVE, a benchmark for Continual Learning On Visual quEstion answering, which contains scene- and function-incremental settings for the two aforementioned CL scenarios. In terms of methodology, the main difference between CL on VQA and classification is that the former additionally involves expanding and preventing forgetting of reasoning mechanisms, while the latter focusing on class representation. Thus, we propose a real-data-free replay-based method tailored for CL on VQA, named Scene Graph as Prompt for Symbolic Replay. Using a piece of scene graph as a prompt, it replays pseudo scene graphs to represent the past images, along with correlated QA pairs. A unified VQA model is also proposed to utilize the current and replayed data to enhance its QA ability. Finally, experimental results reveal challenges in CLOVE and demonstrate the effectiveness of our method. Code and data are available at https://github.com/showlab/CLVQA.


Citations (8)


... With the advancement of generative models, there is growing interest in leveraging these powerful techniques to synthesize dynamic animations from static images, guided by motion features extracted from various user inputs. These inputs can be sparse, such as text prompts [7,41,71], trajectories [42,55,62,67,78], or camera movements [62,73], or dense, like reference videos [28,63,80]. To achieve controllable dynamics, prior works often incorporate Con-trolNet [76] into image or video generative models during the decoding stage, utilizing motion features such as Canny edges, depth maps [17], 2D Gaussian maps [67], and optical flow maps [55]. ...

Reference:

PhysAnimator: Physics-Guided Generative Cartoon Animation
MotionDirector: Motion Customization of Text-to-Video Diffusion Models
  • Citing Chapter
  • October 2024

... Recently, diffusion models have emerged as the most popular paradigm in text-to-video generation [1,3,5,9,16,[44][45][46]. Make-A-Video [29] is first trained on labeled images and then on unlabeled videos to address the issue of the lack of paired video-text data. ...

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

International Journal of Computer Vision

... To allow for more detailed control over object motion and camera movements, additional guidance in the form of motion trajectories [6,28,55,74], pose [77], depth [21], and optical flow [17] has been integrated into video diffusion models to produce more controllable videos. These powerful video diffusion models have also been applied to various downstream tasks, such as video editing [16,35], image animation [12,66,71], video understanding [38,54,61], video interpolation [27,70] and 3D reconstruction and generation [8,18,36,43,60]. Nevertheless, these data-driven approaches usually produce artifacts due to a lack of geometric understanding and physical constraints. ...

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
  • Citing Conference Paper
  • June 2024

... Videop2p [198] 2023 Diffusion Model(U-net) Dreamix [199] 2023 Diffusion Model(U-net) DynVideo [200] 2023 Diffusion Model(U-net) Anyv2v [201] 2023 Diffusion Model(U-net) MagicCrop [202] 2023 Diffusion Model(U-net) ControlAVideo [203] 2023 Diffusion Model(U-net) CCedit [204] 2024 ...

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing
  • Citing Conference Paper
  • June 2024

... For OAK evaluations, we use Faster R-CNN [46], a popular two-stage object detector. We initialize the ResNet-50 [27] backbone with the backbone of the final checkpoint of the streaming SSL model, and fine-tune the entire model on OAK with IID training for 10 epochs, following the training configurations of [63]. ...

Label-Efficient Online Continual Object Detection in Streaming Video
  • Citing Conference Paper
  • October 2023

... However, they require adapting text-to-image models for video generation via frame Fig. 3 Different designs of spatial-temporal modules include: a the original spatial module from U-Net, b a temporal module added within the spatial module, which hinders the image controlnet, and c a temporal module (temporal attention layer) appended after the spatial module, allowing for image control network functionality but failing to produce high-quality and dynamical videos with text-only conditioning. In contrast, our MVB block is conditioned on both image and text in a motion-aware manner and includes a spatiotemporal attention layer, which allows the image controlnet to generate videos of high quality with dynamic motion propagation [16,85] or cross-frame attention [32,79], resulting in subpar temporal consistency compared to those based on VDMs. ...

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
  • Citing Conference Paper
  • October 2023

... IncCLIP [108] enhances continuous visual-language pretraining by generating negative sample texts, thus improving the model's robustness when facing new tasks. SGP [109] uses scene graphs as prompts and integrates them with language models to enhance continual learning in visual question answering tasks. ...

Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
  • Citing Article
  • June 2023

Proceedings of the AAAI Conference on Artificial Intelligence

... By combining several LoRA modules, each fine-tuned to distinct aspects of a task, the model can more effectively manage complex, multifaceted requirements, using the specific strengths of each LoRA module to enhance the overall performance and adaptability of the model. However, when multiple LoRA modules merge, traditional merging techniques [8][9][10] often fail to capture the precise user intentions from textual prompts, as observed in practices where LoRA modules have been integrated into complex image generation tasks without considering the interactions between various model adaptations. This oversight can lead to images that do not align with user expectations, especially as the number of integrated LoRA modules increases. ...

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models
  • Citing Preprint
  • May 2023