Xiaogang Wang’s research while affiliated with The University of Hong Kong and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (478)


Demystify Transformers & Convolutions in Modern Image Deep Networks
  • Article

December 2024

·

11 Reads

·

6 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

·

Min Shi

·

Weiyun Wang

·

[...]

·

Jifeng Dai

Vision transformers have gained popularity recently, leading to the development of new vision backbones with improved features and consistent performance gains. However, these advancements are not solely attributable to novel feature transformation designs; certain benefits also arise from advanced network-level and block-level architectures. This paper aims to identify the real gains of popular convolution and attention operators through a detailed study. We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach, known as the “spatial token mixer” (STM). To facilitate an impartial comparison, we introduce a unified architecture to neutralize the impact of divergent network-level and block-level designs. Subsequently, various STMs are integrated into this unified framework for comprehensive comparative analysis. Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs, but performance differences persist among different STMs. Our detailed analysis also reveals various findings about different STMs, including effective receptive fields, invariance, and adversarial robustness tests.


Figure 4. Cosine similarity of visual features between generation and understanding tasks across different layers. The representations of the image understanding and generation tasks are similar in shallow layers but disentagle in deeper layers.
Figure 5. Attention map visualization of understanding and generation tasks. In the second and fourth rows, we visualize a query token (red) and its attended tokens (blue) in the input image. Each token corresponds to a horizontal rectangular area in the original image due to the 2 × 4 token folding. Darker blue indicates larger attention weights.
Figure 6. Qualitative results of image generation. The images are of size 512 × 512.
Evaluation of text-to-image generation on GenEval [25] benchmark. #A-Params denotes the number of activated parameters during inference. † indicates models with external pretrained diffusion model. Obj.: Object. Attri.: Attribution.
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
  • Preprint
  • File available

December 2024

·

13 Reads

The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding while reducing training complexity. After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs. Our code and models shall be released.

Download

TCFormer: Visual Recognition via Token Clustering Transformer

July 2024

·

6 Reads

·

1 Citation

Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer. The code and models for this work are available at https://github.com/zengwang430521/TCFormer.


TCFormer: Visual Recognition via Token Clustering Transformer

July 2024

·

17 Reads

·

3 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer. The code and models for this work are available at https://github.com/zengwang430521/TCFormer.




Weak Augmentation Guided Relational Self-Supervised Learning

May 2024

·

3 Reads

·

62 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

Self-supervised Learning (SSL) including the mainstream contrastive learning has achieved great success in learning visual representations without data annotations. However, most methods mainly focus on the instance level information (ie, the different augmented images of the same instance should have the same feature or cluster into the same class), but there is a lack of attention on the relationships between different instances. In this paper, we introduce a novel SSL paradigm, which we term as relational self-supervised learning (ReSSL) framework that learns representations by modeling the relationship between different instances. Specifically, our proposed method employs sharpened distribution of pairwise similarities among different instances as relation metric, which is thus utilized to match the feature embeddings of different augmentations. To boost the performance, we argue that weak augmentations matter to represent a more reliable relation, and leverage momentum strategy for practical efficiency. The designed asymmetric predictor head and an InfoNCE warm-up strategy enhance the robustness to hyper-parameters and benefit the resulting performance. Experimental results show that our proposed ReSSL substantially outperforms the state-of-the-art methods across different network architectures, including various lightweight networks (eg, EfficientNet and MobileNet).


Phased Consistency Model

May 2024

·

65 Reads

The consistency model (CM) has recently made significant progress in accelerating the generation of diffusion models. However, its application to high-resolution, text-conditioned image generation in the latent space (a.k.a., LCM) remains unsatisfactory. In this paper, we identify three key flaws in the current design of LCM. We investigate the reasons behind these limitations and propose the Phased Consistency Model (PCM), which generalizes the design space and addresses all identified limitations. Our evaluations demonstrate that PCM significantly outperforms LCM across 1--16 step generation settings. While PCM is specifically designed for multi-step refinement, it achieves even superior or comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show that PCM's methodology is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. More details are available at https://g-u-n.github.io/projects/pcm/.


Cached Transformers: Improving Transformers with Differentiable Memory Cachde

March 2024

·

5 Reads

·

5 Citations

Proceedings of the AAAI Conference on Artificial Intelligence

This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.


RNNPose: 6-DoF Object Pose Estimation Via Recurrent Correspondence Field Estimation and Pose Optimization

January 2024

·

71 Reads

·

9 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

6-DoF object pose estimation from a monocular image is a challenging problem, where a post-refinement procedure is generally needed for high-precision estimation. In this paper, we propose a framework, dubbed RNNPose, based on a recurrent neural network (RNN) for object pose refinement, which is robust to erroneous initial poses and occlusions. During the recurrent iterations, object pose refinement is formulated as a non-linear least squares problem based on the estimated correspondence field (between a rendered image and the observed image). The problem is then solved by a differentiable Levenberg-Marquardt (LM) algorithm enabling end-to-end training. The correspondence field estimation and pose refinement are conducted alternately in each iteration to improve the object poses. Furthermore, to improve the robustness against occlusion, we introduce a consistency-check mechanism based on the learned descriptors of the 3D model and observed 2D images, which downweights the unreliable correspondences during pose optimization. We evaluate RNNPose on several public datasets, including LINEMOD, Occlusion-LINEMOD, YCB-Video and TLESS. We demonstrate state-of-the-art performance and strong robustness against severe clutter and occlusion in the scenes. Extensive experiments validate the effectiveness of our proposed method. Besides, the extended system based on RNNPose successfully generalizes to multi-instance scenarios and achieves top-tier performance on the TLESS dataset.


Citations (54)


... The key innovation was putting the transformer architecture to work in computer vision tasks, and the ViT architecture has since been applied in a variety of vision tasks with excellent performance. 5) Dai at.el [5] designed a unified architecture to provide a fair comparison for traditional and modern spatial token mixers. ...

Reference:

UniNeXt: Exploring A Unified Architecture for Vision Recognition
Demystify Transformers & Convolutions in Modern Image Deep Networks
  • Citing Article
  • December 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence

... InterGen [27] proposes the use of cooperative denoisers in the recently introduced InterHuman dataset. MoMat-MoGen [7] extends the retrieval diffusion model from [63] for human interactions. More recently, methods such as in2IN [39] and InterMask [22] propose further improvements to enhance the generation of human interactions, achieving state-of-the-art performance. ...

Digital Life Project: Autonomous 3D Characters with Social Intelligence
  • Citing Conference Paper
  • June 2024

... They primarily focus on identifying correct answers, such as bounding boxes and objects, without requiring coordinating multiple multimodal abilities. Second, while multimodal tasks in open-world settings [20,26,29] involve complex environments and objectives, they emphasize final task completion, often measured by success rate [18]. This results in a lack of profound analysis over the reasoning process, leading to potentially inaccurate assessments of multimodal reasoning capabilities. ...

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
  • Citing Conference Paper
  • June 2024

... However, the fixed token distribution ignores the semantic information of different areas of the image, which leads to performance degradation. To address this issue, in 2024, Wang Zeng et al. proposed the Token Clustering Transformer (TCFormer) [20]. This method dynamically generates visual tokens based on semantic information, allowing regions with similar semantics to be represented by the same token, even if these regions are not adjacent. ...

TCFormer: Visual Recognition via Token Clustering Transformer
  • Citing Article
  • July 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence

... Recent advancements in contrastive learning (CL) and masked image modeling (MIM) have achieved significant success. The emergence of momentum contrast [30] and SimCLR [11] has spurred extensive research on CL [10,13,27,68,85,88]. With the advent of vision transformer [17] and the inspiration of masked language modeling [14], concurrent works such as bidirectional encoder representation from image transformers [6], masked autoencoders [31], and Sim-MIM [74] have demonstrated the effectiveness of MIM. ...

Weak Augmentation Guided Relational Self-Supervised Learning
  • Citing Article
  • May 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence

... Caching is one of the most important tasks that need to be handled by modern-day content delivery networks (CDNs), and its role is expected to grow even further in the future, e.g., in the context of wireless communications (Liu et al. 2016;Paschos et al. 2018) and artificial intelligence. For the latter, it is crucial to cache: trained models for inference requests (Salem et al. 2023;Zhu et al. 2023;Yu et al. 2024), information that can accelerate the training of large models (Lindgren et al. 2021;Zhang et al. 2024), and trained model parts in distributed learning paradigms (Thapa et al. 2022;Tirana et al. 2024). Typically, in a caching network, one needs to decide where to store the contents in order to maximize metrics related to, e.g., network performance or user experience, with cache hit rate being the most predominant one (Paschos, Iosifidis, and Caire 2020). ...

Cached Transformers: Improving Transformers with Differentiable Memory Cachde
  • Citing Article
  • March 2024

Proceedings of the AAAI Conference on Artificial Intelligence

... The model RNNPose [86], published in 2024, presents RNNPose, a recurrent neural network-based framework for 6D object pose refinement that iteratively optimizes poses using a differentiable Levenberg-Marquardt algorithm, leveraging descriptor-based consistency checks for robustness against occlusions and erroneous initial poses, validated on multiple public datasets. ...

RNNPose: 6-DoF Object Pose Estimation Via Recurrent Correspondence Field Estimation and Pose Optimization

IEEE Transactions on Pattern Analysis and Machine Intelligence

... The advent of large language models (LLMs) such as Chat-GPT (Schulman et al., 2022) has profoundly reshaped the trajectory of AGI development, showcasing exceptional zero-shot reasoning capabilities in addressing various NLP tasks via user-defined prompts or language instructions. Traditional vision foundation models typically follow a pretraining and fine-tuning paradigm Chen et al., 2022;Su et al., 2023;Wang et al., 2023b;Tao et al., 2023). While effective, this approach incurs significant marginal costs when adapting to diverse downstream tasks. ...

Siamese Image Modeling for Self-Supervised Vision Representation Learning
  • Citing Conference Paper
  • June 2023

... In contrast, deep learning-based denoising methods, which learn the prior knowledge of images from large-scale data, can improve the quality of denoised images to a certain extent [5,6]. Various deep learning methods have been developed for image denoising [7][8][9]. For instance, Zhang et al. proposed three classic neural network denoising models: denoising convolutional neural networks (DnCNNs) [10], fast and fexible DnCNNs (FFDNets) [11], and convolutional blind denoising networks (CBDNet) [12]. ...

Real-Time Controllable Denoising for Image and Video
  • Citing Conference Paper
  • June 2023

... Please again see Tab. 1 for the off-the-shelf-methods used. Please note that the MB removal [18] and flow map estimation [11] requires frames adjacent to the current input frame as input. Since we deal with video pass-through mixed-reality, we assume adjacent frames are implicitly available. ...

A Simple Baseline for Video Restoration with Grouped Spatial-Temporal Shift
  • Citing Conference Paper
  • June 2023