Xinlei Chen’s research while affiliated with Meta and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (59)


Figure 2 Generation-only training vs. Joint training with other data. Training solely on generation data results in inferior performance. Joint training with additional data enables visual generation with only 5k generation data and yields high-quality outputs with 200k generation data.
Figure 3 Impact of different data types on visual generation. The baseline of training on only visual generation data is red; Joint training with other data is yellow; Joint training with visual understanding data is green; and all data is blue. Joint training with additional data improves the baseline, with visual understanding tasks contributing the most to enhancing visual generation.
Figure 8 shows that General, Vision-Centric, and Text&Chart VQA tasks strongly correlate with generation performance, each with a Pearson correlation coefficient (ρ) above 0.85. High-Resolution VQA exhibits moderate correlation, with ρ around 0.7. In contrast, Knowledge VQA tasks, such as MMMU, show weak correlation with generation performance. These findings suggest that generation ability aligns more closely with the model's vision capabilities rather than knowledge-specific tasks.
Figure 13 Examples of MetaMorph (II).We showcase more examples of MetaMorph's capabilities: answering questions and transforming images in one conversation (left), generating images (top-right), and leveraging knowledge in LLMs to generate rare concepts (bottom-right).
Results of training solely on generation data vs. joint training with additional data. These results correspond to Figure 2. Joint training with additional data significantly improves generation performance. At 5,000 samples, the model begins to generate reasonably accurate visual tokens, indicating that visual generation is an ability unlocked through the learning of other tasks.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
  • Preprint
  • File available

December 2024

·

67 Reads

·

David Fan

·

Jiachen Zhu

·

[...]

·

Zhuang Liu

In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong "prior" vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process.

Download

Figure 1: Overview of a Pixel Transformer (PiT), which is used to investigate the role of locality. Given an image, we simply treat it as a set of pixels. Besides, we also employ randomly initialized and learnable position embeddings without any information about 2D structure, therefore removing the remaining locality inductive bias from previous work (e.g., ViT [23]). These representations are then fed into the Transformer which performs set operations in its interleaved Self-Attention and MLP blocks (only showing one of each for clarity). We demonstrate the versatility of PiT with three case studies, covering both discriminative and generative tasks.
Figure 3: Qualitative results for case study #3 (image generation). These 256×256 samples are from PiT-L trained on ImageNet, following the same architecture design and generation protocol as that of DiT [51]. 2 PiT generations include fine features and of reasonably similar quality as the DiT generations with locality [51].
Figure 6: Mean attention distances in late, middle, and early layers between PiT and ViT. This metric can be interpreted as the receptive field size for Transformers. The distance is normalized by the image size, and sorted based on the distance value for different attention heads from left to right.
Figure 7: Mean attention offsets in late, middle, and early layers between PiT and ViT. This metric measures the deviation of the attention map from the current token location. The offset is normalized with the image size, and sorted based on the distance value for different attention heads from left to right.
Figure 8: Figure-ground segmentation in early layers of PiT. In each row, we show the original image and the attention maps of two selected early layers (from first to fourth). We use the central pixel in the image space as the query and visualize its attention maps. Structures that can capture the foreground of objects have already emerged in these layers, which prepares for the later layers to learn higher-order relationships.
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

June 2024

·

160 Reads

This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision.



Improving Selective Visual Question Answering by Learning from Your Peers

June 2023

·

13 Reads

Despite advances in Visual Question Answering (VQA), the ability of models to assess their own correctness remains underexplored. Recent work has shown that VQA models, out-of-the-box, can have difficulties abstaining from answering when they are wrong. The option to abstain, also called Selective Prediction, is highly relevant when deploying systems to users who must trust the system's output (e.g., VQA assistants for users with visual impairments). For such scenarios, abstention can be especially important as users may provide out-of-distribution (OOD) or adversarial inputs that make incorrect answers more likely. In this work, we explore Selective VQA in both in-distribution (ID) and OOD scenarios, where models are presented with mixtures of ID and OOD data. The goal is to maximize the number of questions answered while minimizing the risk of error on those questions. We propose a simple yet effective Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions. Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model. It does not require additional manual labels or held-out data and provides a signal for identifying examples that are easy/difficult to generalize to. In our extensive evaluations, we show this benefits a number of models across different architectures and scales. Overall, for ID, we reach 32.92% in the selective prediction metric coverage at 1% risk of error (C@1%) which doubles the previous best coverage of 15.79% on this task. For mixed ID/OOD, using models' softmax confidences for abstention decisions performs very poorly, answering <5% of questions at 1% risk of error even when faced with only 10% OOD examples, but a learned selection function with LYP can increase that to 25.38% C@1%.


Figure 1: Regions are a key concept in adapting general machine learning paradigms to important vision tasks like object detection. Left: from supervised classification to region-based learning in the R-CNN series [29]. Middle: from inter-image contrast to region-level, intra-image contrast as explored in self-supervised pre-training [36]. Right: while being more effective [44], how to use region information in reconstructive pre-training remains under-explored. We aim to close this gap.
R-MAE: Regions Meet Masked Autoencoders

June 2023

·

78 Reads

Vision-specific concepts such as "region" have played a key role in extending general machine learning frameworks to tasks like object detection. Given the success of region-based detectors for supervised learning and the progress of intra-image methods for contrastive learning, we explore the use of regions for reconstructive pre-training. Starting from Masked Autoencoding (MAE) both as a baseline and an inspiration, we propose a parallel pre-text task tailored to address the one-to-many mapping between images and regions. Since such regions can be generated in an unsupervised way, our approach (R-MAE) inherits the wide applicability from MAE, while being more "region-aware". We conduct thorough analyses during the development of R-MAE, and converge on a variant that is both effective and efficient (1.3% overhead over MAE). Moreover, it shows consistent quantitative improvements when generalized to various pre-training data and downstream detection and segmentation benchmarks. Finally, we provide extensive qualitative visualizations to enhance the understanding of R-MAE's behaviour and potential. Code will be made available at https://github.com/facebookresearch/r-mae.




Figure 1. ConvNeXt V2 model scaling. The ConvNeXt V2 model, which has been pre-trained using our fully convolutional masked autoencoder framework, performs significantly better than the previous version across a wide range of model sizes.
End-to-end IN-1K fine-tuning setting for Tiny model.
ImageNet-1K fine-tuning results with a single 224×224 crop. The improvement over the V1 supervised model is shown in parentheses.
ImageNet-22K intermediate fine-tuning results with a single 224×224 crop. The improvement over the V1 supervised model is shown in parentheses. els: Nano, Tiny, Base, Large and Huge. We see consistent improvement over the V1 counterparts. In particular, the
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

January 2023

·

662 Reads

·

7 Citations

Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.


UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

December 2022

·

6 Reads

Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships. However, despite some previous attempts on connecting these two related tasks with highly task-specific neural modules, it remains understudied how to explicitly depict their shared nature to learn them simultaneously. In this work, we propose UniT3D, a simple yet effective fully unified transformer-based architecture for jointly solving 3D visual grounding and dense captioning. UniT3D enables learning a strong multimodal representation across the two tasks through a supervised joint pre-training scheme with bidirectional and seq-to-seq objectives. With a generic architecture design, UniT3D allows expanding the pre-training scope to more various training sources such as the synthesized data from 2D prior knowledge to benefit 3D vision-language tasks. Extensive experiments and analysis demonstrate that UniT3D obtains significant gains for 3D dense captioning and visual grounding.


EurNet: Efficient Multi-Range Relational Modeling of Spatial Multi-Relational Data

November 2022

·

6 Reads

Modeling spatial relationship in the data remains critical across many different tasks, such as image classification, semantic segmentation and protein structure understanding. Previous works often use a unified solution like relative positional encoding. However, there exists different kinds of spatial relations, including short-range, medium-range and long-range relations, and modeling them separately can better capture the focus of different tasks on the multi-range relations (e.g., short-range relations can be important in instance segmentation, while long-range relations should be upweighted for semantic segmentation). In this work, we introduce the EurNet for Efficient multi-range relational modeling. EurNet constructs the multi-relational graph, where each type of edge corresponds to short-, medium- or long-range spatial interactions. In the constructed graph, EurNet adopts a novel modeling layer, called gated relational message passing (GRMP), to propagate multi-relational information across the data. GRMP captures multiple relations within the data with little extra computational cost. We study EurNets in two important domains for image and protein structure modeling. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation verify the gains of EurNet over the previous SoTA FocalNet. On the EC and GO protein function prediction benchmarks, EurNet consistently surpasses the previous SoTA GearNet. Our results demonstrate the strength of EurNets on modeling spatial multi-relational data from various domains. The implementations of EurNet for image modeling are available at https://github.com/hirl-team/EurNet-Image . The implementations for other applied domains/tasks will be released soon.


Citations (31)


... 3D Question Answering requires the model equipped with a language decoder to answer questions regarding the visual context in the given 3D scene [2,41]. Several works focus on addressing the problems of 3D-language prealignment [13,25], or designing adapter layers [9,21], or building 3D sintetic data [60]. In contrast, our work focuses on designing an encoding approach that can enhance 3Dlanguage understanding by capturing fine-grained details. ...

Reference:

PerLA: Perceptive 3D Language Assistant
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
  • Citing Conference Paper
  • October 2023

... The use of additional binary classifiers-so-called selectorswas investigated for text QA (Kamath et al., 2020;Desai and Durrett, 2020;Jiang et al., 2021). This approach was extended to VQA by Whitehead et al. (2022) and Dancette et al. (2023). They focus on fine-tuned models and the use of selectors. ...

Improving Selective Visual Question Answering by Learning from Your Peers
  • Citing Conference Paper
  • June 2023

... From a general point of view, we take advantage of depth-wise convolutions and grouped point-wise convolutions, inspired by [17] to reduce the number of convolutional layer parameters. We harness the Global Response Normalization layer [50] to promote feature diversity and increase contrast and selectivity along channels, enabling the model to identify and differentiate more effectively between relevant features. In addition, dense connections [19] are used to encourage feature reuse and better gradient propagation. ...

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
  • Citing Conference Paper
  • June 2023

... As depicted in Fig. 1, the model architecture consists of four main modules: Backbone, Encoder, Decoder, and Head. The process begins with the Backbone module, where a 224x224 image is inputted, and feature extraction is performed using the ConvNeXt V2-T network [7]. This network processes the image through four stages, generating four sets of feature maps with varying scales. ...

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

... Duan et al. [29] presented that the structure of AI research tasks to become practical technology has increased with more complexity. It has progressed from the bottom of the level of visual navigation, including point navigation [30], object navigation [31], vision, and language navigation [32], to embodied question answering (EQA), including multi-target question answering (MQA) [33] and interactive question answering (IQA) [34]. In this work, the synthetic datasets mentioned in the above section are discussed in which research tasks are utilized. ...

Multi-Target Embodied Question Answering
  • Citing Article
  • April 2019

... Region-wise feature representations have been shown to be very useful in considering contexts for downstream tasks such as semantic segmentation and detection (He et al., 2017;Zhang et al., 2020;Bai et al., 2022). In our proposed geometric region-level foreground-aware contrast, we obtain regions by leveraging the off-the-shelf point cloud over-segmentation techniques Guo et al., 2014). ...

Point-Level Region Contrast for Object Detection Pre-Training
  • Citing Conference Paper
  • June 2022

... MAE are well-regarded for their ability to capture global context and semantic relationships through self-supervised learning, particularly by predicting masked portions of input data. Unlike contrastive learning methods that prioritize representational learning and perform well in linear evaluation setups, MAEs generally require fine-tuning to achieve optimal performance on downstream tasks as presented in [20]. These characteristics make MAEs advantageous for pre-training deep learning models and improving their execution in machine learning applications. ...

Masked Autoencoders Are Scalable Vision Learners
  • Citing Conference Paper
  • June 2022

... The perturbation occurs only during training without any additional inference overhead. At first glance, our method bears some resemblance to siamese learning, where different transformed data are fed into an ANN to produce consistent representations (Chen and He 2021;Wang et al. 2022). However, siamese learning relies on complicated data augmentation strategies, whereas our method is more straightforward and versatile by exploiting the inherent temporal properties of SNNs. ...

On the Importance of Asymmetry for Siamese Representation Learning
  • Citing Conference Paper
  • June 2022

... Transformers in Images [15][16][17][18][19] have recently demonstrated remarkable success in natural image understanding tasks, including classification, detection, and segmentation. Whether pre-trained in a supervised manner on ImageNet or through self-supervised learning, these models have achieved performance that is comparable to and often surpasses, that of CNN-based pre-trained models with a similar number of parameters. ...

An Empirical Study of Training Self-Supervised Vision Transformers
  • Citing Conference Paper
  • October 2021

... For our object detection experiments, we used the Faster R-CNN implementation available from [46] based on the ResNet-50 backbone presented in [47]. To employ an anomaly detection task in the Faster R-CNN baseline, the architecture is supplemented by a one-class classification module based on the predicted object labels, as shown in Fig. 7. ...

Benchmarking Detection Transfer Learning with Vision Transformers