Kaiming He’s research while affiliated with Meta and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (116)


Exploring Simple Siamese Representation Learning
  • Conference Paper

June 2021

·

82 Reads

·

4,276 Citations

Xinlei Chen

·

Kaiming He

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

April 2021

·

28 Reads

·

1 Citation

·

Haoqi Fan

·

·

[...]

·

Kaiming He

We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart. Code is made available at https://github.com/facebookresearch/SlowFast


Figure 7. Different self-supervised learning frameworks perform differently between R-50 [20] (x-axis) and ViT-B [15] (y-axis). The numbers are ImageNet linear probing accuracy from Table 4.
An Empirical Study of Training Self-Supervised Visual Transformers
  • Preprint
  • File available

April 2021

·

292 Reads

·

2 Citations

This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Visual Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research.

Download

Exploring Simple Siamese Representation Learning

November 2020

·

105 Reads

·

5 Citations

Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our "SimSiam" method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning. Code will be made available.


Are Labels Necessary for Neural Architecture Search?

October 2020

·

21 Reads

·

69 Citations

Lecture Notes in Computer Science

Existing neural network architectures in computer vision—whether designed by humans or by machines—were typically found using both images and their associated labels. In this paper, we ask the question: can we find high-quality neural architectures using only images, but no human-annotated labels? To answer this question, we first define a new setup called Unsupervised Neural Architecture Search (UnNAS). We then conduct two sets of experiments. In sample-based experiments, we train a large number (500) of diverse architectures with either supervised or unsupervised objectives, and find that the architecture rankings produced with and without labels are highly correlated. In search-based experiments, we run a well-established NAS algorithm (DARTS) using various unsupervised objectives, and report that the architectures searched without labels can be competitive to their counterparts searched with labels. Together, these results reveal the potentially surprising finding that labels are not necessary, and the image statistics alone may be sufficient to identify good neural architectures.


Graph Structure of Neural Networks

July 2020

·

222 Reads

Neural networks are often represented as graphs of connections between neurons. However, despite their wide use, there is currently little understanding of the relationship between the graph structure of the neural network and its predictive performance. Here we systematically investigate how does the graph structure of neural networks affect their predictive performance. To this end, we develop a novel graph-based representation of neural networks called relational graph, where layers of neural network computation correspond to rounds of message exchange along the graph structure. Using this representation we show that: (1) a "sweet spot" of relational graphs leads to neural networks with significantly improved predictive performance; (2) neural network's performance is approximately a smooth function of the clustering coefficient and average path length of its relational graph; (3) our findings are consistent across many different tasks and datasets; (4) the sweet spot can be identified efficiently; (5) top-performing neural networks have graph structure surprisingly similar to those of real biological neural networks. Our work opens new directions for the design of neural architectures and the understanding on neural networks in general.






Citations (88)


... As these models are trained to learn visual representations aligned with the language representation, most MLLMs [1,41,46,47,86] adopt these vision encoders to process the image input. However, instancelevel contrastive learning suffers from feature suppression, where the model learns only the dominant features in the data while neglecting other valuable features [12,42,44,60,66,69,75,82]. In other words, the model creates so-called simple shortcut features and decision rules that do not consider all available distinguishing features. ...

Reference:

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Scaling Language-Image Pre-Training via Masking
  • Citing Conference Paper
  • June 2023

... The integration of augmented reality and computer vision holds promise for object detection [11]. A non-hierarchical Vision Transformer (ViT) serves as the primary object detection network [12], while an enhanced multiscale feature fusion method improves small object detection performance [13]. A Polar Transformer (Polar-Former) enhances the accuracy of 3D object detection in a bird'seye-view system [14]. ...

Exploring Plain Vision Transformer Backbones for Object Detection
  • Citing Chapter
  • November 2022

Lecture Notes in Computer Science

... The attention blocks within SAM, pre-trained with MAE [35], encapsulate a wealth of insights for token embedding analysis, making them pivotal components within the SAM framework. To harness these insights while facilitating memory-efficient training, we choose to preserve the parameters in attention blocks. ...

Masked Autoencoders Are Scalable Vision Learners
  • Citing Conference Paper
  • June 2022

... Among various neural networks that have been successfully applied in structural engineering [26][27][28][29][30], an auto-encoder is a specialized deep-learning architecture designed to learn a compact representation of data that encodes the most meaningful information [31]. The authors postulate that the learned compact data representation of an auto-encoder architecture will filter out noise, anomalies, redundant information, and other spurious artifacts. ...

Masked Autoencoders As Spatiotemporal Learners
  • Citing Preprint
  • May 2022

... Final Vision Transformer representation z L consists of (N + 1) tokens of shape D. In high-level perception tasks such as image classification, the most common strategy is to use only the [cls] token output of the final ViT block (z L,0 ) as the representation of the entire image which serves as an input to the classifier [14,24,63]. The same approach is used in JEA pretraining, where the invariance objective is imposed on the [cls] representations (typically followed by a projector network [12,16]), while patch tokens are discarded [14,19]. An alternative strategy is to summarize the image representation as the average value of patch tokens, i.e. ...

An Empirical Study of Training Self-Supervised Vision Transformers
  • Citing Conference Paper
  • October 2021

... For our object detection experiments, we used the Faster R-CNN implementation available from [46] based on the ResNet-50 backbone presented in [47]. To employ an anomaly detection task in the Faster R-CNN baseline, the architecture is supplemented by a one-class classification module based on the predicted object labels, as shown in Fig. 7. ...

Benchmarking Detection Transfer Learning with Vision Transformers

... For both D25 and D50, 70% of the ima ges fr om the indistribution datasets were used for training, 20% for validation, and 10% for testing. An EfficientNet network, pr etr ained on Ima geNet [ 100 ], was trained using our method called MAPLE (MAhalanobis distance based uncertainty Prediction for reLiablE classification [ 101 ]) illustrated in Fig. 6 . To address high intraclass variances due to, for instance, different viewpoints from which the images were acquired, we use X-means clustering [ 102 ] to break down classes into m ultiple clusters, eac h of whic h contains ima ges clustering together in the feature space of representations learned by the network. ...

Masked Autoencoders Are Scalable Vision Learners

... The key idea of SSL is to design pretext tasks that leverage data itself or its augmentation as label information. Typical pretext tasks include reconstruction and comparison, which allow models to learn useful representations for downstream tasks [35,36]. A typical SSL workflow is to leverage vast unlabeled data for pre-training, followed by supervised fine-tuning [37]. ...

Exploring Simple Siamese Representation Learning
  • Citing Conference Paper
  • June 2021

... To demonstrate this, we pretrain our model on K400 dataset, and pick checkpoints from various epochs. We then finetune these models on UCF101 dataset to evaluate on action recognition performance and report the corresponding accuracy results in Fig. 3. Given that MoCov2 [9] achieves 84.6% for UCF101 after 50 epochs with heavy training, our method outperforms by solely utilizing 5 epochs (85%). This observation implies that our proposed method can be a strong alternative with shorter pretraining requirements against large-scale video representation learning methods that rely on longer pretraining. ...

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
  • Citing Conference Paper
  • June 2021

... Similar to the slice-wise transition in medical images, adjacent frames in videos could carry useful continuity information. Feichtenhofer et al. [17] selected multiple video clips within a one-minute timespan as positive pairs and found improvements across various contrastive learning frameworks and downstream tasks. Since objects could have substantial difference across spatial and temporal dimensions, the spatial distance and time span may not be the best for positive pair selection. ...

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
  • Citing Preprint
  • April 2021