Kaiming He’s research while affiliated with Meta and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (116)


Figure 1. Feature map in the res3 block of an ImageNet-trained ResNet-50 [9] applied on a clean image (top) and on its adversarially perturbed counterpart (bottom). The adversarial perturbation was produced using PGD [16] with maximum perturbation =16 (out of 256). In this example, the adversarial image is incorrectly recognized as "space heater"; the true label is "digital clock".
Figure 2. More examples similar to Figure 1. We show feature maps corresponding to clean images (top) and to their adversarial perturbed versions (bottom). The feature maps for each pair of examples are from the same channel of a res3 block in the same ResNet-50 trained on clean images. The attacker has a maximum perturbation = 16 in the pixel domain.
Figure 8. CAAD 2018 results of the adversarial defense track. The first-place entry is based on our method. We only show the 5 winning submissions here, out of more than 20 submissions.
Feature Denoising for Improving Adversarial Robustness
  • Preprint
  • File available

December 2018

·

331 Reads

·

·

·

[...]

·

Kaiming He

Adversarial attacks to image classification systems present challenges to convolutional networks and opportunities for understanding them. This study suggests that adversarial perturbations on images lead to noise in the features constructed by these networks. Motivated by this observation, we develop new network architectures that increase adversarial robustness by performing feature denoising. Specifically, our networks contain blocks that denoise the features using non-local means or other filters; the entire networks are trained end-to-end. When combined with adversarial training, our feature denoising networks substantially improve the state-of-the-art in adversarial robustness in both white-box and black-box attack settings. On ImageNet, under 10-iteration PGD white-box attacks where prior art has 27.9% accuracy, our method achieves 55.7%; even under extreme 2000-iteration PGD white-box attacks, our method secures 42.6% accuracy. A network based on our method was ranked first in Competition on Adversarial Attacks and Defenses (CAAD) 2018 --- it achieved 50.6% classification accuracy on a secret, ImageNet-like test dataset against 48 unknown attackers, surpassing the runner-up approach by ~10%. Code and models will be made publicly available.

Download

Rethinking ImageNet Pre-training

November 2018

·

179 Reads

·

3 Citations

We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization. The results are no worse than their ImageNet pre-training counterparts even when using the hyper-parameters of the baseline system (Mask R-CNN) that were optimized for fine-tuning pre-trained models, with the sole exception of increasing the number of training iterations so the randomly initialized models may converge. Training from random initialization is surprisingly robust; our results hold even when: (i) using only 10% of the training data, (ii) for deeper and wider models, and (iii) for multiple tasks and metrics. Experiments show that ImageNet pre-training speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy. To push the envelope we demonstrate 50.9 AP on COCO object detection without using any external data---a result on par with the top COCO 2017 competition results that used ImageNet pre-training. These observations challenge the conventional wisdom of ImageNet pre-training for dependent tasks and we expect these discoveries will encourage people to rethink the current de facto paradigm of `pre-training and fine-tuning' in computer vision.


Exploring the Limits of Weakly Supervised Pretraining: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II

September 2018

·

159 Reads

·

456 Citations

Lecture Notes in Computer Science

State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards “small”. Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate. In this paper, we present a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images. Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.


Group Normalization: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII

September 2018

·

60 Reads

·

1,357 Citations

Lecture Notes in Computer Science

Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems—BN’s error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN’s usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code.


Focal Loss for Dense Object Detection

July 2018

·

4,801 Reads

·

10,969 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.


GLoMo: Unsupervisedly Learned Relational Graphs as Transferable Representations

June 2018

·

129 Reads

Modern deep transfer learning approaches have mainly focused on learning generic feature vectors from one task that are transferable to other tasks, such as word embeddings in language and pretrained convolutional features in vision. However, these approaches usually transfer unary features and largely ignore more structured graphical representations. This work explores the possibility of learning generic latent relational graphs that capture dependencies between pairs of data units (e.g., words or pixels) from large-scale unlabeled data and transferring the graphs to downstream tasks. Our proposed transfer learning framework improves performance on various tasks including question answering, natural language inference, sentiment analysis, and image classification. We also show that the learned graphs are generic enough to be transferred to different embeddings on which the graphs have not been trained (including GloVe embeddings, ELMo embeddings, and task-specific RNN hidden unit), or embedding-free units such as image pixels.


Mask R-CNN

June 2018

·

1,707 Reads

·

2,790 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, \eg, allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: https://github.com/facebookresearch/Detectron.





Citations (87)


... Images are typically resized to smaller dimensions before being input to the model for feature extraction. The CLIP model and its variants [24,36] use smaller input image sizes, typically around 224px or 336px, to reduce computational complexity and enhance model training and inference efficiency while maintaining the extraction of major image features. However, this approach reveals shortcomings when describing small objects or details, as the low-resolution input may lead to the loss of crucial detail information, limiting the model's ability to perceive finegrained aspects of an image. ...

Reference:

EDIR: an expert method for describing image regions based on knowledge distillation and triple fusion
Scaling Language-Image Pre-Training via Masking
  • Citing Conference Paper
  • June 2023

... COCO Object Detection and Instance Segmentation. We employ ViT-Det [32] as our object detector model, which utilizes a Vision Transformer backbone to perform object detection and instance segmentation. Unless otherwise specified, we perform end-to-end fine-tuning on the COCO dataset [34], resizing the images to a resolution of 768 × 768 to expedite the fine-tuning process. ...

Exploring Plain Vision Transformer Backbones for Object Detection
  • Citing Chapter
  • November 2022

Lecture Notes in Computer Science

... Among various neural networks that have been successfully applied in structural engineering [26][27][28][29][30], an auto-encoder is a specialized deep-learning architecture designed to learn a compact representation of data that encodes the most meaningful information [31]. The authors postulate that the learned compact data representation of an auto-encoder architecture will filter out noise, anomalies, redundant information, and other spurious artifacts. ...

Masked Autoencoders As Spatiotemporal Learners
  • Citing Preprint
  • May 2022

... These early approaches extracted visual features from images using CNNs and subsequently generated natural language descriptions utilizing recurrent neural networks, specifically LSTMs. However, with the advancement of deep learning techniques, the introduction of visual language models [6,15] has significantly enhanced the performance of image captioning tasks. These visual language models have achieved significant breakthroughs in their ability to process and integrate visual and linguistic information, enabling the generation of more precise and vivid descriptions through the effective fusion of visual and textual features. ...

An Empirical Study of Training Self-Supervised Vision Transformers
  • Citing Conference Paper
  • October 2021

... For our object detection experiments, we used the Faster R-CNN implementation available from [46] based on the ResNet-50 backbone presented in [47]. To employ an anomaly detection task in the Faster R-CNN baseline, the architecture is supplemented by a one-class classification module based on the predicted object labels, as shown in Fig. 7. ...

Benchmarking Detection Transfer Learning with Vision Transformers

... For both D25 and D50, 70% of the ima ges fr om the indistribution datasets were used for training, 20% for validation, and 10% for testing. An EfficientNet network, pr etr ained on Ima geNet [ 100 ], was trained using our method called MAPLE (MAhalanobis distance based uncertainty Prediction for reLiablE classification [ 101 ]) illustrated in Fig. 6 . To address high intraclass variances due to, for instance, different viewpoints from which the images were acquired, we use X-means clustering [ 102 ] to break down classes into m ultiple clusters, eac h of whic h contains ima ges clustering together in the feature space of representations learned by the network. ...

Masked Autoencoders Are Scalable Vision Learners

... Wu et al. [26] follow the principle and enhance the existing FeatureNet and MsvNet frameworks using a lightweight network technique inspired by the selfsupervised learning method SimSiam [5]. This approach seeks to increase the efficiency of these networks, resulting in what Wu et al. term FeatureNetLite and MsvNet Lite. ...

Exploring Simple Siamese Representation Learning
  • Citing Conference Paper
  • June 2021

... This process utilizes a supervised training objective. Alternatively, strong representations can also be obtained by utilizing the vast amount of available unlabelled videos and employing an unsupervised training objective [23]. Subsequently, transfer learning is employed and the models are fine-tuned on smaller specialized datasets such as UCF101 [65], ActivityNet [32] and HMDB51 [46]. ...

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
  • Citing Conference Paper
  • June 2021

... Similar to the slice-wise transition in medical images, adjacent frames in videos could carry useful continuity information. Feichtenhofer et al. [17] selected multiple video clips within a one-minute timespan as positive pairs and found improvements across various contrastive learning frameworks and downstream tasks. Since objects could have substantial difference across spatial and temporal dimensions, the spatial distance and time span may not be the best for positive pair selection. ...

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
  • Citing Preprint
  • April 2021