Kaiming He’s research while affiliated with Meta and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (116)


Computing Nearest-Neighbor Fields via Propagation-Assisted KD-Trees
  • Conference Paper

June 2012

·

214 Reads

·

139 Citations

Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Kaiming He

·

Jian Sun

Matching patches between two images, also known as computing nearest-neighbor fields, has been proven a useful technique in various computer vision/graphics algorithms. But this is a computationally challenging nearest-neighbor search task, because both the query set and the candidate set are of image size. In this paper, we propose Propagation-Assisted KD-Trees to quickly compute an approximate solution. We develop a novel propagation search method for kd-trees. In this method the tree nodes checked by each query are propagated from the nearby queries. This method not only avoids the time-consuming backtracking in traditional tree methods, but is more accurate. Experiments on public data show that our method is 10-20 times faster than the PatchMatch method [4] at the same accuracy, or reduces its error by 70% at the same running time. Our method is also 2-5 times faster and is more accurate than Coherency Sensitive Hashing [22], a latest state-of-the-art method.


A Global Sampling Method for Alpha Matting
  • Conference Paper
  • Full-text available

June 2011

·

447 Reads

·

346 Citations

Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Kaiming He

·

Christoph Rhemann

·

·

[...]

·

Jian Sun

Alpha matting refers to the problem of softly extracting the foreground from an image. Given a trimap (specifying known foreground/background and unknown pixels), a straightforward way to compute the alpha value is to sample some known foreground and background colors for each unknown pixel. Existing sampling-based matting methods often collect samples near the unknown pixels only. They fail if good samples cannot be found nearby. In this paper, we propose a global sampling method that uses all samples available in the image. Our global sample set avoids missing good samples. A simple but effective cost function is defined to tackle the ambiguity in the sample selection process. To handle the computational complexity introduced by the large number of samples, we pose the sampling task as a correspondence problem. The correspondence search is efficiently achieved by generalizing a randomized algorithm previously designed for patch matching[3]. A variety of experiments show that our global sampling method produces both visually and quantitatively high-quality matting results.

Download

Single Image Haze Removal Using Dark Channel Prior.

January 2011

·

21,418 Reads

·

955 Citations

In this paper, we propose a simple but effective image prior - dark channel prior to remove haze from a single input image. The dark channel prior is a kind of statistics of the haze-free outdoor images. It is based on a key observation - most local patches in haze-free outdoor images contain some pixels which have very low intensities in at least one color channel. Using this prior with the haze imaging model, we can directly estimate the thickness of the haze and recover a high quality haze-free image. Results on a variety of outdoor haze images demonstrate the power of the proposed prior. Moreover, a high quality depth map can also be obtained as a by-product of haze removal.


Fig. 1. Haze removal using a single image. (a) Input hazy image. (b) Image after haze removal by our approach. (c) Our recovered depth map.  
Fig. 2. (a) Haze imaging model. (b) Constant albedo model used in Fattal's work [10].  
Fig. 4. (a) Example images in our haze-free image database. (b) The corresponding dark channels. (c) A hazy image and its dark channel.  
Fig. 5. Statistics of the dark channels. (a) Histogram of the intensity of the pixels in all of the 5,000 dark channels (each bin stands for 16 intensity levels). (b) Cumulative distribution. (c) Histogram of the average intensity of each dark channel.
Fig. 6. Haze removal. (a) Input hazy images. (b) Estimated transmission maps before soft matting. (c) Refined transmission maps after soft matting. (d), (e) Recovered images using (b) and (c), respectively.  

+13

Single Image Haze Removal Using Dark Channel Prior

August 2010

·

7,209 Reads

·

7,259 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

In this paper, we propose a simple but effective image prior - dark channel prior to remove haze from a single input image. The dark channel prior is a kind of statistics of outdoor haze-free images. It is based on a key observation - most local patches in outdoor haze-free images contain some pixels whose intensity is very low in at least one color channel. Using this prior with the haze imaging model, we can directly estimate the thickness of the haze and recover a high quality haze-free image. Results on a variety of hazy images demonstrate the power of the proposed prior. Moreover, a high quality depth map can also be obtained as a by-product of haze removal.


Fast matting using large kernel matting Laplacian matrices

June 2010

·

766 Reads

·

181 Citations

Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Image matting is of great importance in both computer vision and graphics applications. Most exist- ing state-of-the-art techniques rely on large sparse matri- ces such as the matting Laplacian (12). However, solving these linear systems is often time-consuming, which is un- favored for the user interaction. In this paper, we propose a fast method for high quality matting. We first derive an effi- cient algorithm to solve a large kernel matting Laplacian. A large kernel propagates information more quickly and may improve the matte quality. To further reduce running time, we also use adaptive kernel sizes by a KD-tree trimap seg- mentation technique. A variety of experiments show that our algorithm provides high quality results and is 5 to 20 times faster than previous methods.



Citations (88)


... As these models are trained to learn visual representations aligned with the language representation, most MLLMs [1,41,46,47,86] adopt these vision encoders to process the image input. However, instancelevel contrastive learning suffers from feature suppression, where the model learns only the dominant features in the data while neglecting other valuable features [12,42,44,60,66,69,75,82]. In other words, the model creates so-called simple shortcut features and decision rules that do not consider all available distinguishing features. ...

Reference:

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Scaling Language-Image Pre-Training via Masking
  • Citing Conference Paper
  • June 2023

... The integration of augmented reality and computer vision holds promise for object detection [11]. A non-hierarchical Vision Transformer (ViT) serves as the primary object detection network [12], while an enhanced multiscale feature fusion method improves small object detection performance [13]. A Polar Transformer (Polar-Former) enhances the accuracy of 3D object detection in a bird'seye-view system [14]. ...

Exploring Plain Vision Transformer Backbones for Object Detection
  • Citing Chapter
  • November 2022

Lecture Notes in Computer Science

... The attention blocks within SAM, pre-trained with MAE [35], encapsulate a wealth of insights for token embedding analysis, making them pivotal components within the SAM framework. To harness these insights while facilitating memory-efficient training, we choose to preserve the parameters in attention blocks. ...

Masked Autoencoders Are Scalable Vision Learners
  • Citing Conference Paper
  • June 2022

... Among various neural networks that have been successfully applied in structural engineering [26][27][28][29][30], an auto-encoder is a specialized deep-learning architecture designed to learn a compact representation of data that encodes the most meaningful information [31]. The authors postulate that the learned compact data representation of an auto-encoder architecture will filter out noise, anomalies, redundant information, and other spurious artifacts. ...

Masked Autoencoders As Spatiotemporal Learners
  • Citing Preprint
  • May 2022

... Final Vision Transformer representation z L consists of (N + 1) tokens of shape D. In high-level perception tasks such as image classification, the most common strategy is to use only the [cls] token output of the final ViT block (z L,0 ) as the representation of the entire image which serves as an input to the classifier [14,24,63]. The same approach is used in JEA pretraining, where the invariance objective is imposed on the [cls] representations (typically followed by a projector network [12,16]), while patch tokens are discarded [14,19]. An alternative strategy is to summarize the image representation as the average value of patch tokens, i.e. ...

An Empirical Study of Training Self-Supervised Vision Transformers
  • Citing Conference Paper
  • October 2021

... For our object detection experiments, we used the Faster R-CNN implementation available from [46] based on the ResNet-50 backbone presented in [47]. To employ an anomaly detection task in the Faster R-CNN baseline, the architecture is supplemented by a one-class classification module based on the predicted object labels, as shown in Fig. 7. ...

Benchmarking Detection Transfer Learning with Vision Transformers

... For both D25 and D50, 70% of the ima ges fr om the indistribution datasets were used for training, 20% for validation, and 10% for testing. An EfficientNet network, pr etr ained on Ima geNet [ 100 ], was trained using our method called MAPLE (MAhalanobis distance based uncertainty Prediction for reLiablE classification [ 101 ]) illustrated in Fig. 6 . To address high intraclass variances due to, for instance, different viewpoints from which the images were acquired, we use X-means clustering [ 102 ] to break down classes into m ultiple clusters, eac h of whic h contains ima ges clustering together in the feature space of representations learned by the network. ...

Masked Autoencoders Are Scalable Vision Learners

... The key idea of SSL is to design pretext tasks that leverage data itself or its augmentation as label information. Typical pretext tasks include reconstruction and comparison, which allow models to learn useful representations for downstream tasks [35,36]. A typical SSL workflow is to leverage vast unlabeled data for pre-training, followed by supervised fine-tuning [37]. ...

Exploring Simple Siamese Representation Learning
  • Citing Conference Paper
  • June 2021

... To demonstrate this, we pretrain our model on K400 dataset, and pick checkpoints from various epochs. We then finetune these models on UCF101 dataset to evaluate on action recognition performance and report the corresponding accuracy results in Fig. 3. Given that MoCov2 [9] achieves 84.6% for UCF101 after 50 epochs with heavy training, our method outperforms by solely utilizing 5 epochs (85%). This observation implies that our proposed method can be a strong alternative with shorter pretraining requirements against large-scale video representation learning methods that rely on longer pretraining. ...

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
  • Citing Conference Paper
  • June 2021

... Similar to the slice-wise transition in medical images, adjacent frames in videos could carry useful continuity information. Feichtenhofer et al. [17] selected multiple video clips within a one-minute timespan as positive pairs and found improvements across various contrastive learning frameworks and downstream tasks. Since objects could have substantial difference across spatial and temporal dimensions, the spatial distance and time span may not be the best for positive pair selection. ...

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
  • Citing Preprint
  • April 2021