June 2023
·
27 Reads
·
227 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
June 2023
·
27 Reads
·
227 Citations
December 2022
·
82 Reads
·
2 Citations
We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million image-text pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we explore the scaling behavior of increasing the model size, data size, or training length, and report encouraging results and comparisons. We hope that our work will foster future research on scaling vision-language learning.
November 2022
·
232 Reads
·
670 Citations
Lecture Notes in Computer Science
We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available (https://github.com/facebookresearch/detectron2/tree/main/projects/ViTDet).
June 2022
·
373 Reads
·
7,056 Citations
May 2022
·
64 Reads
·
3 Citations
This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.
March 2022
·
134 Reads
·
5 Citations
We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 box AP on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code will be made available.
November 2021
·
291 Reads
·
3 Citations
Object detection is a central downstream task used to test if pre-trained network parameters confer benefits, such as improved accuracy or training speed. The complexity of object detection methods can make this benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive. These difficulties (e.g., architectural incompatibility, slow training, high memory consumption, unknown training formulae, etc.) have prevented recent studies from benchmarking detection transfer learning with standard ViT models. In this paper, we present training techniques that overcome these challenges, enabling the use of standard ViT models as the backbone of Mask R-CNN. These tools facilitate the primary goal of our study: we compare five ViT initializations, including recent state-of-the-art self-supervised learning methods, supervised initialization, and a strong random initialization baseline. Our results show that recent masking-based unsupervised learning methods may, for the first time, provide convincing transfer learning improvements on COCO, increasing box AP up to 4% (absolute) over supervised and prior self-supervised pre-training methods. Moreover, these masking-based initializations scale better, with the improvement growing as model size increases.
November 2021
·
2,194 Reads
·
33 Citations
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
October 2021
·
61 Reads
·
1,822 Citations
June 2021
·
78 Reads
·
306 Citations
... Shorter training times indicate that the model can converge to a high-performance state more quickly, while shorter testing times suggest faster real-time response speeds. Specifically, the FL-LSTM model introduces an innovative focal loss mechanism in the design of the long short-term memory (LSTM) loss function, which adjusts the standard binary classification loss by placing greater emphasis on fault samples [39]. This significantly mitigates the negative impact of data imbalance on model classification performance. ...
August 2017
... A seminal model in this area is CLIP [39], one of the first works to connect vision and language learning by training on web-scale image-text pair data. Since CLIP's introduction, a number of followup works have sought to improve CLIP learning from model, data, learning strategy, and other perspectives [21,22,24,45,53]. Moreover, the superb zero-shot recognition and generalizability of CLIP has been pivotal in driving the development of next-generation VLMs. ...
June 2023
... A key upgrade was the integration of new memory components into the architecture, enabling robust video segmentation support. While the original SAM relied on Vision Transformers (ViTs) [40,41] as its image encoder, SAM 2 adopted a more compact Hiera encoder [42]. This transition not only streamlined the model but also significantly reduced latency for both image and video processing tasks, marking a substantial improvement in its performance and applicability. ...
November 2022
Lecture Notes in Computer Science
... MAE are well-regarded for their ability to capture global context and semantic relationships through self-supervised learning, particularly by predicting masked portions of input data. Unlike contrastive learning methods that prioritize representational learning and perform well in linear evaluation setups, MAEs generally require fine-tuning to achieve optimal performance on downstream tasks as presented in [20]. These characteristics make MAEs advantageous for pre-training deep learning models and improving their execution in machine learning applications. ...
June 2022
... Among various neural networks that have been successfully applied in structural engineering [26][27][28][29][30], an auto-encoder is a specialized deep-learning architecture designed to learn a compact representation of data that encodes the most meaningful information [31]. The authors postulate that the learned compact data representation of an auto-encoder architecture will filter out noise, anomalies, redundant information, and other spurious artifacts. ...
May 2022
... Our work will not consider the successful line of prior work that studies negative-free contrastive learning methods(Grill et al., 2020;Chen et al., 2021). ...
October 2021
... For our object detection experiments, we used the Faster R-CNN implementation available from [46] based on the ResNet-50 backbone presented in [47]. To employ an anomaly detection task in the Faster R-CNN baseline, the architecture is supplemented by a one-class classification module based on the predicted object labels, as shown in Fig. 7. ...
November 2021
... Self-supervised learning demonstrates superior data efficiency Self-supervised learning is a game-changing technique for natural language processing (NLP) 34,35 . Many well-known architectures, including BERT 34 , GPT-X 35 , MAEs (Masked Autoencoders) 40 are SSL at their core. SSL also plays an important role in AlphaFold, a revolutionary AI-based protein structure predictor 25 . ...
November 2021
... Siamese networks [41], emblematic of SSL, assess two input instances through shared weights. An innovative method, SimSiam [42], increases the quality of the representation of unlabeled data by maximizing the similarity between various augmentations of the same instance. By avoiding the need for negative samples and substantial batch sizes, SimSiam simplifies the training process. ...
June 2021
... However, the high cost of annotating video data leads to innovations in unsupervised and self-supervised learning. For instance, unlabeled datasets like HowTo100M [162] spur progress in contrastive learning approaches [61,73,81], while multimodal datasets, Table 2: The journey of action recognition (Part 2): Methods using alternative modalities, including skeleton-based, depth-based, infrared-based, point cloud-based, and multi-modal approaches (e.g., text or audio). Columns detail learning paradigms, data modalities, and publication venues (year). ...
Reference:
The Journey of Action Recognition
June 2021