June 2021
·
82 Reads
·
4,276 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
June 2021
·
82 Reads
·
4,276 Citations
April 2021
·
28 Reads
·
1 Citation
We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart. Code is made available at https://github.com/facebookresearch/SlowFast
April 2021
·
292 Reads
·
2 Citations
This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Visual Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research.
November 2020
·
105 Reads
·
5 Citations
Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our "SimSiam" method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning. Code will be made available.
October 2020
·
21 Reads
·
69 Citations
Lecture Notes in Computer Science
Existing neural network architectures in computer vision—whether designed by humans or by machines—were typically found using both images and their associated labels. In this paper, we ask the question: can we find high-quality neural architectures using only images, but no human-annotated labels? To answer this question, we first define a new setup called Unsupervised Neural Architecture Search (UnNAS). We then conduct two sets of experiments. In sample-based experiments, we train a large number (500) of diverse architectures with either supervised or unsupervised objectives, and find that the architecture rankings produced with and without labels are highly correlated. In search-based experiments, we run a well-established NAS algorithm (DARTS) using various unsupervised objectives, and report that the architectures searched without labels can be competitive to their counterparts searched with labels. Together, these results reveal the potentially surprising finding that labels are not necessary, and the image statistics alone may be sufficient to identify good neural architectures.
July 2020
·
222 Reads
Neural networks are often represented as graphs of connections between neurons. However, despite their wide use, there is currently little understanding of the relationship between the graph structure of the neural network and its predictive performance. Here we systematically investigate how does the graph structure of neural networks affect their predictive performance. To this end, we develop a novel graph-based representation of neural networks called relational graph, where layers of neural network computation correspond to rounds of message exchange along the graph structure. Using this representation we show that: (1) a "sweet spot" of relational graphs leads to neural networks with significantly improved predictive performance; (2) neural network's performance is approximately a smooth function of the clustering coefficient and average path length of its relational graph; (3) our findings are consistent across many different tasks and datasets; (4) the sweet spot can be identified efficiently; (5) top-performing neural networks have graph structure surprisingly similar to those of real biological neural networks. Our work opens new directions for the design of neural architectures and the understanding on neural networks in general.
June 2020
·
130 Reads
·
82 Citations
June 2020
·
411 Reads
·
1,147 Citations
June 2020
·
714 Reads
·
13,092 Citations
June 2020
·
175 Reads
·
1,681 Citations
... As these models are trained to learn visual representations aligned with the language representation, most MLLMs [1,41,46,47,86] adopt these vision encoders to process the image input. However, instancelevel contrastive learning suffers from feature suppression, where the model learns only the dominant features in the data while neglecting other valuable features [12,42,44,60,66,69,75,82]. In other words, the model creates so-called simple shortcut features and decision rules that do not consider all available distinguishing features. ...
June 2023
... The integration of augmented reality and computer vision holds promise for object detection [11]. A non-hierarchical Vision Transformer (ViT) serves as the primary object detection network [12], while an enhanced multiscale feature fusion method improves small object detection performance [13]. A Polar Transformer (Polar-Former) enhances the accuracy of 3D object detection in a bird'seye-view system [14]. ...
November 2022
Lecture Notes in Computer Science
... The attention blocks within SAM, pre-trained with MAE [35], encapsulate a wealth of insights for token embedding analysis, making them pivotal components within the SAM framework. To harness these insights while facilitating memory-efficient training, we choose to preserve the parameters in attention blocks. ...
June 2022
... Among various neural networks that have been successfully applied in structural engineering [26][27][28][29][30], an auto-encoder is a specialized deep-learning architecture designed to learn a compact representation of data that encodes the most meaningful information [31]. The authors postulate that the learned compact data representation of an auto-encoder architecture will filter out noise, anomalies, redundant information, and other spurious artifacts. ...
May 2022
... Final Vision Transformer representation z L consists of (N + 1) tokens of shape D. In high-level perception tasks such as image classification, the most common strategy is to use only the [cls] token output of the final ViT block (z L,0 ) as the representation of the entire image which serves as an input to the classifier [14,24,63]. The same approach is used in JEA pretraining, where the invariance objective is imposed on the [cls] representations (typically followed by a projector network [12,16]), while patch tokens are discarded [14,19]. An alternative strategy is to summarize the image representation as the average value of patch tokens, i.e. ...
October 2021
... For our object detection experiments, we used the Faster R-CNN implementation available from [46] based on the ResNet-50 backbone presented in [47]. To employ an anomaly detection task in the Faster R-CNN baseline, the architecture is supplemented by a one-class classification module based on the predicted object labels, as shown in Fig. 7. ...
November 2021
... For both D25 and D50, 70% of the ima ges fr om the indistribution datasets were used for training, 20% for validation, and 10% for testing. An EfficientNet network, pr etr ained on Ima geNet [ 100 ], was trained using our method called MAPLE (MAhalanobis distance based uncertainty Prediction for reLiablE classification [ 101 ]) illustrated in Fig. 6 . To address high intraclass variances due to, for instance, different viewpoints from which the images were acquired, we use X-means clustering [ 102 ] to break down classes into m ultiple clusters, eac h of whic h contains ima ges clustering together in the feature space of representations learned by the network. ...
November 2021
... The key idea of SSL is to design pretext tasks that leverage data itself or its augmentation as label information. Typical pretext tasks include reconstruction and comparison, which allow models to learn useful representations for downstream tasks [35,36]. A typical SSL workflow is to leverage vast unlabeled data for pre-training, followed by supervised fine-tuning [37]. ...
June 2021
... To demonstrate this, we pretrain our model on K400 dataset, and pick checkpoints from various epochs. We then finetune these models on UCF101 dataset to evaluate on action recognition performance and report the corresponding accuracy results in Fig. 3. Given that MoCov2 [9] achieves 84.6% for UCF101 after 50 epochs with heavy training, our method outperforms by solely utilizing 5 epochs (85%). This observation implies that our proposed method can be a strong alternative with shorter pretraining requirements against large-scale video representation learning methods that rely on longer pretraining. ...
June 2021
... Similar to the slice-wise transition in medical images, adjacent frames in videos could carry useful continuity information. Feichtenhofer et al. [17] selected multiple video clips within a one-minute timespan as positive pairs and found improvements across various contrastive learning frameworks and downstream tasks. Since objects could have substantial difference across spatial and temporal dimensions, the spatial distance and time span may not be the best for positive pair selection. ...
April 2021