Conference Paper

Exploring Simple Siamese Representation Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A popular class of SSL techniques are contrastive methods [11], [12], [13], [14], [15], which primarily contrast positive examples against both positive and negative examples. Another set of approaches, a.k.a non-constrastive approaches, such as [16], [17], however, show that it is possible to produce similar or superior results to the contrastive approaches, without necessarily requiring to contrast against negative examples. Tian et al. [18] provided an analysis of this later set of approaches and showed how they actually avoid trivial solutions (representation collapse), a challenging problem for the contrastive learning techniques. ...
... Frameworks based on contrastive loss such as [11], [12], [13], [14], [15] enforce invariance to data augmentation by contrasting positive examples against both positive and negative examples, with the main downside of requiring a large batch of negative pairs. This motivated approaches to SSL that primarily use only positive pairs (e.g., BYOL [16] and later SimSiam [17]), while avoiding the problem of representation collapse or trivial representations. Most recent approaches are based on enforcing invariance using redundancy reduction through whitening the embedding/latent space [19], [7]. ...
... Rather, we devise a general SSL framework composed of controlled uncertainty injection, new architecture, and new loss function, without needing auxiliary information. We provide extensive relevant details on clustering based baselines [32], [33], [34] and several important loss functions, including triplet loss [35], typical contrastive loss [36], [11], [12], [13], [14], [15], and non-contrastive loss functions [16], [17], [7], [19] in supplementary materials. ...
Preprint
Full-text available
Self-supervised learning (SSL) frameworks consist of pretext task, and loss function aiming to learn useful general features from unlabeled data. The basic idea of most SSL baselines revolves around enforcing the invariance to a variety of data augmentations via the loss function. However, one main issue is that, inattentive or deterministic enforcement of the invariance to any kind of data augmentation is generally not only inefficient, but also potentially detrimental to performance on the downstream tasks. In this work, we investigate the issue from the viewpoint of uncertainty in invariance representation. Uncertainty representation is fairly under-explored in the design of SSL architectures as well as loss functions. We incorporate uncertainty representation in both loss function as well as architecture design aiming for more data-dependent invariance enforcement. The former is represented in the form of data-derived uncertainty in SSL loss function resulting in a generative-discriminative loss function. The latter is achieved by feeding slightly different distorted versions of samples to the ensemble aiming for learning better and more robust representation. Specifically, building upon the recent methods that use hard and soft whitening (a.k.a redundancy reduction), we introduce a new approach GUESS, a pseudo-whitening framework, composed of controlled uncertainty injection, a new architecture, and a new loss function. We include detailed results and ablation analysis establishing GUESS as a new baseline.
... Dimensional collapse, however, emerges out of highly correlated dimensions in the repre-sentation, where dimensions collapse to a single dimension (or potentially much fewer than actual number of dimensions). Complete collapse has been well-addressed by techniques such as careful training protocols [6], asymmetric architectural design and training protocols [7,13]. These in essence inject some variance to avoid having zero variance (complete collapse). ...
... 1. SSL: Enforcing invariance to the representation of augmented views is the driving first principle of most existing SSL approaches. This core idea has been instantiated via a variety of methods including contrastive approaches [6], non-contrastive approaches [7,13], clusteringbased methods [3,4], whitening-based techniques [11,39], etc. Along with this, there have been parallel efforts on improved augmentation protocols [32], sampling strategies [1,34,37], and robustness [29]. ...
... Baselines: For comparison, we contrast our framework to different classes of the recent baselines, including contrastive, non-contrastive, clustering-based, and whitening (aka redundancy reduction) baselines, as well as baselines primarily based on vision transformers. These baselines include SimCLR [6], BYOL [13], SimSiam [7], SwAV [4], Barlow-Twins (BT) [39], Whitening-MSE (W-MSE) with d = 4 [11], and DINO [5]. ...
Preprint
Full-text available
The success of self-supervised learning (SSL) has been the focus of multiple recent theoretical and empirical studies, including the role of data augmentation (in feature decoupling) as well as complete and dimensional representation collapse. While complete collapse is well-studied and addressed, dimensional collapse has only gain attention and addressed in recent years mostly using variants of redundancy reduction (aka whitening) techniques. In this paper, we further explore a complementary approach to whitening via feature decoupling for improved representation learning while avoiding representation collapse. In particular, we perform feature decoupling by early promotion of useful features via careful feature coloring. The coloring technique is developed based on a Bayesian prior of the augmented data, which is inherently encoded for feature decoupling. We show that our proposed framework is complementary to the state-of-the-art techniques, while outperforming both contrastive and recent non-contrastive methods. We also study the different effects of coloring approach to formulate it as a general complementary technique along with other baselines.
... It applies to fine-grained and novel objects and could work with local models such as LLaVA [26]. Specifically, inspired by self-supervised representation learning (SSL) in computer vision, such as SimCLR [8], MoCo [17], SwAV [6], SimSiam [9], and BYOL [16], we propose to leverage a VLM to capture the inter-class difference and intra-class commonality and articulate these findings in natural language, as illustrated in Fig. 1. Specifically, we cast the VLM to describe the key difference between two images from different classes, which would preserve the discriminative features and remove the redundant ones, such as the yellow body shared by both species on the left side of Fig. 1 (a). ...
... Focus on unique or specific characteristics ods, such as SimCLR [8] and MoCo [17], achieve this goal by learning to discriminate between different samples (negative pairs) and augmentations of the same sample (positive pairs). Chen et al. [9] have found that the key function of negative pairs is to prevent the model from learning collapsing features, where models produce constant or trivial outputs. As a result, subsequent works have relieved the need for negative samples by employing techniques like clustering [5,6], momentum update [16], or stop gradient [9]. ...
... Chen et al. [9] have found that the key function of negative pairs is to prevent the model from learning collapsing features, where models produce constant or trivial outputs. As a result, subsequent works have relieved the need for negative samples by employing techniques like clustering [5,6], momentum update [16], or stop gradient [9]. These methods focus on learning the underlying shared representations between the augmentations of the same sample. ...
Preprint
Full-text available
Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller mode. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks. Code is available at: https://github.com/joeyy5588/VRL/tree/main.
... In contrast, the SSL paradigm has garnered much attention in computer vision due to its costfree annotation [13][14][15][16]. By creating label-agnostic pretext tasks, such as contrastive learning (CL) [17,18] and masked image modelling (MIM) [19], this paradigm can obtain general visual representations. CL strives to bring the representations of semantically similar inputs, such as two augmented views of the same image (i.e. ...
... SSL pre-training methods construct pretext tasks that harness the intrinsic properties of images, thereby effectively facilitating feature extraction and generating robust visual representations [15]. The main SSL pretext tasks include contrastive learning (CL) [17,18] and masked image modelling (MIM) [19]. ...
... SimCLR [17], the first CL method to achieve comparable results to supervised learning pre-training, utilises a plethora of augmentation strategies for data. MoCo [18] uses a sizeable queue to store negative samples, thereby enabling the intake of more negative samples for CL and overcoming the need for large batch sizes, similar to SimCLR. Moreover, MoCo-v3 [34] extends the practice of CL from convolutional neural networks to the ViT architecture. ...
Article
Full-text available
Intelligent sorting is an important prerequisite for the full quantitative consumption and harmless disposal of kitchen waste. The existing object detection method based on an ImageNet pre‐trained model is an effective way of sorting. Owing to significant domain gaps between natural images and kitchen waste images, it is difficult to reflect the characteristics of diverse scales and dense distribution in kitchen waste based on an ImageNet pre‐trained model, leading to poor generalisation. In this article, the authors propose the first pre‐trained model for kitchen waste sorting called KitWaSor, which combines both contrastive learning (CL) and masked image modelling (MIM) through self‐supervised learning (SSL). First, to address the issue of diverse scales, the authors propose a mixed masking strategy by introducing an incomplete masking branch based on the original random masking branch. It prevents the complete loss of small‐scale objects while avoiding excessive leakage of large‐scale object pixels. Second, to address the issue of dense distribution, the authors introduce semantic consistency constraints on the basis of the mixed masking strategy. That is, object semantic reasoning is performed through semantic consistency constraints to compensate for the lack of contextual information. To train KitWaSor, the authors construct the first million‐level kitchen waste dataset across seasonal and regional distributions, named KWD‐Million. Extensive experiments show that KitWaSor achieves state‐of‐the‐art (SOTA) performance on the two most relevant downstream tasks for kitchen waste sorting (i.e. image classification and object detection), demonstrating the effectiveness of the proposed KitWaSor.
... The network is trained using global cosine-similarity [12] between the features at different layers of the encoders and the features at the opposite level of the decoder. Specifically, the model is trained in a cross-reconstruction fashion where the decoder learns to reconstruct the features of the frozen encoder starting with the ones obtained by the trainable encoder, and vice versa, using the following loss: where sg is the stop gradient operation [24] used to avoid propagating the gradient directly into the encoder, f ℓ E and f ℓ D represent the flattened features of the encoder and decoder respectively at the ℓ th layer, and ⟨·, ·⟩ is the dot product operation. ...
... Experimental results (see ablation study reported in Section B of supplementary material) show that the contrastive reconstruction loss outperforms arguably simpler solutions such as SimSiam [24]. ...
... To show the impact of our contrastive approach we implemented a TTT method based on the SimSiam [24] framework. This solution only compares the features at the bottleneck level and is based on a single encoder, followed by a projection head and a predictor. ...
Preprint
Full-text available
The remarkable progress in deep learning (DL) showcases outstanding results in various computer vision tasks. However, adaptation to real-time variations in data distributions remains an important challenge. Test-Time Training (TTT) was proposed as an effective solution to this issue, which increases the generalization ability of trained models by adding an auxiliary task at train time and then using its loss at test time to adapt the model. Inspired by the recent achievements of contrastive representation learning in unsupervised tasks, we propose ReC-TTT, a test-time training technique that can adapt a DL model to new unseen domains by generating discriminative views of the input data. ReC-TTT uses cross-reconstruction as an auxiliary task between a frozen encoder and two trainable encoders, taking advantage of a single shared decoder. This enables, at test time, to adapt the encoders to extract features that will be correctly reconstructed by the decoder that, in this phase, is frozen on the source domain. Experimental results show that ReC-TTT achieves better results than other state-of-the-art techniques in most domain shift classification challenges.
... The idea is to train a model to learn how to solve a pretext/proxy task while being supervised via a signal from unlabeled data, guided by a loss function corresponding to the given SSL framework. One interpretation of SSL [2,6,19] considers instance-wise SSL frameworks as performing K-means clustering on augmented views from a sample, i.e., assigning same centroid to the views of the same sample. There have been rich explorations on two main components of SSL frameworks, i.e., developing better and more oriented pretext tasks as well as developing more effective loss functions. ...
... There have been rich explorations on two main components of SSL frameworks, i.e., developing better and more oriented pretext tasks as well as developing more effective loss functions. This resulted in the emergence of different types of baselines, which include contrastive [5], non-contrastive [6,12], clustering-based [3,4], hard/soft whitening (redundancy reduction) [10,30], etc. Even though the performance of SSL frameworks are very promising, in justification of its theoretical formulations, there have been arguments on how mutual information contributes to SSL performance. ...
Preprint
Full-text available
Self Supervised learning (SSL) has demonstrated its effectiveness in feature learning from unlabeled data. Regarding this success, there have been some arguments on the role that mutual information plays within the SSL framework. Some works argued for increasing mutual information between representation of augmented views. Others suggest decreasing mutual information between them, while increasing task-relevant information. We ponder upon this debate and propose to revisit the core idea of SSL within the framework of partial information decomposition (PID). Thus, with SSL under PID we propose to replace traditional mutual information with the more general concept of joint mutual information to resolve the argument. Our investigation on instantiation of SSL within the PID framework leads to upgrading the existing pipelines by considering the components of the PID in the SSL models for improved representation learning. Accordingly we propose a general pipeline that can be applied to improve existing baselines. Our pipeline focuses on extracting the unique information component under the PID to build upon lower level supervision for generic feature learning and on developing higher-level supervisory signals for task-related feature learning. In essence, this could be interpreted as a joint utilization of local and global clustering. Experiments on four baselines and four datasets show the effectiveness and generality of our approach in improving existing SSL frameworks.
... Siamese networks. Siamese networks [3] are a group of versatile models widely employed across a diverse range of applications, including object tracking [1], face verification [26], one-shot learning [17], and self-supervised learning [7,8]. These networks, characterized by weight-sharing and the comparison of entities through two inputs, have attracted significant interest due to their ability to learn meaningful representations and extract knowledge without explicit supervision. ...
... where sg(·) denotes the stop-gradient technique on the output logits. As presented in seminal works for selfsupervised learning such as Simsiam [8], the stop-gradient stabilizes the optimization process and prevents the collapse of representations. Thus, we implement knowledge vaporization on the original model by minimizing L KV for every forgetting data x ∈ D f . ...
Preprint
In response to the practical demands of the ``right to be forgotten" and the removal of undesired data, machine unlearning emerges as an essential technique to remove the learned knowledge of a fraction of data points from trained models. However, existing methods suffer from limitations such as insufficient methodological support, high computational complexity, and significant memory demands. In this work, we propose the concepts of knowledge vaporization and concentration to selectively erase learned knowledge from specific data points while maintaining representations for the remaining data. Utilizing the Siamese networks, we exemplify the proposed concepts and develop an efficient method for machine unlearning. Our proposed Siamese unlearning method does not require additional memory overhead and full access to the remaining dataset. Extensive experiments conducted across multiple unlearning scenarios showcase the superiority of Siamese unlearning over baseline methods, illustrating its ability to effectively remove knowledge from forgetting data, enhance model utility on remaining data, and reduce susceptibility to membership inference attacks.
... Khan and Dai [32] proposed a video transformer with face UV texture map for deepfake detection to improve detection accuracy. Kim et al. [33] designed a continual learning framework with knowledge distillation and apply it to deep fake detection. In summary, we argue that most of the existing approaches fail to decouple the common and special features across a sequence of tasks. ...
... The learning of the commonality allows the model to focus more on the common features between the datasets, thus improving the generalization to unknown datasets [18] 57.32 60.27 58.74 60.48 Head pose [47] 54.60 56.18 58.29 56.30 Xception+Reg [48] 71.20 70.48 69.79 71.21 LRNet [49] 57.40 59.26 59.17 58.80 Xception [46] 65.50 67.40 68.39 66.76 MTD-Net [40] 70.12 72.38 71.42 70.70 Schwarcz chellappa [50] 67.44 67.30 68.13 68.42 Yu et al. [51] 74.20 70.09 67.14 72.18 Video transformer [32] 70.18 68.37 71.39 72.69 CoReD [33] 74 ...
Article
Since different kinds of face forgeries leave similar forgery traces in videos, learning the common features from different kinds of forged faces would achieve promising generalization ability of forgery detection. Therefore, to accurately detect known forgeries while ensuring high generalization ability of detecting unknown forgeries, we propose an intra-inter network (IIN) for face forgery detection (FFD) in videos with continual learning. The proposed IIN mainly consists of three modules, i.e., intra-module, inter-module, and forged trace masking module (FTMM). Specifically, the intra-module is trained for each kind of face forgeries by supervised learning to extract special features, while the inter-module is trained by self-supervised learning to extract the common features. As a result, the common and special features of the different forgeries are decoupled by the two feature learning modules, and then the decoupled common features can be utlized to achieve high generalization ability for FFD. Moreover, the FTMM is deployed for contrastive learning to further improve detection accuracy. The experimental results on FaceForensic++ dataset demonstrate that the proposed IIN outperforms the state-of-the-arts in FFD. Also, the generalization ability of the IIN verified on DFDC and Celeb-DF datasets demonstrates that the proposed IIN significantly improves the generalization ability for FFD.
... The key idea of SSL is to design pretext tasks that leverage data itself or its augmentation as label information. Typical pretext tasks include reconstruction and comparison, which allow models to learn useful representations for downstream tasks [35,36]. A typical SSL workflow is to leverage vast unlabeled data for pre-training, followed by supervised fine-tuning [37]. ...
Preprint
Full-text available
Structural health monitoring (SHM) has experienced significant advancements in recent decades, accumulating massive monitoring data. Data anomalies inevitably exist in monitoring data, posing significant challenges to their effective utilization. Recently, deep learning has emerged as an efficient and effective approach for anomaly detection in bridge SHM. Despite its progress, many deep learning models require large amounts of labeled data for training. The process of labeling data, however, is labor-intensive, time-consuming, and often impractical for large-scale SHM datasets. To address these challenges, this work explores the use of self-supervised learning (SSL), an emerging paradigm that combines unsupervised pre-training and supervised fine-tuning. The SSL-based framework aims to learn from only a very small quantity of labeled data by fine-tuning, while making the best use of the vast amount of unlabeled SHM data by pre-training. Mainstream SSL methods are compared and validated on the SHM data of two in-service bridges. Comparative analysis demonstrates that SSL techniques boost data anomaly detection performance, achieving increased F1 scores compared to conventional supervised training, especially given a very limited amount of labeled data. This work manifests the effectiveness and superiority of SSL techniques on large-scale SHM data, providing an efficient tool for preliminary anomaly detection with scarce label information.
... The stop-gradient (stopgrad) operation effectively prevents collapsing solutions, which occur when the model outputs converge to a constant value in pursuit of representational similarity [39]. ...
Preprint
Legged robots possess inherent advantages in traversing complex 3D terrains. However, previous work on low-cost quadruped robots with egocentric vision systems has been limited by a narrow front-facing view and exteroceptive noise, restricting omnidirectional mobility in such environments. While building a voxel map through a hierarchical structure can refine exteroception processing, it introduces significant computational overhead, noise, and delays. In this paper, we present MOVE, a one-stage end-to-end learning framework capable of multi-skill omnidirectional legged locomotion with limited view in 3D environments, just like what a real animal can do. When movement aligns with the robot's line of sight, exteroceptive perception enhances locomotion, enabling extreme climbing and leaping. When vision is obstructed or the direction of movement lies outside the robot's field of view, the robot relies on proprioception for tasks like crawling and climbing stairs. We integrate all these skills into a single neural network by introducing a pseudo-siamese network structure combining supervised and contrastive learning which helps the robot infer its surroundings beyond its field of view. Experiments in both simulations and real-world scenarios demonstrate the robustness of our method, broadening the operational environments for robotics with egocentric vision.
... Broadly, one learns meaningful representations from unlabeled data, reducing the demand for labeled samples when training (downstream) predictive models. In recent years, there has been a strong focus on self-supervised approaches to representation learning, which learn neural network-based embedding maps from carefully constructed augmentations of unlabeled data, such as image cropping, rotations, color distortion, Gaussian blur, etc. [3,7,9,15]. Contrastive representation learning is a popular form of self-supervised learning where one aims to learn a mapping of the data to a Euclidean space such that semantically similar data, obtained via augmentations, are embedded closer than independent samples [45,19,24]. ...
Preprint
Full-text available
Contrastive representation learning is a modern paradigm for learning representations of unlabeled data via augmentations -- precisely, contrastive models learn to embed semantically similar pairs of samples (positive pairs) closer than independently drawn samples (negative samples). In spite of its empirical success and widespread use in foundation models, statistical theory for contrastive learning remains less explored. Recent works have developed generalization error bounds for contrastive losses, but the resulting risk certificates are either vacuous (certificates based on Rademacher complexity or f-divergence) or require strong assumptions about samples that are unreasonable in practice. The present paper develops non-vacuous PAC-Bayesian risk certificates for contrastive representation learning, considering the practical considerations of the popular SimCLR framework. Notably, we take into account that SimCLR reuses positive pairs of augmented data as negative samples for other data, thereby inducing strong dependence and making classical PAC or PAC-Bayesian bounds inapplicable. We further refine existing bounds on the downstream classification loss by incorporating SimCLR-specific factors, including data augmentation and temperature scaling, and derive risk certificates for the contrastive zero-one risk. The resulting bounds for contrastive loss and downstream prediction are much tighter than those of previous risk certificates, as demonstrated by experiments on CIFAR-10.
... Recently, self-supervised learning methods have garnered significant attention by leveraging labels derived directly from the data. Among these, contrastive learning strategies have shown considerable promise in acquiring general visual representations from natural images [16][17][18][19][20][21][22][23]. However, it is important to note that many of these methods tend to be computationally intensive. ...
Article
Full-text available
Recent advancements in convolutional neural networks have improved computer vision applications, including satellite imagery analysis. However, the lack of large labeled datasets and the complexity of remote sensing tasks render supervised learning methods less effective. While ImageNet pre-trained models have been used to address this, the domain difference between natural and satellite images poses significant limitations. These facts motivate us to explore both supervised and self-supervised learning to capture in-domain visual representations from satellite images, address the domain difference with ImageNet, and reduce the need for labeled datasets and computational resources. Furthermore, our research endeavors to identify the effective characteristics that make a dataset a suitable candidate for representation learning in the satellite imagery domain. The importance of choosing the right pre-training dataset cannot be overstated; it directly influences model performance and generalization capabilities. Given the plethora of available datasets in this field, selecting an appropriate one is fraught with difficulty, as each dataset varies in terms of quality, resolution, and relevance to specific tasks. The obtained weights from proper datasets serve as the initial weights for segmentation and object detection models. In terms of self-supervised pre-training, the SimSiam algorithm, employing the ResNet50 backbone, was utilized. Our results underscore that selecting a dataset with high spatial resolution is crucial, as it significantly enhances feature learning and improves model performance in remote sensing applications. This study explores the impact of hierarchical pre-training on dense prediction tasks, initially utilizing public datasets such as ImageNet, followed by in-domain datasets. Our systematic approach yielded significant improvements in performance metrics, enhancing the mean Intersection over Union (mIOU) score and pixel accuracy by 4.06% and 9.62%, respectively, when compared to existing literature. Furthermore, relative to our baseline model, the proposed methodology achieved enhancements of 2.1% in mIOU and 0.88% in pixel accuracy for semantic segmentation tasks on the DeepGlobe Land Cover Classification dataset, surpassing the efficacy of conventional ImageNet pre-trained weights. Additionally, we modified the DeepLabv3 architecture by re-implementing it to facilitate the transfer of previously trained weights to its backbone. To enhance convergence speed, we set the overall output stride to 32. In object detection tasks, including those involving Oil Tank Storage and Airplanes, the mean Average Precision (mAP) metric exhibited enhancements of 2.65% and 1.47%, respectively. This finding indicates that selecting an appropriate dataset for pre-training visual representations can significantly enhance the efficacy of remote sensing image analysis. Consequently, our proposed method emerges as a promising and practical approach for advancing this field.
... There are various distance metrics being proposed trying to keep positive pairs close and negative pairs far away in the embedding space. Some examples include SimCLR [10], MoCo [27], BYOL [4], SimSiam [11], DINO [7]. Self-prediction methods mask out portions of the original image and try to reconstruct the original image. ...
Preprint
Full-text available
We propose a method to improve the generalization ability of skin lesion classification models by combining self-supervised learning (SSL), unsupervised domain adaptation (UDA), and active domain adaptation (ADA). The main steps of the approach include selection of a SSL pretrained model on natural image datasets, subsequent SSL retraining on all available skin lesion datasets, finetuning of the model on source domain data with labels, application of UDA methods on target domain data, and lastly, implementation of ADA methods. The efficacy of the proposed approach is assessed across ten skin lesion datasets of domains, demonstrating its potential for enhancing the performance of skin lesion classification models. This approach holds promise for facilitating the widespread adoption of medical imaging models in clinical settings, thereby amplifying their impact.
... However, in certain datasets and label proportion settings, the OLDFS method does not show a significant gap compared to our method. This may be due to OLDFS's ability to learn task-irrelevant features, which could contribute to enhancing performance in downstream visual recognition tasks [9,30]. ...
Preprint
Self-supervised learning is emerging in fine-grained visual recognition with promising results. However, existing self-supervised learning methods are often susceptible to irrelevant patterns in self-supervised tasks and lack the capability to represent the subtle differences inherent in fine-grained visual recognition (FGVR), resulting in generally poorer performance. To address this, we propose a novel Priority-Perception Self-Supervised Learning framework, denoted as PP-SSL, which can effectively filter out irrelevant feature interference and extract more subtle discriminative features throughout the training process. Specifically, it composes of two main parts: the Anti-Interference Strategy (AIS) and the Image-Aided Distinction Module (IADM). In AIS, a fine-grained textual description corpus is established, and a knowledge distillation strategy is devised to guide the model in eliminating irrelevant features while enhancing the learning of more discriminative and high-quality features. IADM reveals that extracting GradCAM from the original image effectively reveals subtle differences between fine-grained categories. Compared to features extracted from intermediate or output layers, the original image retains more detail, allowing for a deeper exploration of the subtle distinctions among fine-grained classes. Extensive experimental results indicate that the PP-SSL significantly outperforms existing methods across various datasets, highlighting its effectiveness in fine-grained recognition tasks. Our code will be made publicly available upon publication.
... In the summarization task, marginal performance differences exist among the SSL models. These discrepancies can be attributed partially to the inherent difficulty of 2-class sentence classification and the collapse problem (Yan et al., 2021;Chen and He, 2021;Gao et al., 2021) associated with BERT sentence representation. These difficulties are worsened by the constraint of having only 100 labelled summaries for training extractive models. ...
Preprint
The semi-supervised learning (SSL) strategy in lightweight models requires reducing annotated samples and facilitating cost-effective inference. However, the constraint on model parameters, imposed by the scarcity of training labels, limits the SSL performance. In this paper, we introduce PS-NET, a novel framework tailored for semi-supervised text mining with lightweight models. PS-NET incorporates online distillation to train lightweight student models by imitating the Teacher model. It also integrates an ensemble of student peers that collaboratively instruct each other. Additionally, PS-NET implements a constant adversarial perturbation schema to further self-augmentation by progressive generalizing. Our PS-NET, equipped with a 2-layer distilled BERT, exhibits notable performance enhancements over SOTA lightweight SSL frameworks of FLiText and DisCo in SSL text classification with extremely rare labelled data.
... Wu et al. [26] follow the principle and enhance the existing FeatureNet and MsvNet frameworks using a lightweight network technique inspired by the selfsupervised learning method SimSiam [5]. This approach seeks to increase the efficiency of these networks, resulting in what Wu et al. term FeatureNetLite and MsvNet Lite. ...
Chapter
Full-text available
Driven by increasing customer demands, manufacturing processes now encompass increasingly intricate workflows. The industry uses computer-aided process planning to manage these complex manufacturing processes effectively. A crucial task here is to analyze product data and determine the required machining features, represented as 3D mesh geometries. However, a notable challenge arises, particularly with custom products, where the interpretation of the 3D mesh geometry varies significantly depending on the available machinery and expert preferences. This study introduces a configurable automated feature recognition framework based on expert knowledge. Experts can use a configurable synthetic data generator to encode their requirements within this framework via the training data. A machine-learning graph classification approach is used to recognize the 3D geometries of machining features in the generated data, based on to the user requirements. The system accomplishes this without requiring for data conversion into alternative formats, such as voxel or pixel representations, like other approaches are forced to.
... Comparison to an Accel SSL Benchmarking Study: Haresamudram et al. (2022) seeks to assess the current state of the accelerometry self-supervised learning field, in the context of activity classification. To this end, they benchmark a diverse range of self-supervised, including our aforementioned accel-specific methods, Accel SimCLR (Tang et al., 2020) and Augmentation Prediction (Saeed et al., 2019), as well as generalized SSL methods, such as SimSiam (Chen & He, 2021) and BYOL (Grill et al., 2020). Each of their SSL approaches have a unique backbone architecture that corresponds to their original work. ...
Preprint
We present RelCon, a novel self-supervised \textit{Rel}ative \textit{Con}trastive learning approach that uses a learnable distance measure in combination with a softened contrastive loss for training an motion foundation model from wearable sensors. The learnable distance measure captures motif similarity and domain-specific semantic information such as rotation invariance. The learned distance provides a measurement of semantic similarity between a pair of accelerometer time-series segments, which is used to measure the distance between an anchor and various other sampled candidate segments. The self-supervised model is trained on 1 billion segments from 87,376 participants from a large wearables dataset. The model achieves strong performance across multiple downstream tasks, encompassing both classification and regression. To our knowledge, we are the first to show the generalizability of a self-supervised learning model with motion data from wearables across distinct evaluation tasks.
... where || · || 2 is the ℓ 2 -norm, h 1 and h 2 are MLP projection heads used only during training, and D(·, ·) calculates the cosine distance. Moreover, D(·, ·) is computed as in [7], where we implement the stop-gradient operation to avoid degenerated solutions. We refer to the pair L T R and L Reg as the transitive relation loss (TRL) since it is meant to bridge the similarities between template and search region Table 1. ...
Preprint
Efficient visual trackers overfit to their training distributions and lack generalization abilities, resulting in them performing well on their respective in-distribution (ID) test sets and not as well on out-of-distribution (OOD) sequences, imposing limitations to their deployment in-the-wild under constrained resources. We introduce SiamABC, a highly efficient Siamese tracker that significantly improves tracking performance, even on OOD sequences. SiamABC takes advantage of new architectural designs in the way it bridges the dynamic variability of the target, and of new losses for training. Also, it directly addresses OOD tracking generalization by including a fast backward-free dynamic test-time adaptation method that continuously adapts the model according to the dynamic visual changes of the target. Our extensive experiments suggest that SiamABC shows remarkable performance gains in OOD sets while maintaining accurate performance on the ID benchmarks. SiamABC outperforms MixFormerV2-S by 7.6\% on the OOD AVisT benchmark while being 3x faster (100 FPS) on a CPU.
... These constructed predictive and generative learning systems are solely focused on ad-hoc pretext problems, and as a result, they lack generality. There are several other methods, such as clustering methods (deep cluster [20], SeLA [21], SwAV [22]), distillation methods (BYOL [23], SimSiam [24], DINO [25]), Redundancy reduction methods (Barlow Twins [26], VICReg [27]) that behave as non-contrastive representation learning methods. ...
Article
Full-text available
Histopathological images, characterized by their high resolution and intricate cellular structures, present unique challenges for automated analysis. Traditional supervised learning-based methods often rely on extensive labeled datasets, which are labour-intensive and expensive. In learning representations, self-supervised learning techniques have shown promising outcomes directly from raw image data without manual annotations. In this paper, we propose a novel margin-aware optimized contrastive learning approach to enhance representation learning from histopathological images using a self-supervised approach. The proposed approach optimizes contrastive learning with a margin-based strategy to effectively learn discriminative representations while enforcing a semantic similarity threshold. In the proposed loss function, a margin is used to enforce a certain level of similarity between positive pairs in the embedding space, and a scaling factor is introduced to adjust the sensitivity of the loss, thereby enhancing the discriminative capacity of the learned representations. Our approach demonstrates robust generalization in in- and out-domain settings through comprehensive experimental evaluations conducted on five distinct benchmark histopathological datasets belonging to three cancer types. The results obtained on different experimental settings show that the proposed approach outmatched the state-of-the-art approaches in cross-domain and cross-disease settings.
... The target network is updated using a moving average of the online network's parameters. SimSiam achieves notable performance through self-supervised learning alone, stopping gradient operations without requiring negative sample pairs, large batch sizes, or momentum encoders [52]. An alternative method involves clustering [53]. ...
Article
Full-text available
In orthodontics, the manual tracing of cephalometric radiographs is a common practice, where the Sella Turcica (ST) serves as a reference point. The radiologist often manually traces the outline of the sella using manual tools (e.g., calipers on radiographs). Perhaps the inherent complexity and variability in the shapes of sella and the lack of advanced assessment tools make the classification of sella challenging, as it requires extensive training, skills, time, and manpower to detect subtle changes that often may not be apparent. Moreover, existing semi-supervised learning (SSL) methods face key limitations such as shift invariance, inadequate feature representation, overfitting on small datasets, and a lack of generalization to unseen variations in ST morphology. Medical imaging data are often unlabeled, limiting the training of automated classification systems for ST morphology. To address these limitations, a novel semi-supervised deep subspace embedding (SSLDSE) framework is proposed. This approach integrates real-time stochastic augmentation to significantly expand the training dataset and introduce natural variability in the ST morphology, overcoming the constraints of small and non-representative datasets. Non-linear features are extracted and mapped to a non-linear subspace using Kullback–Leibler divergence, which ensures that the model remains consistent despite image transformations, thus resolving issues related to shift invariance. Additionally, fine-tuning the Inception-ResNet-v2 network on these enriched features reduces retraining costs when new unlabeled data becomes available. t-distributed stochastic neighbor embedding (t-SNE) is employed for effective feature representation through manifold learning, capturing complex patterns that previous methods might miss. Finally, a zero-shot classifier is utilized to accurately categorize the ST, addressing the challenge of classifying new or unseen variations. Further, the proposed SSLDSE framework is evaluated through comparative analysis with the existing methods (Active SSL, GAN SSL, Contrastive SSL, Modified Inception-ResNet-v2) for ST classification using various evaluation metrics. The SSLDSE and the existing methods are trained on our dataset (sourced from PGI Chandigarh, India), and a blind test is conducted on the benchmark dataset (IEEE ISBI 2015). The proposed method improves classification accuracy by 15% compared to state-of-the-art models and reduces retraining costs.
... This approach uses positive and negative pairs to refine model accuracy, enhancing similarities within pairs of similar samples and expanding differences among dissimilar ones [32]. Pioneering models such as MoCo [33], SimCLR [34], BYOL [10], SwAV [35], and Sim-Siam [36] have demonstrated significant advancements over traditional supervised learning methods and have found applications in diverse fields like agriculture, biology, and aviation [37]. ...
Article
Full-text available
Image recognition models often struggle with class imbalance, which can impede their performance. To overcome this issue, researchers have extensively used resampling methods, traditionally focused on tabular datasets. In contrast to the original method, which generates data at the data level, this paper introduces a novel strategy that combines contrastive learning with the Synthetic Minority Oversampling Technique based on Rough Set Theory (SMOTE-RSB) specifically tailored for imbalanced image datasets. Our method leverages contrastive learning to refine representation learning and balance features, thus effectively mitigating the challenges of imbalanced image classification. We begin by extracting features using a pre-trained contrastive learning encoder. Subsequently, SMOTE-RSB is applied to these features to augment underrepresented classes and reduce irrelevant features. We evaluated our approach on several modified benchmark datasets, including CIFAR-10, SVHN, and ImageNet-LT, achieving notable improvements: an F1 score of 72.43% and a Gmean of 82.53% on the CIFAR-10 long-tailed dataset, F1 scores up to 79.57% and Gmean of 88.20% on various SVHN datasets, and a Top-1 accuracy of 68.67% on ImageNet-LT. Both qualitative and quantitative results confirm the effectiveness of our method in managing imbalances in image datasets. Additional ablation studies exploring various contrastive learning models and oversampling techniques highlight the flexibility and efficiency of our approach across different settings, underscoring its significant potential for enhancing imbalanced image classification.
... Furthermore, the Siamese network is trained using a contrastive loss function, which encourages the distance between similar examples to be small and dissimilar examples to be large [106]: ...
Article
Full-text available
Deep learning (DL) has become a core component of modern artificial intelligence (AI), driving significant advancements across diverse fields by facilitating the analysis of complex systems, from protein folding in biology to molecular discovery in chemistry and particle interactions in physics. However, the field of deep learning is constantly evolving, with recent innovations in both architectures and applications. Therefore, this paper provides a comprehensive review of recent DL advances, covering the evolution and applications of foundational models like convolutional neural networks (CNNs) and Recurrent Neural Networks (RNNs), as well as recent architectures such as transformers, generative adversarial networks (GANs), capsule networks, and graph neural networks (GNNs). Additionally, the paper discusses novel training techniques, including self-supervised learning, federated learning, and deep reinforcement learning, which further enhance the capabilities of deep learning models. By synthesizing recent developments and identifying current challenges, this paper provides insights into the state of the art and future directions of DL research, offering valuable guidance for both researchers and industry experts.
... Contrastive Learning. Among the numerous self-supervised learning approaches in computer vision that focus on learning from unlabeled data, contrastive learning has emerged in recent years [2,12,23,26,36,45]. Its fundamental principle involves learning representations through instance discrimination, which involves attracting similar samples while optionally repelling dissimilar ones. ...
... Skin biopsies from 1787 individuals (stratified by age and sex, see Table 1) were retrieved from the archives at the Department of Pathology, Zealand University Hospital and digital images were produced for analysis using computer vision (CV) 89 . There were 919 males and 868 females with skin biopsies. ...
Preprint
As global life expectancy increases, so does the burden of chronic diseases, yet individuals exhibit considerable variability in the rate at which they age. Identifying biomarkers that distinguish fast from slow ageing is crucial for understanding the biology of ageing, enabling early disease detection, and improving prevention strategies. Using contrastive deep learning, we show that skin biopsy images alone are sufficient to determine an individual's age. We then use visual features in histopathology slides of the skin biopsies to construct a novel biomarker of ageing. By linking with comprehensive health registers in Denmark, we demonstrate that visual features in histopathology slides of skin biopsies predict mortality and the prevalence of chronic age-related diseases. Our work highlights how routinely collected health data can provide additional value when used together with deep learning, by creating a new biomarker for ageing which can be actively used to determine mortality over time.
Preprint
Full-text available
Masked Image Modeling (MIM) has emerged as a popular method for Self-Supervised Learning (SSL) of visual representations. However, for high-level perception tasks, MIM-pretrained models offer lower out-of-the-box representation quality than the Joint-Embedding Architectures (JEA) - another prominent SSL paradigm. To understand this performance gap, we analyze the information flow in Vision Transformers (ViT) learned by both approaches. We reveal that whereas JEAs construct their representation on a selected set of relevant image fragments, MIM models aggregate nearly whole image content. Moreover, we demonstrate that MIM-trained ViTs retain valuable information within their patch tokens, which is not effectively captured by the global [cls] token representations. Therefore, selective aggregation of relevant patch tokens, without any fine-tuning, results in consistently higher-quality of MIM representations. To our knowledge, we are the first to highlight the lack of effective representation aggregation as an emergent issue of MIM and propose directions to address it, contributing to future advances in Self-Supervised Learning.
Preprint
Full-text available
Data augmentation is widely used to enhance generalization in visual classification tasks. However, traditional methods struggle when source and target domains differ, as in domain adaptation, due to their inability to address domain gaps. This paper introduces GenMix, a generalizable prompt-guided generative data augmentation approach that enhances both in-domain and cross-domain image classification. Our technique leverages image editing to generate augmented images based on custom conditional prompts, designed specifically for each problem type. By blending portions of the input image with its edited generative counterpart and incorporating fractal patterns, our approach mitigates unrealistic images and label ambiguity, improving the performance and adversarial robustness of the resulting models. Efficacy of our method is established with extensive experiments on eight public datasets for general and fine-grained classification, in both in-domain and cross-domain settings. Additionally, we demonstrate performance improvements for self-supervised learning, learning with data scarcity, and adversarial robustness. As compared to the existing state-of-the-art methods, our technique achieves stronger performance across the board.
Preprint
Full-text available
In the era of foundation models, CLIP has emerged as a powerful tool for aligning text and visual modalities into a common embedding space. However, the alignment objective used to train CLIP often results in subpar visual features for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at extracting rich visual features due to their specialized training paradigm. Yet, these SSL models require an additional supervised linear probing step, which relies on fully labeled data which is often expensive and difficult to obtain at scale. In this paper, we propose a label-free prompt-tuning method that leverages the rich visual features of self-supervised learning models (DINO) and the broad textual knowledge of large language models (LLMs) to largely enhance CLIP-based image classification performance using unlabeled images. Our approach unfolds in three key steps: (1) We generate robust textual feature embeddings that more accurately represent object classes by leveraging class-specific descriptions from LLMs, enabling more effective zero-shot classification compared to CLIP's default name-specific prompts. (2) These textual embeddings are then used to produce pseudo-labels to train an alignment module that integrates the complementary strengths of LLM description-based textual embeddings and DINO's visual features. (3) Finally, we prompt-tune CLIP's vision encoder through DINO-assisted supervision using the trained alignment module. This three-step process allows us to harness the best of visual and textual foundation models, resulting in a powerful and efficient approach that surpasses state-of-the-art label-free classification methods. Notably, our framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6% over the state-of-the-art LaFter across 11 diverse image classification datasets.
Article
Pretrained language models have shown strong effectiveness in code-related tasks, such as code retrieval, code generation, code summarization, and code completion tasks. In this paper, we propose CO de assista N t vi A retrieval-augme N ted language model (CONAN), which aims to build a code assistant by mimicking the knowledge-seeking behaviors of humans during coding. Specifically, it consists of a code structure aware retriever (CONAN-R) and a dual-view code representation-based retrieval-augmented generation model (CONAN-G). CONAN-R pretrains CodeT5 using Code-Documentation Alignment and Masked Entity Prediction tasks to make language models code structure-aware and learn effective representations for code snippets and documentation. Then CONAN-G designs a dual-view code representation mechanism for implementing a retrieval-augmented code generation model. CONAN-G regards the code documentation descriptions as prompts, which help language models better understand the code semantics. Our experiments show that CONAN achieves convincing performance on different code generation tasks and significantly outperforms previous retrieval augmented code generation models. Our further analyses show that CONAN learns tailored representations for both code snippets and documentation by aligning code-documentation data pairs and capturing structural semantics by masking and predicting entities in the code data. Additionally, the retrieved code snippets and documentation provide necessary information from both program language and natural language to assist the code generation process. CONAN can also be used as an assistant for Large Language Models (LLMs), providing LLMs with external knowledge in shorter code document lengths to improve their effectiveness on various code tasks. It shows the ability of CONAN to extract necessary information and help filter out the noise from retrieved code documents.
Conference Paper
Full-text available
Combining clustering and representation learning is one of the most promising approaches for unsupervised learning of deep neural networks. However, doing so naively leads to ill posed learning problems with degenerate solutions. In this paper, we propose a novel and principled learning formulation that addresses these issues. The method is obtained by maximizing the information between labels and input data indices. We show that this criterion extends standard cross-entropy minimization to an optimal transport problem, which we solve efficiently for millions of input images and thousands of labels using a fast variant of the Sinkhorn-Knopp algorithm. The resulting method is able to self-label visual data so as to train highly competitive image representations without manual labels. Our method achieves state of the art representation learning performance for AlexNet and ResNet-50 on SVHN, CIFAR-10, CIFAR-100 and ImageNet and yields the first self-supervised AlexNet that outperforms the supervised Pascal VOC detection baseline. Code and models are available.
Conference Paper
Full-text available
The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object’s appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.
Conference Paper
Full-text available
Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on CIFAR-10 and CIFAR-100 datasets where we demonstrate new state-of-the-art results below 4\% and 19\%, respectively. Our source code is available at https://github.com/loshchil/SGDR.
Article
Full-text available
The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object's appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 video object detection dataset. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in the VOT2015 benchmark.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Article
Full-text available
The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
Conference Paper
Full-text available
Dimensionality reduction involves mapping a set of high dimensional input points onto a low dimensional manifold so that 'similar" points in input space are mapped to nearby points on the manifold. We present a method - called Dimensionality Reduction by Learning an Invariant Mapping (DrLIM) - for learning a globally coherent nonlinear function that maps the data evenly to the output manifold. The learning relies solely on neighborhood relationships and does not require any distancemeasure in the input space. The method can learn mappings that are invariant to certain transformations of the inputs, as is demonstrated with a number of experiments. Comparisons are made to other techniques, in particular LLE.
Chapter
Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a “dog” can be seen, heard, and felt). We investigate the classic hypothesis that a powerful representation is one that models view-invariant factors. We study this hypothesis under the framework of multiview contrastive learning, where we learn a representation that aims to maximize mutual information between different views of the same scene but is otherwise compact. Our approach scales to any number of views, and is view-agnostic. We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics. Code is available at: http://github.com/HobbitLong/CMC/.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. This system enables us to train visual recognition models on internet-scale data with high efficiency.
Article
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without tricks, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code will be made available.
Article
We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time. We demonstrate the usefulness of our method on applications in supervised image recognition, generative modelling, and deep reinforcement learning.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Conference Paper
In modern face recognition, the conventional pipeline consists of four stages: detect => align => represent => classify. We revisit both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network. This deep network involves more than 120 million parameters using several locally connected layers without weight sharing, rather than the standard convolutional layers. Thus we trained it on the largest facial dataset to-date, an identity labeled dataset of four million facial images belonging to more than 4,000 identities, where each identity has an average of over a thousand samples. The learned representations coupling the accurate model-based alignment with the large facial database generalize remarkably well to faces in unconstrained environments, even with a simple classifier. Our method reaches an accuracy of 97.25% on the Labeled Faces in the Wild (LFW) dataset, reducing the error of the current state of the art by more than 25%, closely approaching human-level performance.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Article
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Article
Optimal transportation distances are a fundamental family of parameterized distances for histograms. Despite their appealing theoretical properties, excellent performance in retrieval tasks and intuitive formulation, their computation involves the resolution of a linear program whose cost is prohibitive whenever the histograms' dimension exceeds a few hundreds. We propose in this work a new family of optimal transportation distances that look at transportation problems from a maximum-entropy perspective. We smooth the classical optimal transportation problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn-Knopp's matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transportation solvers. We also report improved performance over classical optimal transportation distances on the MNIST benchmark problem.
Article
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.
Article
This paper describesan algorithm for verification of signatures
Deep clustering for unsupervised learning of visual features
  • Mathilde Caron
  • Piotr Bojanowski
  • Armand Joulin
  • Matthijs Douze
A simple framework for contrastive learning of visual representations
  • Ting Chen
  • Simon Kornblith
  • Mohammad Norouzi
  • Geoffrey Hinton
Learning deep representations by mutual information estimation and maximization
  • Alex R Devon Hjelm
  • Samuel Fedorov
  • Karan Lavoie-Marchildon
  • Adam Grewal
  • Yoshua Trischler
  • Bengio
Unsupervised learning of visual features by contrasting cluster assignments
  • Mathilde Caron
  • Ishan Misra
  • Julien Mairal
  • Priya Goyal
  • Piotr Bojanowski
  • Armand Joulin
Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition
  • Yann Lecun
  • Bernhard Boser
  • S John
  • Donnie Denker
  • Richard E Henderson
  • Wayne Howard
  • Lawrence D Hubbard
  • Jackel
Faster R-CNN: Towards real-time object detection with region proposal networks
  • Kaiming Shaoqing Ren
  • Ross He
  • Jian Girshick
  • Sun
Learning representations by maximizing mutual information across views
  • Philip Bachman
  • Devon Hjelm
  • William Buchwalter
Siamese neural networks for one-shot image recognition
  • Gregory Koch
  • Richard Zemel
  • Ruslan Salakhutdinov
Bootstrap your own latent: A new approach to self-supervised learning
  • grill
Representation learning with contrastive predictive coding
  • van den oord
Unsupervised learning of visual features by contrasting cluster assignments
  • caron
Learning representations by maximizing mutual information across views
  • bachman
Improved baselines with momentum contrastive learning
  • chen
Large batch training of convolutional networks
  • you
Deep clustering for unsupervised learning of visual features
  • caron
A simple framework for contrastive learning of visual representations
  • chen
Siamese neural networks for one-shot image recognition
  • koch
Learning deep representations by mutual information estimation and maximization
  • hjelm
Data-efficient image recognition with contrastive predictive coding
  • hénaff