ArticlePublisher preview available

Contrastive self-supervised learning: review, progress, challenges and future research directions

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

In the last decade, deep supervised learning has had tremendous success. However, its flaws, such as its dependency on manual and costly annotations on large datasets and being exposed to attacks, have prompted researchers to look for alternative models. Incorporating contrastive learning (CL) for self-supervised learning (SSL) has turned out as an effective alternative. In this paper, a comprehensive review of CL methodology in terms of its approaches, encoding techniques and loss functions is provided. It discusses the applications of CL in various domains like Natural Language Processing (NLP), Computer Vision, speech and text recognition and prediction. The paper presents an overview and background about SSL for understanding the introductory ideas and concepts. A comparative study for all the works that use CL methods for various downstream tasks in each domain is performed. Finally, it discusses the limitations of current methods, as well as the need for additional techniques and future directions in order to make meaningful progress in this area.
This content is subject to copyright. Terms and conditions apply.
International Journal of Multimedia Information Retrieval (2022) 11:461–488
https://doi.org/10.1007/s13735-022-00245-6
TRENDS AND SURVEYS
Contrastive self-supervised learning: review, progress, challenges
and future research directions
Pranjal Kumar1·Piyush Rawat2·Siddhartha Chauhan1
Received: 30 May 2022 / Revised: 7 July 2022 / Accepted: 14 July 2022 / Published online: 5 August 2022
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022
Abstract
In the last decade, deep supervised learning has had tremendous success. However, its flaws, such as its dependency on manual
and costly annotations on large datasets and being exposed to attacks, have prompted researchers to look for alternative models.
Incorporating contrastive learning (CL) for self-supervised learning (SSL) has turned out as an effective alternative. In this
paper, a comprehensive review of CL methodology in terms of its approaches, encoding techniques and loss functions is
provided. It discusses the applications of CL in various domains like Natural Language Processing (NLP), Computer Vision,
speech and text recognition and prediction. The paper presents an overview and background about SSL for understanding
the introductory ideas and concepts. A comparative study for all the works that use CL methods for various downstream
tasks in each domain is performed. Finally, it discusses the limitations of current methods, as well as the need for additional
techniques and future directions in order to make meaningful progress in this area.
Keywords Contrastive learning ·Self-supervised learning ·Unsupervised learning ·Data augmentation ·Survey
1 Introduction
Deep learning has advanced to the point where it is now
an essential part of nearly all intelligent systems. Using the
abundance of data available today, deep neural networks
(DNNs) have become a compelling approach for a wide range
of computer vision (CV) tasks including object detection,
image classification [13], image segmentation [4,5], activ-
ity recognition, etc., and natural language processing (NLP)
tasks such as sentiment analysis [6], pre-trained language
models [710], question answering [1114], etc. It is possi-
ble, however, that the labour-intensive process of manually
annotating millions of data samples has exhausted the super-
vised approach to learning features. Most modern computer
BPranjal Kumar
pranjal@nith.ac.in
Piyush Rawat
psh.rawat@gmail.com
Siddhartha Chauhan
sid@nith.ac.in
1NIT Hamirpur, Hamirpur, Himachal Pradesh 177005, India
2Department of Systemics, School of Computer Science,
University of Petroleum and Energy Studies, Dehradun,
Uttarakhand 248007, India
vision systems (that are supervised) attempt to learn some
form of image representation in order to discover a pattern
between data points and their respective annotations in large
datasets. It has been suggested that providing visual expla-
nations for decisions made by models can help to make them
more transparent and understandable. [15].
On the other hand, supervised learning has hit a snag. It
is prone to generalization errors, spurious correlations, and
adversarial attacks because of its reliance on time-consuming
and expensive manual labelling. We expect the neural net-
works to learn more quickly with fewer labels, samples,
and trials. This paradigm has been adopted by many cur-
rent models because it is data efficient and generalizable
as an alternative that has received significant attention in
the research community. In traditional supervised learning
methods, the amount of available annotated training data
is extremely important. A dearth of annotations has forced
researchers to develop new methods for making use of
the vast amount of data already available. With the help
of self-supervised methods, deep learning progresses with-
out expensive annotations and learns feature representation
where data serve as supervision. Autoencoders and exten-
sions, Deep Infomax, and Contrastive Coding, among other
self-supervised learning models, will be thoroughly exam-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... In the initial pre-training step (Fig. 1A, top), we train a DCNN using a self-supervised contrastive learning algorithm [44,45]. During this step, the synaptic weight parameters of the network are optimized to minimize an objective function. ...
Preprint
Full-text available
Recent advances in self-supervised learning have attracted significant attention from both machine learning and neuroscience. This is primarily because self-supervised methods do not require annotated supervisory information, making them applicable to training artificial networks without relying on large amounts of curated data, and potentially offering insights into how the brain adapts to its environment in an unsupervised manner. Although several previous studies have elucidated the correspondence between neural representations in deep convolutional neural networks (DCNNs) and biological systems, the extent to which unsupervised or self-supervised learning can explain the human-like acquisition of categorically structured information remains less explored. In this study, we investigate the correspondence between the internal representations of DCNNs trained using a self-supervised contrastive learning algorithm and human semantics and recognition. To this end, we employ a few-shot learning evaluation procedure, which measures the ability of DCNNs to recognize novel concepts from limited exposure, to examine the inter-categorical structure of the learned representations. Two comparative approaches are used to relate the few-shot learning outcomes to human semantics and recognition, with results suggesting that the representations acquired through contrastive learning are well aligned with human cognition. These findings underscore the potential of self-supervised contrastive learning frameworks to model learning mechanisms similar to those of the human brain, particularly in scenarios where explicit supervision is unavailable, such as in human infants prior to language acquisition.
... Kumar P incorporated contrast learning into SSL to investigate the conditions of application in different domains. The findings indicated that this method has positive significance for complementary contrast learning [13]. To address the issue of information overload, Wu S. et al. presented a recommendation model based on GNN. ...
... These gaps are critical for ongoing research and the successful implementation of practical applications. First, models based on self-supervised contrastive learning, especially those involving large-scale data augmentation and complex sampling strategies, can be computationally intensive and require many hardware resources [76]. Second, contrastive learning methods are susceptible to data quality. ...
Article
Full-text available
Class imbalance remains a formidable challenge in machine learning, particularly affecting fields that depend on accurate classification across skewed datasets, such as medical imaging and software defect prediction. Traditional approaches often fail to adequately address the underrepresentation of minority classes, leading to models that exhibit high performance on majority classes but have poor performance on critical minority classes. Self-supervised contrastive learning has become an extremely encouraging method for this issue, enabling the utilization of unlabeled data to generate robust and generalizable models. This paper reviews the advancements in self-supervised contrastive learning for imbalanced classification, focusing on methodologies that enhance model performance through innovative contrastive loss functions and data augmentation strategies. By pulling similar instances closer and pushing dissimilar ones apart, these techniques help mitigate the biases inherent in imbalanced datasets. We critically analyze the effectiveness of these methods in diverse scenarios and propose future research directions aimed at refining these approaches for broader application in real-world settings. This review serves as a guide for researchers exploring the potential of contrastive learning to address class imbalances, highlighting recent successes and identifying crucial gaps that need addressing.
... Contrastive learning methods traditionally rely on predefined similarity measures such as cosine similarity or Euclidean distance [34]. However, these metrics do not capture the global structure of learned representations. ...
Preprint
Self-supervised learning has revolutionized representation learning by eliminating the need for labeled data. Contrastive learning methods, such as SimCLR, maximize the agreement between augmented views of an image but lack explicit regularization to enforce a globally structured latent space. This limitation often leads to suboptimal generalization. We propose SinSim, a novel extension of SimCLR that integrates Sinkhorn regularization from optimal transport theory to enhance representation structure. The Sinkhorn loss, an entropy-regularized Wasserstein distance, encourages a well-dispersed and geometry-aware feature space, preserving discriminative power. Empirical evaluations on various datasets demonstrate that SinSim outperforms SimCLR and achieves competitive performance against prominent self-supervised methods such as VICReg and Barlow Twins. UMAP visualizations further reveal improved class separability and structured feature distributions. These results indicate that integrating optimal transport regularization into contrastive learning provides a principled and effective mechanism for learning robust, well-structured representations. Our findings open new directions for applying transport-based constraints in self-supervised learning frameworks.
Conference Paper
Full-text available
Contrastive self-supervised learning has become a prominent technique in representation learning. The main step in these methods is to contrast semantically similar and dissimilar pairs of samples. However, in the domain of Natural Language Processing (NLP), the augmentation methods used in creating similar pairs with regard to contrastive learning (CL) assumptions are challenging. This is because, even simply modifying a word in the input might change the semantic meaning of the sentence, and hence, would violate the distributional hypothesis. In this review paper, we formalize the contrastive learning framework, emphasize the considerations that need to be addressed in the data transformation step, and review the state-of-the-art methods and evaluations for contrastive representation learning in NLP. Finally, we describe some challenges and potential directions for learning better text representations using contrastive methods.
Article
In this paper, we propose an online clustering method called Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning. To be specific, for a given dataset, the positive and negative instance pairs are constructed through data augmentations and then projected into a feature space. Therein, the instance- and cluster-level contrastive learning are respectively conducted in the row and column space by maximizing the similarities of positive pairs while minimizing those of negative ones. Our key observation is that the rows of the feature matrix could be regarded as soft labels of instances, and accordingly the columns could be further regarded as cluster representations. By simultaneously optimizing the instance- and cluster-level contrastive loss, the model jointly learns representations and cluster assignments in an end-to-end manner. Besides, the proposed method could timely compute the cluster assignment for each individual, even when the data is presented in streams. Extensive experimental results show that CC remarkably outperforms 17 competitive clustering methods on six challenging image benchmarks. In particular, CC achieves an NMI of 0.705 (0.431) on the CIFAR-10 (CIFAR-100) dataset, which is an up to 19% (39%) performance improvement compared with the best baseline. The code is available at https://github.com/XLearning-SCU/2021-AAAI-CC.
Article
Graph classification is a widely studied problem and has broad applications. In many real-world problems, the number of labeled graphs available for training classification models is limited, which renders these models prone to overfitting. To address this problem, we propose two approaches based on contrastive self-supervised learning (CSSL) to alleviate overfitting. In the first approach, we use CSSL to pretrain graph encoders on widely-available unlabeled graphs without relying on human-provided labels, then finetune the pretrained encoders on labeled graphs. In the second approach, we develop a regularizer based on CSSL, and solve the supervised classification task and the unsupervised CSSL task simultaneously. To perform CSSL on graphs, given a collection of original graphs, we perform data augmentation to create augmented graphs out of the original graphs. An augmented graph is created by consecutively applying a sequence of graph alteration operations. A contrastive loss is defined to learn graph encoders by judging whether two augmented graphs are from the same original graph. Experiments on various graph classification datasets demonstrate the effectiveness of our proposed methods. The code is available at https://github.com/UCSD-AI4H/GraphSSL.
Article
A steady momentum of innovations and breakthroughs has convincingly pushed the limits of unsupervised image representation learning. Compared to static 2D images, video has one more dimension (time). The inherent supervision existing in such sequential structure offers a fertile ground for building unsupervised learning models. In this paper, we compose a trilogy of exploring the basic and generic supervision in the sequence from spatial, spatiotemporal and sequential perspectives. We materialize the supervisory signals through determining whether a pair of samples is from one frame or from one video, and whether a triplet of samples is in the correct temporal order. We uniquely regard the signals as the foundation in contrastive learning and derive a particular form named Sequence Contrastive Learning (SeCo). SeCo shows superior results under the linear protocol on action recognition (Kinetics), untrimmed activity recognition (ActivityNet) and object tracking (OTB-100). More remarkably, SeCo demonstrates considerable improvements over recent unsupervised pre-training techniques, and leads the accuracy by 2.96% and 6.47% against fully-supervised ImageNet pre-training in action recognition task on UCF101 and HMDB51, respectively. Source code is available at https://github.com/YihengZhang-CV/SeCo-Sequence-Contrastive-Learning.
Article
One significant factor we expect the video representation learning to capture, especially in contrast with the image representation learning, is the object motion. However, we found that in the current mainstream video datasets, some action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded. For example, a trained model may predict a video as playing football simply because it sees the field, neglecting that the subject is dancing as a cheerleader on the field. This is against our original intention towards the video representation learning and may bring scene bias on a different dataset that can not be ignored. In order to tackle this problem, we propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid. Specifically, we construct a positive clip and a negative clip for each video. Compared to the original video, the positive/negative is motion-untouched/broken but scene-broken/untouched by Spatial Local Disturbance and Temporal Local Disturbance. Our objective is to pull the positive closer while pushing the negative farther to the original clip in the latent space. In this way, the impact of the scene is weakened while the temporal sensitivity of the network is further enhanced. We conduct experiments on two tasks with various backbones and different pre-training datasets, and find that our method surpass the SOTA methods with a remarkable 8.1% and 8.8% improvement towards action recognition task on the UCF101 and HMDB51 datasets respectively using the same backbone.
Conference Paper
InfoNCE is a widely used contrastive training loss. It aims to estimate the mutual information between a pair of variables by discriminating between each positive pair and its associated K negative pairs. It is proved that when the sample labels are clean, the lower bound of mutual information estimation is tighter when more negative samples are incorporated, which usually yields better model performance. However, in practice the labels often contain noise, and incorporating too many noisy negative samples into model training may be suboptimal. In this paper, we study how many negative samples are optimal for InfoNCE in different scenarios via a semi-quantitative theoretical framework. More specifically, we first propose a probabilistic model to analyze the influence of the negative sampling ratio K on training sample informativeness. Then, we design a training effectiveness function to measure the overall influence of training samples based on their informativeness. We estimate the optimal negative sampling ratio using the K value that maximizes the training effectiveness function. Based on our framework, we further propose an adaptive negative sampling method that can dynamically adjust the negative sampling ratio to improve InfoNCE-based model training. Extensive experiments in three different tasks show our framework can accurately predict the optimal negative sampling ratio, and various models can benefit from our adaptive negative sampling method.
Article
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34 k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.