A preview of this full-text is provided by Springer Nature.
Content available from International Journal of Multimedia Information Retrieval
This content is subject to copyright. Terms and conditions apply.
International Journal of Multimedia Information Retrieval (2022) 11:461–488
https://doi.org/10.1007/s13735-022-00245-6
TRENDS AND SURVEYS
Contrastive self-supervised learning: review, progress, challenges
and future research directions
Pranjal Kumar1·Piyush Rawat2·Siddhartha Chauhan1
Received: 30 May 2022 / Revised: 7 July 2022 / Accepted: 14 July 2022 / Published online: 5 August 2022
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022
Abstract
In the last decade, deep supervised learning has had tremendous success. However, its flaws, such as its dependency on manual
and costly annotations on large datasets and being exposed to attacks, have prompted researchers to look for alternative models.
Incorporating contrastive learning (CL) for self-supervised learning (SSL) has turned out as an effective alternative. In this
paper, a comprehensive review of CL methodology in terms of its approaches, encoding techniques and loss functions is
provided. It discusses the applications of CL in various domains like Natural Language Processing (NLP), Computer Vision,
speech and text recognition and prediction. The paper presents an overview and background about SSL for understanding
the introductory ideas and concepts. A comparative study for all the works that use CL methods for various downstream
tasks in each domain is performed. Finally, it discusses the limitations of current methods, as well as the need for additional
techniques and future directions in order to make meaningful progress in this area.
Keywords Contrastive learning ·Self-supervised learning ·Unsupervised learning ·Data augmentation ·Survey
1 Introduction
Deep learning has advanced to the point where it is now
an essential part of nearly all intelligent systems. Using the
abundance of data available today, deep neural networks
(DNNs) have become a compelling approach for a wide range
of computer vision (CV) tasks including object detection,
image classification [1–3], image segmentation [4,5], activ-
ity recognition, etc., and natural language processing (NLP)
tasks such as sentiment analysis [6], pre-trained language
models [7–10], question answering [11–14], etc. It is possi-
ble, however, that the labour-intensive process of manually
annotating millions of data samples has exhausted the super-
vised approach to learning features. Most modern computer
BPranjal Kumar
pranjal@nith.ac.in
Piyush Rawat
psh.rawat@gmail.com
Siddhartha Chauhan
sid@nith.ac.in
1NIT Hamirpur, Hamirpur, Himachal Pradesh 177005, India
2Department of Systemics, School of Computer Science,
University of Petroleum and Energy Studies, Dehradun,
Uttarakhand 248007, India
vision systems (that are supervised) attempt to learn some
form of image representation in order to discover a pattern
between data points and their respective annotations in large
datasets. It has been suggested that providing visual expla-
nations for decisions made by models can help to make them
more transparent and understandable. [15].
On the other hand, supervised learning has hit a snag. It
is prone to generalization errors, spurious correlations, and
adversarial attacks because of its reliance on time-consuming
and expensive manual labelling. We expect the neural net-
works to learn more quickly with fewer labels, samples,
and trials. This paradigm has been adopted by many cur-
rent models because it is data efficient and generalizable
as an alternative that has received significant attention in
the research community. In traditional supervised learning
methods, the amount of available annotated training data
is extremely important. A dearth of annotations has forced
researchers to develop new methods for making use of
the vast amount of data already available. With the help
of self-supervised methods, deep learning progresses with-
out expensive annotations and learns feature representation
where data serve as supervision. Autoencoders and exten-
sions, Deep Infomax, and Contrastive Coding, among other
self-supervised learning models, will be thoroughly exam-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.