Content uploaded by Manisha Saini
Author content
All content in this area was uploaded by Manisha Saini on Jul 21, 2023
Content may be subject to copyright.
Content uploaded by Manisha Saini
Author content
All content in this area was uploaded by Manisha Saini on Jul 21, 2023
Content may be subject to copyright.
THIS ARTICLE HAS BEEN ACCEPTED FOR PUBLICATION IN A FUTURE ISSUE OF THIS JOURNAL,BUT HAS NOT BEEN FULLY EDITED. CONTENT MAY CHANGE PRIOR TO
FINAL PUBLICATION. CITATION INFORMATION: DOI 10.1007/S10462-023-10557-6, ARTIFICIAL INTELLIGENCE REVIEW
Tackling class imbalance in computer vision: A
contemporary review
Manisha Saini* and Seba Susan
Delhi Technological University, Delhi, India
*Email: manisha.saini44@gmail.com
Abstract
Class imbalance is a key issue affecting the performance of computer vision applications such as medical
image analysis, objection detection and recognition, image segmentation, scene understanding, and many
others. Class imbalance refers to the situation when the number of samples in the majority classes
outnumber the minority class populations. The model might then get biased towards the majority classes
while neglecting the minority classes, adversely affecting the classification performance. In this paper, an
extensive literature survey has been conducted to discuss in depth about the class imbalance issues
affecting various classification tasks in computer vision. The study analyzes the performance of several
contemporary machine learning algorithms such as chi-square support vector machine and gradient
boosted decision trees, and deep learning models such as deep pre-trained convolutional networks,
generative adversarial networks and vision transformers, for effective learning from imbalanced computer
vision datasets. Most of these models either perform data-level manipulation (data augmentation) or
cost-sensitive learning (loss functions) or a combination of the two. This survey also includes a summary
of novel deep learning frameworks customized to mitigate the effect of class imbalance. It has included
recent advancement and new developments in this field such as Explainable AI. The scrutiny of various
popular and benchmark imbalanced datasets in computer vision and performance evaluation metrics are
also included as a part of this study. Along with that it has emphasized on the research gaps in
contemporary literature which would contribute towards future artificial vision models that can learn
effectively from imbalanced datasets.
Keywords: Class imbalance, Computer vision, Data-level manipulation, cost-sensitive learning, Deep
learning
1. Introduction
Nowadays, due to a vivid increase in the visual data from multiple sources, there is an exponential rise in
the cases of class imbalance occurring in real-world computer vision datasets. Class imbalance is the
situation where the classes present in a dataset are unevenly distributed. This problem is highlighted when
the sample distribution in one or more classes is more in comparison to other classes due to which priority
during the learning process is given to the class which has more sample size distribution in comparison to
others [1]. Generally, when one class is much more prevalent than the others, the algorithm will prioritize
minimizing errors for the more prevalent class at the expense of the less prevalent classes. This can
degrade the performance of the less prevalent classes which brings down the overall classification
performance [2]. Even if a model exhibits high classification accuracy, there are higher chances of
misleading results due to inaccurate predictions for the underrepresented classes. The topic has been
extensively researched in the data mining community [3], but not so satisfactorily in the computer vision
field. The handling of class imbalance in image datasets remains one of the persistent challenges in the
field of computer vision. It is crucial to address this issue to ensure fair and accurate classification
performance in computer vision tasks.
In this section, the paper delves deeper into the issue of class imbalance, examining it from various
perspectives. It begins with section 1.1 which sheds light on the generic problem of class imbalance that is
prevalent in real-world applications. Next, section 1.2 explores the potential causes of class imbalance,
aiming to understand the factors contributing to this issue. Moving on to section 1.3, the study takes a
closer look at the specific challenges posed by class imbalance in both binary and multi-class datasets. In
section 1.4, this study delves into a discussion of the traditional methods in data mining that have been
employed to tackle this problem, and that can be easily adapted to the computer vision domain. In section
1.5, the focus is directed towards class imbalance in the domain of computer vision while section 1.6
gives an overview of some popular solutions for mitigating the class imbalance in computer vision. A
summary of other surveys, contributions of the current survey, and the organization of the paper are
outlined in sections 1.7, 1.8 and 1.9, respectively. By understanding these details, the reader can better
appreciate the significance of addressing class imbalance to ensure fair and accurate performance in
computer vision applications.
1.1 Some real-world instances of class imbalance
The class imbalance problem is prevalent in a variety of real-world applications including diagnostic
procedures in the biomedical domain, credit card fraud detection, surveillance, biometric verification etc.
In most cases, it is the minority class that is of more interest, hence a lot is at stake for finding effective
solutions for learning from such a flawed dataset. Alam et al. have worked on credit card imbalanced
datasets [150]. The major finding of their study is that the performance of over-sampling techniques like
synthetic minority oversampling (SMOTE) [177] is better than under-sampling techniques applied at data
level. According to the experimental results it was proved that the Gradient Boosted Decision Trees works
well in comparison to other machine learning classifiers for imbalanced datasets. Yang et al. (2020)
emphasized on feature engineering by selecting appropriate features which can overall improve the early
prediction of ovarian cancer [153]. The experimental analysis proved that the decision tree in
combination with SVM-SMOTE gives overall better performance in comparison to other traditional
classifiers. The decision tree classifier has already proved to be a reliable machine learning strategy in
various domains of data mining [154][155].
In the biomedical domain, class imbalance occurs when certain disease conditions are rare, as is
the case with certain cancer subtypes, as compared to healthy individuals [4]. The model may not
accurately identify individuals with the disease in this case, which can have serious consequences in
healthcare settings. In case of credit card fraud detection, class imbalance occurs when fraudulent
transactions are much less common than legitimate transactions [5]. A model which is trained on this kind
of dataset may prioritize minimizing the errors for the majority class (transactions which are legitimate) at
the expense of the minority class (transactions which are fraudulent), which ultimately leads to poor
performance of the minority class.
Similarly, in surveillance and biometric applications, class imbalance can occur when certain events
or individuals are much less common than others [6]. Class imbalance in computer vision may induce a
rise in the numbers of false negatives and false positives degrading the system performance. For example,
a surveillance system that is trained to detect criminal activity may perform poorly on the minority class
(activities which are suspected as criminal) because it has very few examples to learn from. This can lead
to false negatives, where the system fails to detect criminal activity that is actually occurring. The
foreground-background imbalance is a typical instance of class imbalance in video surveillance
applications. An innovative approach to anomaly detection known as Attention-loss driven anomaly
detection method is introduced in [7] to address the foreground-background imbalance by assigning
distinct weights to the foreground and background in order to mitigate the foreground-background
imbalance. Another popular classification task in video surveillance is the detection of small objects such
as weapons and purses whose samples in the training set are in minority as compared to the background
clutter; many objects in the background may resemble the small objects under consideration leading to a
rise in the number of false positives. An improvised approach was proposed in [8] for detecting small
objects using binary classifiers. The number of false positives were found to be significantly reduced by
this approach. The initial stage involves selecting candidate regions from the input frame and gradually
shifting towards applying a binarization technique based on a convolutional neural network classifier.
1.2 Factors contributing to class imbalance
The occurrence of imbalanced datasets in the real-world may be due to some common factors as
illustrated below for the example of biomedical image datasets: (a) Improper manner of data collection
that may occur due to lack of technology or due to the lack of proper equipment leading to improper data
acquisition. This is generally observed in the biomedical domain where there are possibilities of
equipment failure or lack of modern diagnostic procedures (b) A lot of manual effort is required for the
collection of a labeled and annotated dataset which can also contribute towards class imbalance. The lack
of availability of medical professionals/experts for huge data collection and annotation have rendered
several biomedical datasets to have an imbalanced class distribution profile (c) Imbalance situation can
also occur due to the actual lack of samples in different classes which are termed as minority classes. This
may be observed for images of cancerous tissues that occur in a minority as compared to the healthy
tissues (d) Class overlapping could be another reason where there might be more than two groups (such as
various subtypes of cancers) and further noise caused due to class label and unclear separation of
borders/boundaries between various classes can also cause significant impact on the model performance.
The mistakenly defined wrong labels can also cause the increase in the imbalance scenario. In this type of
situation there is a need to have clearly defined strategies for labeling the classes in an appropriate manner
so that there is no overlapping of the class boundaries.
1.3 Binary versus multi-class imbalanced datasets
Imbalanced datasets found in literature are broadly categorized into two groups i.e. binary and multi-class.
While a binary imbalanced dataset has two classes (one is majority and another one is minority), in the
case of multi-class imbalanced dataset there are multiple classes (there could be multiple majority or
minority classes). Different types of solutions are proposed in literature for binary and multi-class
imbalanced datasets though the works on multi-class datasets are relatively rare to find. Fig. 1. shows
instances of balanced and imbalanced class distributions of the samples of images in a biomedical dataset
in the cases of both binary and multi-class scenarios. In comparison to the binary dataset, there are more
difficulties and challenges involved in addressing the concerns related to multi-class imbalance problems.
There is a possibility of occurrence of various combinations in multi-class imbalanced datasets which
includes: (1) presence of more minority classes as compared to the majority classes in the dataset (2)
occurrence of fewer minority classes and more majority classes. Moreover, in the case of the multi-class
imbalanced dataset, it is difficult to set a proper definition for deciding which are the majority or minority
categories solely based upon the sample distribution in each class [9].
Fig. 1. Visual representation of imbalanced and balanced class distributions for binary- and
multi-class datasets
Long-tailed class imbalance [67] is a problem in visual recognition tasks where the system is biased
towards the dominant classes but performs adversely on the tail classes. Balanced group softmax (BAGS)
is used in [68] to mitigate long-tailed class imbalance by ensuring that the training process for the head
and tail classes are done in such a way that they are both trained equally well. They also emphasized that
the reason for the down performance on long-tailed data is that the classifier becomes imbalanced in the
presence of insufficient training on few-shot class. Fig. 2. illustrates a long-tailed data distribution
showcasing the contrast between the dominant or majority classes and the tail classes which are the
minority classes.
Fig. 2. Illustration of long-tailed data distribution in an imbalanced dataset.
1.4 An overview of traditional strategies used for imbalanced learning in data mining
To address class imbalance issues in data mining, various researchers have proposed solutions at data,
algorithm and hybrid levels [10][11]. These are known as the traditional approaches and use machine
learning for the classification. Data level approaches aim to balance the class distributions by resampling
[3] which can be of three forms: (i) oversampling the under-represented minority class (ii) under sampling
the over-represented majority class (iii) hybrid approaches involving both under sampling and
oversampling procedures. Data-level manipulations help to create a balanced dataset having more or less
equal samples in each class and helps to remove any kind of bias towards a particular class. Applying data
level approaches on the training dataset can act as a regularization technique and reduces the chance of
overfitting the model in case there is an increase in the number of samples (oversampling), however,
decreasing the number of samples (under sampling) can lead to under-fitting issues [12]. Hybrid sampling
techniques that involve sampling the majority class and oversampling the minority class have been found
to yield improved performance [13]. Such data level manipulations can be termed as intelligent data
pruning techniques [14] that retain only those samples that help in effective classification. Data pruning
has thus two objectives (i) to maintain a balanced class distribution (ii) retain the βsignificantβ samples
that maintain the diversity in the data.
Fig. 3 depicts a taxonomy of various popular approaches used at data, algorithm and hybrid levels
for handling imbalanced datasets. As illustrated in Fig. 3, data augmentation using image transformations,
synthetic image generation using generative adversarial networks (GAN) [179], SMOTE and its variants,
and over- and under-sampling algorithms are popular data level approaches which are applied to balance
the dataset for binary and multi-class imbalanced datasets. Algorithm level approach, on the other hand,
aims to modify the learning algorithm such that it pays more attention towards the under-represented
class. An example of algorithm approach is cost-sensitive learning that assigns a higher weight to the loss
function emanating from the minority class and a lower weight to the loss function emanating from the
majority class [15]. Design of optimal loss functions for mitigating class imbalance in computer vision is
a significant research direction. Application of class weights is another algorithmic approach to address
class imbalance in computer vision by assigning different weights to samples of majority and minority
classes present during the training process to account for the disproportionate representation of the
classes. By adjusting the weights, the algorithm can give more importance to the minority class and help
mitigate the effects of class imbalance. Hybrid methods mainly focus on merging the data level and
algorithm level approaches to create effective combinations. There exist several hybrid approaches such
as RUSBoost which is a hybrid of resampling and boosting [17]; more generically these are known as the
ensemble approaches [18]. Ensemble classifiers, such as bagging and boosting classifiers, are created by
combining the strong and weak classifiers together that makes way for efficient learning. A recent work
by Chen et al. [19] introduces a novel combination called balance cascade that initially samples the
imbalanced data using Up-Down sampling strategy that initially upsamples and then downsamples the
minority class, and later on the balanced dataset is trained on a βgrowingβ multi-channel cascade forest,
hence the name balance cascade. The channels that do not give good performance are terminated.
Fig. 3. Taxonomy of imbalanced learning solutions at the data, algorithm and hybrid levels
1.5 Class imbalance in computer vision-An overview of the problem
Object detection, localization, segmentation and classification are important tasks in the field of computer
vision, and are used in a wide range of applications ranging from images and video analysis, robotics to
autonomous vehicles etc. Class imbalance is a common problem in the computer vision domain,
particularly in numerous tasks associated with classification, object detection, and semantic segmentation
[20]. In image classification tasks, class imbalance can occur when some classes are much more in
number than others. There are various application domains in computer vision where imbalanced datasets
are characteristically found [163]; some of these research areas are mentioned below for reference:
(1) Biomedical domain: Imbalanced datasets are common in the biomedical domain [116]. In the case of
diagnostic systems for detecting rare diseases, the number of samples for the rare disease may be much
smaller than the number of samples for more common disease types or the healthy samples. Examples in
the biomedical domain are: a) ECG Heartbeat classification dataset: - This dataset contains ECG readings
taken from individuals and is used to evaluate the performance of systems in detecting abnormalities in
heart rhythms. b) Another example could be cancer detection from MRI/CT scan images - the cancerous
samples being smaller in number in comparison to the non-cancerous samples.
(2) Biometrics recognition: Biometric domain consists of facial, fingerprints and iris patterns. Behavioral
biometrics such as gait recognition is also well researched [174]. The class profile of biometric datasets is
mostly imbalanced for most real-world datasets [148]. For example, in case of face recognition system,
the number of samples for certain individuals may be much smaller in number in comparison to the other
individuals present in the dataset.
(3) Defect detection in industrial processes: Another common imbalanced dataset is the one used for
defect detection in fabrics or in surface inspection in industrial processes [105, 106]. Here, the number of
image patches containing the defect are far less as compared to the defect-free image patches.
(4) Pedestrian detection in traffic surveillance: This is a commonly found imbalanced dataset useful in
self-driving cars and video and traffic surveillance [115]. The dataset is imbalanced because the number
of negative samples (non-pedestrian images) is much greater in number than positive samples (i.e.
pedestrian images).
(5) Object recognition: Effective classification of the objects in a scene is tremendously affected by the
distribution of object classes in the training set [175]. For example, in a dataset of objects, the class
"object1" may have many more samples than the class "object2". This can lead to a model that performs
well on the majority class (object1), but poorly on the minority class (object2).
(6) Human activity recognition: Likewise, in human activity datasets, the less frequent activities have
fewer samples, resulting in an uneven class distribution [176]. Human activity recognition from videos is
challenging when the intra-class variation is high and the inter-class variation is low. For instance, sports
activities such as jumping, running and walking have overlapping contexts; this scenario is critically
affected when a few of the classes are inadequately represented in the learning process.
(7) Agriculture: Some popular computer vision experiments related to agriculture are plant disease
classification, weeds classification etc. [110][111]. The class imbalance issue is prevalent in this field due
to the common occurrences of certain types of plant diseases as opposed to the rarely occurring ones.
(8) Object detection and image segmentation: Object detection and semantic segmentation tasks are also
affected by class imbalance. For example, in a dataset of traffic scenes, the "pedestrian" class may be rarer
than the "car class". This can lead to a model that has difficulty in detecting pedestrians, even if it
performs well on other classes. In case of object detection, the main challenge involves identifying the
location and class of numerous objects present in a given image; the problem is complicated when the size
of the object is small. One major instance of class imbalance in computer vision is the smaller foreground
(objects) area as compared to the larger (and cluttered) background area. As the object occupies a smaller
portion of the image, there are multiple background pixels in comparison to the foreground pixels. This
class imbalance can affect the performance of algorithms, particularly if the model is biased towards the
background class. Fig. 4. depicts the object detection case where a class imbalance situation occurs in the
real-world example consisting of two categories: Minority (Foreground) and majority (Background). The
number of background pixels were found to be significantly higher in this case.
Fig. 4. Illustration of class imbalance in the form of smaller foreground area versus larger
background area for the object detection task.
The object detection and segmentation tasks can be adversely affected by the class imbalance when
particularly there are large numbers of classes having small number of instances. There could be multiple
reasons for class imbalance, such as the relative rarity of certain objects in the dataset, or the difficulty of
collecting and labeling data for certain classes. Object detection, localization, and classification can be
together used to identify and classify objects in an image or video. For example, in the case of a
self-driving car, object detection is used to identify the pedestrians, cars, and other objects in its
environment, and then further use object classification to determine the type of each object. Discussion
related to the impact of class imbalance on the perceptive capacity of self-driving cars is provided in [21]
with a comprehensive overview of the application of deep learning techniques in the context of
self-driving cars.
1.6 Class imbalance in computer vision- popular solutions adapted from data mining to
computer vision
It is generally acknowledged that the traditional machine learning solutions discussed previously are
applicable to the imbalanced image datasets available in computer vision and deep learning; however,
there are rare works that embed class imbalance solutions into deep neural network architectures as per
the survey in [22]. There are two popular and effective solutions adapted from data mining studies into
computer vision β (i) cost-sensitive learning and (ii) data augmentation which is equivalent to
oversampling of the minority class. Many researchers have tested and tried different combinations of loss
functions and data augmentation techniques for different computer vision tasks [142]. Data augmentation
of the minority class is the most simple and well-tried remedy for mitigating class imbalance issues in
deep learning [69]. The data is augmented synthetically through various image transformations such as
scaling, rotation, flipping etc., and in some cases the Generative Adversarial Network (GAN) is used to
generate the fake or synthetic images [78]. Fig. 5. illustrates data augmentation on image samples of the
UC Merced Land Use Dataset. Simple image transformations such as rotation, scaling, horizontal
flipping, vertical flipping, shearing, and zooming, are applied to create a whole set of synthetic images
that look totally alike the natural images in the dataset.
Fig. 5. Illustration of applying data augmentation on the image Samples of UC Merced Land Use
Dataset
Most researchers have focused on simply comparing the performance of different pre-trained
deep convolutional neural networks (CNNs) for learning from imbalanced datasets; data augmentation is
applied in some cases to increase the samples in the minority class for improving the classification
performance [23][24][86]. Saini and Susan [69] applied data augmentation only on the minority class of
the BreakHis breast cancer dataset using image transformations such as shear, zoom, horizontal flip and
resizing. From the experimental analysis,they found that data augmentation when applied only on
minority classes proves to be effective for imbalanced datasets. A detailed discussion on different loss
functions used in cost-sensitive learning to mitigate class imbalance is given in the later sections of this
paper. The loss function which is also the fitness function of the neural network pertaining to the minority
class is weighted by a higher value (w>1) as shown in (1).
(1)
πΏππ π =πππππππ‘π¦πππ π +πππππππ‘π¦πππ π *π€
Table 1 presents some updates on recent progress in imbalanced learning for vision-related
classification tasks. The application, dataset and remedy for class imbalance is discussed.
Table 1. Some recently introduced computer vision applications involving imbalanced datasets
Author &
year
Computer vision
application
Imbalanced dataset
Solution proposed β Data augmentation vs
Cost-sensitive learning
Sarafianos
et al. (2018)
[25]
Visual attribute
classification
challenges
The authors have used
highly imbalanced
datasets:
(1) PETA
(2) WIDER
- They have introduced a loss function that
handles instance- and class-level class imbalance,
and have achieved results using PETA and
WIDER attribute datasets by using attention
mechanisms.
- The focal loss weighted-variant works better for
handling imbalance and also focuses on hard
examples.
Sambasiva
m et al.
(2021) [26]
-Cassava leaf
disease detection
and classification
Authors have created
their dataset which
consists of fine-grained
cassava leaf
disease categories having
10,000 labeled images.
(This dataset has high
imbalance ratio)
- Class-weight, Synthetic Minority Oversampling
Technique (SMOTE), focal loss techniques have
helped to improve performance of the model to a
great extent
Yeung et al.
(2022) [27]
-Biomedical
application
The authors have used
five different types of
publicly available
imbalanced datasets: -
(1) CVS-ClinicDB
(2) Digital Retinal
Images for Vessel
Extraction (DRIVE)
(3) Breast Ultrasound
2017 (BUS2017)
(4) Brain Tumor
Segmentation 2020
(BraTS20)
(5) Kidney Tumor
Segmentation 2019
(KiTS19)
- Proposed the unified focal loss for class
imbalance scenario.
-Evaluated performance on five different image
segmentation datasets (3D binary as well as 3D
multi-class), and found that their loss function
works better in comparison to other
state-of-the-art loss functions using DSC (Dice
similarity cosine) and IoU (Intersection Over
Union) scores.
Carranza-G
arcΓa et al.
(2021) [28]
-Autonomous
driving
Waymo Open Dataset
- Designed an ensemble model which combines
numerous learning strategies, and addresses the
class imbalance problem by using a reduced focal
loss that can significantly improve the detection
of tough objects in two-stage detectors.
Dong et al.
-shop domain
They have used popular
large-scaled imbalanced
datasets:
- Formulated the novel idea corresponding to the
batch incremental hard sample mining of
minority attribute classes, and developed the deep
learning approach to reduce the impact of the
(2017) [29]
(Recognizing
detailed facial or
clothing
attributes in
images of people)
(1) CelebA facial
attribute dataset
(2) X-Domain clothing
attribute dataset
majority class using class rectification loss
(CRL).
Ngo and
Yoon (2020)
[30]
Facial expression
recognition
AffectNet database
- Formulated a novel loss function named
weighted-cluster loss.
Yang et al.
(2022) [31]
Satellite remote
sensing
Three remote sensing
image datasets have been
used:
(1) DOTA-v1.5
(2) TGRS-HRRSD
(3) RSOD
- They have proposed the variant of YOLO as
RS-YOLOX, and along with it used Varifocal
Loss function to balance the number of samples.
(positive as well as negative).
- They have also used Efficient Channel
Attention (ECA) and Adaptively Spatial Feature
Fusion (ASFF) to improvise the feature learning
ability along with enhancing the power to
improve the performance of the model in
identifying even the smallest targets.
Alaba et al.
(2022) [32]
Fish species
recognition
SEAMAP 21
(Highly imbalanced reef
fish dataset)
- Proposed a deep framework involving a
MobileNetv3 deep neural network, and proposed
the class-aware loss function which gives more
weightage to the minority class (species having a
smaller number of image samples) in comparison
to the majority class.
Liu et al.
(2017) [33]
Traffic
surveillance
systems/
Transport
systems
MIO-TCD classification
challenge dataset
- Proposed a deep learning based method
consisting of two phases: (1) First phase: -Data
augmentation at data level is applied along with
balanced sampling.
(2) Second phase: - Ensemble approach is
designed using different convolutional neural
network models having varied architectures.
Susan and
Kaushik
(2022) [34]
Face
classification
Labeled Faces in the
Wild (LFW)
-Authors have proposed localized metric learning
for extremely large multi class imbalanced face
databases.
-Extracted Histogram of Gradient (HOG) features
from images which are passed as input for metric
learning.
Alam et al.
Biomedical-
MNIST: HAM10000
dataset
-RegNetY-320 deep learning model outperforms
AlexNet, InceptionV3 deep learning models.
(2022) [35]
Skin lesion
cancer detection
- RegNetY-320 is a variant of ResNet that uses
the RegNet module which is based on the basic
ResNet building block in order to tackle
large-scale images.
Table 1 showcases a variety of computer vision applications ranging from biomedical or satellite image
classification to plant disease and fish species recognition, to highlight the extent of the problem. One
notable fact from Table 1 is the exploration of efficient loss functions for deep models which has surfaced
to be the most effective deep learning solution to date for imbalanced learning followed by ensemble
learning and application of SMOTE variants. The focal loss function appears to be a popular choice.
Some new loss functions introduced over the last three years are observed to be the class-aware loss
function and weighted cluster loss. Distance metric learning is observed to be another imbalance
treatment strategy, though the works incorporating metric learning for vision datasets are relatively less.
1.7 Other surveys
It is vividly observed that the existing state-of-the-art survey papers mostly focused on narrowing down
areas associated with imbalanced datasets or focusing particular on
techniques/approaches/application/task in depth.But the overall perspective and a common solution
related to the computer vision field while working with imbalanced datasets is missing which has
motivated us to write this survey. Table 2 illustrates the in-depth study conducted related to the recently
published existing state-of-the-art survey papers whwich illustrates how the proposed work is distinguished
from the existing research works.
Table 2. Summary of related survey papers published recently in the field of computer vision
Referenc
e
Imbalanced
dataset
Machine
Learning
approaches
Deep
Learning
approaches
Data
augme
ntation
GAN
Vision
Transformer
Image
classification
Segmentatio
n
Object
detection
[162]
β
β
β
β
β
β
β
β
β
[164]
β
β
β
β
β
β
β
β
β
[159]
β
β
β
β
β
β
β
β
β
[161]
β
β
β
β
β
β
β
β
β
[42]
β
β
β
β
β
β
β
β
β
[22]
β
β
β
β
β
β
β
β
β
[165]
β
β
β
β
β
β
β
β
β
[166]
β
β
β
β
β
β
β
β
β
This Survey
β
β
β
β
β
β
β
β
β
As represented in Table 1, most of the surveys covered few aspects as observed from the number of ticks
in each row. Sampath et al. had published a survey in the field of Generative adversarial networks (GAN)
[162]. This included an elaborate discussion on the latest advancements associated with GAN and various
GAN-based architectures in order to address the imbalance problems in image datasets. The survey was
conducted in a very systematic way but the overall main focus was on how GAN can mitigate the effect
of class imbalance. Xu et al. have done a comprehensive survey of the application of data augmentation
techniques for deep learning in the computer vision domain [164]. However, the discussion related to
challenges associated with class imbalance is lacking, and solutions to mitigate the effect of class
imbalance is not emphasized upon in detail, as the main focus of the paper was on augmentation and how
they could improve the performance, including the specific challenges which included domain shift and
intensity transformations for occlusion. Most surveys found contained a few aspects which were not
covered in other state-of-the-art surveys and also were specific to one application domain. Shamshad et al.
conducted a survey related to transformers, which is recently trending in the computer vision domain, by
focusing on medical imaging [159]. They surveyed the application of transformers in classification,
detection, segmentation, reconstruction, synthesis, registration, clinical report generation, and various
other tasks associated with medical images without putting separate emphasis on class imbalance
scenarios and solutions for mitigating the class imbalance. Castiglioni et al. has also conducted a survey
on the biomedical domain; they discussed the pros as well as cons along with the recommendations for
selecting any machine learning or deep learning approaches on the medical imaging [161]. There are
other surveys which have mostly focused on machine learning approaches to address the class imbalance
problems. Kaur et al. had done an extensive survey of challenges and solutions associated with machine
learning covering various applications such as marketing sector, information security, bioinformatic in
text and image processing-based applications [42]. Johnson et al. conducted an extensive survey in an
elaborative and structured way by including machine learning and deep learning approaches in computer
vision which motivated us to conduct the survey on imbalanced datasets [22]. Few approaches such as
generative models and vision transformers have not been detailed in the survey. Zaidi et al. had presented
the survey on object detection and recent developments associated with object detectors based on deep
learning and associated performance metrics [165]. This survey covers in depth about object detection,
but not the approaches associated with class imbalances in object detection particularly. Chen et al.
combined the discussion on ensemble learning and deep convolutional neural networks (CNNs) in order
to tackle the class imbalanced problem more effectively [166]. Data augmentations, transformers and
GAN are not part of the survey. The current research survey covers all the aspects as observed from Table
2, including imbalanced datasets, deep learning, machine learning, data augmentations, GAN, vision
transformers, image classification, segmentation and object detection.
1.8 Chief contributions of this survey
Unlike the state-of-the-art survey, the analysis conducted in this study aims to include the numerous real
world domain applications ranging from biomedical to satellite images and along with in depth
description of various techniques which are proposed in recent times in the field of computer vision and
deep learning for imbalanced datasets. Instead of narrowing down on specific domains or techniques, it
has included recognition, segmentation, classification and object detection applications by using data,
algorithmic, and hybrid approaches to mitigate the effect of class imbalance problem. In this survey, a
thorough review of the existing literature, which includes recent advancements, is included to offer a
comprehensive understanding of this research field. The emphasis is upon the insight into the future
trends associated with the class imbalance problem in the computer vision area. The time complexity
involved in deep learning models is discussed in this survey. It has included recent advancement and new
developments in this field such as Transformers, GAN and Explainable AI. Additionally, certain
benchmark and popular imbalanced datasets are identified and illustrated along with an associated
discussion of the characteristics of the datasets and attributes. The study also discusses the challenges and
constraints faced when applying these methods to real-world scenarios, and provides insights into the
potential impact and practical implications of the mentioned techniques. This survey highlights trends,
challenges and future directions in the area of class imbalance in computer vision which will enable
readers to stay updated with the latest research developments and challenges involved in this evolving and
promising domain which might impact real-world applications. By addressing these aspects, the main aim
was to provide a valuable resource for researchers, practitioners, in understanding and addressing class
imbalance in computer vision tasks. An overview of the appropriate evaluation metrics used on popular
benchmark imbalanced datasets is included in the study to promote unbiased and fair decisions or
predictions in the classification experiment. This detailed analysis was missing in the previous
state-of-the-art surveys.
An in-depth discussion is illustrated in this section related to the existing state-of-the-art research
survey works, and the proposed survey work is differentiated and distinguished from the previous
state-of-the-art works. The extensive searches were conducted using numerous combinations of
keywords/ search strings, including βdeep learning and class imbalanceβ, 'machine learning and class
imbalanceβ, βGAN and class imbalanceβ and βtransformers and class imbalanceβ in order to gather
peer-reviewed research articles from various sources such as conference, journals, book chapters, and
reports as inspired by previous works present in literature. This search spanned across prominent
databases such as: IEEE, Scopus, ScienceDirect, SpringerLink, ACM Digital Library, and Web of Science
from January 2005 to January 2023 (with a focus on the recent advancements in the last five years); this
analysis was found approximately equivalent to 1000 plus relevant articles which were further filtered to
extract relevant 180 articles emphasizing on the different application domains without any repetition. Fig.
6. (i) illustrates the pie-chart depicting the statistics of the papers mentioned in this survey, and Fig. 6 (ii)
illustrates the pie chart of application-wise categories included in this survey. Further Fig. 7 presents a
timeline of curated techniques developed for imbalanced image classification between 2005 and the
present. As observed, the earlier works relied on sampling strategies based on SMOTE and its variants to
balance the distribution, while after 2010, the attention shifted to cost-sensitive learning and testing and
trying different loss functions for imbalanced learning. However, ever since deep learning resurged in
2012, there has been an increased interest in addressing class imbalance in deep learning, though deep
learning specific solutions are more difficult to find.
However, there is a research gap in the literature work which motivated us to compile and do analysis of
various deep learning-based approaches in computer vision domain for imbalanced datasets. There are
limited studies associated with imbalanced datasets in computer vision domain using deep learning while
considering the numerous applications in various domains. The combination of new advanced approaches
such as GAN and vision have proved to contribute towards mitigating the impact of class imbalance. Also
the performance evaluation in several experiments is flawed since the models are judged on the basis of
wrong evaluation metrics. Our survey aims to address all these issues, and identify generic solutions for
mitigating the class imbalance, cutting across a broad spectrum of computer vision applications.
(a)
(b)
Fig. 6. (Top) The pie-charts depicting the statistics of the papers mentioned in this survey.
(Bottom) Pie chart of application-wise categories included in this survey.
Fig. 7. Timeline of curated techniques popularly used for imbalanced image classification between
2005 and present.
The significant contributions of the paper are highlighted briefly: (1) The survey conducted in this paper
provides a detailed analysis of various machine learning and deep learning methods for addressing class
imbalance situations, along with a summarization of related works for binary and multi-class imbalanced
datasets (2) Discussion related to the time complexity factor of AI models is elaborated. (3) Research
challenges, insights and limitations are discussed in depth associated with learning from imbalanced
datasets (4) Performance evaluation and details of some popular imbalanced datasets in computer vision
are included (5) The scope of the paper is not restricted towards a discussion about imbalanced datasets,
but to discuss in depth about the open challenges and application domains, and associated solutions in the
computer vision domain, and provide future research insights and directions to address this challenging
problem in a structured fashion.
1.9 Organization of the paper
The paper is organized as follows: in Section 2 the study includes a discussion on machine learning
techniques adapted from the data mining field that have been applied successfully for handling class
imbalance in computer vision. Further, Section 3 expands upon the current crop of deep learning
techniques incorporating some mechanism for mitigating class imbalance. Deep learning techniques
incorporating data augmentation and loss functions, which are two of the most popular remedies for class
imbalance till date, are emphasized in a more curated and detailed way in Section 4. Further in Section 5,
the emphasis is on popular imbalanced datasets and their performance evaluation metrics for numerous
real-world challenging problems. The open research problems and future research directions are outlined
in Section 6. The paper is concluded in Section 7.
2. Machine learning solutions for class imbalance in computer vision
Machine learning can be categorized as a subfield of Artificial Intelligence that aims to learn from data
with the help of algorithms, and is able to execute tasks with human-like intelligence without being
explicitly programmed [36][155][156]. In machine learning, tasks are subdivided into different modules.
First features are extracted from the data, which comes under the category of feature engineering, then
features extracted are passed to a machine learning classifier. Machine learning models do not require
large computational resources such as cloud Graphical processing units (GPU), and the data can be easily
processed, loaded and models can be easily trained using Central processing units (CPU). It can even
work well with smaller datasets and the model is trained quickly also, and gives promising results.
Machine learning can be classified into: (1) Supervised (2) Unsupervised (3) Reinforcement learning [37].
In supervised learning, there is labeled data, but in case of unsupervised learning the dataset is not
labeled, while reinforcement learning is based upon maximizing rewards and minimizing punishments.
The class imbalance issue affects supervised, unsupervised and reinforcement learning techniques all
three alike; therefore, some type of mechanism needs to be incorporated to counter its effect.
There are efficient machine learning and pattern recognition approaches to tackle the class
imbalance problem in literature. Data augmentation is one of the most widely used technique to increase
the set of samples so as to enhance the diversity of samples to enhance the performance of the imbalanced
dataset. The aim is to increase the sample set by synthesizing numerous samples from the minority class
also so that ultimately the model performance is not biased towards the majority class. Various popular
data augmentation techniques available in literature are image transformations like Rotation, Scaling, and
Translation, Flip, Hue, saturation and Random Erasing. The choice of application of data augmentation
techniques can vary based upon the dataset and also on the characteristics of the classes present in the
dataset. In order to find the right data augmentation techniques for a specific dataset, it is required to
conduct multiple trials of combinations of image transformations. Even GAN and Synthetic Minority
Over-sampling Technique (SMOTE) can be a form of data augmentation technique. SMOTE is also
amongst the popular techniques, which generates the synthetic samples for the minority class by
interpolating between feature vectors of neighboring samples. SMOTE can help to enhance the diversity
of the samples by taking minority classes without focusing on the duplication of the existing samples.
Another technique is ADASYN [178] which is an extension of SMOTE that introduces adaptivity to the
synthetic sample generation process. In ADASYN, different weights are assigned to the minority class
samples as it will generate more synthetic samples for the minority class samples that are harder to
classify and focus on the more challenging regions of the feature space. Generative Adversarial Network
(GAN) approach is often used nowadays in order to address class imbalance at the data level in computer
vision tasks by generating the realistic fake samples. The quality of the generated samples plays a crucial
role to ensure that images appear to be more realistic and better representative of the minority class.
Guillaume et al. (2017) had created an open course python toolbox in 2017 named
imbalanced-learn API [38] which includes several machine learning algorithms to tackle class imbalance.
They have made available tools for undersampling, oversampling, and combinations of both
oversampling and undersampling, along with ensemble learning methods together in one toolbox. Data
mining researchers seeking quick-fix problems to class imbalance often make use of imbalanced-learn.
However, the application to computer vision datasets is not so easy due to the scalability of the data and
the two-dimensional spatial layout of images that renders simple interpolation based resampling strategies
ineffective in some cases. SMOTE and its variants like Borderline SMOTE and ADASYN have therefore
been applied limitedly in computer vision for resampling of hyperspectral or satellite images [39] or
biomedical images such as histopathological images or X-ray images [4] [40]. In such a scenario, many
researchers confined their efforts to finding the optimal combination of image features and machine
learning classifiers to mitigate the effect of class imbalance. A novel approach is introduced in [41]
related to multi-class imbalanced datasets constructed from a dictionary of bag-of-visual-words using
deep features extracted from the ResNet pre-trained neural network and applied to the non-linear ChiΒ²
SVM classifier for the multi-class classification task. They validated their approach on the Graz-02 and
TF-Flowers datasets, and stated that the choice of ResNet deep features with ChiΒ² SVM classifier is a
perfect combination of features with classifiers for imbalanced datasets. Overall SVM is considered an
effective classifier for imbalanced datasets; it is considered as the base classifier for ensemble learning of
imbalanced data [42].
The cost-sensitive methods however, are amply used in computer vision since it does not involve data
level manipulation. Cost-sensitive learning uses a cost matrix [16] and pays attention to the
misclassification that involves cost. These algorithms aim to do adjustment of the misclassification costs
or decision thresholds to account for the imbalanced class distribution. Several combinations of features
and classifiers have been tested for cost-sensitive learning where the misclassification cost for the
minority class is given more bias. Zhang et al.(2016) [43] tried cost-sensitive learning on the combination
of wavelet entropy features with three different classifiers: - SVM, K-Nearest Neighbor (kNN) and
decision trees, and found that kNN performed best. In contemporary times, cost-sensitive learning is
integrated into deep networks, this study has included detailed discussion on this topic in subsequent
sections. It therefore concludes that the resampling strategies popular in data mining are not advisable in
computer vision. In the absence of any class imbalance correction, SVM is found to be an effective
machine learning classifier for learning directly from imbalanced datasets. However, cost-sensitive
learning is an effective strategy for imbalanced learning that has been adapted successfully to deep
learning models.
Focal Loss is introduced specifically to address class imbalance even when there is presence of a
large number of easy examples (majority class) and a small number of difficult examples (minority class).
There are two crucial parameters present in focal Loss, focusing parameter (gamma) and the balancing
parameter (alpha). The focusing parameter (gamma) controls the degree of down weighting for
well-classified examples and the balancing parameter adjusts the balance between the positive (minority)
and negative (majority) class while training.
Hybrid approaches combine data level and algorithmic level approaches. Ensemble methods like
bagging and boosting are helpful for dealing with class imbalance in computer vision tasks. These
methods combine multiple models to improve performance and handle imbalanced classes effectively.
Bagging involves training several classifiers independently on different subsets of the training data, where
each classifier is trained on a random sample with replacement. This ensures that minority class samples
are included in the subsets. By combining the predictions of all classifiers, bagging reduces variance and
makes predictions more stable. It helps improve overall performance by giving minority class samples a
chance to be included in training. Boosting, on the other hand, trains classifiers sequentially and focuses
more on misclassified examples from previous models. It adapts by assigning higher weights to
misclassified minority class samples. Boosting combines weak learners into a strong learner, performing
well on imbalanced datasets. Boosting algorithms like AdaBoost, Gradient Boosting, or XGBoost can
handle class imbalance by assigning higher weights to minority class samples during each iteration. The
proficient method is devised that involves constructing weak and strong classifiers using different subsets
of the imbalanced data, exemplifying the fusion of resampling techniques with the learning process helps
to address the challenge associated with imbalanced data. However the impact of hybridization of data
level manipulations and learning strategies has become trending research in current times.
3. Deep learning solutions for class imbalance in computer vision
The Deep Convolutional Neural Network, a term that came into use due to the many hidden layers in the
neural network architecture, is widely used nowadays in numerous applications in various domains. The
overall usage of deep learning has been tremendously increased in various applications ranging from
computer vision, speech recognition, video processing, activity recognition and natural language
processing. The hype of deep learning has been enhanced to a great extent nowadays due to the easy
availability of computing power resources such as Graphic processing units (GPUs) as well as high usage
of distributed and parallel systems which makes it easier to train huge datasets. To understand how class
imbalance solutions may be incorporated into deep learning, it is necessary to start with the deep neural
network architecture and the computation of feature maps, and then foray into the topic of transfer
learning using pre-trained networks.
In this section, the paper delves further into the different approaches in deep learning aimed at addressing
class imbalance in computer vision. Section 3.1 provides a comprehensive explanation of the process of
training a deep neural network from scratch. Section 3.2 further explores the utilization of pre-trained
networks specifically tailored for tackling class imbalance in computer vision tasks. Section 3.3 discusses
new developments in the field of computer vision such as vision transformers and explainable AI to
address class imbalance in computer vision, and also offers promising avenues for improving the fairness,
generalizability and interpretability of models in imbalanced settings. Lastly, section 3.4 provides a
comprehensive analysis of these approaches from the perspective of time complexity and their practical
utility. Considering the computational demands and real-world applicability, this section offers valuable
insights into the feasibility and scalability of the discussed methods for handling class imbalance in
computer vision tasks.
3.1 Training a deep neural network from scratch
Deep learning and machine learning differ in the complexity of the model, the feature extraction and
learning process. In machine learning, features are extracted separately then the extracted features are
passed to train the classifier model, thus there is separate feature engineering and classification step.
However, in deep learning, feature extraction and classification are performed together in a single step as
shown in Fig. 8.
Fig. 8. Illustration of feature extraction and classification in machine learning and deep learning
using a facial dataset as example
Deep learning therefore learns features directly from the data in an end-to-end fashion, and it scores over
machine learning in terms of the high accuracies achieved for various computer vision tasks [44]. Also,
deep neural network architectures are much more complex and have multiple hidden layers, with some of
the latest deep models like DenseNet [45] having more than 100 layers. Srinivas et al. presented that
VGG-16 gives better accuracy in comparison to Inception-v3, and ResNet50 pre-trained networks for
classification of brain MRI images into benign and malignant tumor classes [151]. Fig. 9. illustrates the
multiple layers generally found in deep learning models: convolution, pooling, fully-connected layers and
the softmax or classification layer, is shown for an input satellite image along with the feature map
representations extracted from intermediate layers.
Fig. 9. Deep learning model comprising of multiple layers with satellite image as input
Each layer in a CNN typically consists of a set of filters, which are used to detect specific patterns or
features in the input data. As the data passes through the layers of the network, the filters are applied to
the data and the resulting feature maps are combined and processed by the next layer in the network as
illustrated in Fig. 10. where visualization of features representations at five different layers of a deep
learning model is shown for some sample images from the MNIST numeral recognition database. One of
the key features of CNNs is their ability to learn hierarchical representations of the data, which allows
them to automatically learn and extract useful features from the data without the need for manual feature
engineering.
Fig. 10. Visualization of features maps computed for MNIST numeral images at subsequent layers
of a five-layer deep learning model (C1: convolutional layer, M2: max-pooling, C3: convolutional
layer, M4: max-pooling, C5: convolutional layer)
Deep CNNs have been widely used in a variety of applications, including image and video analysis,
natural language processing, and speech recognition. Training a deep CNN from scratch is a
computationally extensive process and requires adequate training using labeled samples for effective
performance. In this scenario, the presence of class imbalance in the image dataset deters the effective
working of the classifier. Just as for machine learning, for Deep CNNs trained from scratch, the two most
popular solutions for class imbalance in deep learning are (i) data augmentation where the samples are
synthetically increased [137][138] and (ii) cost-sensitive learning wherein the loss function of the
minority class is given a boost during the network optimization [95][139][143]. It is notable that Deep
CNNs trained from scratch are not suitable for small datasets [140]; large pre-trained networks are
advisable in this case, which is the topic of discussion in the next subsection.
3.2 Transfer learning using pre-trained networks
An alternative to the CNN networks which are trained from scratch are pre-trained networks which are
already trained on a very large dataset and can be fine-tuned for a specific task using a smaller dataset
[47]. Even with the huge availability of resources, the time taken to train deep neural networks from
scratch is very high, and also large labeled datasets are required for effective learning. This led to the
development of pre-trained models that have an advantage when dealing with smaller datasets, and in that
case the time taken to train the network is significantly less since the learning process is bootstrapped for
the smaller dataset using transfer learning [48]. Several researchers have proved that pre-trained networks
with fine tuning outperforms deep CNNs trained from scratch [136].
Various examples of popular pre-trained networks in the computer vision domain: (1) VGG
(Visual Geometry Group) [49]: VGG is a pre-trained convolutional neural network (CNN) that was
developed by researchers at the University of oxford. It was trained on the ImageNet dataset and is
commonly used as a starting point for image classification tasks. (4) ResNet (Residual Network) [50], is
another popular pre-trained CNN that was developed by researchers at Microsoft. It is notable for its use
of residual connections, which allows the network to learn more complex patterns and achieve better
performance on tasks such as image classification. Besides this there are pre-trained networks that can be
very useful in a variety of scenarios, such as when training data is limited or when a model needs to be
quickly deployed for a new task. By starting with a pre-trained network, it is often possible to achieve
good performance with relatively little additional training, and they can serve as strong baselines for
comparison with other models. There are many different types of pre-trained networks available, and the
most appropriate type will depend on the specific task and dataset. The popular pre-trained networks
utilized for various vision-related tasks are discussed next.
(1) Pre-trained networks for image classification: These are pre-trained networks that have been trained
on large datasets of images for the task of image classification. Examples include VGG [49], ResNet [50],
Inception [51], DenseNet [45], MobileNet [52], Efficient Net [53], InceptionResNet [54] etc. The
pre-trained networks that have been trained on a large dataset can then be fine-tuned or used as a starting
point for training on a new task. Transfer learning can be useful when training data is limited or when it is
desirable to leverage the knowledge learned on a related task. Even when there is a limited dataset, it is
best to train using pre-trained networks since the training of the model would be better due to the presence
of pre-trained weights that enforces the model to learn better from the smaller dataset and make accurate
predictions. The development of deep learning architectures that support imbalanced datasets is upcoming
research. One example is VGGIN-Net [24][86] that integrates the lower layers of VGG16 and Inception
block in its architecture. The result is improved classification as compared to the baseline deep models.
(2) Pre-trained networks for object detection: These are various pre-trained networks that have been
trained to detect and classify objects in images. They are typically based on CNNs and may also include
additional components such as region proposal networks (RPNs) and anchor boxes. Examples are Faster
RCNN [55], RetinaNet [56], EfficientDet [57] etc. The extensive comparative analysis of pre-trained
networks for diabetic retinopathy screening for object detection, segmentation and classification tasks is
performed in [23], wherein all the three tasks can be converged into a single problem for an application.
From the analysis they inferred that in the scenario of class imbalance, (1) Eο¬cientDet-D0 and SSD
(MobileNetV1) pre-trained models are most suited for performing object detection (2) PSPNet (with focal
loss) works effectively for the segmentation task (3) DenseNet121 pre-trained model performs effective
classification.
(3) Pre-trained networks for image segmentation: Similarly for image segmentation there are pre-trained
networks available such as Deeplabv3 [58], PSPNet [59] etc. A few years ago, segmentation was
considered an unsupervised image processing task achieved through clustering or region growing and
merging techniques [132][133] . Present-day image segmentation, however, uses deep learning in the
form of fully convolutional pixel-labeling networks, semantic (pixel-level) and instance-level
(object-level) segmentation, recurrent neural networks and attention-based models, generative adversarial
networks etc. [134]. The application of deep learning thus reduces segmentation to an end-to-end
classification task; hence class imbalance affects the image segmentation task as well. There are very few
works that deal with class imbalance in image segmentation. One of the rare works is that of Milletari et
al. [135] who proposed a fully convolutional neural network called V-Net with an objective function
based on the Dice coefficient. The new objective function was successful in mitigating the strong class
imbalance between foreground versus background pixels in the 3D medical image segmentation task.
(4) Generative models for data augmentation: These are pre-trained networks that have been trained to
generate new data that is similar to a given training dataset in various applications associated with
computer vision [160]. Examples include GANs [60] and Variational Autoencoders (VAEs) [61]. GAN
has proved to be one of the most effective techniques to synthetically increase the minority class samples
at data level which overall helps to deal with cases of imbalance scenarios. Goodfellow et al. proposed
that GAN has two components: generator (G) and discriminator (D) [60]. The generator module helps to
generate the fake image samples and the role of the discriminator is to detect whether the fake images
generated are actually fake images or real. In GAN, the generator G and discriminator D are continually
competing with one another by following the min-max strategy algorithm as represented in the equation
below.
(2)
ππππΊπππ₯π· ππ·, πΊ( ) = Ξπ₯~ππππ‘π π₯() ππππ·π₯()[ ]+πΈπ§~ ππ§ (π§) πππ 1βπ·πΊπ§()( )( )[ ]
Tanaka et al. had conducted an extensive survey on different GAN architectures; the authors proved that
data augmentation by generating synthetic images using GAN is more effective than traditional
approaches for imbalance treatment [73]. Some applications of GANs for imbalanced datasets are
summarized in Table 3 in section 4. Fig. 11. depicts the Generative adversarial network having generator
and discriminator components. The generator after adding noise will generate fake images, and further
fake and real images are passed into the discriminator to distinguish between real and fake images.
Fig. 11. Generative adversarial network having generator and discriminator components
In Table 3, a detailed discussion related to various popular state-of-the-art pre-trained networks and their
key factors is presented.
Table 3. Discussion of state-of-the-art pre-trained networks and key factors
References
Pre-trained networks
Key factors
[49]
VGG
VGG architecture consists of stacking multiple
convolutional layers with the filter size and pooling
layers. VGG consists of approximately 138 million
parameters which makes this network bulky.
[51]
Inception
The Inception network consists of an Inception module
having filters of 1x1, 3x3, 5x5 convolutions followed
by concatenation or pooling. This allows the model to
capture features at multiple levels of abstraction and
enables it to learn both local and global information.
[50]
ResNet
ResNet architecture is formulated with multiple
residual networks having skip connections which helps
the gradient to smoothly flow during backpropagation
[45]x
DenseNet
DenseNet architecture has dense blocks where each
block contains multiple layers with the same number of
feature maps. This dense block structure encourages
the reusability of features and strengthens the feature
propagation across layers. In DenseNet architecture the
summation of output feature maps with the incoming
feature maps is there, followed by concatenation.
[55]
Faster RCNN
Fast RCNN performs detection on multiple region
proposals and it is based upon the Region Proposal
Network (RPN). The RPN generates the region
proposals which are potential bounding box locations
and also the containing objects of interest.
[167]
YOLO
One of the most popular object detection algorithms is
YOLO (You Only Look Once) having multiple
variants. The latest variant is YOLO-NAS, YOLOv8.
The various object detection algorithms use sliding
windows or region proposal methods but the key factor
which distinguishes them from others is that YOLO
divides the input image into a various grid and finally
predicts bounding boxes and class probabilities directly
on the basis of the grid which makes YOLO
significantly faster than many other object detection
algorithms.
[168]
U-Net
The popular pre-trained network that is most widely
used for segmentation is U-Net. The distinctive factor
about U-Net is its U-shaped architecture consisting of
encoder and a decoder. The U-Net consists of skip
connections between the corresponding encoder and
decoder layers. Overall, these connections help to
preserve spatial information and also enable decoders
to access lower-level features.
[169]
Swin Transformer
The Swin Transformer has an architecture that
combines the goodness of vision transformers and
convolutional neural networks (CNNs). The
introduction of hierarchical structure and
window-based self-attention mechanism makes
efficient processing of images at different scales
appropriate.
3.3 New developments βTransformers and Explainable AI
As an alternative to CNNs, Transformers [62] and Explainable AI (XAI) [63] have brought in a revolution
in recent times. Transformers have been widely adapted for use in computer vision in applications such as
object detection, classification, and image generation; they rely on self-attention modules instead of
convolution layers [62][159]. The transformers in the computer vision domain can be categorized into
Vision Transformer (ViT), 3D Transformer, Two-Stream Transformer, Detr/ Detection Transformer, etc.
[64].
Transformers can be adapted to address class imbalance challenges by using an adaptive
weight-based sampling strategy [141] that assigns more weights to the minority classes to provide higher
probability of being sampled during training. The imbalanced learning strategies are merged into the
pre-training stage itself. Another approach is to use a transformer-based model with a loss function that is
designed to address class imbalance such as focal loss which helps to focus on the hard examples/
samples [142]. Recent literature highlights various application domains where the transformers can be
applied to alleviate the effect of imbalanced datasets. Bai et al. (2022) applied the Vision Transformer
(ViT) on a small medical capsule endoscopy imbalanced dataset [65]. ViT uses self-attention which
enables it to capture effectively long-range information. The authors emphasized upon three step pooling
in order to reduce the spatial dimension by approximately three times. Kaselimi et al. (2022) had
proposed the ForestViT and achieved superlative performance in comparison to other well-known
state-of-the-art deep learning models for imbalanced datasets [66]. The superiority was more noticeable in
case of the minority classes.
Moreover, there is a rising need of having explainable deep learning models instead of
considering deep learning models as a black box. Considering the fact it works as a black box, it is
difficult to understand how a model actually works between taking the input and finally making
predictions, which makes it difficult to understand how the deep working model can help to tackle
imbalance challenges [144]. Explainable AI models provide clear justification and explanation of
decisions and how each and every component are connected together to create the impact. These models
are created and designed in such a way that it can be transparent and interpretable so that being a user,
someone can exactly understand how the model is making its decisions and can trust the outcomes of the
model [145]. SHapley Additive exPlanations (SHAP) tools are now increasingly being used [173] to
analyze the decisions of deep convolutional networks in order to assess the explainability and
interpretability of the classification outcomes of imbalanced datasets.
3.4 Time Complexity of AI models
The time complexity of an algorithm is the amount of time and computational resources required to
execute the algorithm [158]. The time complexity of AI (Artificial Intelligence) specifically DL (Deep
Learning) models can vary significantly depending on numerous factors, including the architecture of the
model, size of the input data, complexity of the task, and even it can be based on the hardware used for
computation. Also while training a DL model, the training time complexity is influenced by the number
of trainable parameters in the model and also by the size of the training dataset. It is observed that as the
number of parameters increases then the time required for model optimization through techniques like
backpropagation and gradient descent will also increase in parallel. Additionally, it is also dependent on
the size of the training datasets i.e. if it involves larger datasets, it is generally observed that it requires
more time for processing and iteration through the training steps. Time complexity is also calculated to
observe the complexity associated with the AI/DL model.
The Inference time complexity refers to the time required to do predictions while using a trained
AI/DL model with different test samples. The complexity is also again dependent on a few factors such as
model architecture, input data size, and considering the computational requirements while performing
underlying operations. The complex models involving more layers and complex operations involving
convolutions generally have higher inference time complexity in comparison to less-complex models. For
example, the basic Conv2D and Conv2DTranspose layers, if present in the architecture, have more
parameters that need to be trained than a simple dense layer present in the architecture. The input data size
also has an effect on the inference time, as larger inputs require more computations. Overall, the model
architecture plays a crucial role in calculating the overall performances as well as inference time.
However different AI/DL model architectures have different complexity involved in their architecture so
have varying time complexities. Generally, CNNs architectures involved in the computer vision tasks, can
be computationally intensive due to presence of the multiple layers and convolutions in their architecture.
Vision transformers have complex attention mechanisms present in their architecture that are also
involved in their higher time complexity. The parameter which is generally ignored is the choice of
hardware acceleration which has an impact on the overall time complexity involved associated with
AI/DL models. The specialized hardware, such as GPUs (Graphics Processing Units) or TPUs (Tensor
Processing Units). The model training and inference due can be accelerated to an extent based upon their
parallel processing capabilities. By using hardware accelerators tailored for AI/DL workloads can help to
overall reduce the overall time required for computations. In the case of GAN, while training a generator
and discriminator, it takes much more training and pre-processing time than a very deep dense neural
network; this will have an impact on the involved complexity. Also, it depends on the width (less layers
but more filters in the layers) and depth (less filters in the layers but more layers) effects.
4. Extensive survey of contemporary literature on data augmentation and loss functions used
in deep learning frameworks involving imbalanced datasets
This study has reviewed in depth various data augmentation procedures, loss functions and deep learning
architectures that have been introduced over the past five years to tackle class imbalance. The majority of
these works were published in the last two years. Table 4. presents a detailed analysis of various
applications where data augmentation is used with deep learning models to mitigate the effect of class
imbalance in various application domains considering different tasks of computer vision including
classification, segmentation, object detection, recognition. Table 5 presents an exclusive study on data
augmentation using GAN which is trending research on fake image generation and proves to increase the
efficiency of the models to a great extent in numerous applications.
Table 4. A summary of data augmentation procedures used along with deep learning models to
mitigate the effect of class imbalance (from past five years data)
References
Task
Imbalanced
dataset
Data augmentation
details
Contribution
[71]
Classification -
Alzheimerβs
detection
Open Access
Series of Imaging
Studies (OASIS)
-Some data
augmentation operations
applied were Rotation,
Cropping applied from
right, bottom, left,
corner, top, having
parameters equivalent to
(90Λ,270Λ, 180Λ) set
while cropping.
-Proposed transfer learning
for Magnetic Resonance
Imaging (MRI) and 3D
MRI views of the brain
along with image
augmentation in order to
avoid overfitting.
-From the analysis it was
found that the main view of
the brain along with
augmentation gave higher
performance in comparison
with the 3D view of the
brain.
[72]
Character
recognition
Dataset is taken
from the digital
-Applied different
image processing
-Have emphasized on the
fact that robust data
collection of
Southeast Asian
palm leaf
manuscripts.
operations including
noise, background,
brightness adjustment
and affine
transformations. The
Affine transformation
operations include
horizontal and vertical
translations, rotations,
zoom, or compositions
of two or more
transformations
together.
augmentation operations
can improve the overall
performance of the
CNN-based ancient
Sundanese model.
[33]
Classification -
traffic surveillance
MIO vision traffic
camera dataset
(MIO-TCD)
- Applied random
rotations, cropping, flips
and shifts.
- For extremely
imbalanced data an
oversampling strategy
with random shuffle is
proposed.
-Proposed an ensemble
deep learning model using
different CNN
architectures along with
balanced sampling.
-Adapted maximum
majority voting for the
classification of images.
[74]
Classification-
Biomedical
images
-Skin melanomas
diagnosis
-Histopathologica
l images
- Magnetic
resonance
imaging (MRI)
scans analysis
-Applied classical image
transformation
operations ranging from
zoom, rotate, crop,
histogram-based
methods and style
transfer and generative
adversarial networks.
- Created a method of
data augmentation using
image style transfer
- Have compared different
data augmentation
operations for image
classification.
- They have proposed a
method of data
augmentation using image
style transfer to generate
new image samples of
high-quality perceptual
images which are further
used to train the network in
order to improve the
training efficiency.
[75]
Segmentation
IPPN dataset
-Introduced a method of
image augmentation for
segmentation tasks
which takes image mask
pairs and transforms
them to capture
numerous scenes.
- They have compared the
performance of their
proposed augmentation
with and without basic
augmentation operations
and found that there is
relative increase in
F1-score using the
DeepLabV3 model in
comparison to basic
augmentations.
[76]
Object detection
Collected labeled
dataset consisting
of individual
Rotated crop images by
90Λ, which augmented
underrepresented cell
-Used Faster Region-based
Convolutional Neural
cells.
counts by roughly four
times and removed crop
images containing only
RBCs.
Network (Faster R-CNN),
for object detection
Table 5. A summary of GANs used for data augmentation in various computer vision applications
(from past five years data)
References
Task
Imbalanced
dataset
GAN type
Details of data
augmentation using GAN
[77]
Defect spot welds
detection
Industrial spot
welding defects
images dataset
Balancing GAN and
gradient penalty
(BAGAN-GP)
-Extensive experiments
showed that the proposed
approach can generate spot
welds, defect images
efficiently, and improve
performance of the
classification for industrial
inspection with
annotation-lack or
class-imbalanced dataset.
-This work provides a
valuable reference for
industry defect image
analysis based on deep
learning.
[78]
Breast Cancer
classification
Breast cancer
histopathology
dataset
(BREAKHIS)
Deep Convolutional
Generative Adversarial
network
(DCGAN)
-DCGAN is used at the
data level which is applied
to the samples present at
the minority class.
- The balanced binary
dataset after applying
DCGAN is passed to the
proposed modified VGG16
architecture.
[79]
Mammogram
classification
Digital Database
for Screening
Mammography
(DDSM)
Class conditional GAN
with mask infilling
(ciGAN)
- Proposed the novel
approach ciGAN using
multi scale generator
architecture.
- The proposed approach
has a generator which uses
a cascading refinement
network to generate the
multiple scale features.
[80]
X-ray image
classification
Large-scale dataset
X-ray images
Deep Convolutional
GAN (DCGAN) and
Cycle-GAN
-Have proposed the use of
DCGAN to generate new
image samples of the X-ray
for the threat objects
-Cycle-GAN has been
adapted for the translation
of camera images of threat
objects into the X-ray
images.
-Then they have trained
using various Region
Based Convolutional
Neural Network (R-CNN)
models along with
numerous augmentation
approaches
[81]
Digit recognition
-MNIST
-E-MNIST
-SVHN
-CIFAR-10
-Multiple Fake Class
Generative Adversarial
Networks (MFC-GAN)
-Have proposed novel
GAN approach called
MFC-GAN which uses
multiple fake classes to
make sure there occurs
fine-grained generation and
classification of minority
class instances.
-Performed augmentation
using MFC-GAN;
achieved superlative
performance in comparison
to other traditional
augmentation or
oversampling techniques.
[82]
Facial emotion
Classification
-Facial Expression
Recognition
Database (FER
2013)
- Static Facial
Expressions in the
Wild (SFEW)
- Japanese Female
Facial Expression
(JAFFE)
-Cycle-consistent
adversarial networks
(CycleGAN)
-GAN
-Performed augmentation
using cycle-consistent
adversarial networks
(CycleGAN) along with
CNN as the classifier with
least-squared loss as
adversarial loss to prevent
the gradient vanishing
issue.
[83]
Weather
classification
-Multi-class
Weather Image
(MWI)
-Multi-class
weather dataset
(MWD)
-Self-made
-GAN
-Deep Convolutional
Adversarial networks
(DCGANs)
- Proposed an ensemble
approach (using KNN,
SVM, RF and AdaBoost)
incorporating advanced
GANs, along with efficient
data cleaning technique
(Edited Nearest Neighbor
laboratory dataset
(ENN) Rule) to remove
outlier images generated
using GANs, for dealing
with the class imbalance
scenario tested on various
weather classification
datasets.
[84]
Plant Disease
classification
Prepared a small
tomato plant
disease image
dataset generated
in different
circumstances of
varying lightning,
temperature,
season, humidity
and also different
places using a
camera.
-Introduced a U-net
integrated in CycleGAN
in order to improvise the
perceptual quality of
generated image
samples.
- Generated synthetic fake
samples at data level using
the proposed GAN
approach.
-Improvised approach that
can improve learning with
respect to data sample
distributions which helps to
further reduce the partiality
which is inculcated using
class imbalance.
[85]
Medical image
semantic
segmentation
-LiTS-2017-liver
lesion
segmentation
-MDA231,
PhCHeLa-
microscopic cell
segmentation
-BraTS-2017-
brain tumor
segmentation.
-Conditional Generative
Adversarial Network
-Recurrent conditional
GAN.
-Have proposed
conditional generative
refinement network for
biomedical image
segmentation which
consists of:
(1) Generative network: -
helps to segment pixels
labels.
(2) Discriminative
network: - helps in
classification of segmented
output to real or fake
category.
(3) Refinement network: -
which is further trained on
prediction of false positive
and negative masks which
is predicted by the
generator.
(a)
(b)
Fig. 12. (a) Class rectification loss (b) Dice Loss for imbalanced data learning.
Fig. 12 illustrates the computation and updation process of two popular loss functions used for
imbalanced datasets. Fig. 12. (a) illustrates the Class Rectification Loss (CRL) based regularization
approach for imbalanced data learning where the imbalanced dataset is provided in batches to the CNN
along with the class rectification loss. In each batch of training, it is profiled to find the majority and
minority classes, respectively. Fig. 12 (b) shows the Dice loss computation and updation in the learning
process. Yeung et al. in a recent work had emphasized upon the usage of the correct focal loss function
while training any deep learning model as it can have an impact on the overall performance and
convergence of the model [70]. In Table 6, a discussion on different loss functions infused in the deep
learning models for imbalanced datasets is present. A variety of loss functions are covered, including
focal loss, dice loss, CRL etc. for a variety of computer vision tasks.
Table 6. A summary of loss functions used to mitigate the effect of class imbalance (from past five
years data)
References
Task
Dataset
Loss
Analysis
[87]
Classification
Have constructed
an imbalanced
Cross entropy (CCE)
loss
Proposed cross entropy
(CCE) loss for imbalanced
dataset from
popular balanced
datasets
-CIFAR
-Fashion MNIST
-Tiny ImageNet
classification which
emphasizes upon
suppressing the
probabilities of incorrect
classes to help the deep
learning models to learn
discriminative information.
[88]
Human attribute
analysis
-Face attribute
dataset-CelebA
-pedestrian
attribute dataset
RAP
Dynamic Curriculum
Learning (DCL)
-Proposed the approach
named as Dynamic
Curriculum Learning
which incorporates the
adjustment of the sampling
strategy adaptively to
generalize in a much better
way.
[89]
Object detection
-BDD100K
(highly
imbalanced
driving database)
weighted Cross Entropy
Loss
Proposed the weighted
variant of the original
Cross Entropy loss named
as weighted Cross Entropy
Loss that assigns suitable
weights to each object
class present in the dataset.
[90]
Deep Regression
Tracking
Five benchmark
datasets including
OTB-2013,
OTB-2015,
Temple-128,
UAV-123 and
VOT-2016
Shrinkage Loss
Shrinkage loss has been
proposed by authors to
balance the training data by
penalizing the importance
of easy training samples,
and along with that applied
residual connections in
order to integrate the
multiple convolutional
layers as well as their
output response maps to
facilitate faster
convergence of regression
networks.
[91]
Lung nodule
classification
LIDC/IDRI
dataset
Focal loss
-Author uses a 15-layer 2D
deep convolutional neural
network named as LdcNet
along with focal loss
function to classify lung
CT scans into nodule or
non-nodule category.
[46]
Medical image
segmentation
Multimodal brain
tumor
segmentation
dataset
(BraTS2018)
Focal Dice loss
Dice loss enables
foreground-background
separation under class
imbalance, however, focal
dice loss helps to reduce
the contribution from easy
examples during the
learning process.
[92]
Imbalanced
regression
-Imbalanced
Human Mesh
Recovery
(IIHMR)
-Age and
Depth Estimation
Balanced Mean Square
Error (MSE) loss
-Have proposed the
Balanced MSE which was
found effective for
high-dimensional
imbalanced regression.
[93]
Visual
Classification
-CIFAR-10
-CIFAR-100
- Tiny ImageNet
- iNaturalist 2018
Influence-Balanced
Loss
-Proposed the Influence
balanced Loss which is
used for balancing in order
to remove the significance
of samples that cause an
overfitted decision
boundary.
[94]
Medical image
segmentation
-Digital Retinal
Images for Vessel
Extraction
(DRIVE)
-CVC-ClinicDB
-Brain Tumor
Segmentation
2020 (BraTS20)
-Breast ultrasound
2017 (BUS2017)
-Kidney Tumor
Segmentation
2019 (KiTS19)
Unified Focal loss
-Have proposed the
Unified Focal loss for
medical image
segmentation in case of
imbalanced datasets for 2D
and 3D multi-class
segmentation. Results
show significant
improvement in
comparison to other loss
functions in most of the
cases.
[95]
Lung nodule
classification
LIDC-IDRI
dataset
Class-Weighted loss
-Proposed the loss function
using class-weight concept
that loss function
associated with each class
is weighted by the ratio of
the total population to the
class population.
Sivapuram et al. in 2023 [170] proposed a novel learning strategy called VISAL to address the challenges
of class imbalance in classification tasks. The aim of VISAL is to improve the generalization of samples
from the minority class by combining angular and Euclidean margins within the cross-entropy (CE)
learning strategy. The researchers evaluate their approach on various imbalanced datasets, including
CIFAR, COVIDx, IMDB reviews, and Tiny ImageNet. One of the key advantages of VISAL is its
seamless integration into existing deep neural network (DNN) models. The idea of VISAL is incorporated
into a well-designed CE loss function, making it easy to incorporate into different DNN frameworks. This
flexibility enables it to effectively tackle class imbalance issues in high level vision applications,
including image segmentation and object detection. A deep neural network named BiLSTM for
imbalanced medical data of IoT systems was proposed in [171] by combining the decision tree model
with BiLSTM deep learning and by using data balancing strategy. In Table 7, a discussion related to
various deep learning frameworks that are customized to mitigate the effect of class imbalance is
presented. A scrutiny of this Table reveals some unique ideas in the form of iterative learning
mechanisms, weighted ensemble formation, smart combination of classifiers, novel architecture etc., that
were proposed as alternatives to the more popularly used data augmentation and cost-sensitive learning.
Table 7 illustrates some of the latest research works including different variants of vision transformers,
enhanced ensemble based deep learning approaches, and advanced deep learning architectures. These
works pave the way for intelligent design of futuristic models that can be generalized to any type of
imbalanced dataset and can achieve optimal performance in any field of computer vision.
Table 7. A summary of novel deep learning frameworks customized to mitigate the effect of class
imbalance
(From past five years data)
Field of computer vision
Classification
task
Deep Learning framework
Reference
Biomedical
Histopathological
image
classification
-Proposed a combination of different
pre-trained networks using Coalition
game theory, Choquet fuzzy integral, and
Information theory.
-The pre-trained networks used for the
combination are: VGG16, VGG19,
Inception V3, Xception and
InceptionResnetV2
-The fuzzy ensemble considers subsets of
classifiers and hence is effective for
imbalanced classification
[96]
Lung cancer
detection from CT
scan
-DFD-Net has been proposed for lung
cancer detection from CT scan images.
-A retraining strategy assures fine-tuning
of only some layers of the network which
ensures improved detection of the
minority class.
[97]
Heart disease
detection
-Proposed a combination of GAN and
LSTM, named as GAN-LSTM, for
imbalanced data by synthetically
generating fake samples for detection.
[98]
Alzheimerβs
disease detection
from Brain MRI
images
-12-layer CNN architecture trained on
the OASIS dataset
-Data pre-processing steps: 1) Image
resizing, and 2) Image denoising
-The customized CNN outperformed
pre-trained CNNs for the classification of
the imbalanced data.
[99]
Cervical cancer
detection
-Proposed a Token-to-Token Vision
Transformer (T2T-ViT) in combination
with SMOTE-Tomek Links in order to
balance the dataset along with the
[100]
weights of the images and number of
image samples.
Surveillance for
Crime detection
Weapon detection
-Have emphasized on the small regions
of interest in the image using the
YOLOV4 object detection model
-YOLOV4 model is used along with
Partial-ResNet with multi-scale fusion of
fine semantic features of small objects, in
order to gather more information about
the minority class.
[101]
Pose/ postures
tracking
-Proposed Sensor-independent Parallel
dEep ConvoluTional leaRning nEtwork
(SPECTRE) deep learning network using
convolutional networks in order to
perform classification of various postures
at workplace.
-SPECTRE is able to extrapolate using
explainable AI the patterns are
significant for learning
[102]
Human activity
recognition
-Proposed an approach using LSTM
infused with 1-D ConvNet, named as the
joint diverse temporal learning
framework
-The joint learning of the two temporal
models, using fuzzy temporal windows
and multiple sensors, ensures improved
classification of the minority classes.
[103]
Event detection
-Proposed an enhanced ensemble based
deep learning framework in which SVM
classifiers are used as weak learners in
order to eradicate the class imbalance
issues.
-Weight coefficients of the classifiers in
the ensemble are adjusted according to
classification performance.
[104]
Industrial Processing
Fabric defect
detection
-Proposed a novel deep CNN named as
Mobile-Unet for learning effectively
from imbalanced defect/non-defect data.
-Depth-wise separable convolution with a
median frequency balancing loss function
helps to mitigate the class imbalance
[105]
Surface defect
inspection
-Proposed a deep ensemble framework
with dynamic fluctuation adaptation.
-Weight adjustment is performed after
every iteration so that the sub-models are
fine-tuned on new data characteristics.
[106]
E-commerce
image
classification
-Proposed approach related to deep
learning and computer vision-based
system for segregating, detecting and
also removing the offensive and
non-compliant images from e-commerce
catalog.
-They initially started with limited
samples to fine-tune Resnet50 and
Inception-V3 deep learning models, and
then improved the training set iteratively;
they have also used YOLOV3 and Faster
R-CNN object detection model for
creating the proposed system.
[107]
Material
identification
-Proposed a Vision Transformer (ViT) for
classification as well as detection of
construction materials.
-It was observed that the model can be
easily generalized to other imbalanced
datasets.
[108]
Agriculture
Weeds
classification
-Proposed a deep learning based
approach for weed classification using a
voting method by combining the
multimodal Deep CNNs: VGG, NASNet,
ResNet, MobileNet, and InceptionResNet
together.
-In this approach the score vector is
adapted for the voting mechanism with
priority weights determination in order to
identify the better DNN models which
have a greater contribution in scoring.
[109]
Defect detection
on tomatoes
-The ResNet50 model with fine tuning of
all the layers of the model was found to
be an effective solution for the
imbalanced dataset which is biased
towards the healthy class.
[110]
Leaf disease
identification
-Proposed a lightweight convolutional
neural network RegNet for the small
imbalanced apple leaf diseases dataset.
-The results depict that RegNet using
Adam optimizer having learning rate set
to 0.0001 proves to be effective in
comparison to other state-of-the-art deep
learning including Vision Transformer.
[111]
Classification of
biotic stress in
coffee plantation
-Proposed a multi-task system using
convolutional neural networks which
uses ResNet50 model along with data
augmentation technique using leaf image
for the identification of severity and
[112]
classification of biotic stress for the
coffee plantations.
-All layers of ResNet50 were made
trainable for effective separation of the
imbalanced classes.
Smart Mobility
Surveillance of
urban road traffic
-Proposed a novel approach for
surveillance and traffic monitoring for an
aerial image imbalanced dataset by using
YOLOv4 object detection models along
with basic augmentation such as
saturation, hue, flip, rotation, brightness,
zoom, shearing, crop.
-This proposed approach is able to
effectively recognize the different traffic
components from the aerial images from
different angles and altitude.
[113]
Vehicle
detection/classific
ation
-A deep learning architecture was
proposed for surveillance images by
combining bagging and convolutional
neural networks together in a single
framework.
[114]
Pedestrian
detection
-Proposed a hybrid patch-based deep
learning approach using EfficientNet-B0
based classifier and a false positive
reduction algorithm
[115]
5. An analysis of popular imbalanced datasets and performance evaluation metrics
Further into this section, Section 5.1 provides a comprehensive list of popular as well as benchmark
datasets widely employed by researchers across diverse fields with inherent class imbalance. Lastly,
section 5.2 is dedicated to discussing the evaluation metrics and their applicability in different scenarios.
Overall, these three sections collectively contribute to an understanding of class imbalance challenges in
different research areas, provide a list of popular and relevant datasets, and guide researchers in selecting
appropriate evaluation metrics for effective assessment.
5.1 Popular imbalanced datasets in computer vision
In an imbalanced dataset, generally the minority classes are of great interest and which needs to be
handled carefully especially in the biomedical sector where there is high risk of misdiagnosis. In this
section, the study reviews several popularly used datasets in computer vision belonging to different
application domains. These benchmark datasets can serve as starting points for evaluating the
performance of algorithms and techniques in handling imbalanced data in computer vision tasks. The
datasets in computer vision can be broadly categorized on the basis of their imbalance distribution and
also on the basis of the size of the datasets into the following groups:
1. The dataset having an adequate number of samples present in each class. While training the
model the probability associated with each class is equal.
2. Another category of datasets is the one having adequate number of samples; however, instances
of some classes are rarer than that of other classes; such datasets are termed as uneven datasets.
3. Another type of dataset is the small size dataset having a smaller number of samples present; such
datasets can have samples present in each class which could be equal or unequal.
It is very difficult for the model to learn from datasets belonging to categories 2 and 3. Uneven or
inadequate class population arises due to the non-availability of samples in the acquisition stage since due
to the privacy concerns or restrictions it becomes difficult to collect samples for every class or category.
The most common reason for the occurrence of class imbalance could be possibly due to the existence of
a smaller number of expert professionals for the acquisition of the appropriate and correct samples. Other
reasons could be due to manual data collection for labeled dataset and deficiency of the sampleβs presence
in particular classes. Both the factors such as class imbalance and size of dataset can have a tremendous
impact on the overall performances of the CNNs which form the backbone of deep neural networks [146].
However, in machine learning, classifiers like XGBoost are known to perform well even in the presence
of class imbalance [147].
Some popular imbalanced datasets in computer vision are listed below.
(1) CI-MNIST (Correlated and Imbalanced MNIST) dataset [117] consists of imbalanced handwritten
digits; it is a variant of the MNIST dataset. MNIST dataset is a popular dataset used for image
classification, which has 60,000 grayscale images of handwritten digits (0-9). The dataset is not inherently
imbalanced; researchers often create imbalanced versions by artificially manipulating the distribution
representation of samples present in classes.
(2) ISIC 2018 Melanoma Detection Challenge and Dataset [118] which has images of benign lesions
much more in number in comparison to the malignant melanoma.
(3) Mammographic Masses (DDSM) consists of mammographic images where benign masses have a high
number of samples in comparison to the malignant masses [119].
(4) PASCAL VOC [149] consists of annotated images for object detection, segmentation and
classification tasks, where some classes are more common than others having more sample size due to the
presence of more images.
(5) COVID-19 X-ray image classification [120] and COVID-19 lung CT image segmentation [121] are
imbalanced datasets having more images present of COVID-19 abnormalities in comparison to normal
images.
(6) CelebA is a celebrity image imbalanced dataset having few celebrity classes having a disproportionate
number of samples in comparison to other majority classes [122]. RAP [123] is a highly imbalanced
dataset used for pedestrian attribute recognition with an imbalance ratio of 1:1800.
(7) Popular Diabetic retinopathy imbalanced datasets are: Kaggle DRD Dataset [124], DDR Dataset [125]
and Indian Diabetic Retinopathy image (IDRiD) Dataset [126].
(8) Intel and MobileODT- cervical cancer screening dataset [127] available at Kaggle is a smaller
imbalanced dataset consisting of different images of women's cervix type present in images.
(9) BreakHis Histopathological dataset [128] and Breast-Histopathological-Images dataset [129] are two
very popular breast cancer datasets having histopathological images of breast cancer images with the
benign image samples being less in number in comparison to the malignant samples, but in case of
Breast-Histopathological-Images dataset the class distribution is different as IDC -ve class is in majority
in comparison to the IDC +ve classes.
3.1 Performance metrics for evaluating imbalanced datasets
There are standard performance evaluation metrics that are commonly used for imbalanced datasets in
real-world challenging computer vision problems [130] [131]. Some of these performance evaluation
metrics are discussed as follows: (1) Precision measures the proportion of true positive predictions made
by the model to the total number of positive predictions made. (2) Recall measures the proportion of true
positive predictions made by the model to the total number of actual positive samples in the dataset. (3)
F1-score is the harmonic mean of precision and recall, and is one good metric to measure for imbalanced
dataset. (4) Area under the precision-recall curve (AUPRC) metric measures the performance of the
model over a range of different classification thresholds, and is a good metric to use when the positive
class is rare. (5) Classification accuracy is the most commonly used performance evaluation metric for
classification tasks. It is simply the percentage of test examples that are correctly classified by the model.
Accuracy should not be the only evaluation measure while measuring the performance of an imbalanced
dataset as it can lead to false predictions sometimes. (6) AUC-ROC: The AUC-ROC (area under the
receiver operating characteristic curve) is a measure of the model's ability to distinguish between positive
and negative examples. It is often used for binary classification tasks. (7) G-mean metric is the geometric
mean of sensitivity (recall) and specificity, and is also a good metric to use when the classes are
imbalanced and both false positives and false negatives are important to minimize. (8) Training time: The
time required to train a model can be an important factor in certain situations, such as when working with
large datasets or when model training needs to be completed within a certain time frame. (9) Inference
time: The time required to make predictions with the model can also be an important factor, particularly
for applications where real-time performance is required. It is equally important to note that no single
metric can provide a complete picture of a model's performance, and it is often necessary to consider
multiple metrics in order to get a comprehensive understanding of a model's strengths and weaknesses.
(10) Matthew's Correlation Coefficient (MCC) is a statistical measure for binary classification which is
used to balance the impact of false negative and false positive which will be useful in the applications
involving errors with different costs. (11) Cohen's kappa coefficient is a statistical measure which is used
to measure the level of agreement between two annotators. (12) Indexed Balanced accuracy (IBA) metric
gives maximum weight to minority classes and also penalizes the classifiers which perform well on the
majority class but not well on the minority class. (13) Intersection over Union (IoU) and Dice score are
the performance evaluation metrics for segmentation and object detection. Intersection over Union (IoU)
measures the overlap which occurs between the ground and predicted truth regions and is used as the ratio
of the area of the intersection between the two regions to the area of their union. There are few variants of
IOU which can be used such as frequency-weighted IoU, mean IoU and per-class IoU for an imbalanced
dataset. The Dice Coefficient is twice into the Area of Overlap divided by the total number of pixels in
both images. The variants of Dice score such as weighted Dice score and generalized Dice score can be
used for imbalanced datasets. There are many different ways to compare the performance of deep learning
models, and the most appropriate method will depend on the specific goals and requirements of the task at
hand as illustrated in Table 8.
Table 8. Various performance metrics used for the evaluation of imbalanced datasets.
Evaluation Parameters
Accuracy
ππ+ππ
ππ+ππ+πΉπ+πΉπ
Precision
ππ
ππ+ππ
Recall
ππ
ππ+πΉπ
F1-Score
2*(ππππππ πππ*π
πππππ)
(ππππππ πππ+π
πππππ)
Matthew's Correlation
Coefficient (MCC)
ππ*ππβπΉπ*πΉπ
ππ+πΉπ( )ππ+πΉπ( ) ππ+πΉπ( )(ππ+πΉπ)
Cohen Kappa
π0βππ
1β π0
Geometric Mean
πππππ‘π = ππ/(ππ+πΉπ)
πππππ‘π =π π/(ππ+πΉπ)
πΊππππ = πππππ‘π*πππππ‘π
Index Balanced Accuracy
(IBA)
π·ππππππππ = (πππππ‘πβπππππ‘π)
πΌπ΅π΄ = (1 + π·ππππππππ) * πΊππππΒ²
Intersection over Union
(IoU)
π΄πππ ππ πΌππ‘πππ πππ‘πππ
π΄πππ ππ πππππ
Dice Score
2 * π΄πππ ππ πΌππ‘πππ πππ‘πππ
π΄πππ ππ πππππ
4. Discussion and future insights of the trends associated with class imbalance problem
The study has been conducted in depth about various data augmentation procedures, loss functions and
deep learning architectures widely used to tackle class imbalance in various computer applications.
Different tasks of computer vision are covered in this study including classification, segmentation, object
detection and recognition. Also, an exclusive study on data augmentation using GAN which is trending
research on fake image generation is included; this technique was found to enhance the overall efficiency
of the models to a great extent in various real-world applications. The various deep learning frameworks
that are customized to mitigate the effect of class imbalance are presented including vision transformers
and ensemble transformers that is trending research nowadays. A scrutiny of all the Tables presented
reveals some unique ideas in forms of the use of iterative learning, weighted ensembles, appropriate loss
functions and smart combination of classifiers and data augmentation, novel architectural layers etc. for
mitigating the effect of class imbalance. This work paves the way for intelligent design of futuristic
models that can be generalized on an imbalanced dataset in real world applications. The limitations, open
issues, potential insights, future trends and directions that may shape the future of addressing class
imbalance in the computer vision domain are further highlighted in this section.
4.1 More challenging application areas and customized solutions
As the computer vision models are increasingly being deployed in more real-world scenarios such as
medical imaging and automated diagnostic procedures, autonomous driving and surveillance, multimodal
biometric recognition and many more, where few classes may be rare (minority) in comparison to others
or in other words there might be imbalanced class representation due to the performance of the
underrepresented classes is downgraded. Therefore, new application domains in computer vision need to
be identified and customized imbalanced learning procedures need to be adopted to suit the classification
task at hand.
4.2 Active learning with hyperparameter tuning for few-shot learning
Further, integration of active learning and automated selection of models with appropriate
hyperparameters may also play a more significant role in addressing class imbalance when the class
definition is ambiguous and the majority and minority samples overlap to a large extent. Even after
selectively choosing the most informative samples for annotation, active learning can help to mitigate the
impact of class imbalance by focusing on underrepresented or on more difficult instances/samples which
can overall improve the performance of models while reducing the annotation efforts. This would help to
address the cases of few-shot or zero-shot learning which deals with test samples having new class
definitions; semi-supervised techniques are found to be more successful for this case than supervised
learning [172].
4.3 Generalizability of transfer learning to unseen classes
Transfer learning and domain adaptation wherein pre-trained networks are fine-tuned on target
domains still remain a crucial factor in handling the class imbalance situation. The transfer of knowledge
should be such that the fine-tuned model is able to identify new unseen classes. In the new era, it is
expected that there will be more design and development of new specialized advanced approaches/
packages or open-source frameworks to tackle class imbalance in computer vision; the overall goal will
be to improve the performance of models and make it more generalized to work well on learning
representations from the minority classes in real world applications.
4.4 Interpretability and explainability of learning frameworks
Modifying the existing learning algorithms to make it more explainable and interpretable is
ongoing research. Explainable deep learning is still not fully achievable, and the problem is complicated
in the background of class imbalance. As AI technologies become more pervasive, ethical considerations
surrounding class imbalance in computer vision will gain significance. Bias and fairness concerns in
algorithmic decision-making may drive the development of more comprehensive frameworks for
addressing class imbalance to ensure equitable outcomes across various demographic groups/
applications.
4.5 Effect of adversarial learning on the minority classes
Adding adversarial samples (such as in GAN) can exacerbate the challenges of addressing class
imbalance. Adversarial samples are deliberately crafted input patterns that can be created for any class,
including both majority and minority classes [157]. However, the impact of adversarial samples can be
particularly severe for minority classes in imbalanced datasets. When a computer vision model is trained
on imbalanced data, it may already have difficulty learning representations for the minority classes due to
the limited number of examples available. Adversarial samples can further exacerbate this problem by
introducing additional sources of misclassification, making it even more challenging for the model to
accurately classify instances from the minority classes. To tackle these vulnerabilities, researchers have
been actively investigating adversarial samples and working on methods to mitigate their impact. Overall,
this integration will help in creation of more equitable and reliable computer vision models in the terms of
class imbalance and adversarial samples.
4.6 Limitations of current research
The limitations of contemporary research on imbalanced learning are summarized below.
(i) Despite the advancements made in the field of machine learning and deep learning; the
system is still data driven. Changing the dataset will bring in a lot of changes in the
performance. Therefore, the current crop of balancing strategies do not have
generalization ability.
(ii) Data augmentation may in some cases (minority samples are too few) lead to repetitive
samples, which may in turn lead to overfitting of the model.
(iii) Designing loss functions for cost-sensitive learning is effective, however, they are usually
customized for the task at hand and the imbalance ratio of the dataset.
(iv) There are relatively fewer works on extremely imbalanced datasets. The handful of
samples in the minority class would bring down the system performance.
7. Conclusion
This survey includes a comprehensive analysis of machine learning and deep learning methods for
addressing the challenges of class imbalance in computer vision, with an increased focus on deep learning
solutions that form the state-of-the-art today. Class imbalance is prevalent in real-world image datasets
such as biomedical datasets where the number of healthy or control samples heavily outnumber the
diseased samples. Addressing the class imbalance issue is therefore a crucial step to improve the
performance. There are very few surveys conducted in a structured manner related to data augmentation,
loss functions and deep learning frameworks in the context of class imbalance.
From this study, the two most popular solutions for class imbalance in deep learning frameworks
in numerous applications appear to be (i) pre-trained CNNs with data augmentation which is a form of
data-level augmentation (ii) cost-sensitive learning for both CNNs trained from scratch and pre-trained
CNNs which aims to improve the learning process without manipulating the input data. Fine-tuning with
data augmentation is one of the best available solutions to prevent overfitting when the dataset is small in
size. Most of the researchers have experimented on different combinations of loss functions and data
augmentation. As per the findings, the choice of the loss function, system components, architectural
blocks or layers, and applying various balancing strategies approaches at data level, plays a crucial role in
how effectively the system is able to combat class imbalance. The study has also reviewed some deep
neural network frameworks that have been designed keeping class imbalance in mind, though such works
are rarer to find. The pre-trained network ResNet50 was found to mitigate class imbalance even in the
absence of data augmentation or cost-sensitive learning as compared to all other pre-trained networks,
emphasizing the role of residual connections for improving the classification of minority classes. The
YOLO object detection models also were found helpful for imbalanced learning since these models
identify the points of interest in an image. When integrated with data augmentation techniques, this
method helps to mitigate class imbalance as proved by surveyed works.
This study would be useful for the creation of new deep learning frameworks that can learn
effectively from imbalanced datasets, and also inspires development of hybrid machine learning and deep
learning approaches involving manipulations at both data-level and algorithm-level. Overall these
techniques can be used in various domains or applications eradicating the problems or challenges
associated with the class imbalanced datasets. Also discussed the challenges and constraints faced when
applying these methods to real-world scenarios. This survey highlights trends, challenges and future
directions in the field of class imbalance in computer vision and deep learning which will enable readers
to stay updated with the latest research developments and challenges involved in this evolving and
promising domain which might impact real-world applications. The paper reviews a large variety of
imbalanced datasets used in various applications of computer vision, and multiple complementary
performance metrics such as F1-score, IBA, MCC and Geometric mean, which would help to identify the
preferred deep/non-deep learning methodology for various computer vision applications having class
imbalance situations. Lastly, research that evaluates the use of deep learning to address class imbalance in
non-image data is limited which needs to be expanded upon in future works. The current study has
limitations as it is focused in the computer vision domain. Future studies should integrate the
interdisciplinary works of various domains linking computer vision with other fields of Artificial
intelligence such as Internet of Things (IoT), Robotics, and Natural language processing (NLP),