ArticlePDF Available

The Impact of Augmentation and SMOTE Implementation on the Classification Models Performance: A Case Study on Student Academic Performance Dataset

Authors:
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 245 (2024) 282–289
1877-0509 © 2024 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the 9th International Conference on Computer Science
and Computational Intelligence 2024
10.1016/j.procs.2024.10.253
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 00 (2024) 000000
www.elsevier.com/locate/procedia
1877-0509 © 2024 The Authors. Published by ELSEVIER B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the 9th International Conference on Computer Science and Computational
Intelligence 2024
9th International Conference on Computer Science and Computational Intelligence 2024 (ICCSCI 2024)
The Impact of Augmentation and SMOTE Implementation on
the Classification Models Performance: A Case Study on
Student Academic Performance Dataset
Albert Verasius Dian Sanoa*, Faqir M. Bhattib, Eka Mirandac, Mediana Aryunic, Alfi
Yusrotis Zakiyyahd, Charles Bernandoc
aComputer Science Department, School of Computer Science, Bina Nusantara University, Jakarta 11480, Indonesia
bRiphah Institute of Computing and Applied Sciences, Riphah Intternational University, Raiwind, Lahore, Pakistan
cInformation Systems Department, School of Information Systems, Bina Nusantara University, Jakarta 11480, Indonesia
dMathematics Department, School of computer Science, Bina Nusantara University, Jakarta 11480, Indonesia
Abstract
This study aims to find out the impact of data augmentation and Synthetic Minority Over-sampling Techniques (SMOTE)
implementation on the classification models performance using small and imbalanced dataset of student academic performance.
The design of this study involved a comprehensive experiment by comparing four scenarios: 1) comparing classification models
without both data augmentation and SMOTE, 2) models with data augmentation, 3) models with SMOTE, and 4) models with
both data augmentation and SMOTE. The model’s performances were each measured based on standard evaluation metrices such
as accuracy, precision, recall, F1-score.
To test the results validity, there were three classification algorithms implemented and evaluated for each scenario, that is,
Random Forest, XGBoost, and AdaBoost. The finding of this study highlights the significant impact of data augmentation and
SMOTE to the increase of classification models performance, particularly over the small and imbalanced dataset. Results showed
that the implementation of both techniques simultaneously brought about the most significant increase in the evaluation metrices
compared to the implementation of both techniques separately.
The originality of this study lies in its comprehensive approach in comparing the effectiveness of data augmentation and
SMOTE, as well as the use of student academic performance dataset, which is, a real case in the context of artificial intelligence.
This finding gives a valuable insight to the researchers and practitioners in choosing appropriate techniques to handle small and
* Corresponding author.
E-mail address: avds@binus.ac.id
10.1016/j.procs.2024.10.253 1877-0509
Albert Verasius Dian Sano et al. / Procedia Computer Science 245 (2024) 282–289 283
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 00 (2024) 000000
www.elsevier.com/locate/procedia
1877-0509 © 2024 The Authors. Published by ELSEVIER B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the 9th International Conference on Computer Science and Computational
Intelligence 2024
9th International Conference on Computer Science and Computational Intelligence 2024 (ICCSCI 2024)
The Impact of Augmentation and SMOTE Implementation on
the Classification Models Performance: A Case Study on
Student Academic Performance Dataset
Albert Verasius Dian Sanoa*, Faqir M. Bhattib, Eka Mirandac, Mediana Aryunic, Alfi
Yusrotis Zakiyyahd, Charles Bernandoc
aComputer Science Department, School of Computer Science, Bina Nusantara University, Jakarta 11480, Indonesia
bRiphah Institute of Computing and Applied Sciences, Riphah Intternational University, Raiwind, Lahore, Pakistan
cInformation Systems Department, School of Information Systems, Bina Nusantara University, Jakarta 11480, Indonesia
dMathematics Department, School of computer Science, Bina Nusantara University, Jakarta 11480, Indonesia
Abstract
This study aims to find out the impact of data augmentation and Synthetic Minority Over-sampling Techniques (SMOTE)
implementation on the classification models performance using small and imbalanced dataset of student academic performance.
The design of this study involved a comprehensive experiment by comparing four scenarios: 1) comparing classification models
without both data augmentation and SMOTE, 2) models with data augmentation, 3) models with SMOTE, and 4) models with
both data augmentation and SMOTE. The models performances were each measured based on standard evaluation metrices such
as accuracy, precision, recall, F1-score.
To test the results validity, there were three classification algorithms implemented and evaluated for each scenario, that is,
Random Forest, XGBoost, and AdaBoost. The finding of this study highlights the significant impact of data augmentation and
SMOTE to the increase of classification models performance, particularly over the small and imbalanced dataset. Results showed
that the implementation of both techniques simultaneously brought about the most significant increase in the evaluation metrices
compared to the implementation of both techniques separately.
The originality of this study lies in its comprehensive approach in comparing the effectiveness of data augmentation and
SMOTE, as well as the use of student academic performance dataset, which is, a real case in the context of artificial intelligence.
This finding gives a valuable insight to the researchers and practitioners in choosing appropriate techniques to handle small and
* Corresponding author.
E-mail address: avds@binus.ac.id
284 Sano et. al. / Procedia Computer Science 00 (2024) 000000
imbalanced class datasets. This study is expected to make an important contribution to the more effective development of
classification methodology in various domains.
© 2024 The Authors. Published by ELSEVIER B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the 9th International Conference on Computer Science and
Computational Intelligence 2024
Keywords: Data augmentation; Imbalanced class; Small dataset, SMOTE
1. Introduction
In this currently running digital era, the use of Artificial Intelligence (AI) techniques in data analysis is increasing
more and more in many various fields [1][2][3], including in the context of classification and prediction. In such a
context, handling a small and imbalanced class dataset becomes a significant challenge. In this study, we propose to
explore the impact of the implementation of data augmentation and Synthetic Minority Over-sampling Technique
(SMOTE) on the performance of classification models using student academic performance dataset.
The main objective of this study is to answer the proposed research questions as stated here: 1) What is the impact
of the implementation of data augmentation and SMOTE to the classification models performance? 2) Does the
combinational use of both techniques simultaneously bring about a more significant increase in the performance
compared to that of the use separately (without combinational use)? These two questions are proposed in the context
of a small and imbalanced class of student academic performance dataset. In addition to reflecting the main objective
of this study, the research questions also give direction to the next discussion about the use of data augmentation and
SMOTE to the small and imbalanced class dataset.
Small and imbalanced class datasets are common problem in many fields, mostly in medical and finance,
including in academic research [4][5][6]. The appropriate handling of this problem has important implication in
leveraging the accuracy and reliability of classification model performance, which can in turn improve the decision
making and AI application in various fields.
This study is expected to make an important contribution in the sense that we will present a better understanding
of the performance difference between the use of mere data augmentation, mere SMOTE, and the combination of
both simultaneously in the context of handling small and imbalanced dataset of student academic performance.
The background of this study is driven by the need to handle appropriately and enhance the performance of
classification models for small and imbalanced datasets. Along with the ever-growing interest of the AI techniques
implementation in education, this study has direct relevance to the need and practice in this field.
This paper is organized as follows: following the introduction section, we will discuss our experiment design
which is the methods, including dataset description and the implemented techniques. Furthermore, we will present
the results and discussion including the four experiment scenarios, followed by the discussion of the findings. This
paper will be concluded by a conclusion and the direction of the future research.
The topic of this study is encouraged by the need for an effective approach in handling small and imbalanced
problems of student academic performance dataset. By understanding the impact of some technique’s performance,
it is expected that we can present practical guidance to researchers and practitioners in selecting the appropriate
approaches for their data analysis.
2. Methods
2.1. Data collection
The data were collected from an online questionnaire from January 6, 2024, to February 28, 2024, and responded
by 252 students from various departments, such as information systems, managements, law, marketing, etc. The
dataset has sixteen attributes consisting of fifteen independent attributes and one class attribute derived from each
statement in the questionnaire. The learning method context that was taken for this questionnaire is Case Based
Learning (CBL). The list of fifteen dependent attributes can be found in Table 1. The 16th statement or attribute,
284 Sano et. al. / Procedia Computer Science 00 (2024) 000000
imbalanced class datasets. This study is expected to make an important contribution to the more effective development of
classification methodology in various domains.
© 2024 The Authors. Published by ELSEVIER B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the 9th International Conference on Computer Science and
Computational Intelligence 2024
Keywords: Data augmentation; Imbalanced class; Small dataset, SMOTE
1. Introduction
In this currently running digital era, the use of Artificial Intelligence (AI) techniques in data analysis is increasing
more and more in many various fields [1][2][3], including in the context of classification and prediction. In such a
context, handling a small and imbalanced class dataset becomes a significant challenge. In this study, we propose to
explore the impact of the implementation of data augmentation and Synthetic Minority Over-sampling Technique
(SMOTE) on the performance of classification models using student academic performance dataset.
The main objective of this study is to answer the proposed research questions as stated here: 1) What is the impact
of the implementation of data augmentation and SMOTE to the classification models performance? 2) Does the
combinational use of both techniques simultaneously bring about a more significant increase in the performance
compared to that of the use separately (without combinational use)? These two questions are proposed in the context
of a small and imbalanced class of student academic performance dataset. In addition to reflecting the main objective
of this study, the research questions also give direction to the next discussion about the use of data augmentation and
SMOTE to the small and imbalanced class dataset.
Small and imbalanced class datasets are common problem in many fields, mostly in medical and finance,
including in academic research [4][5][6]. The appropriate handling of this problem has important implication in
leveraging the accuracy and reliability of classification model performance, which can in turn improve the decision
making and AI application in various fields.
This study is expected to make an important contribution in the sense that we will present a better understanding
of the performance difference between the use of mere data augmentation, mere SMOTE, and the combination of
both simultaneously in the context of handling small and imbalanced dataset of student academic performance.
The background of this study is driven by the need to handle appropriately and enhance the performance of
classification models for small and imbalanced datasets. Along with the ever-growing interest of the AI techniques
implementation in education, this study has direct relevance to the need and practice in this field.
This paper is organized as follows: following the introduction section, we will discuss our experiment design
which is the methods, including dataset description and the implemented techniques. Furthermore, we will present
the results and discussion including the four experiment scenarios, followed by the discussion of the findings. This
paper will be concluded by a conclusion and the direction of the future research.
The topic of this study is encouraged by the need for an effective approach in handling small and imbalanced
problems of student academic performance dataset. By understanding the impact of some techniques performance,
it is expected that we can present practical guidance to researchers and practitioners in selecting the appropriate
approaches for their data analysis.
2. Methods
2.1. Data collection
The data were collected from an online questionnaire from January 6, 2024, to February 28, 2024, and responded
by 252 students from various departments, such as information systems, managements, law, marketing, etc. The
dataset has sixteen attributes consisting of fifteen independent attributes and one class attribute derived from each
statement in the questionnaire. The learning method context that was taken for this questionnaire is Case Based
Learning (CBL). The list of fifteen dependent attributes can be found in Table 1. The 16th statement or attribute,
© 2024 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientic committee of the 9th International Conference on Computer Science and
Computational Intelligence 2024
284 Albert Verasius Dian Sano et al. / Procedia Computer Science 245 (2024) 282–289
Sano et. al. / Procedia Computer Science 00 (2024) 000000 285
which is also the class attribute, is “How are the students grades during the course with CBL learning”. The possible
values of this class attribute are “Stable” or “Increase”. In other words, this value becomes the students’ academic
performance in CBL method.
Table 1. Dataset attributes.
Mean (1-5)
Standard Dev
material studied
4.37
0.676
4.3
0.735
specific problem
4.29
0.699
current problem’s context is feasible
4.33
0.72
assessed in relation to the problem
4.31
0.694
enhancement of decision-making skills
4.37
0.701
4.44
0.651
information submitted by other members for
discussion
4.45
0.669
group effectively
4.3
0.762
leadership skills
4.15
0.878
clearly in group discussions
4.25
0.792
needing constant guidance
3.95
0.934
viewpoints of others within the group
4.44
0.698
responsibilities towards fellow group members
4.24
0.796
recognize personal limitations
4.21
0.78
2.2. Data pre-processing
There are three main activities in this stage:
1. Statements or attributes were converted using the Likert scale with the range of one to five where one indicates
strongly disagree and five indicates strongly agree. The use of Likert conversion for questionnaires are common
practice in many domains [7][8] . Mean and Standard Deviation (SD) are used in Table 1 because they provide a
summary of responses to the questionnaire. The use of mean and SD for Likert-scale data is supported by some
studies and statistical practices, despite the common view that Likert-scale data are ordinal [9][10]. The mean
indicates the average response to each statement. A higher mean describes greater agreement or a more positive
response. Standard deviation indicates the variability of the responses. A higher standard deviation means there was
more variability in the response. Let us take the first statement as an example, “CBL improves your understanding
of the material studied” has a mean of 4.37 and SD of 0.676. The high mean indicates that respondents generally
Albert Verasius Dian Sano et al. / Procedia Computer Science 245 (2024) 282–289 285
Sano et. al. / Procedia Computer Science 00 (2024) 000000 285
which is also the class attribute, is “How are the students grades during the course with CBL learning”. The possible
values of this class attribute are “Stable” orIncrease”. In other words, this value becomes the students academic
performance in CBL method.
Table 1. Dataset attributes.
Questionnaire statements/attributes
Mean (1-5)
Standard Dev
CBL improves your understanding of the
material studied
4.37
0.676
CBL facilitates student self-learning
4.3
0.735
Hypotheses can be formulated from CBL for a
specific problem
4.29
0.699
In CBL, incorporating prior knowledge into the
current problems context is feasible
4.33
0.72
In CBL, the information collected can be
assessed in relation to the problem
4.31
0.694
CBL learning methods promote the
enhancement of decision-making skills
4.37
0.701
CBL enhances information processing skills
4.44
0.651
CBL makes students learn to critically analyze
information submitted by other members for
discussion
4.45
0.669
CBL trains you to communicate ideas to the
group effectively
4.3
0.762
CBL provides an opportunity to improve
leadership skills
4.15
0.878
CBL enables you to convey your thoughts
clearly in group discussions
4.25
0.792
CBL enables students to engage without
needing constant guidance
3.95
0.934
In CBL, you develop an appreciation for the
viewpoints of others within the group
4.44
0.698
Students can recognize their ethical and moral
responsibilities towards fellow group members
4.24
0.796
Participating in CBL groups helps you
recognize personal limitations
4.21
0.78
2.2. Data pre-processing
There are three main activities in this stage:
1. Statements or attributes were converted using the Likert scale with the range of one to five where one indicates
strongly disagree and five indicates strongly agree. The use of Likert conversion for questionnaires are common
practice in many domains [7][8] . Mean and Standard Deviation (SD) are used in Table 1 because they provide a
summary of responses to the questionnaire. The use of mean and SD for Likert-scale data is supported by some
studies and statistical practices, despite the common view that Likert-scale data are ordinal [9][10]. The mean
indicates the average response to each statement. A higher mean describes greater agreement or a more positive
response. Standard deviation indicates the variability of the responses. A higher standard deviation means there was
more variability in the response. Let us take the first statement as an example, “CBL improves your understanding
of the material studiedhas a mean of 4.37 and SD of 0.676. The high mean indicates that respondents generally
286 Sano et. al. / Procedia Computer Science 00 (2024) 000000
agree or strongly agree that CBL improves their understanding. The relatively low SD suggests that most
respondents have similar opinions on that statement.
2. Data augmentation. Data augmentation, in short, is a technique of creating synthetic samples from an existing
dataset [11][12][13]. It is generally used in classification models. Data augmentation applied in study is Random
Oversampling.
3. SMOTE implementation. SMOTE briefly is the process of oversampling of minority labeled class of a dataset
by constructing synthetic examples [14][15]. Some extended variants of SMOTE for particular applications have
been developed by some experts resulting in the so called Deep SMOTE, SASMOTE, GSMOTE, etc. [14][16][17].
In this study, we implemented the basic SMOTE technique that works based on k-NN with the default k = 5
using python library. The key formula to generate a new synthetic data is shown in the equation below.
 = +  ( ) (1)
The brief explanation of SMOTE algorithm is described as follows:
1. Choose a minority sample. For example, pick a minority sample Xi
2. Find the nearest neighbors. Find k nearest neighbors from Xi. Let’s say it as Xneigh.
3. Create a new sample. Calculate the difference between X and Xneigh. Multiply the difference with
random numbers between 0 and 1, and add the result to X. The formula can be seen in equation (1).
4. Repeat step 1 to 3 to create many synthetic samples
We skipped elaborating on standard data pre-processing such as data cleaning, as the data were clean, no missing
values, and no redundancy.
2.3. Data splitting
We used hold-out method for splitting the dataset and split it into the ratio of 80:20 for training and testing data
consecutively, as it is the most practiced splitting ratio for training and testing dataset. This approach is widely
accepted and has been supported by numerous studies and reviews as a standard practice to ensure sufficient data for
training while reserving enough for testing to validate model performance. Furthermore, it is a commonly employed
ratio, particularly for its simplicity and effectiveness in preventing overfitting and ensuring robust model evaluation
[18][19]. In our experiment we also tried k-fold cross validation with k=10. These two splitting techniques are
widely and commonly practiced because of their simplicity and effectiveness in evaluating model performance [19].
However, the evaluation results showed similar results and almost the same, then we used hold-out method and
ignore k-fold cross validation to avoid too many scenarios for this experiment and keeping focus on the impact of
data augmentation and SMOTE.
2.4. Classification model implementation
We chose three commonly practiced classification algorithms, Random Forest, XGBoost, and Ada Boost for this
experiment. These three algorithms are chosen because they are ensemble methods, meaning that they combine
some models to work together to improve the model’s performance. The brief explanation about the three
algorithms is described as follows:
1. Random Forest. This is an ensemble algorithm that combines many decision trees to increase accuracy and
to reduce overfitting. This algorithm uses bootstrap sampling method to generate some subsets from training
data. Each tree will give a prediction, and the final decision is decided based on majority votes from all trees.
2. XGBoost. This algorithm generates models iteratively where each new model improves by fixing errors from
previous models. This algorithm uses regularization techniques to prevent overfitting and improves
generalization.
3. AdaBoost. This algorithm is a simple boosting algorithm that combines some weak models, i.e., small
decision tree, into a robust model. This algorithm gives more weight to misclassified samples in the previous
iteration, thereby focusing on errors. Adaptively adjust the model to improve classification performance.
286 Albert Verasius Dian Sano et al. / Procedia Computer Science 245 (2024) 282–289
Sano et. al. / Procedia Computer Science 00 (2024) 000000 287
The applications of these three algorithms have been known as widely used and very effective in machine
learning [20][21][22]. Each algorithm will take four scenarios, i.e. 1) without applying data augmentation or
SMOTE, 2) with data augmentation applied, 3) with SMOTE applied, and 4) with both data augmentation and
SMOTE applied.
2.5. Model performance evaluation.
We use standard evaluation measurements for classification for this step, i.e., accuracy, precision, recall, and F1-
score. Below is the list of formulae’s equations to compute the metrics.
Accuracy = +
+++ (2)
TP stands for True Positive. It is the count of positive values identified accurately as positive.
TN (True Negative) represents the count of negative values identified accurately as negative.
FP (False Positive) represents the count of negative values identified inaccurately as positive.
FN (False Negative) represents the count of positive values identified inaccurately as negative.
Precision = 
 + (3)
Recall = 
 + (4)
F1-score = 2 ∗ ()
 +  (5)
3. Results and Discussion
As stated earlier, we had four scenarios in this experiment. Each scenario was tried out using three classification
algorithms.
Scenario 1: This scenario used the original dataset which is small and imbalanced. It means without the use of
data augmentation and SMOTE. Findings from the evaluation dataset are showcased in Table 2.
Table 2. Findings of classification model performance using scenario 1.
Algorithm
Accuracy
Precision
Recall
F1-score
Random Forest
46.939%
48.105%
46.939%
46.806%
XGBoost
53.061%
54.736%
53.061%
52.787%
AdaBoost
55.102%
56.560%
55.102%
54.990%
If we examine the results shown in Table 2, then they apparently look normal with relatively quite low scores
regarding accuracy, precision, recall and F1-score. These results are a consequence of a small and imbalanced
dataset. AdaBoost outperforms the other two algorithms. In general, however, we identify the results as relatively
low with the range between 46.939% and 56.560%. We use these results as the baseline to measure the impact of
data augmentation and/or SMOTE implementation on the dataset.
Scenario 2: This scenario applied data augmentation without SMOTE implementation on the dataset. Findings
from the evaluation dataset are showcased in Table 3.
Albert Verasius Dian Sano et al. / Procedia Computer Science 245 (2024) 282–289 287
Sano et. al. / Procedia Computer Science 00 (2024) 000000 287
The applications of these three algorithms have been known as widely used and very effective in machine
learning [20][21][22]. Each algorithm will take four scenarios, i.e. 1) without applying data augmentation or
SMOTE, 2) with data augmentation applied, 3) with SMOTE applied, and 4) with both data augmentation and
SMOTE applied.
2.5. Model performance evaluation.
We use standard evaluation measurements for classification for this step, i.e., accuracy, precision, recall, and F1-
score. Below is the list of formulae’s equations to compute the metrics.
Accuracy = +
+++ (2)
TP stands for True Positive. It is the count of positive values identified accurately as positive.
TN (True Negative) represents the count of negative values identified accurately as negative.
FP (False Positive) represents the count of negative values identified inaccurately as positive.
FN (False Negative) represents the count of positive values identified inaccurately as negative.
Precision =
 + (3)
Recall =
 + (4)
F1-score = 2 ( )
 +  (5)
3. Results and Discussion
As stated earlier, we had four scenarios in this experiment. Each scenario was tried out using three classification
algorithms.
Scenario 1: This scenario used the original dataset which is small and imbalanced. It means without the use of
data augmentation and SMOTE. Findings from the evaluation dataset are showcased in Table 2.
Table 2. Findings of classification model performance using scenario 1.
Algorithm
Accuracy
Precision
Recall
F1-score
Random Forest
46.939%
48.105%
46.939%
46.806%
XGBoost
53.061%
54.736%
53.061%
52.787%
AdaBoost
55.102%
56.560%
55.102%
54.990%
If we examine the results shown in Table 2, then they apparently look normal with relatively quite low scores
regarding accuracy, precision, recall and F1-score. These results are a consequence of a small and imbalanced
dataset. AdaBoost outperforms the other two algorithms. In general, however, we identify the results as relatively
low with the range between 46.939% and 56.560%. We use these results as the baseline to measure the impact of
data augmentation and/or SMOTE implementation on the dataset.
Scenario 2: This scenario applied data augmentation without SMOTE implementation on the dataset. Findings
from the evaluation dataset are showcased in Table 3.
288 Sano et. al. / Procedia Computer Science 00 (2024) 000000
Table 3. Findings of classification model performance using scenario 2.
Algorithm
Accuracy
Precision
Recall
F1-score
Random Forest
89.000%
89.027%
89.000%
89.004%
XGBoost
89.000%
89.027%
89.000%
89.004%
AdaBoost
72.000%
72.361%
72.000%
72.000%
Examining Table 3 and comparing it to the baseline, which is shown in Table 2, we can see that there is a
significant impact because of the use of data augmentation. The results showed quite high scores with the range
between 72.000% and 89.027%. In this scenario, AdaBoost underperforms the other two algorithms, while Random
Forest and XGBoost showed the same results.
Scenario 3: This scenario applied SMOTE without data augmentation on the dataset. Findings from the
evaluation dataset are showcased in Table 4.
Table 4. Findings of classification model performance using scenario 3.
Algorithm
Accuracy
Precision
Recall
F1-score
Random Forest
51.020%
53.459%
51.020%
49.990%
XGBoost
48.980%
50.890%
48.980%
48.209%
AdaBoost
57.143%
59.039%
57.143%
56.893%
Examining Table 4 and comparing it to the baseline shown in Table 2, we find that there is a small increase in the
evaluation scores for Random Forest and AdaBoost algorithms, and a small decrease for XGBoost. In general, the
scores for the three algorithms are relatively low and similar to the baseline with the range between 48.209% and
59.039%. We can infer that the impact of SMOTE implementation is small. Even though it gives small increase for
Random Forest and AdaBoost, it is worth noting that it does not always give positive impact as shown in XGBoost
results.
Scenario 4: This scenario applied both data augmentation and SMOTE on the dataset. Findings from the
evaluation dataset are showcased in Table 5.
Table 5. Findings of classification model performance using scenario 4.
Algorithm
Accuracy
Precision
Recall
F1-score
Random Forest
89.000%
89.004%
89.000%
88.993%
XGBoost
90.500%
90.791%
90.500%
90.459%
AdaBoost
73.000%
73.097%
73.000%
72.863%
Examining Table 5 and comparing it to the baseline in Table 2, we find that there is a significant increase in the
evaluation scores. If we compare Table 5 to Table 3, however, we find that there is a small increase in the scores for
XGBoost and AdaBoost, while Random Forest has the same results.
From these four scenarios we find that the use of data augmentation and SMOTE simultaneously on the small
and imbalanced dataset has significant impact to the classification evaluation scores, regarding accuracy, precision,
recall, and F1-score. In detail, data augmentation contributes very significantly to the scores, while SMOTE does
very little.
4. Conclusion
The original small and imbalanced dataset produces a relatively poor performance of the evaluated classification
models with the scores ranging from about 46.939% - 56.560% regarding accuracy, precision, recall, and F1-score.
Data augmentation has a very significant impact on the model’s performance over the small and imbalanced
dataset. On the other hand, SMOTE does not have as big an impact as data augmentation.
The combinational use of data augmentation and SMOTE gives the best results, however, the most significant
contribution comes from the data augmentation. This concludes that apparently data augmentation technique is
288 Albert Verasius Dian Sano et al. / Procedia Computer Science 245 (2024) 282–289
Sano et. al. / Procedia Computer Science 00 (2024) 000000 289
much more effective compared to SMOTE in leveraging the classification model’s performance in the context of
small and imbalanced dataset. For further research, considering the relatively small impact of SMOTE in this
experiment, it is worth noting that the use of SMOTE variants such as Borderline-SMOTE, SMOTE-ENN needs to
be considered to experiment with.
Acknowledgements
This work is supported by Bina Nusantara University as a part of Bina Nusantara University’s BINUS
International Research Applied entitled “Learning Analytics Tools Based on Machine Learning in Predicting
Student Academic Performance in Case-Based Learning (CBL)” with contract number: 069/VRRTT/III/2024 and
contract date: March 18, 2024.
References
[1] B. Burger, D. K. Kanbach, S. Kraus, M. Breier, and V. Corvello, “On the use of AI-based tools like ChatGPT to support management
research,” Eur. J. Innov. Manag., vol. 26, no. 7, pp. 233241, 2023, doi: 10.1108/EJIM-02-2023-0156.
[2] H. Crompton and D. Burke, “Artificial intelligence in higher education: the state of the field,” Int. J. Educ. Technol. High. Educ., vol.
20, no. 1, 2023, doi: 10.1186/s41239-023-00392-8.
[3] F. Ramzan, C. Sartori, S. Consoli, and D. Reforgiato Recupero, “Generative Adversarial Networks for Synthetic Data Generation in
Finance: Evaluating Statistical Similarities and Quality Assessment,” AI, vol. 5, no. 2, pp. 667685, May 2024, doi:
10.3390/ai5020035.
[4] J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with class imbalance,” J. Big Data, vol. 6, no. 1, 2019, doi:
10.1186/s40537-019-0192-5.
[5] L. Dube and T. Verster, “Enhancing classification performance in imbalanced datasets: A comparative analysis of machine learning
models,” Data Sci. Financ. Econ., vol. 3, no. 4, pp. 354379, 2023.
[6] M. S. Kraiem, F. Sánchez-Hernández, and M. N. Moreno-García, “Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties. An approach based on association models,” Appl. Sci., vol. 11, no. 18, 2021, doi:
10.3390/app11188546.
[7] J. C. Westland, “Information loss and bias in likert survey responses,” PLoS One, vol. 17, no. 7 July, pp. 117, 2022, doi:
10.1371/journal.pone.0271949.
[8] A. T. Jebb, V. Ng, and L. Tay, “A Review of Key Likert Scale Development Advances: 19952019,” Front. Psychol., vol. 12, no. May,
pp. 114, 2021, doi: 10.3389/fpsyg.2021.637547.
[9] A. D. Averin, A. A. Yakushev, O. A. Maloshitskaya, S. A. Surby, O. I. Koifman, and I. P. Beletskaya, “Synthesis of porphyrin-
diazacrown ether and porphyrin-cryptand conjugates for fluorescence detection of copper(II) ions,” Russ. Chem. Bull., vol. 66, no. 8,
pp. 14561466, 2017, doi: 10.1007/s11172-017-1908-3.
[10] J. C. F. de Winter and D. Dodou, “Five-point likert items: T test versus Mann-Whitney-Wilcoxon,” Pract. Assessment, Res. Eval., vol.
15, no. 11, 2010, doi: 10.7275/bj1p-ts64.
[11] C. Shorten, T. M. Khoshgoftaar, and B. Furht, “Text Data Augmentation for Deep Learning,” J. Big Data, vol. 8, no. 1, 2021, doi:
10.1186/s40537-021-00492-0.
[12] C. Shorten and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for Deep Learning,” J. Big Data, vol. 6, no. 1, 2019, doi:
10.1186/s40537-019-0197-0.
[13] A. Mumuni and F. Mumuni, “Data augmentation: A comprehensive survey of modern approaches,” Array, vol. 16, no. August, p.
100258, 2022, doi: 10.1016/j.array.2022.100258.
[14] T. Kosolwattana, C. Liu, R. Hu, S. Han, H. Chen, and Y. Lin, “A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly
imbalanced data classification in healthcare,” BioData Min., vol. 16, no. 1, pp. 114, 2023, doi: 10.1186/s13040-023-00330-4.
[15] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “snopes.com: Two-Striped Telamonia Spider,” J. Artif. Intell. Res.,
vol. 16, no. Sept. 28, pp. 321357, 2002, [Online]. Available:
https://arxiv.org/pdf/1106.1813.pdf%0Ahttp://www.snopes.com/horrors/insects/telamonia.asp.
[16] D. Dablain, B. Krawczyk, and N. V. Chawla, “DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data,” IEEE Trans.
Neural Networks Learn. Syst., vol. 34, no. 9, pp. 63906404, 2023, doi: 10.1109/TNNLS.2021.3136503.
[17] J. Fonseca and F. Bacao, “Geometric SMOTE for imbalanced datasets with nominal and continuous features,” Expert Syst. Appl., vol.
234, no. September 2022, p. 121053, 2023, doi: 10.1016/j.eswa.2023.121053.
[18] L. Alzubaidi et al., Review of deep learning: concepts, CNN architectures, challenges, applications, future directions, vol. 8, no. 1.
Springer International Publishing, 2021.
[19] E. Kee, J. J. Chong, Z. J. Choong, and M. Lau, “A Comparative Analysis of Cross-Validation Techniques for a Smart and Lean Pick-
and-Place Solution with Deep Learning,” Electron., vol. 12, no. 11, 2023, doi: 10.3390/electronics12112371.
Albert Verasius Dian Sano et al. / Procedia Computer Science 245 (2024) 282–289 289
Sano et. al. / Procedia Computer Science 00 (2024) 000000 289
much more effective compared to SMOTE in leveraging the classification model’s performance in the context of
small and imbalanced dataset. For further research, considering the relatively small impact of SMOTE in this
experiment, it is worth noting that the use of SMOTE variants such as Borderline-SMOTE, SMOTE-ENN needs to
be considered to experiment with.
Acknowledgements
This work is supported by Bina Nusantara University as a part of Bina Nusantara Universitys BINUS
International Research Applied entitled “Learning Analytics Tools Based on Machine Learning in Predicting
Student Academic Performance in Case-Based Learning (CBL)with contract number: 069/VRRTT/III/2024 and
contract date: March 18, 2024.
References
[1] B. Burger, D. K. Kanbach, S. Kraus, M. Breier, and V. Corvello, “On the use of AI-based tools like ChatGPT to support management
research,” Eur. J. Innov. Manag., vol. 26, no. 7, pp. 233241, 2023, doi: 10.1108/EJIM-02-2023-0156.
[2] H. Crompton and D. Burke,Artificial intelligence in higher education: the state of the field, Int. J. Educ. Technol. High. Educ., vol.
20, no. 1, 2023, doi: 10.1186/s41239-023-00392-8.
[3] F. Ramzan, C. Sartori, S. Consoli, and D. Reforgiato Recupero, “Generative Adversarial Networks for Synthetic Data Generation in
Finance: Evaluating Statistical Similarities and Quality Assessment, AI, vol. 5, no. 2, pp. 667685, May 2024, doi:
10.3390/ai5020035.
[4] J. M. Johnson and T. M. Khoshgoftaar, Survey on deep learning with class imbalance, J. Big Data, vol. 6, no. 1, 2019, doi:
10.1186/s40537-019-0192-5.
[5] L. Dube and T. Verster, “Enhancing classification performance in imbalanced datasets: A comparative analysis of machine learning
models,” Data Sci. Financ. Econ., vol. 3, no. 4, pp. 354379, 2023.
[6] M. S. Kraiem, F. Sánchez-Herndez, and M. N. Moreno-García, “Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties. An approach based on association models,” Appl. Sci., vol. 11, no. 18, 2021, doi:
10.3390/app11188546.
[7] J. C. Westland,Information loss and bias in likert survey responses, PLoS One, vol. 17, no. 7 July, pp. 117, 2022, doi:
10.1371/journal.pone.0271949.
[8] A. T. Jebb, V. Ng, and L. Tay,A Review of Key Likert Scale Development Advances: 19952019, Front. Psychol., vol. 12, no. May,
pp. 114, 2021, doi: 10.3389/fpsyg.2021.637547.
[9] A. D. Averin, A. A. Yakushev, O. A. Maloshitskaya, S. A. Surby, O. I. Koifman, and I. P. Beletskaya,Synthesis of porphyrin-
diazacrown ether and porphyrin-cryptand conjugates for fluorescence detection of copper(II) ions, Russ. Chem. Bull., vol. 66, no. 8,
pp. 14561466, 2017, doi: 10.1007/s11172-017-1908-3.
[10] J. C. F. de Winter and D. Dodou,Five-point likert items: T test versus Mann-Whitney-Wilcoxon,” Pract. Assessment, Res. Eval., vol.
15, no. 11, 2010, doi: 10.7275/bj1p-ts64.
[11] C. Shorten, T. M. Khoshgoftaar, and B. Furht, “Text Data Augmentation for Deep Learning,J. Big Data, vol. 8, no. 1, 2021, doi:
10.1186/s40537-021-00492-0.
[12] C. Shorten and T. M. Khoshgoftaar, A survey on Image Data Augmentation for Deep Learning, J. Big Data, vol. 6, no. 1, 2019, doi:
10.1186/s40537-019-0197-0.
[13] A. Mumuni and F. Mumuni,Data augmentation: A comprehensive survey of modern approaches, Array, vol. 16, no. August, p.
100258, 2022, doi: 10.1016/j.array.2022.100258.
[14] T. Kosolwattana, C. Liu, R. Hu, S. Han, H. Chen, and Y. Lin, “A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly
imbalanced data classification in healthcare,” BioData Min., vol. 16, no. 1, pp. 114, 2023, doi: 10.1186/s13040-023-00330-4.
[15] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “snopes.com: Two-Striped Telamonia Spider, J. Artif. Intell. Res.,
vol. 16, no. Sept. 28, pp. 321357, 2002, [Online]. Available:
https://arxiv.org/pdf/1106.1813.pdf%0Ahttp://www.snopes.com/horrors/insects/telamonia.asp.
[16] D. Dablain, B. Krawczyk, and N. V. Chawla, “DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data,IEEE Trans.
Neural Networks Learn. Syst., vol. 34, no. 9, pp. 63906404, 2023, doi: 10.1109/TNNLS.2021.3136503.
[17] J. Fonseca and F. Bacao,Geometric SMOTE for imbalanced datasets with nominal and continuous features, Expert Syst. Appl., vol.
234, no. September 2022, p. 121053, 2023, doi: 10.1016/j.eswa.2023.121053.
[18] L. Alzubaidi et al., Review of deep learning: concepts, CNN architectures, challenges, applications, future directions, vol. 8, no. 1.
Springer International Publishing, 2021.
[19] E. Kee, J. J. Chong, Z. J. Choong, and M. Lau,A Comparative Analysis of Cross-Validation Techniques for a Smart and Lean Pick-
and-Place Solution with Deep Learning, Electron., vol. 12, no. 11, 2023, doi: 10.3390/electronics12112371.
290 Sano et. al. / Procedia Computer Science 00 (2024) 000000
[20] Z. Jin, J. Shang, Q. Zhu, C. Ling, W. Xie, and B. Qiang, “RFRSF: Employee Turnover Prediction Based on Random Forests and
Survival Analysis,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 12343 LNCS,
pp. 503515, 2020, doi: 10.1007/978-3-030-62008-0_35.
[21] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., vol.
13-17-Augu, pp. 785794, 2016, doi: 10.1145/2939672.2939785.
[22] Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” J. Comput.
Syst. Sci., vol. 55, no. 1, pp. 119139, 1997, doi: 10.1006/jcss.1997.1504.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Generating synthetic data is a complex task that necessitates accurately replicating the statistical and mathematical properties of the original data elements. In sectors such as finance, utilizing and disseminating real data for research or model development can pose substantial privacy risks owing to the inclusion of sensitive information. Additionally, authentic data may be scarce, particularly in specialized domains where acquiring ample, varied, and high-quality data is difficult or costly. This scarcity or limited data availability can limit the training and testing of machine-learning models. In this paper, we address this challenge. In particular, our task is to synthesize a dataset with similar properties to an input dataset about the stock market. The input dataset is anonymized and consists of very few columns and rows, contains many inconsistencies, such as missing rows and duplicates, and its values are not normalized, scaled, or balanced. We explore the utilization of generative adversarial networks, a deep-learning technique, to generate synthetic data and evaluate its quality compared to the input stock dataset. Our innovation involves generating artificial datasets that mimic the statistical properties of the input elements without revealing complete information. For example, synthetic datasets can capture the distribution of stock prices, trading volumes, and market trends observed in the original dataset. The generated datasets cover a wider range of scenarios and variations, enabling researchers and practitioners to explore different market conditions and investment strategies. This diversity can enhance the robustness and generalization of machine-learning models. We evaluate our synthetic data in terms of the mean, similarities, and correlations.
Article
Full-text available
In the realm of machine learning, where data-driven insights guide decision-making, addressing the challenges posed by class imbalance in datasets has emerged as a crucial concern. The effectiveness of classification algorithms hinges not only on their intrinsic capabilities but also on their adaptability to uneven class distributions, a common issue encountered across diverse domains. This study delves into the intricate interplay between varying class imbalance levels and the performance of ten distinct classification models, unravelling the critical impact of this imbalance on the landscape of predictive analytics. Results showed that random forest (RF) and decision tree (DT) models outperformed others, exhibiting robustness to class imbalance. Logistic regression (LR), stochastic gradient descent classifier (SGDC) and naïve Bayes (NB) models struggled with imbalanced datasets. Adaptive boosting (ADA), gradient boosting (GB), extreme gradient boosting (XGB), light gradient boosting machine (LGBM), and k-nearest neighbour (kNN) models improved with balanced data. Adaptive synthetic sampling (ADASYN) yielded more reliable predictions than the under-sampling (UNDER) technique. This study provides insights for practitioners and researchers dealing with imbalanced datasets, guiding model selection and data balancing techniques. RF and DT models demonstrate superior performance, while LR, SGDC and NB models have limitations. By leveraging the strengths of RF and DT models and addressing class imbalance, classification performance in imbalanced datasets can be enhanced. This study enriches credit risk modelling literature by revealing how class imbalance impacts default probability estimation. The research deepens our understanding of class imbalance's critical role in predictive analytics. Serving as a roadmap for practitioners and researchers dealing with imbalanced data, the findings guide model selection and data balancing strategies, enhancing classification performance despite class imbalance.
Article
Full-text available
As one of the core applications of computer vision, object detection has become more important in scenarios requiring high accuracy but with limited computational resources such as robotics and autonomous vehicles. Object detection using machine learning running on embedded device such as Raspberry Pi provides the high possibility to detect any custom objects without the recalibration of camera. In this work, we developed a smart and lean object detection model for shipping containers by using the state-of-the-art deep learning TensorFlow model and deployed it to a Raspberry Pi. Using EfficientDet-Lite2, we explored the different cross-validation strategies (Hold-out and K-Fold). The experimental results show that compared with the baseline EfficientDet-Lite2 algorithm, our model improved the mean average precision (mAP) by 44.73% for the Hold-out dataset and 6.26% for K-Fold cross-validation. We achieved Average Precision (AP) of more than 80% and best detection scores of more than 93% for the Hold-out dataset. For the 5-Fold lean dataset, the results show the Average Precision across the three lightweight models are generally high as the models achieved more than 50% average precision, with YOLOv4 Tiny performing better than EfficientDet-Lite2 and Single Shot Detector (SSD) MobileNet V2 Feature Pyramid Network (FPN) 320 as a lightweight model.
Article
Full-text available
In many healthcare applications, datasets for classification may be highly imbalanced due to the rare occurrence of target events such as disease onset. The SMOTE (Synthetic Minority Over-sampling Technique) algorithm has been developed as an effective resampling method for imbalanced data classification by oversampling samples from the minority class. However, samples generated by SMOTE may be ambiguous, low-quality and non-separable with the majority class. To enhance the quality of generated samples, we proposed a novel self-inspected adaptive SMOTE (SASMOTE) model that leverages an adaptive nearest neighborhood selection algorithm to identify the “visible” nearest neighbors, which are used to generate samples likely to fall into the minority class. To further enhance the quality of the generated samples, an uncertainty elimination via self-inspection approach is introduced in the proposed SASMOTE model. Its objective is to filter out the generated samples that are highly uncertain and inseparable with the majority class. The effectiveness of the proposed algorithm is compared with existing SMOTE-based algorithms and demonstrated through two real-world case studies in healthcare, including risk gene discovery and fatal congenital heart disease prediction. By generating the higher quality synthetic samples, the proposed algorithm is able to help achieve better prediction performance (in terms of F1 score) on average compared to the other methods, which is promising to enhance the usability of machine learning models on highly imbalanced healthcare data.
Article
Full-text available
This systematic review provides unique findings with an up-to-date examination of artificial intelligence (AI) in higher education (HE) from 2016 to 2022. Using PRISMA principles and protocol, 138 articles were identified for a full examination. Using a priori, and grounded coding, the data from the 138 articles were extracted, analyzed, and coded. The findings of this study show that in 2021 and 2022, publications rose nearly two to three times the number of previous years. With this rapid rise in the number of AIEd HE publications, new trends have emerged. The findings show that research was conducted in six of the seven continents of the world. The trend has shifted from the US to China leading in the number of publications. Another new trend is in the researcher affiliation as prior studies showed a lack of researchers from departments of education. This has now changed to be the most dominant department. Undergraduate students were the most studied students at 72%. Similar to the findings of other studies, language learning was the most common subject domain. This included writing, reading, and vocabulary acquisition. In examination of who the AIEd was intended for 72% of the studies focused on students, 17% instructors, and 11% managers. In answering the overarching question of how AIEd was used in HE, grounded coding was used. Five usage codes emerged from the data: (1) Assessment/Evaluation, (2) Predicting, (3) AI Assistant, (4) Intelligent Tutoring System (ITS), and (5) Managing Student Learning. This systematic review revealed gaps in the literature to be used as a springboard for future researchers, including new tools, such as Chat GPT.
Article
Full-text available
Purpose – The article discusses the current relevance of artificial intelligence (AI) in research and how AI improves various research methods. This article focuses on the practical case study of systematic literature reviews (SLRs) to provide a guideline for employing AI in the process. Design/methodology/approach – Researchers no longer require technical skills to use AI in their research. The recent discussion about using Chat Generative Pre-trained Transformer (GPT), a chatbot by OpenAI, has reached the academic world and fueled heated debates about the future of academic research. Nevertheless, as the saying goes, AI will not replace our job; a human being using AI will. This editorial aims to provide an overview of the current state of using AI in research, highlighting recent trends and developments in the field. Findings – The main result is guidelines for the use of AI in the scientific research process. The guidelines were developed for the literature review case but the authors believe the instructions provided can be adjusted to many fields of research, including but not limited to quantitative research, data qualification, research on unstructured data, qualitative data and even on many support functions and repetitive tasks. Originality/value – AI already has the potential to make researchers’ work faster, more reliable and more convenient. The authors highlight the advantages and limitations of AI in the current time, which should be present in any research utilizing AI. Advantages include objectivity and repeatability in research processes that currently are subject to human error. The most substantial disadvantages lie in the architecture of current general-purpose models, which understanding is essential for using them in research. The authors will describe the most critical shortcomings without going into technical detail and suggest how to work with the shortcomings daily.
Article
Full-text available
To guarantee good performance, modern machine learning models typically require large amounts of quality annotated data. Meanwhile, the data collection and annotation processes are usually performed manually, and consume a lot of time and resources. The quality and representativeness of curated data for a given task is usually dictated by the natural availability of clean data in the particular domain as well as the level of expertise of developers involved. In many real-world application settings it is often not feasible to obtain sufficient training data. Currently, data augmentation is the most effective way for alleviating this problem. The main goal of data augmentation is to increase the volume, quality and diversity of training data. This paper presents an extensive and thorough review of data augmentation methods applicable in computer vision domains. The focus is on more recent and advanced data augmentation techniques. The surveyed methods include deeply learned augmentation strategies as well as feature-level and meta-learning-based data augmentation techniques. Data synthesis approaches based on realistic 3D graphics modeling, neural rendering, and generative adversarial networks are also covered. Different from previous surveys, we cover a more extensive array of modern techniques and applications. We also compare the performance of several state-of-the-art augmentation methods and present a rigorous discussion of the effectiveness of various techniques in different scenarios of use based on performance results on different datasets and tasks.
Article
Full-text available
Likert response surveys are widely applied in marketing, public opinion polls, epidemiological and economic disciplines. Theoretically, Likert mapping from real-world beliefs could lose significant amounts of information, as they are discrete categorical metrics. Similarly, the subjective nature of Likert-scale data capture, through questionnaires, holds the potential to inject researcher biases into the statistical analysis. Arguments and counterexamples are provided to show how this loss and bias can potentially be substantial under extreme polarization or strong beliefs held by the surveyed population, and where the survey instruments are poorly controlled. These theoretical possibilities were tested using a large survey with 14 Likert-scaled questions presented to 125,387 respondents in 442 distinct behavioral-demographic groups. Despite the potential for bias and information loss, the empirical analysis found strong support for an assumption of minimal information loss under Normal beliefs in Likert scaled surveys. Evidence from this study found that the Normal assumption is a very good fit to the majority of actual responses, the only variance from Normal being slightly platykurtic (kurtosis ~ 2) which is likely due to censoring of beliefs after the lower and upper extremes of the Likert mapping. The discussion and conclusions argue that further revisions to survey protocols can assure that information loss and bias in Likert-scaled data are minimal.
Article
Full-text available
Despite over two decades of progress, imbalanced data is still considered a significant challenge for contemporary machine learning models. Modern advances in deep learning have further magnified the importance of the imbalanced data problem, especially when learning from images. Therefore, there is a need for an oversampling method that is specifically tailored to deep learning models, can work on raw images while preserving their properties, and is capable of generating high-quality, artificial images that can enhance minority classes and balance the training set. We propose Deep synthetic minority oversampling technique (SMOTE), a novel oversampling algorithm for deep learning models that leverages the properties of the successful SMOTE algorithm. It is simple, yet effective in its design. It consists of three major components: 1) an encoder/decoder framework; 2) SMOTE-based oversampling; and 3) a dedicated loss function that is enhanced with a penalty term. An important advantage of DeepSMOTE over generative adversarial network (GAN)-based oversampling is that DeepSMOTE does not require a discriminator, and it generates high-quality artificial images that are both information-rich and suitable for visual inspection. DeepSMOTE code is publicly available at https://github.com/dd1github/DeepSMOTE.