Conference PaperPDF Available

SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An Enhancement Strategy to Handle Imbalance in Data Level

Authors:

Abstract and Figures

The imbalanced dataset often becomes obstacle in supervised learning process. Imbalance is case in which the example in training data belonging to one class is heavily outnumber the examples in the other class. Applying classifier to this dataset results in the failure of classifier to learn the minority class. Synthetic Minority Oversampling Technique (SMOTE) is a well known oversampling method that tackles imbalance in data level. SMOTE creates synthetic example between two close vectors that lay together. Our study considers three improvements of SMOTE and call them as SMOTEOut, SMOTE-Cosine, and Selected-SMOTE, in order to cover cases which are not already done by SMOTE. To investigate the proposed method, our experiments were conducted with eighteen different datasets. The results show that our proposed SMOTE give some improvements of B-ACC and F1-Score.
Content may be subject to copyright.
SMOTE-Out, SMOTE-Cosine, and
Selected-SMOTE: An Enhancement Strategy
to Handle Imbalance in Data Level
Fajri Koto
Faculty of Computer Science
University of Indonesia
Depok, Jawa Barat, Indonesia 16423
Email: fajri91@ui.ac.id
Abstract—The imbalanced dataset often becomes ob-
stacle in supervised learning process. Imbalance is case in
which the example in training data belonging to one class
is heavily outnumber the examples in the other class.
Applying classifier to this dataset results in the failure of
classifier to learn the minority class. Synthetic Minority
Oversampling Technique (SMOTE) is a well known over-
sampling method that tackles imbalance in data level.
SMOTE creates synthetic example between two close
vectors that lay together. Our study considers three
improvements of SMOTE and call them as SMOTE-
Out, SMOTE-Cosine, and Selected-SMOTE, in order to
cover cases which are not already done by SMOTE. To
investigate the proposed method, our experiments were
conducted with eighteen different datasets. The results
show that our proposed SMOTE give some improvements
of B-ACC and F1-Score.
I. INTRODUCTION
To achieve optimum performance, classifier requires
balanced distribution of dataset. However, case in
which the example in training data belonging to one
class is heavily outnumber the examples in the other
class, is often faced in real world. It is commonly
caused by the difficulty or the expensive cost to con-
struct datasets. For instance, biomedical data such as
rare disease and abnormal prognosis, or data which is
obtained from very difficult or expensive experiment.
Applying classifier to the imbalanced dataset causes
classifiers fail to learn the minority class because
of majority class generalization. Whereas in fact the
minority class is often being important subject of
investigation.
Some attempts have been addressed to tackle the
imbalance. According to [1] the approaches can be
roughly divided into two categories: 1) Data level re-
balance and 2) Modified learning algorithm approach.
Two well known re-balancing techniques in data level,
Random Under Sampling (RUS) and Random Over
Sampling (ROS) have been introduced as the standard
non-heuristic re-sampling technique. RUS is done by
randomly eliminating majority class example, while
ROS achieves balance by generating random replica-
tion of minority class examples [2]. In the second
category, [3] and [4] modify decision tree and the
original SVM respectively to increase its sensitivity on
minority class. Approaches of ensemble learning were
also introduced in AdaBoost [6] and AdaCost [7].
Further studies related to under and over-sampling
have been also done. [8] states that under-sampling can
establish a reasonable baseline for algorithmic compar-
ison in imbalanced dataset problem. They argue that it
is a better approach than over-sampling. However, we
consider certain condition, an extreme imbalanced case
in which under-sampling may discard many potential
useful data. For instance, 1000 examples of dataset
in which 990 are positive and 10 are negative exam-
ples, will cause under-sampling no longer beats over-
sampling.
We realize a further study related to over-sampling
is necessary. In this paper we address it by improving a
well known over-sampling technique, Synthetic Minor-
ity Oversampling Technique (SMOTE) [10]. SMOTE
is a technique to generate new examples of minority
class that is done by interpolating between two exam-
ples of minority class that lay together. Thus, over-
fitting problem which causes the decision boundaries
for the minority class to spread further into the ma-
jority class space, can be avoided. In further study,
SMOTE-Borderline was also introduced [9]. Techni-
cally, it aims to emphasize the boundary between the
majority and minority class space.
In this study, we consider three cases which have
not been covered by SMOTE and SMOTE-Borderline.
First, in case where the distribution of minority ex-
amples is very dense and close to each other, new
variation of minority examples will not be achieved,
caused by the interpolation that is only done along
the line connecting two minority examples. Second,
the euclidean approach that is used to measure the
nearest neighbor, only considers the distance between
two vectors. Whereas in fact, the similarity between
two vectors can be also considered by their angle or di-
rection. And third, to produce new examples, SMOTE
synthesizes all attributes of dataset. Whereas, not all
of them represent the boundary between minority
and majority class space. We argue that synthesizing
certain attributes which are considered as significant,
ICACSIS 2014
193
can yield better examples.
In next sections the further explanation of these
cases will be given. First, in section 2 we provide the
overview of SMOTE. Our proposed methods including
further explanation and technical method will be dis-
cussed in Section 3. The experimental setup and results
then are given in Section 4. And finally, conclusion are
drawn in Section 5.
II. TH E OVE RVIEW OF SYNTHETIC MINORITY
OVERSAMPLING TECHNIQUE
SMOTE is an over-sampling approach in which the
minority class is over-sampled by creating ”synthetic”
examples rather than by over-sampling with replace-
ment. The minority class is over-sampled by taking
each minority class sample and introducing synthetic
examples along the line segment joining any/all of the
kminority class nearest neighbors. Depending upon
the amount of over-sampling requires, neighbors from
the knearest neighbors are randomly chosen. [10]
proposed SMOTE by utilizing Euclidean distance (Eq.
1) to find the closest neighbor of minority examples.
d(p, q) = p(p1q1)2+ (p2q2)2+.. + (pnqn)2
(1)
In general, SMOTE is applied based on the proce-
dure below:
Determine number of neighbors kand amount of
SMOTE N.
Randomly select Nminority samples and put into
A.
For each element aiin A, find its knearest
neighbor by calculating euclidean distance, and
then randomly select one nearest neighbor vand
compute synthetic example in between aiand v.
III. THE PRO PO SE D SM OTE E NHANCEMENT
In this section, we introduce SMOTE-Out, SMOTE-
Cosine, and Selected-SMOTE as strategy to enhance
SMOTE, in order to cover some cases which are told
in previous section.
A. SMOTE-Out
Applying SMOTE to minority examples with dense
distribution may cause SMOTE creates meaningless
synthetic examples. Assume in Fig. 1 the minority
examples are represented by cross mark, then the
dash triangle is the line where synthetic examples
created for applying SMOTE to a minority example.
It may arise problem if two vectors lay very close
together and results in very short line in between.
We propose SMOTE-Out as strategy to handle it by
creating synthetic example in outside area of dash line.
The illustration of SMOTE-Out procedure is de-
scribed in Fig. 2 and Alg. 1. To create synthetic
example in space of circle (see Fig. 1), SMOTE-Out
utilizes the nearest majority example as direction to
go off the track. SMOTE-Out may rise a question
regarding to how it avoids the over-fitting. We tackle
Fig. 1. The difference between SMOTE and SMOTE-Out
Fig. 2. Illustration of SMOTE-Out procedure
this issue by using the nearest minority example to
draw in the synthetic point.
Now suppose we have vector uas minority example
and vas its nearest majority class neighbor. To get
the outside vector of uthat respects to v, we can
find vector dif1 = uvthat represent the different
between uand v. Suppose the outside vector of u
called as u0has a constraint (u0v)>(uv), in
order to keep the u0distance from the majority class
space. Mathematically, it can be just simply calculated
by u0=u+rand(0, a)dif 1. In this study we
use a= 0.3to minimize the possibility of over-
fitting. The next step is finding vector wthat close
to u0. Suppose xis the nearest minority neighbor of
u, then wis simply calculated by applying SMOTE
between xand u0. Vector wis calculated by formula
of w=x+rand(0, a)dif 2, where dif 2 = u0x
ICACSIS 2014
194
Algorithm 1 Smote-Out of u
Input: data u, dataset majority, dataset minority
majorneighbor[] = getNrstNeighbor(u,majority)
minorneighbor[] = getNrstNeighbor(u,minority)
k= size(majorneighbor) = size(minorneighbor)
v=majorneighbor[random(1 to k)]
dif1=u-v
u0=u+ random(0 to 0.3)* dif1
x=minorneighbor[random(1 to k)]
dif2=u0-x
w=x+ random(0 to 0.5))* dif2
Algorithm 2 Finding the nearest neighbor of uin
SMOTE-Cosine
Input: data u, dataset minority
Output: data neighbor
A= [] to keep euclidean score
B= [] to keep cosine score
minority =minority without data u
k = size(minority)
for i= 1 to kdo
A[i] = euclidean (u,minority[i])
B[i] = cosine (u,minority[i])
A=sortByAscending (A)
B=sortByDescending (B)
neighbor =voteResult (A, B)
and a= 0.5in this study.
Based on the description above, assume uas a circle
center, ais the distance between uand x, and bis
fraction of dif1. Then the circle area in Fig. 1 will
follow these two cases: 1) If abthen the circle
radius is the distance between uand x. 2) If a < b
then the synthetic example will be created in circle
area with radius equals to b.
B. SMOTE-Cosine
As we told in Section 1, nearest neighbor of a
minority sample may be achieved better by considering
distance and the direction of vectors. We investigate
this problem by addressing it as SMOTE-Cosine, in
which we use the incorporation of cosine similarity in
eq. 2 and Euclidean distance formula to obtain new
nearest neighbor. The detail is described in Alg. 2
sim(u, v) = Pu.v
pPu2pPv2(2)
In Alg.2, we apply voting mechanism to incorporate
results of euclidean distance and cosine similarity.
Voting is done by applying higher weight for higher
rank. We then simply add both weight correspondingly
and sort the result to determine the nearest neighbor.
C. Selected-SMOTE
Selected-SMOTE aims to emphasize the dimension
of significant attributes that is done by synthesizing
certain attributes based on feature selection. In Fig. 3
Fig. 3. Selected-SMOTE illustration
Algorithm 3 Selected SMOTE procedure to every
minority examples
Input: dataset data
attr[] is list of attribute in dataset
signif icantAttr[] = featureSelection (data)
minority[] = getMinorityData(data)
for i= 1 to size(minority)do
u=minority[i]
neighbor[] = getNrstNeighbor(u,minority)
k= size(neighbor)
v=neighbor[random(1 to k)]
dif =v-u
for m= 1 to sumOf Attr(u)do
if signif icantAttr contains attr[m]then
u0[m]=u[m]+ random(0 to 1))* dif[m]
else
u0[m]=u[m]
we illustrate the basic idea why it is able to enhance
SMOTE performance. Suppose in Fig. 3, bis the
border line between majority and minority space which
only consists of two attributes: xand y. And assume
b=yand pis a minority example. In applying
SMOTE to vector p, both attributes xand ywill be
synthesized. Whereas, synthesizing attribute yshall
be unnecessary since we only need to emphasize the
variation of attribute x, caused by border line b. The
complete procedure of selected-SMOTE is described
in Alg. 3.
IV. EXP ER IM EN T
A. Experimental Set-up
To investigate the performance of proposed tech-
nique, we use eighteen different datasets from UCI. As
preliminary study, the experiments were conducted on
binary datasets with varying imbalance ratio, from 1:3
ICACSIS 2014
195
TABLE I
EIG HTE EN DATA SE TS WI TH VARI OU S IMB ALA NC ED PRO PO RTIO N
Dataset Attribute #positive #negative Proportion
Arrhythmia (ART) 279 44 245 1:5.7
Breast Cancer Wisconsin (BCW) 10 100 458 1:4.58
India liver (IL) 10 20 100 1:5
Dermatology (DER) 33 30 112 1:3.73
Yeast (YEA) 8 50 463 1:9.26
Fertility (FER) 10 12 88 1:7.33
Climate Model Simulation (CMS) 18 46 494 1:10.74
Glass Identification (GI) 10 25 163 1:6.52
Ionosphere (ION) 34 60 225 1:3.75
Statlog-Landsat Satellite (SLS) 36 300 1072 1:3.57
Credit card (CC) 15 25 296 1:11.84
Car Evaluation (CE) 6 384 1210 1:3.15
Hill-valley (HV) 100 50 311 1:6.22
Inflammations(INF) 6 14 70 1:5
Carcinoma tissue (CT) 10 21 85 1:4.04
Congressional Voting (CV) 16 50 267 1:5.34
Magic Gamma Telescope (MGT) 11 100 1000 1:10
White Wine Quality (WWQ) 11 500 3838 1:7.77
TABLE II
TWO-C LA SSE S CO NFU SI ON MATR IX
Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN
to 1:12 and different number of attributes. All datasets
are summarized in Table. I.
In experiment stage, we divide the dataset with
ratio 3:7 to construct testing and training data. We
apply LIBSVM [11] with linear kernel as classifier and
BACC and FM easure as evaluation method.
Evaluation of classifier induced by imbalanced datasets
need special attention because despite high accuracy
it may not meet user requirement of recognition of
minority class. The evaluation formula is provided in
Eq. 6 and Eq. 7 and calculated based on two-classes
confusion matrix in Table. II.
P recision =T P
T P +F P (3)
Recall =Sensitivity =T P
T P +F N (4)
Specif icity =T N
T N +F P (5)
FMeasure =2precision recall
precision +recall (6)
BAcc = 0.5(Specif icity +Sensitivity)(7)
B. Experiment Result
In this experiment SMOTE is applied in five varia-
tions: SMOTE, SMOTE-Out, Combination of SMOTE
and SMOTE-Out, SMOTE-Cosine, and Selected-
SMOTE. For each SMOTE, we also conduct 5 time
experiments and calculate the average score of BAcc
and Fmeasure, in order to generalize randomization
in SMOTE. For all datasets, SMOTE is applied based
on corresponding class proportion to achieve balance.
For instance, in Arrhythmia dataset, 570% synthetic
sample of minority examples will be created to re-
balance the ratio of 1:5.7.
In Table. III we present all of our experiment results
for eighteen datasets. For each proposed approach, we
compare the results with standard SMOTE by counting
number of dataset in group of better, equals or worse
than standard SMOTE. In our first proposed method,
SMOTE-out, only 3 and 2 datasets have worse score
of BACC and FM easure. While 10 and 12
datasets have better scores, and 5 and 4 have the same
scores. Whereas incorporating SMOTE and SMOTE-
Out by applying 50% for each, reveal a better result.
Our experiment shows 12 and 13 datasets have better
score, and only 2 and 2 datasets have worse BACC
and FMeasure respectively.
Whereas results of SMOTE-Cosine only show that 8
datasets have better score than SMOTE, and 7 of them
give worse score. It may indicate that our SMOTE-
cosine is not good enough to improve the standard per-
formance. However we argue,there might be another
voting or incorporation mechanism which will give
better result, since the similarity of two vectors also
shall be determined by considering vector’s direction.
Similar with SMOTE-Out, Our third proposed
method shows that 11 and 10 datasets have better
BACC and FM easure. Only 4 and 5 datasets
give worse score than SMOTE. It indicates that our
idea to synthesize certain attributes based on feature
selection, apparently can give some improvements for
SMOTE.
V. CONCLUSION
In this paper we present three improvements of
SMOTE: SMOTE-Out, SMOTE-Cosine, and Selected-
SMOTE, in order to cover cases which are not al-
ready done by SMOTE. Our experiment results reveal
ICACSIS 2014
196
TABLE III
EXP ERI ME NT RE SULT
No Dataset SMOTE SMOTE-OUT Combine Both SMOTE-Cosine Selected-SMOTE
BACC FM BACC FM BACC FM BACC FM BACC FM
1 ART 74.62 58.97 76.00 61.29 75.12 61.88 72.66 60.97 77.26 63.42
2 BCW 97.61 96.32 97.61 96.32 97.75 96.98 98.04 98.29 97.68 96.65
3 IL 79.67 57.52 80.00 58.08 80.00 58.08 76.33 51.75 79.33 56.79
4 DER 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
5 YEA 83.38 56.28 83.45 56.52 83.38 56.24 85.68 66.15 83.60 57.11
6 FER 52.50 27.02 52.50 27.87 55.09 33.35 54.35 31.37 52.87 27.56
7 CMS 88.83 66.43 90.04 73.73 90.53 68.78 84.60 49.25 89.21 65.29
8 GI 76.35 66.10 77.60 66.96 77.40 66.05 83.65 70.19 77.81 67.88
9 ION 80.56 95.11 78.89 94.71 81.11 95.24 76.11 94.06 82.22 95.51
10 SLS 99.44 99.85 99.44 99.85 99.44 99.85 99.22 99.78 99.11 99.75
11 CC 40.81 40.03 41.11 40.59 45.58 45.02 46.11 45.53 43.19 42.38
12 CE 75.84 56.95 75.62 56.73 76.39 57.55 73.47 54.65 76.39 57.51
13 HV 98.40 91.54 99.15 95.07 98.94 94.13 91.37 81.15 99.15 95.18
14 INF 76.19 50.00 76.19 50.00 76.19 50.00 76.19 50.00 76.19 50.00
15 CT 69.84 48.31 81.48 68.64 67.69 50.20 60.28 49.13 70.71 50.75
16 CV 92.22 79.13 92.59 80.53 92.47 80.06 92.47 80.06 91.98 78.25
17 MGT 75.10 45.17 74.93 46.26 74.53 46.66 75.73 44.80 75.10 45.17
18 WWQ 61.67 27.94 62.19 28.36 62.05 28.24 64.94 35.71 61.59 27.84
Better than SMOTE 10 12 12 13 8 8 11 10
Equals to SMOTE 5 4 4 3 3 3 3 3
Worse than SMOTE 3 2 2 2 7 7 4 5
that SMOTE-Out, incorporation between SMOTE and
SMOTE-Out, and Selected-SMOTE are able to boost
standard performance. SMOTE-Out result indicates
new variation of minority sample can be achieved
better by synthesizing samples in outside the line
connecting two vectors. Similar with SMOTE-Out,
our hypothesis regarding to selected-SMOTE is also
agreed with the experiment result. The conducted
approach is able to enrich the variation of minority
examples better because of emphasizing dimension of
significant attributes only. In our future work, we will
further investigate various incorporation approaches
of SMOTE with existing advanced SMOTE. We will
also investigate another way to improve incorporation
mechanism of SMOTE-cosine. We still argue that the
nearest neighbor shall be better calculated by consid-
ering direction and distance in between two vectors.
REFERENCES
[1] Z. F. Ye and B.L Lu, ”Learning Imbalanced Data Sets with a
Min-Max Modular Support Vector Machine”. In Proceedings
of International Joint Conference on Neural Network, Orlando,
Florida, USA, 2000.
[2] G. E. Batista, R. C. Prati and M. C. Monard, ”A study of the
behavior of several methods for balancing machine learning
training data”. In ACM SIGKDD Explorations Newsletter, 2000.
[3] C. Cardie and N. Howe, ”Improving minority class prediction
using case-specific feature weights”. In Proceedings of ICML,
pp. 57-65, 1997.
[4] K. Veropoulos, C. Campbell and N. Cristianini, ”Controlling
the sensitivity of support vector machines”. In Proceedings of
International joint conference on artificial intelligence, pp. 55-
60, 1999.
[5] Z. H. Zhou and X. Y. Liui, ”Training cost-sensitive neural net-
works with methods addressing the class imbalance problem”.
In Proceedings of Knowledge and Data Engineering, IEEE, pp.
63-77, 2006.
[6] R. E. Schapire, ”A brief introduction to boosting”. In Proceed-
ings of Ijcai, pp. 1401-1406, 1999.
[7] W. Fan, S. J. Stolfo, J. Zhang and P.K. Chan, ”AdaCost:
misclassification cost-sensitive boosting”. In Proceedings of
ICML, pp. 97-105, 1999.
[8] C. Drummond and R.C. Holte, ”C4. 5, class imbalance, and
cost sensitivity: why under-sampling beats over-sampling”. In
Workshop on Learning from Imbalanced Datasets II, 2003.
[9] H. Han, W. Y. Wang and B. H. Mao, ”Borderline-SMOTE: A
new over-sampling method in imbalanced data sets learning”.
In Advances in intelligent computing, pp. 878-887, 2005.
[10] N. V. Chawla, K. W. Bowye, L. O. Hall and W.P. Kegelmeyer,
”SMOTE: synthetic minority over-sampling technique”. In Jour-
nal of Articial Intelligence Research 16, pp. 321357, 2002.
[11] C. C. Chang and C. J. Lin, ”LIBSVM: a library for support
vector machines”. In ACM Transactions on Intelligent Systems
and Technology (TIST), pp. 27, 2011.
ICACSIS 2014
197
... Kmeans-SMOTE [21] incorporates the K-means clustering into the SMOTE to obtain effective minority instances. To date, various SMOTE's variants have been proposed [22], such as safe-level-synthetic minority over-sampling technique (SL-SMOTE) [23] and SMOTE-OUT [24]. Although these methods help mitigate overfitting to some extent, they cannot guarantee the effectiveness of the produced instances and may generate random noise. ...
... Finally, the solution of problem (24) can be solved by the derivative of f with respect to b ...
... Four folds are used for training, and the rest is used for testing. To validate the effectiveness of the DDNR method, we select eight popular imbalanced learning methods for comparison, including SMOTE [19], BL-SMOTE [20], SMOTE-OUT [24], Kmeans-SMOTE [21], ADA-INCVAE [26], DTE-SBD [30], SNP-ECC [31], and weighted competitive collaborative representation-based classifier (WCCRC) [42]. The parameters of the competing methods are set to their default values in Python or within the specified range of values provided in the original papers. ...
Article
Full-text available
Representation-based methods have found widespread applications in various classification tasks. However, these methods cannot deal effectively with imbalanced data scenarios. They tend to neglect the importance of minority samples, resulting in bias toward the majority class. To address this limitation, we propose a density-based discriminative nonnegative representation approach for imbalanced classification tasks. First, a new class-specific regularization term is incorporated into the framework of a nonnegative representation based classifier (NRC) to reduce the correlation between classes and improve the discrimination ability of the NRC. Second, a weight matrix is generated based on the hybrid density information of each sample’s neighbors and the decision boundary, which can assign larger weights to minority samples and thus reduce the preference for the majority class. Furthermore, the resulting model can be efficiently optimized through the alternating direction method of multipliers. Extensive experimental results demonstrate that our proposed method is superior to numerous state-of-the-art imbalanced learning methods.
... The oversampling methods investigated in this paper include but are not limited to SMOTE [4], SMOTE TomekLinks [23], Borderline SMOTE [24], ADASYN [25], AHC [26], Distance SMOTE [27], polynom fit SMOTE [28], ADOMS [29], Safe Level SMOTE [30], MSMOTE [31], DE oversampling [32], SMOBD [33], SUNDO [34], MSYN [35], SVM balance [36], TRIM SMOTE [37], ProWSyn [38], SL graph SMOTE [39], LVQ SMOTE [40], SOI CJ [41], ROSE [42], SMOTE OUT [43], SMOTE Cosine [43], Selected SMOTE [43], LN SMOTE [44], MWMOTE [45], PDFOS [46], RWO sampling [47], NEATER [48], DEAGO [49], Gazzah [50], SMOTE IPF [51], KernelADASYN [52], MOT2LD [53], etc. The names of the other methods are mentioned in Tables I, II, and III. ...
... The oversampling methods investigated in this paper include but are not limited to SMOTE [4], SMOTE TomekLinks [23], Borderline SMOTE [24], ADASYN [25], AHC [26], Distance SMOTE [27], polynom fit SMOTE [28], ADOMS [29], Safe Level SMOTE [30], MSMOTE [31], DE oversampling [32], SMOBD [33], SUNDO [34], MSYN [35], SVM balance [36], TRIM SMOTE [37], ProWSyn [38], SL graph SMOTE [39], LVQ SMOTE [40], SOI CJ [41], ROSE [42], SMOTE OUT [43], SMOTE Cosine [43], Selected SMOTE [43], LN SMOTE [44], MWMOTE [45], PDFOS [46], RWO sampling [47], NEATER [48], DEAGO [49], Gazzah [50], SMOTE IPF [51], KernelADASYN [52], MOT2LD [53], etc. The names of the other methods are mentioned in Tables I, II, and III. ...
... The oversampling methods investigated in this paper include but are not limited to SMOTE [4], SMOTE TomekLinks [23], Borderline SMOTE [24], ADASYN [25], AHC [26], Distance SMOTE [27], polynom fit SMOTE [28], ADOMS [29], Safe Level SMOTE [30], MSMOTE [31], DE oversampling [32], SMOBD [33], SUNDO [34], MSYN [35], SVM balance [36], TRIM SMOTE [37], ProWSyn [38], SL graph SMOTE [39], LVQ SMOTE [40], SOI CJ [41], ROSE [42], SMOTE OUT [43], SMOTE Cosine [43], Selected SMOTE [43], LN SMOTE [44], MWMOTE [45], PDFOS [46], RWO sampling [47], NEATER [48], DEAGO [49], Gazzah [50], SMOTE IPF [51], KernelADASYN [52], MOT2LD [53], etc. The names of the other methods are mentioned in Tables I, II, and III. ...
Conference Paper
Full-text available
The usefulness of the oversampling approach to class-imbalanced structured medical datasets is discussed in this paper. In this regard, we basically look into the oversampling approach’s prevailing assumption that synthesized instances do belong to the minority class. We used an off-the-shelf over-sampling validation system to test this assumption. According to the experimental results from the validation system, at least one of the three medical datasets used had newly generated samples that were not belonging to the minority class as a result of the oversampling methods validated. Additionally, the error rate varied based on the dataset and oversampling method tested. Therefore, we claim that synthesizing new instances without first confirming that they are aligned with the minority class is a risky approach, especially in medical fields where misdiagnosis can have serious repercussions. As alternatives to oversampling, ensemble, data partitioning, and method-level approaches are advised since they do not make false assumptions.
... having a disease). However, one of the problems with the original SMOTE algorithm is that it only samples within the existing data distribution of the minority case , i.e. it assumes that existing distribution represent the total "universe" of possible samples and simply "fills in" the data with new synthetic samples from that same distribution (see [13] for examples). New synthetic samples are not created outside that range, e.g. ...
... That is obviously a problematic assumption, which is likely untrue when working with smaller clinical datasets. That problem led to the development of several variations of SMOTE, such as SMOTE_OUT [13], that attempt to correct that problem in various ways. Several other variations can be found here: https://pypi.org/project/smotevariants/ ...
Conference Paper
Full-text available
There is great interest in applying artificial intelligence (AI) techniques to healthcare issues such as Autism, particularly in combination with digital health technologies (robots, wearables, smartphones, etc.) in user homes. However, a critical challenge is that modern AI techniques like deep learning (DL) typically require large datasets with millions of samples, yet in healthcare we are often working with smaller clinical samples (<50 participants). To address that challenge, we need to develop new approaches that can learn more efficiently from less data. In this paper, we propose a novel approach to few-shot learning (FSL) called SMOTE_FSL, which is applicable to traditional machine learning (ML) models, allowing them to work with smaller sample sizes as well as various types of healthcare data (not only image or text data). We compare SMOTE_FSL on two healthcare sensor datasets gathered using robots and wearables, with results showing SMOTE_FSL performs comparably to state-of-the-art DL-based FSL methods (e.g. autoencoders, generative adversarial networks [GAN]). That indicates such an approach holds potential to expand utilization of FSL to a broad range of healthcare data derived from smaller clinical sample sizes.
... ML algorithms require a balanced dataset to achieve optimal performance [15]. In real life, it is obvious that train data may have an imbalance, which means one class is heavily outnumbered. ...
... In high-dimensional space, SVM is particularly good at identifying the best decision borders that divide various classes. In the realm of neurology and brain imaging, SVM emerged as the prevailing method in preceding years [15]. The equation that accurately defines the hyperplane capable of separating the provided data points can be mathematically represented as follows. ...
Conference Paper
Alzheimer's disease, a prevalent form of dementia, stands as a well-established neurodegenerative condition marked by the gradual decline of brain cells and neurons. Its challenging nature arises from the absence of surgical or medicinal remedies. Typically, individuals with AD face a maximum survival span of 3 to 4 years. Notably, during the initial stages of the disease, symptoms often remain undetectable. Professionals rely on the Mini-Mental State Examination (MMSE) and comprehensive familial history to diagnose dementia in its early phases, a task fraught with inherent complexities. Nevertheless, early detection and the deceleration of AD progression remain feasible objectives. Various neuroimaging methods, including Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), and Computerized Tomography (CT) scans, offer significant aid in promptly identifying Alzheimer's disease. This study employs MRI images to categorize AD stages using a Machine Learning approach, incorporating a proposed feature extraction technique. Addressing a substantial class imbalance within the MRI dataset, the SMOTE technique is utilized for dataset balancing before the implementation of ML algorithms. Notably, the proposed model outperforms contemporary models concerning precision, accuracy, recall, and the F1 score.
... Oversampling with SMOTE is achieved by duplicating data samples from the minority class. This method can balance the data without providing additional information to the model (Koto, 2014). Adaptive Synthetic Algorithm (ADASYN): This balancing technique adaptively generates minority samples based on their local density This dataset has a smaller number of samples. ...
Article
Full-text available
Gestational diabetes is characterized by hyperglycemia diagnosed during pregnancy. High blood sugar levels are likely to affect both the mother and child. This disease frequently goes undiagnosed due to its fewer prominent symptoms, resulting in severe unmanaged hyperglycemia, obesity, childbirth complications and overt diabetes. Artificial Intelligence is increasingly deployed in the medical field, revolutionizing and automating data processing and decision-making. Machine learning is a subset of artificial intelligence that can create reliable healthcare screening and predictive systems. With the advent of machine learning, detecting gestational diabetes and getting more profound insights about the disease is possible. This study explores the development of a reliable clinical decision support system for gestational diabetes detection using multiple machine learning architectures using combinations of five data balancing methods to detect gestational diabetes. An ensemble stack trained on the synthetic minority oversampling technique with edited nearest neighbor obtained the highest performance with accuracy, sensitivity and precision of 96%, 95% and 99%, respectively. Additionally, a layer of explainable artificial intelligence was added to the best-performing model using libraries such as SHapley Additive exPlanations, Local Interpretable Model-agnostic Explanations, Quantum lattice, Explain Like I’m 5 algorithm, Anchor and Feature importance. The importance of factors such as Visceral Adipose Deposit and its contribution toward the prediction of gestational diabetes is explored. This research aims to provide a meaningful and interpretable clinical decision support system to aid healthcare professionals in early gestational diabetes detection and improved patient management.
... Therefore, in order to handle the problem, we suggest adopting the cosine classifier widely used to alleviate class imbalance to make classification, which utilizes cosine normalization to impose balanced magnitudes across both old and novel classes. [21][22][23]. ...
Preprint
Intelligent fault diagnosis has made extraordinary advancements currently. Nonetheless, few works tackle class-incremental learning for fault diagnosis under limited fault data, i.e., imbalanced and long-tailed fault diagnosis, which brings about various notable challenges. Initially, it is difficult to extract discriminative features from limited fault data. Moreover, a well-trained model must be retrained from scratch to classify the samples from new classes, thus causing a high computational burden and time consumption. Furthermore, the model may suffer from catastrophic forgetting when trained incrementally. Finally, the model decision is biased toward the new classes due to the class imbalance. The problems can consequently lead to performance degradation of fault diagnosis models. Accordingly, we introduce a supervised contrastive knowledge distillation for incremental fault diagnosis under limited fault data (SCLIFD) framework to address these issues, which extends the classical incremental classifier and representation learning (iCaRL) framework from three perspectives. Primarily, we adopt supervised contrastive knowledge distillation (KD) to enhance its representation learning capability under limited fault data. Moreover, we propose a novel prioritized exemplar selection method adaptive herding (AdaHerding) to restrict the increase of the computational burden, which is also combined with KD to alleviate catastrophic forgetting. Additionally, we adopt the cosine classifier to mitigate the adverse impact of class imbalance. We conduct extensive experiments on simulated and real-world industrial processes under different imbalance ratios. Experimental results show that our SCLIFD outperforms the existing methods by a large margin.
Article
Full-text available
This paper takes a new look at two sampling schemes commonly used to adapt machine al- gorithms to imbalanced classes and misclas- sication costs. It uses a performance anal- ysis technique called cost curves to explore the interaction of over and under-sampling with the decision tree learner C4.5. C4.5 was chosen as, when combined with one of the sampling schemes, it is quickly becom- ing the community standard when evaluat- ing new cost sensitive learning algorithms. This paper shows that using C4.5 with under- sampling establishes a reasonable standard for algorithmic comparison. But it is recom- mended that the least cost classier be part of that standard as it can be better than under- sampling for relatively modest costs. Over- sampling, however, shows little sensitivity, there is often little dierence in performance when misclassication costs are changed.
Article
Full-text available
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Article
Full-text available
There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. In this situation, which is found in real world data describing an infrequent but important event, the learning system may have di#culties to learn the concept related to the minority class. In this work we perform a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets. Our experiments provide evidence that class imbalance does not systematically hinder the performance of learning systems. In fact, the problem seems to be related to learning with too few minority class examples in the presence of other complicating factors, such as class overlapping. Two of our proposed methods, Smote + Tomek and Smote + ENN, deal with these conditions directly, allying a known over-sampling method with data cleaning methods in order to produce better-defined class clusters. Our comparative experiments show that, in general, over-sampling methods provide more accurate results than under-sampling methods considering the area under the ROC curve (AUC). This result seems to contradict results previously published in the literature. Smote + Tomek and Smote + ENN presented very good results for data sets with a small number of positive examples. Moreover, Random over-sampling, a very simple over-sampling method, is very competitive to more complex over-sampling methods. Since the over-sampling methods provided very good performance results, we also measured the syntactic complexity of decision trees induc...
Conference Paper
To overcome the class imbalance problem in statistical machine learning research area, re-balancing the learning task is one of the most classical and intuitive approach. Besides re-sampling, many researchers consider task decomposition as an alternative method for re-balance. Min-max modular support vector machine combines both intelligent task decomposition methods and the min-max modular network model as classifier ensemble. It overcomes several shortcomings of re-sampling, and could also achieve fast learning and parallel learning. We compare its classification performance with resampling and cost sensitive learning on several imbalanced data sets from different application areas. The experimental results indicate that our method can handle class imbalance problem efficiently.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Conference Paper
In recent years, mining with imbalanced data sets receives more and more attentions in both theoretical and practical aspects. This paper introduces the importance of imbalanced data sets and their broad application domains in data mining, and then summarizes the evaluation metrics and the existing methods to evaluate and solve the imbalance problem. Synthetic minority over-sampling technique (SMOTE) is one of the over-sampling methods addressing this problem. Based on SMOTE method, this paper presents two new minority over-sampling methods, borderline-SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are over-sampled. For the minority class, experiments show that our approaches achieve better TP rate and F-value than SMOTE and random over-sampling methods.
Conference Paper
Boosting is a general method for improving the accuracy of any given learning algorithm. This short paper introduces the boosting algorithm AdaBoost, and explains the underlying theory of boosting, including an explanation of why boosting often does not suffer from overfitting. Some examples of recent applications of boost­ ing are also described.
Conference Paper
In recent years, mining with imbalanced data sets receives more and more attentions in both theoretical and practical aspects. This paper introduces the importance of imbalanced data sets and their broad application domains in data mining, and then summarizes the evaluation metrics and the existing methods to evaluate and solve the imbalance problem. Synthetic minority over-sampling technique (SMOTE) is one of the over-sampling methods addressing this problem. Based on SMOTE method, this paper presents two new minority over-sampling methods, borderline-SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are over-sampled. For the minority class, experiments show that our approaches achieve better TP rate and F-value than SMOTE and random over-sampling methods.