Content uploaded by Fajri Koto
Author content
All content in this area was uploaded by Fajri Koto on Oct 19, 2014
Content may be subject to copyright.
SMOTE-Out, SMOTE-Cosine, and
Selected-SMOTE: An Enhancement Strategy
to Handle Imbalance in Data Level
Fajri Koto
Faculty of Computer Science
University of Indonesia
Depok, Jawa Barat, Indonesia 16423
Email: fajri91@ui.ac.id
Abstract—The imbalanced dataset often becomes ob-
stacle in supervised learning process. Imbalance is case in
which the example in training data belonging to one class
is heavily outnumber the examples in the other class.
Applying classifier to this dataset results in the failure of
classifier to learn the minority class. Synthetic Minority
Oversampling Technique (SMOTE) is a well known over-
sampling method that tackles imbalance in data level.
SMOTE creates synthetic example between two close
vectors that lay together. Our study considers three
improvements of SMOTE and call them as SMOTE-
Out, SMOTE-Cosine, and Selected-SMOTE, in order to
cover cases which are not already done by SMOTE. To
investigate the proposed method, our experiments were
conducted with eighteen different datasets. The results
show that our proposed SMOTE give some improvements
of B-ACC and F1-Score.
I. INTRODUCTION
To achieve optimum performance, classifier requires
balanced distribution of dataset. However, case in
which the example in training data belonging to one
class is heavily outnumber the examples in the other
class, is often faced in real world. It is commonly
caused by the difficulty or the expensive cost to con-
struct datasets. For instance, biomedical data such as
rare disease and abnormal prognosis, or data which is
obtained from very difficult or expensive experiment.
Applying classifier to the imbalanced dataset causes
classifiers fail to learn the minority class because
of majority class generalization. Whereas in fact the
minority class is often being important subject of
investigation.
Some attempts have been addressed to tackle the
imbalance. According to [1] the approaches can be
roughly divided into two categories: 1) Data level re-
balance and 2) Modified learning algorithm approach.
Two well known re-balancing techniques in data level,
Random Under Sampling (RUS) and Random Over
Sampling (ROS) have been introduced as the standard
non-heuristic re-sampling technique. RUS is done by
randomly eliminating majority class example, while
ROS achieves balance by generating random replica-
tion of minority class examples [2]. In the second
category, [3] and [4] modify decision tree and the
original SVM respectively to increase its sensitivity on
minority class. Approaches of ensemble learning were
also introduced in AdaBoost [6] and AdaCost [7].
Further studies related to under and over-sampling
have been also done. [8] states that under-sampling can
establish a reasonable baseline for algorithmic compar-
ison in imbalanced dataset problem. They argue that it
is a better approach than over-sampling. However, we
consider certain condition, an extreme imbalanced case
in which under-sampling may discard many potential
useful data. For instance, 1000 examples of dataset
in which 990 are positive and 10 are negative exam-
ples, will cause under-sampling no longer beats over-
sampling.
We realize a further study related to over-sampling
is necessary. In this paper we address it by improving a
well known over-sampling technique, Synthetic Minor-
ity Oversampling Technique (SMOTE) [10]. SMOTE
is a technique to generate new examples of minority
class that is done by interpolating between two exam-
ples of minority class that lay together. Thus, over-
fitting problem which causes the decision boundaries
for the minority class to spread further into the ma-
jority class space, can be avoided. In further study,
SMOTE-Borderline was also introduced [9]. Techni-
cally, it aims to emphasize the boundary between the
majority and minority class space.
In this study, we consider three cases which have
not been covered by SMOTE and SMOTE-Borderline.
First, in case where the distribution of minority ex-
amples is very dense and close to each other, new
variation of minority examples will not be achieved,
caused by the interpolation that is only done along
the line connecting two minority examples. Second,
the euclidean approach that is used to measure the
nearest neighbor, only considers the distance between
two vectors. Whereas in fact, the similarity between
two vectors can be also considered by their angle or di-
rection. And third, to produce new examples, SMOTE
synthesizes all attributes of dataset. Whereas, not all
of them represent the boundary between minority
and majority class space. We argue that synthesizing
certain attributes which are considered as significant,
ICACSIS 2014
193
can yield better examples.
In next sections the further explanation of these
cases will be given. First, in section 2 we provide the
overview of SMOTE. Our proposed methods including
further explanation and technical method will be dis-
cussed in Section 3. The experimental setup and results
then are given in Section 4. And finally, conclusion are
drawn in Section 5.
II. TH E OVE RVIEW OF SYNTHETIC MINORITY
OVERSAMPLING TECHNIQUE
SMOTE is an over-sampling approach in which the
minority class is over-sampled by creating ”synthetic”
examples rather than by over-sampling with replace-
ment. The minority class is over-sampled by taking
each minority class sample and introducing synthetic
examples along the line segment joining any/all of the
kminority class nearest neighbors. Depending upon
the amount of over-sampling requires, neighbors from
the knearest neighbors are randomly chosen. [10]
proposed SMOTE by utilizing Euclidean distance (Eq.
1) to find the closest neighbor of minority examples.
d(p, q) = p(p1−q1)2+ (p2−q2)2+.. + (pn−qn)2
(1)
In general, SMOTE is applied based on the proce-
dure below:
•Determine number of neighbors kand amount of
SMOTE N.
•Randomly select Nminority samples and put into
A.
•For each element aiin A, find its knearest
neighbor by calculating euclidean distance, and
then randomly select one nearest neighbor vand
compute synthetic example in between aiand v.
III. THE PRO PO SE D SM OTE E NHANCEMENT
In this section, we introduce SMOTE-Out, SMOTE-
Cosine, and Selected-SMOTE as strategy to enhance
SMOTE, in order to cover some cases which are told
in previous section.
A. SMOTE-Out
Applying SMOTE to minority examples with dense
distribution may cause SMOTE creates meaningless
synthetic examples. Assume in Fig. 1 the minority
examples are represented by cross mark, then the
dash triangle is the line where synthetic examples
created for applying SMOTE to a minority example.
It may arise problem if two vectors lay very close
together and results in very short line in between.
We propose SMOTE-Out as strategy to handle it by
creating synthetic example in outside area of dash line.
The illustration of SMOTE-Out procedure is de-
scribed in Fig. 2 and Alg. 1. To create synthetic
example in space of circle (see Fig. 1), SMOTE-Out
utilizes the nearest majority example as direction to
go off the track. SMOTE-Out may rise a question
regarding to how it avoids the over-fitting. We tackle
Fig. 1. The difference between SMOTE and SMOTE-Out
Fig. 2. Illustration of SMOTE-Out procedure
this issue by using the nearest minority example to
draw in the synthetic point.
Now suppose we have vector uas minority example
and vas its nearest majority class neighbor. To get
the outside vector of uthat respects to v, we can
find vector dif1 = u−vthat represent the different
between uand v. Suppose the outside vector of u
called as u0has a constraint (u0−v)>(u−v), in
order to keep the u0distance from the majority class
space. Mathematically, it can be just simply calculated
by u0=u+rand(0, a)∗dif 1. In this study we
use a= 0.3to minimize the possibility of over-
fitting. The next step is finding vector wthat close
to u0. Suppose xis the nearest minority neighbor of
u, then wis simply calculated by applying SMOTE
between xand u0. Vector wis calculated by formula
of w=x+rand(0, a)∗dif 2, where dif 2 = u0−x
ICACSIS 2014
194
Algorithm 1 Smote-Out of u
Input: data u, dataset majority, dataset minority
majorneighbor[] = getNrstNeighbor(u,majority)
minorneighbor[] = getNrstNeighbor(u,minority)
k= size(majorneighbor) = size(minorneighbor)
v=majorneighbor[random(1 to k)]
dif1=u-v
u0=u+ random(0 to 0.3)* dif1
x=minorneighbor[random(1 to k)]
dif2=u0-x
w=x+ random(0 to 0.5))* dif2
Algorithm 2 Finding the nearest neighbor of uin
SMOTE-Cosine
Input: data u, dataset minority
Output: data neighbor
A= [] to keep euclidean score
B= [] to keep cosine score
minority =minority without data u
k = size(minority)
for i= 1 to kdo
A[i] = euclidean (u,minority[i])
B[i] = cosine (u,minority[i])
A=sortByAscending (A)
B=sortByDescending (B)
neighbor =voteResult (A, B)
and a= 0.5in this study.
Based on the description above, assume uas a circle
center, ais the distance between uand x, and bis
fraction of dif1. Then the circle area in Fig. 1 will
follow these two cases: 1) If a≥bthen the circle
radius is the distance between uand x. 2) If a < b
then the synthetic example will be created in circle
area with radius equals to b.
B. SMOTE-Cosine
As we told in Section 1, nearest neighbor of a
minority sample may be achieved better by considering
distance and the direction of vectors. We investigate
this problem by addressing it as SMOTE-Cosine, in
which we use the incorporation of cosine similarity in
eq. 2 and Euclidean distance formula to obtain new
nearest neighbor. The detail is described in Alg. 2
sim(u, v) = Pu.v
pPu2∗pPv2(2)
In Alg.2, we apply voting mechanism to incorporate
results of euclidean distance and cosine similarity.
Voting is done by applying higher weight for higher
rank. We then simply add both weight correspondingly
and sort the result to determine the nearest neighbor.
C. Selected-SMOTE
Selected-SMOTE aims to emphasize the dimension
of significant attributes that is done by synthesizing
certain attributes based on feature selection. In Fig. 3
Fig. 3. Selected-SMOTE illustration
Algorithm 3 Selected SMOTE procedure to every
minority examples
Input: dataset data
attr[] is list of attribute in dataset
signif icantAttr[] = featureSelection (data)
minority[] = getMinorityData(data)
for i= 1 to size(minority)do
u=minority[i]
neighbor[] = getNrstNeighbor(u,minority)
k= size(neighbor)
v=neighbor[random(1 to k)]
dif =v-u
for m= 1 to sumOf Attr(u)do
if signif icantAttr contains attr[m]then
u0[m]=u[m]+ random(0 to 1))* dif[m]
else
u0[m]=u[m]
we illustrate the basic idea why it is able to enhance
SMOTE performance. Suppose in Fig. 3, bis the
border line between majority and minority space which
only consists of two attributes: xand y. And assume
b=yand pis a minority example. In applying
SMOTE to vector p, both attributes xand ywill be
synthesized. Whereas, synthesizing attribute yshall
be unnecessary since we only need to emphasize the
variation of attribute x, caused by border line b. The
complete procedure of selected-SMOTE is described
in Alg. 3.
IV. EXP ER IM EN T
A. Experimental Set-up
To investigate the performance of proposed tech-
nique, we use eighteen different datasets from UCI. As
preliminary study, the experiments were conducted on
binary datasets with varying imbalance ratio, from 1:3
ICACSIS 2014
195
TABLE I
EIG HTE EN DATA SE TS WI TH VARI OU S IMB ALA NC ED PRO PO RTIO N
Dataset Attribute #positive #negative Proportion
Arrhythmia (ART) 279 44 245 1:5.7
Breast Cancer Wisconsin (BCW) 10 100 458 1:4.58
India liver (IL) 10 20 100 1:5
Dermatology (DER) 33 30 112 1:3.73
Yeast (YEA) 8 50 463 1:9.26
Fertility (FER) 10 12 88 1:7.33
Climate Model Simulation (CMS) 18 46 494 1:10.74
Glass Identification (GI) 10 25 163 1:6.52
Ionosphere (ION) 34 60 225 1:3.75
Statlog-Landsat Satellite (SLS) 36 300 1072 1:3.57
Credit card (CC) 15 25 296 1:11.84
Car Evaluation (CE) 6 384 1210 1:3.15
Hill-valley (HV) 100 50 311 1:6.22
Inflammations(INF) 6 14 70 1:5
Carcinoma tissue (CT) 10 21 85 1:4.04
Congressional Voting (CV) 16 50 267 1:5.34
Magic Gamma Telescope (MGT) 11 100 1000 1:10
White Wine Quality (WWQ) 11 500 3838 1:7.77
TABLE II
TWO-C LA SSE S CO NFU SI ON MATR IX
Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN
to 1:12 and different number of attributes. All datasets
are summarized in Table. I.
In experiment stage, we divide the dataset with
ratio 3:7 to construct testing and training data. We
apply LIBSVM [11] with linear kernel as classifier and
B−ACC and F−M easure as evaluation method.
Evaluation of classifier induced by imbalanced datasets
need special attention because despite high accuracy
it may not meet user requirement of recognition of
minority class. The evaluation formula is provided in
Eq. 6 and Eq. 7 and calculated based on two-classes
confusion matrix in Table. II.
P recision =T P
T P +F P (3)
Recall =Sensitivity =T P
T P +F N (4)
Specif icity =T N
T N +F P (5)
F−Measure =2∗precision ∗recall
precision +recall (6)
B−Acc = 0.5∗(Specif icity +Sensitivity)(7)
B. Experiment Result
In this experiment SMOTE is applied in five varia-
tions: SMOTE, SMOTE-Out, Combination of SMOTE
and SMOTE-Out, SMOTE-Cosine, and Selected-
SMOTE. For each SMOTE, we also conduct 5 time
experiments and calculate the average score of B−Acc
and F−measure, in order to generalize randomization
in SMOTE. For all datasets, SMOTE is applied based
on corresponding class proportion to achieve balance.
For instance, in Arrhythmia dataset, 570% synthetic
sample of minority examples will be created to re-
balance the ratio of 1:5.7.
In Table. III we present all of our experiment results
for eighteen datasets. For each proposed approach, we
compare the results with standard SMOTE by counting
number of dataset in group of better, equals or worse
than standard SMOTE. In our first proposed method,
SMOTE-out, only 3 and 2 datasets have worse score
of B−ACC and F−M easure. While 10 and 12
datasets have better scores, and 5 and 4 have the same
scores. Whereas incorporating SMOTE and SMOTE-
Out by applying 50% for each, reveal a better result.
Our experiment shows 12 and 13 datasets have better
score, and only 2 and 2 datasets have worse B−ACC
and F−Measure respectively.
Whereas results of SMOTE-Cosine only show that 8
datasets have better score than SMOTE, and 7 of them
give worse score. It may indicate that our SMOTE-
cosine is not good enough to improve the standard per-
formance. However we argue,there might be another
voting or incorporation mechanism which will give
better result, since the similarity of two vectors also
shall be determined by considering vector’s direction.
Similar with SMOTE-Out, Our third proposed
method shows that 11 and 10 datasets have better
B−ACC and F−M easure. Only 4 and 5 datasets
give worse score than SMOTE. It indicates that our
idea to synthesize certain attributes based on feature
selection, apparently can give some improvements for
SMOTE.
V. CONCLUSION
In this paper we present three improvements of
SMOTE: SMOTE-Out, SMOTE-Cosine, and Selected-
SMOTE, in order to cover cases which are not al-
ready done by SMOTE. Our experiment results reveal
ICACSIS 2014
196
TABLE III
EXP ERI ME NT RE SULT
No Dataset SMOTE SMOTE-OUT Combine Both SMOTE-Cosine Selected-SMOTE
BACC FM BACC FM BACC FM BACC FM BACC FM
1 ART 74.62 58.97 76.00 61.29 75.12 61.88 72.66 60.97 77.26 63.42
2 BCW 97.61 96.32 97.61 96.32 97.75 96.98 98.04 98.29 97.68 96.65
3 IL 79.67 57.52 80.00 58.08 80.00 58.08 76.33 51.75 79.33 56.79
4 DER 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
5 YEA 83.38 56.28 83.45 56.52 83.38 56.24 85.68 66.15 83.60 57.11
6 FER 52.50 27.02 52.50 27.87 55.09 33.35 54.35 31.37 52.87 27.56
7 CMS 88.83 66.43 90.04 73.73 90.53 68.78 84.60 49.25 89.21 65.29
8 GI 76.35 66.10 77.60 66.96 77.40 66.05 83.65 70.19 77.81 67.88
9 ION 80.56 95.11 78.89 94.71 81.11 95.24 76.11 94.06 82.22 95.51
10 SLS 99.44 99.85 99.44 99.85 99.44 99.85 99.22 99.78 99.11 99.75
11 CC 40.81 40.03 41.11 40.59 45.58 45.02 46.11 45.53 43.19 42.38
12 CE 75.84 56.95 75.62 56.73 76.39 57.55 73.47 54.65 76.39 57.51
13 HV 98.40 91.54 99.15 95.07 98.94 94.13 91.37 81.15 99.15 95.18
14 INF 76.19 50.00 76.19 50.00 76.19 50.00 76.19 50.00 76.19 50.00
15 CT 69.84 48.31 81.48 68.64 67.69 50.20 60.28 49.13 70.71 50.75
16 CV 92.22 79.13 92.59 80.53 92.47 80.06 92.47 80.06 91.98 78.25
17 MGT 75.10 45.17 74.93 46.26 74.53 46.66 75.73 44.80 75.10 45.17
18 WWQ 61.67 27.94 62.19 28.36 62.05 28.24 64.94 35.71 61.59 27.84
Better than SMOTE 10 12 12 13 8 8 11 10
Equals to SMOTE 5 4 4 3 3 3 3 3
Worse than SMOTE 3 2 2 2 7 7 4 5
that SMOTE-Out, incorporation between SMOTE and
SMOTE-Out, and Selected-SMOTE are able to boost
standard performance. SMOTE-Out result indicates
new variation of minority sample can be achieved
better by synthesizing samples in outside the line
connecting two vectors. Similar with SMOTE-Out,
our hypothesis regarding to selected-SMOTE is also
agreed with the experiment result. The conducted
approach is able to enrich the variation of minority
examples better because of emphasizing dimension of
significant attributes only. In our future work, we will
further investigate various incorporation approaches
of SMOTE with existing advanced SMOTE. We will
also investigate another way to improve incorporation
mechanism of SMOTE-cosine. We still argue that the
nearest neighbor shall be better calculated by consid-
ering direction and distance in between two vectors.
REFERENCES
[1] Z. F. Ye and B.L Lu, ”Learning Imbalanced Data Sets with a
Min-Max Modular Support Vector Machine”. In Proceedings
of International Joint Conference on Neural Network, Orlando,
Florida, USA, 2000.
[2] G. E. Batista, R. C. Prati and M. C. Monard, ”A study of the
behavior of several methods for balancing machine learning
training data”. In ACM SIGKDD Explorations Newsletter, 2000.
[3] C. Cardie and N. Howe, ”Improving minority class prediction
using case-specific feature weights”. In Proceedings of ICML,
pp. 57-65, 1997.
[4] K. Veropoulos, C. Campbell and N. Cristianini, ”Controlling
the sensitivity of support vector machines”. In Proceedings of
International joint conference on artificial intelligence, pp. 55-
60, 1999.
[5] Z. H. Zhou and X. Y. Liui, ”Training cost-sensitive neural net-
works with methods addressing the class imbalance problem”.
In Proceedings of Knowledge and Data Engineering, IEEE, pp.
63-77, 2006.
[6] R. E. Schapire, ”A brief introduction to boosting”. In Proceed-
ings of Ijcai, pp. 1401-1406, 1999.
[7] W. Fan, S. J. Stolfo, J. Zhang and P.K. Chan, ”AdaCost:
misclassification cost-sensitive boosting”. In Proceedings of
ICML, pp. 97-105, 1999.
[8] C. Drummond and R.C. Holte, ”C4. 5, class imbalance, and
cost sensitivity: why under-sampling beats over-sampling”. In
Workshop on Learning from Imbalanced Datasets II, 2003.
[9] H. Han, W. Y. Wang and B. H. Mao, ”Borderline-SMOTE: A
new over-sampling method in imbalanced data sets learning”.
In Advances in intelligent computing, pp. 878-887, 2005.
[10] N. V. Chawla, K. W. Bowye, L. O. Hall and W.P. Kegelmeyer,
”SMOTE: synthetic minority over-sampling technique”. In Jour-
nal of Articial Intelligence Research 16, pp. 321357, 2002.
[11] C. C. Chang and C. J. Lin, ”LIBSVM: a library for support
vector machines”. In ACM Transactions on Intelligent Systems
and Technology (TIST), pp. 27, 2011.
ICACSIS 2014
197