Conference PaperPDF Available

SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An Enhancement Strategy to Handle Imbalance in Data Level

Authors:

Abstract and Figures

The imbalanced dataset often becomes obstacle in supervised learning process. Imbalance is case in which the example in training data belonging to one class is heavily outnumber the examples in the other class. Applying classifier to this dataset results in the failure of classifier to learn the minority class. Synthetic Minority Oversampling Technique (SMOTE) is a well known oversampling method that tackles imbalance in data level. SMOTE creates synthetic example between two close vectors that lay together. Our study considers three improvements of SMOTE and call them as SMOTEOut, SMOTE-Cosine, and Selected-SMOTE, in order to cover cases which are not already done by SMOTE. To investigate the proposed method, our experiments were conducted with eighteen different datasets. The results show that our proposed SMOTE give some improvements of B-ACC and F1-Score.
Content may be subject to copyright.
SMOTE-Out, SMOTE-Cosine, and
Selected-SMOTE: An Enhancement Strategy
to Handle Imbalance in Data Level
Fajri Koto
Faculty of Computer Science
University of Indonesia
Depok, Jawa Barat, Indonesia 16423
Email: fajri91@ui.ac.id
Abstract—The imbalanced dataset often becomes ob-
stacle in supervised learning process. Imbalance is case in
which the example in training data belonging to one class
is heavily outnumber the examples in the other class.
Applying classifier to this dataset results in the failure of
classifier to learn the minority class. Synthetic Minority
Oversampling Technique (SMOTE) is a well known over-
sampling method that tackles imbalance in data level.
SMOTE creates synthetic example between two close
vectors that lay together. Our study considers three
improvements of SMOTE and call them as SMOTE-
Out, SMOTE-Cosine, and Selected-SMOTE, in order to
cover cases which are not already done by SMOTE. To
investigate the proposed method, our experiments were
conducted with eighteen different datasets. The results
show that our proposed SMOTE give some improvements
of B-ACC and F1-Score.
I. INTRODUCTION
To achieve optimum performance, classifier requires
balanced distribution of dataset. However, case in
which the example in training data belonging to one
class is heavily outnumber the examples in the other
class, is often faced in real world. It is commonly
caused by the difficulty or the expensive cost to con-
struct datasets. For instance, biomedical data such as
rare disease and abnormal prognosis, or data which is
obtained from very difficult or expensive experiment.
Applying classifier to the imbalanced dataset causes
classifiers fail to learn the minority class because
of majority class generalization. Whereas in fact the
minority class is often being important subject of
investigation.
Some attempts have been addressed to tackle the
imbalance. According to [1] the approaches can be
roughly divided into two categories: 1) Data level re-
balance and 2) Modified learning algorithm approach.
Two well known re-balancing techniques in data level,
Random Under Sampling (RUS) and Random Over
Sampling (ROS) have been introduced as the standard
non-heuristic re-sampling technique. RUS is done by
randomly eliminating majority class example, while
ROS achieves balance by generating random replica-
tion of minority class examples [2]. In the second
category, [3] and [4] modify decision tree and the
original SVM respectively to increase its sensitivity on
minority class. Approaches of ensemble learning were
also introduced in AdaBoost [6] and AdaCost [7].
Further studies related to under and over-sampling
have been also done. [8] states that under-sampling can
establish a reasonable baseline for algorithmic compar-
ison in imbalanced dataset problem. They argue that it
is a better approach than over-sampling. However, we
consider certain condition, an extreme imbalanced case
in which under-sampling may discard many potential
useful data. For instance, 1000 examples of dataset
in which 990 are positive and 10 are negative exam-
ples, will cause under-sampling no longer beats over-
sampling.
We realize a further study related to over-sampling
is necessary. In this paper we address it by improving a
well known over-sampling technique, Synthetic Minor-
ity Oversampling Technique (SMOTE) [10]. SMOTE
is a technique to generate new examples of minority
class that is done by interpolating between two exam-
ples of minority class that lay together. Thus, over-
fitting problem which causes the decision boundaries
for the minority class to spread further into the ma-
jority class space, can be avoided. In further study,
SMOTE-Borderline was also introduced [9]. Techni-
cally, it aims to emphasize the boundary between the
majority and minority class space.
In this study, we consider three cases which have
not been covered by SMOTE and SMOTE-Borderline.
First, in case where the distribution of minority ex-
amples is very dense and close to each other, new
variation of minority examples will not be achieved,
caused by the interpolation that is only done along
the line connecting two minority examples. Second,
the euclidean approach that is used to measure the
nearest neighbor, only considers the distance between
two vectors. Whereas in fact, the similarity between
two vectors can be also considered by their angle or di-
rection. And third, to produce new examples, SMOTE
synthesizes all attributes of dataset. Whereas, not all
of them represent the boundary between minority
and majority class space. We argue that synthesizing
certain attributes which are considered as significant,
ICACSIS 2014
193
can yield better examples.
In next sections the further explanation of these
cases will be given. First, in section 2 we provide the
overview of SMOTE. Our proposed methods including
further explanation and technical method will be dis-
cussed in Section 3. The experimental setup and results
then are given in Section 4. And finally, conclusion are
drawn in Section 5.
II. TH E OVE RVIEW OF SYNTHETIC MINORITY
OVERSAMPLING TECHNIQUE
SMOTE is an over-sampling approach in which the
minority class is over-sampled by creating ”synthetic”
examples rather than by over-sampling with replace-
ment. The minority class is over-sampled by taking
each minority class sample and introducing synthetic
examples along the line segment joining any/all of the
kminority class nearest neighbors. Depending upon
the amount of over-sampling requires, neighbors from
the knearest neighbors are randomly chosen. [10]
proposed SMOTE by utilizing Euclidean distance (Eq.
1) to find the closest neighbor of minority examples.
d(p, q) = p(p1q1)2+ (p2q2)2+.. + (pnqn)2
(1)
In general, SMOTE is applied based on the proce-
dure below:
Determine number of neighbors kand amount of
SMOTE N.
Randomly select Nminority samples and put into
A.
For each element aiin A, find its knearest
neighbor by calculating euclidean distance, and
then randomly select one nearest neighbor vand
compute synthetic example in between aiand v.
III. THE PRO PO SE D SM OTE E NHANCEMENT
In this section, we introduce SMOTE-Out, SMOTE-
Cosine, and Selected-SMOTE as strategy to enhance
SMOTE, in order to cover some cases which are told
in previous section.
A. SMOTE-Out
Applying SMOTE to minority examples with dense
distribution may cause SMOTE creates meaningless
synthetic examples. Assume in Fig. 1 the minority
examples are represented by cross mark, then the
dash triangle is the line where synthetic examples
created for applying SMOTE to a minority example.
It may arise problem if two vectors lay very close
together and results in very short line in between.
We propose SMOTE-Out as strategy to handle it by
creating synthetic example in outside area of dash line.
The illustration of SMOTE-Out procedure is de-
scribed in Fig. 2 and Alg. 1. To create synthetic
example in space of circle (see Fig. 1), SMOTE-Out
utilizes the nearest majority example as direction to
go off the track. SMOTE-Out may rise a question
regarding to how it avoids the over-fitting. We tackle
Fig. 1. The difference between SMOTE and SMOTE-Out
Fig. 2. Illustration of SMOTE-Out procedure
this issue by using the nearest minority example to
draw in the synthetic point.
Now suppose we have vector uas minority example
and vas its nearest majority class neighbor. To get
the outside vector of uthat respects to v, we can
find vector dif1 = uvthat represent the different
between uand v. Suppose the outside vector of u
called as u0has a constraint (u0v)>(uv), in
order to keep the u0distance from the majority class
space. Mathematically, it can be just simply calculated
by u0=u+rand(0, a)dif 1. In this study we
use a= 0.3to minimize the possibility of over-
fitting. The next step is finding vector wthat close
to u0. Suppose xis the nearest minority neighbor of
u, then wis simply calculated by applying SMOTE
between xand u0. Vector wis calculated by formula
of w=x+rand(0, a)dif 2, where dif 2 = u0x
ICACSIS 2014
194
Algorithm 1 Smote-Out of u
Input: data u, dataset majority, dataset minority
majorneighbor[] = getNrstNeighbor(u,majority)
minorneighbor[] = getNrstNeighbor(u,minority)
k= size(majorneighbor) = size(minorneighbor)
v=majorneighbor[random(1 to k)]
dif1=u-v
u0=u+ random(0 to 0.3)* dif1
x=minorneighbor[random(1 to k)]
dif2=u0-x
w=x+ random(0 to 0.5))* dif2
Algorithm 2 Finding the nearest neighbor of uin
SMOTE-Cosine
Input: data u, dataset minority
Output: data neighbor
A= [] to keep euclidean score
B= [] to keep cosine score
minority =minority without data u
k = size(minority)
for i= 1 to kdo
A[i] = euclidean (u,minority[i])
B[i] = cosine (u,minority[i])
A=sortByAscending (A)
B=sortByDescending (B)
neighbor =voteResult (A, B)
and a= 0.5in this study.
Based on the description above, assume uas a circle
center, ais the distance between uand x, and bis
fraction of dif1. Then the circle area in Fig. 1 will
follow these two cases: 1) If abthen the circle
radius is the distance between uand x. 2) If a < b
then the synthetic example will be created in circle
area with radius equals to b.
B. SMOTE-Cosine
As we told in Section 1, nearest neighbor of a
minority sample may be achieved better by considering
distance and the direction of vectors. We investigate
this problem by addressing it as SMOTE-Cosine, in
which we use the incorporation of cosine similarity in
eq. 2 and Euclidean distance formula to obtain new
nearest neighbor. The detail is described in Alg. 2
sim(u, v) = Pu.v
pPu2pPv2(2)
In Alg.2, we apply voting mechanism to incorporate
results of euclidean distance and cosine similarity.
Voting is done by applying higher weight for higher
rank. We then simply add both weight correspondingly
and sort the result to determine the nearest neighbor.
C. Selected-SMOTE
Selected-SMOTE aims to emphasize the dimension
of significant attributes that is done by synthesizing
certain attributes based on feature selection. In Fig. 3
Fig. 3. Selected-SMOTE illustration
Algorithm 3 Selected SMOTE procedure to every
minority examples
Input: dataset data
attr[] is list of attribute in dataset
signif icantAttr[] = featureSelection (data)
minority[] = getMinorityData(data)
for i= 1 to size(minority)do
u=minority[i]
neighbor[] = getNrstNeighbor(u,minority)
k= size(neighbor)
v=neighbor[random(1 to k)]
dif =v-u
for m= 1 to sumOf Attr(u)do
if signif icantAttr contains attr[m]then
u0[m]=u[m]+ random(0 to 1))* dif[m]
else
u0[m]=u[m]
we illustrate the basic idea why it is able to enhance
SMOTE performance. Suppose in Fig. 3, bis the
border line between majority and minority space which
only consists of two attributes: xand y. And assume
b=yand pis a minority example. In applying
SMOTE to vector p, both attributes xand ywill be
synthesized. Whereas, synthesizing attribute yshall
be unnecessary since we only need to emphasize the
variation of attribute x, caused by border line b. The
complete procedure of selected-SMOTE is described
in Alg. 3.
IV. EXP ER IM EN T
A. Experimental Set-up
To investigate the performance of proposed tech-
nique, we use eighteen different datasets from UCI. As
preliminary study, the experiments were conducted on
binary datasets with varying imbalance ratio, from 1:3
ICACSIS 2014
195
TABLE I
EIG HTE EN DATA SE TS WI TH VARI OU S IMB ALA NC ED PRO PO RTIO N
Dataset Attribute #positive #negative Proportion
Arrhythmia (ART) 279 44 245 1:5.7
Breast Cancer Wisconsin (BCW) 10 100 458 1:4.58
India liver (IL) 10 20 100 1:5
Dermatology (DER) 33 30 112 1:3.73
Yeast (YEA) 8 50 463 1:9.26
Fertility (FER) 10 12 88 1:7.33
Climate Model Simulation (CMS) 18 46 494 1:10.74
Glass Identification (GI) 10 25 163 1:6.52
Ionosphere (ION) 34 60 225 1:3.75
Statlog-Landsat Satellite (SLS) 36 300 1072 1:3.57
Credit card (CC) 15 25 296 1:11.84
Car Evaluation (CE) 6 384 1210 1:3.15
Hill-valley (HV) 100 50 311 1:6.22
Inflammations(INF) 6 14 70 1:5
Carcinoma tissue (CT) 10 21 85 1:4.04
Congressional Voting (CV) 16 50 267 1:5.34
Magic Gamma Telescope (MGT) 11 100 1000 1:10
White Wine Quality (WWQ) 11 500 3838 1:7.77
TABLE II
TWO-C LA SSE S CO NFU SI ON MATR IX
Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN
to 1:12 and different number of attributes. All datasets
are summarized in Table. I.
In experiment stage, we divide the dataset with
ratio 3:7 to construct testing and training data. We
apply LIBSVM [11] with linear kernel as classifier and
BACC and FM easure as evaluation method.
Evaluation of classifier induced by imbalanced datasets
need special attention because despite high accuracy
it may not meet user requirement of recognition of
minority class. The evaluation formula is provided in
Eq. 6 and Eq. 7 and calculated based on two-classes
confusion matrix in Table. II.
P recision =T P
T P +F P (3)
Recall =Sensitivity =T P
T P +F N (4)
Specif icity =T N
T N +F P (5)
FMeasure =2precision recall
precision +recall (6)
BAcc = 0.5(Specif icity +Sensitivity)(7)
B. Experiment Result
In this experiment SMOTE is applied in five varia-
tions: SMOTE, SMOTE-Out, Combination of SMOTE
and SMOTE-Out, SMOTE-Cosine, and Selected-
SMOTE. For each SMOTE, we also conduct 5 time
experiments and calculate the average score of BAcc
and Fmeasure, in order to generalize randomization
in SMOTE. For all datasets, SMOTE is applied based
on corresponding class proportion to achieve balance.
For instance, in Arrhythmia dataset, 570% synthetic
sample of minority examples will be created to re-
balance the ratio of 1:5.7.
In Table. III we present all of our experiment results
for eighteen datasets. For each proposed approach, we
compare the results with standard SMOTE by counting
number of dataset in group of better, equals or worse
than standard SMOTE. In our first proposed method,
SMOTE-out, only 3 and 2 datasets have worse score
of BACC and FM easure. While 10 and 12
datasets have better scores, and 5 and 4 have the same
scores. Whereas incorporating SMOTE and SMOTE-
Out by applying 50% for each, reveal a better result.
Our experiment shows 12 and 13 datasets have better
score, and only 2 and 2 datasets have worse BACC
and FMeasure respectively.
Whereas results of SMOTE-Cosine only show that 8
datasets have better score than SMOTE, and 7 of them
give worse score. It may indicate that our SMOTE-
cosine is not good enough to improve the standard per-
formance. However we argue,there might be another
voting or incorporation mechanism which will give
better result, since the similarity of two vectors also
shall be determined by considering vector’s direction.
Similar with SMOTE-Out, Our third proposed
method shows that 11 and 10 datasets have better
BACC and FM easure. Only 4 and 5 datasets
give worse score than SMOTE. It indicates that our
idea to synthesize certain attributes based on feature
selection, apparently can give some improvements for
SMOTE.
V. CONCLUSION
In this paper we present three improvements of
SMOTE: SMOTE-Out, SMOTE-Cosine, and Selected-
SMOTE, in order to cover cases which are not al-
ready done by SMOTE. Our experiment results reveal
ICACSIS 2014
196
TABLE III
EXP ERI ME NT RE SULT
No Dataset SMOTE SMOTE-OUT Combine Both SMOTE-Cosine Selected-SMOTE
BACC FM BACC FM BACC FM BACC FM BACC FM
1 ART 74.62 58.97 76.00 61.29 75.12 61.88 72.66 60.97 77.26 63.42
2 BCW 97.61 96.32 97.61 96.32 97.75 96.98 98.04 98.29 97.68 96.65
3 IL 79.67 57.52 80.00 58.08 80.00 58.08 76.33 51.75 79.33 56.79
4 DER 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
5 YEA 83.38 56.28 83.45 56.52 83.38 56.24 85.68 66.15 83.60 57.11
6 FER 52.50 27.02 52.50 27.87 55.09 33.35 54.35 31.37 52.87 27.56
7 CMS 88.83 66.43 90.04 73.73 90.53 68.78 84.60 49.25 89.21 65.29
8 GI 76.35 66.10 77.60 66.96 77.40 66.05 83.65 70.19 77.81 67.88
9 ION 80.56 95.11 78.89 94.71 81.11 95.24 76.11 94.06 82.22 95.51
10 SLS 99.44 99.85 99.44 99.85 99.44 99.85 99.22 99.78 99.11 99.75
11 CC 40.81 40.03 41.11 40.59 45.58 45.02 46.11 45.53 43.19 42.38
12 CE 75.84 56.95 75.62 56.73 76.39 57.55 73.47 54.65 76.39 57.51
13 HV 98.40 91.54 99.15 95.07 98.94 94.13 91.37 81.15 99.15 95.18
14 INF 76.19 50.00 76.19 50.00 76.19 50.00 76.19 50.00 76.19 50.00
15 CT 69.84 48.31 81.48 68.64 67.69 50.20 60.28 49.13 70.71 50.75
16 CV 92.22 79.13 92.59 80.53 92.47 80.06 92.47 80.06 91.98 78.25
17 MGT 75.10 45.17 74.93 46.26 74.53 46.66 75.73 44.80 75.10 45.17
18 WWQ 61.67 27.94 62.19 28.36 62.05 28.24 64.94 35.71 61.59 27.84
Better than SMOTE 10 12 12 13 8 8 11 10
Equals to SMOTE 5 4 4 3 3 3 3 3
Worse than SMOTE 3 2 2 2 7 7 4 5
that SMOTE-Out, incorporation between SMOTE and
SMOTE-Out, and Selected-SMOTE are able to boost
standard performance. SMOTE-Out result indicates
new variation of minority sample can be achieved
better by synthesizing samples in outside the line
connecting two vectors. Similar with SMOTE-Out,
our hypothesis regarding to selected-SMOTE is also
agreed with the experiment result. The conducted
approach is able to enrich the variation of minority
examples better because of emphasizing dimension of
significant attributes only. In our future work, we will
further investigate various incorporation approaches
of SMOTE with existing advanced SMOTE. We will
also investigate another way to improve incorporation
mechanism of SMOTE-cosine. We still argue that the
nearest neighbor shall be better calculated by consid-
ering direction and distance in between two vectors.
REFERENCES
[1] Z. F. Ye and B.L Lu, ”Learning Imbalanced Data Sets with a
Min-Max Modular Support Vector Machine”. In Proceedings
of International Joint Conference on Neural Network, Orlando,
Florida, USA, 2000.
[2] G. E. Batista, R. C. Prati and M. C. Monard, ”A study of the
behavior of several methods for balancing machine learning
training data”. In ACM SIGKDD Explorations Newsletter, 2000.
[3] C. Cardie and N. Howe, ”Improving minority class prediction
using case-specific feature weights”. In Proceedings of ICML,
pp. 57-65, 1997.
[4] K. Veropoulos, C. Campbell and N. Cristianini, ”Controlling
the sensitivity of support vector machines”. In Proceedings of
International joint conference on artificial intelligence, pp. 55-
60, 1999.
[5] Z. H. Zhou and X. Y. Liui, ”Training cost-sensitive neural net-
works with methods addressing the class imbalance problem”.
In Proceedings of Knowledge and Data Engineering, IEEE, pp.
63-77, 2006.
[6] R. E. Schapire, ”A brief introduction to boosting”. In Proceed-
ings of Ijcai, pp. 1401-1406, 1999.
[7] W. Fan, S. J. Stolfo, J. Zhang and P.K. Chan, ”AdaCost:
misclassification cost-sensitive boosting”. In Proceedings of
ICML, pp. 97-105, 1999.
[8] C. Drummond and R.C. Holte, ”C4. 5, class imbalance, and
cost sensitivity: why under-sampling beats over-sampling”. In
Workshop on Learning from Imbalanced Datasets II, 2003.
[9] H. Han, W. Y. Wang and B. H. Mao, ”Borderline-SMOTE: A
new over-sampling method in imbalanced data sets learning”.
In Advances in intelligent computing, pp. 878-887, 2005.
[10] N. V. Chawla, K. W. Bowye, L. O. Hall and W.P. Kegelmeyer,
”SMOTE: synthetic minority over-sampling technique”. In Jour-
nal of Articial Intelligence Research 16, pp. 321357, 2002.
[11] C. C. Chang and C. J. Lin, ”LIBSVM: a library for support
vector machines”. In ACM Transactions on Intelligent Systems
and Technology (TIST), pp. 27, 2011.
ICACSIS 2014
197
... Sandhan et al. [17] tried to balance the dataset with a mixed sampling method. Researchers such as Gao considered the improvement of weight and density in avoid overfitting [18][19][20]. Young et al. [21] proposed oversampling method based on voronoi diagrams and Douzas improved oversampling method based on self-organizing map [22]. Ramentol et al. [23][24][25] have applied unbalanced learning methods in machine diagnosis and intrusion detection. ...
Article
In network anomaly detection, network traffic dataareoftenimbalanced, that is, certain classes of network traffic datahavealargesampledata volume while other classes have few, resulting in reduced overall network traffic anomaly detection on a minority class of samples. For imbalanced data, researchers have proposed the use of oversampling techniques to balance data sets; in particular, an oversampling method called the SMOTE provides a simple and effective solution for balancing data sets. However, current oversampling methods suffer from the generation of noisy samples and poor information quality. Hence, this study proposes an oversampling method for imbalanced network traffic data that combines the SMOTE algorithm and FINCH clustering algorithm to filter out minority sample clusters, proposes a scheme to allocate the number of synthetic samples per cluster according to the clustering sparsity and sample weight, and finally uses multi-layer sensors for noisy sample cleaning during sampling. We compare the proposed method with other oversampling methods, verifying that a data set processed using this method works better in network traffic anomaly detection.
... This technique first uses SVM to make a decision hyperplane and then creates new instances and optimized them using the PSO technique. In Koto (2014), the authors proposed different variants of smote to handle class imbalance SMOTE-OUT, which do not take instances of a particular region to create synthetic instances. ...
Article
Class imbalance creates a considerable impact on the classification of instances using traditional classifiers. Class imbalance, along with other difficulties, creates a significant impact on recognizing instances of minority class. Researchers work in various directions to mitigate class imbalance effect along with noise as well as missing values in datasets. However, combined studies of noisy class imbalance along with incomplete datasets have not been performed yet. This article contains a detailed analysis of 84 different machine learning models to deal with noisy binary class imbalanced and incomplete data using AUC, G-Mean, and F1-score as performance metrics. This article contains a detailed experiment considering missing value imputation and oversampling techniques. The article contains three comparisons: first missing value imputation techniques in incomplete and binary class imbalanced data, second, resampling techniques in noisy binary class imbalanced data, and third, combined techniques in noisy binary class imbalanced and incomplete data. We conclude that MICE and KNN techniques perform well with an increase in the imbalanced dataset's missing value from the first comparison. In second comparison, the SMOTE-ENN technique performs better than state-of-art in noisy binary class imbalanced datasets, and in the third comparison, we conclude that MICE with SMOTE-ENN technique perform well compared to the rest of the techniques.
Article
Full-text available
For the last two decades, oversampling has been employed to overcome the challenge of learning from imbalanced datasets. Many approaches to solving this challenge have been offered in the literature. Oversampling, on the other hand, is a concern. That is, models trained on fictitious data may fail spectacularly when put to real-world problems. The fundamental difficulty with oversampling approaches is that, given a real-life population, the synthesized samples may not truly belong to the minority class. As a result, training a classifier on these samples while pretending they represent minority may result in incorrect predictions when the model is used in the real world. We analyzed a large number of oversampling methods in this paper and devised a new oversampling evaluation system based on hiding a number of majority examples and comparing them to those generated by the oversampling process. Based on our evaluation system, we ranked all these methods based on their incorrectly generated examples for comparison. Our experiments using more than 70 oversampling methods and nine imbalanced real-world datasets reveal that all oversampling methods studied generate minority samples that are most likely to be majority. Given data and methods in hand, we argue that oversampling in its current forms and methodologies is unreliable for learning from class imbalanced data and should be avoided in real-world applications.
Article
Full-text available
An important problem associated with the aerial mapping of the seabed is the precise classification of point clouds characterizing the water surface, bottom, and bottom objects. This study aimed to improve the accuracy of classification by addressing the asymmetric amount of data representing these three groups. A total of 53 Synthetic Minority Oversampling Technique (SMOTE) algorithms were adjusted and evaluated to balance the amount of data. The prepared data set was used to train the Multi-Layer Perceptron (MLP) neural network used for classifying the point cloud. Data balancing contributed to significantly increasing the accuracy of classification. The best overall classification accuracy achieved varied from 95.8% to 97.0%, depending on the oversampling algorithm used, and was significantly better than the classification accuracy obtained for unbalanced data and data with downsampling (89.6% and 93.5%, respectively). Some of the algorithms allow for 10% increased detection of points on the objects compared to unbalanced data or data with simple downsampling. The results suggest that the use of selected oversampling algorithms can aid in improving the point cloud classification and making the airborne laser bathymetry technique more appropriate for seabed mapping.
Preprint
Full-text available
For the last two decades, oversampling has been employed to overcome the challenge of learning from imbalanced datasets. Many approaches to solving this challenge have been offered in the literature. Oversampling, on the other hand, is a concern. That is, models trained on fictitious data may fail spectacularly when put to real-world problems. The fundamental difficulty with oversampling approaches is that, given a real-life population, the synthesized samples may not truly belong to the minority class. As a result, training a classifier on these samples while pretending they represent minority may result in incorrect predictions when the model is used in the real world. We analyzed a large number of oversampling methods in this paper and devised a new oversampling evaluation system based on hiding a number of majority examples and comparing them to those generated by the oversampling process. Based on our evaluation system, we ranked all these methods based on their incorrectly generated examples for comparison. Our experiments using more than 70 oversampling methods and three imbalanced real-world datasets reveal that all oversampling methods studied generate minority samples that are most likely to be majority. Given data and methods in hand, we argue that oversampling in its current forms and methodologies is unreliable for learning from class imbalanced data and should be avoided in real-world applications.
Article
More often than not, data collected in real-time tends to be imbalanced i.e., the samples belonging to a particular class are significantly more than the others. This degrades the performance of the predictor. One of the most notable algorithms to handle such an imbalance in the dataset by fabricating synthetic data, is the “Synthetic Minority Oversampling Technique (SMOTE)”. However, data imbalance is not solely responsible for the poor performance of the classifier. Certain research works have demonstrated that noisy samples can have a significant role in misclassifying the dataset. Also, handling large data is computationally expensive. Hence, data reduction is imperative. In this work, we put forth a novel extension of SMOTE by integrating it with the Kalman filter. The proposed method, Kalman-SMOTE (KSMOTE), filters out the noisy samples in the final dataset after SMOTE, which includes both the raw data and the synthetically generated samples, thereby reducing the size of the dataset. Our model is validated with a wide range of datasets. An experimental analysis of the results shows that our model outperforms the presently available techniques.
Article
In different fields, such as machine learning and data mining, class imbalances have been one of the most complex issues for the past few decades. The unique condition of an imbalanced dataset that distributes each class of a particular dataset unevenly. The positive class is slightly smaller than the negative one. Many standard classification algorithms in this case do not classify instances related to the positive class. Typically the main goal of the classification task is a positive class. To deal with this problem, several approaches were proposed, for example sampling dependent over-sampling, undersampling, classification level enhancements, or the combination of two or more classifiers. The major problem however is that most solutions have a negative class, a computational cost, a storage problem, or a long training period. Data upsampling or downsampling may resolve a possible solution to the issue of skewness of data. In this paper, a hybrid technique is presented, followed by a random forest algorithm (SMO-RF), to categorized binary imbalanced data using the Technique of Synthetic Minority Oversampling. We have tested our model with four standard imbalanced datasets and obtained a higher F-measure, G-mean as well as ROC values for all data sets.
Article
Information extracted from electrohysterography recordings could potentially prove to be an interesting additional source of information to estimate the risk on preterm birth. Recently, a large number of studies have reported near-perfect results to distinguish between recordings of patients that will deliver term or preterm using a public resource, called the Term/Preterm Electrohysterogram database. However, we argue that these results are overly optimistic due to a methodological flaw being made. In this work, we focus on one specific type of methodological flaw: applying over-sampling before partitioning the data into mutually exclusive training and testing sets. We show how this causes the results to be biased using two artificial datasets and reproduce results of studies in which this flaw was identified. Moreover, we evaluate the actual impact of over-sampling on predictive performance, when applied prior to data partitioning, using the same methodologies of related studies, to provide a realistic view of these methodologies’ generalization capabilities. We make our research reproducible by providing all the code under an open license.
Article
Full-text available
This paper takes a new look at two sampling schemes commonly used to adapt machine al- gorithms to imbalanced classes and misclas- sication costs. It uses a performance anal- ysis technique called cost curves to explore the interaction of over and under-sampling with the decision tree learner C4.5. C4.5 was chosen as, when combined with one of the sampling schemes, it is quickly becom- ing the community standard when evaluat- ing new cost sensitive learning algorithms. This paper shows that using C4.5 with under- sampling establishes a reasonable standard for algorithmic comparison. But it is recom- mended that the least cost classier be part of that standard as it can be better than under- sampling for relatively modest costs. Over- sampling, however, shows little sensitivity, there is often little dierence in performance when misclassication costs are changed.
Article
Full-text available
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Article
Full-text available
There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. In this situation, which is found in real world data describing an infrequent but important event, the learning system may have di#culties to learn the concept related to the minority class. In this work we perform a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets. Our experiments provide evidence that class imbalance does not systematically hinder the performance of learning systems. In fact, the problem seems to be related to learning with too few minority class examples in the presence of other complicating factors, such as class overlapping. Two of our proposed methods, Smote + Tomek and Smote + ENN, deal with these conditions directly, allying a known over-sampling method with data cleaning methods in order to produce better-defined class clusters. Our comparative experiments show that, in general, over-sampling methods provide more accurate results than under-sampling methods considering the area under the ROC curve (AUC). This result seems to contradict results previously published in the literature. Smote + Tomek and Smote + ENN presented very good results for data sets with a small number of positive examples. Moreover, Random over-sampling, a very simple over-sampling method, is very competitive to more complex over-sampling methods. Since the over-sampling methods provided very good performance results, we also measured the syntactic complexity of decision trees induc...
Conference Paper
To overcome the class imbalance problem in statistical machine learning research area, re-balancing the learning task is one of the most classical and intuitive approach. Besides re-sampling, many researchers consider task decomposition as an alternative method for re-balance. Min-max modular support vector machine combines both intelligent task decomposition methods and the min-max modular network model as classifier ensemble. It overcomes several shortcomings of re-sampling, and could also achieve fast learning and parallel learning. We compare its classification performance with resampling and cost sensitive learning on several imbalanced data sets from different application areas. The experimental results indicate that our method can handle class imbalance problem efficiently.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Conference Paper
In recent years, mining with imbalanced data sets receives more and more attentions in both theoretical and practical aspects. This paper introduces the importance of imbalanced data sets and their broad application domains in data mining, and then summarizes the evaluation metrics and the existing methods to evaluate and solve the imbalance problem. Synthetic minority over-sampling technique (SMOTE) is one of the over-sampling methods addressing this problem. Based on SMOTE method, this paper presents two new minority over-sampling methods, borderline-SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are over-sampled. For the minority class, experiments show that our approaches achieve better TP rate and F-value than SMOTE and random over-sampling methods.
Conference Paper
Boosting is a general method for improving the accuracy of any given learning algorithm. This short paper introduces the boosting algorithm AdaBoost, and explains the underlying theory of boosting, including an explanation of why boosting often does not suffer from overfitting. Some examples of recent applications of boost­ ing are also described.
Conference Paper
In recent years, mining with imbalanced data sets receives more and more attentions in both theoretical and practical aspects. This paper introduces the importance of imbalanced data sets and their broad application domains in data mining, and then summarizes the evaluation metrics and the existing methods to evaluate and solve the imbalance problem. Synthetic minority over-sampling technique (SMOTE) is one of the over-sampling methods addressing this problem. Based on SMOTE method, this paper presents two new minority over-sampling methods, borderline-SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are over-sampled. For the minority class, experiments show that our approaches achieve better TP rate and F-value than SMOTE and random over-sampling methods.