Conference PaperPDF Available

ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning

Authors:
  • Hunan University (湖南大学)

Abstract and Figures

This paper presents a novel adaptive synthetic (ADASYN) sampling approach for learning from imbalanced data sets. The essential idea of ADASYN is to use a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn compared to those minority examples that are easier to learn. As a result, the ADASYN approach improves learning with respect to the data distributions in two ways: (1) reducing the bias introduced by the class imbalance, and (2) adaptively shifting the classification decision boundary toward the difficult examples. Simulation analyses on several machine learning data sets show the effectiveness of this method across five evaluation metrics.
Content may be subject to copyright.
ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced
Learning
Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li
Abstract—This paper presents a novel adaptive synthetic
(ADASYN) sampling approach for learning from imbalanced
data sets. The essential idea of ADASYN is to use a weighted
distribution for different minority class examples according to
their level of difficulty in learning, where more synthetic data
is generated for minority class examples that are harder to
learn compared to those minority examples that are easier to
learn. As a result, the ADASYN approach improves learning
with respect to the data distributions in two ways: (1) reducing
the bias introduced by the class imbalance, and (2) adaptively
shifting the classification decision boundary toward the difficult
examples. Simulation analyses on several machine learning data
sets show the effectiveness of this method across five evaluation
metrics.
I. INTRODUCTION
LEARNING from imbalanced data sets is a relatively new
challenge for many of today’s data mining applications.
From applications in Web mining to text categorization to
biomedical data analysis [1], this challenge manifests itself
in two common forms: minority interests and rare instances.
Minority interests arise in domains where rare objects (minor-
ity class samples) are of great interest, and it is the objective
of the machine learning algorithm to identify these minority
class examples as accurately as possible. For instance, in
financial engineering, it is important to detect fraudulent credit
card activities in a pool of large transactions [2] [3]. Rare
instances, on the other hand, concerns itself with situations
where data representing a particular event is limited compared
to other distributions [4] [5], such as the detection of oil
spills from satellite images [6]. One should note that many
imbalanced learning problems are caused by a combination of
these two factors. For instance, in biomedical data analysis, the
data samples for different kinds of cancers are normally very
limited (rare instances) compared to normal non-cancerous
cases; therefore, the ratio of the minority class to the majority
class can be significant (at a ratio of 1 to 1000 or even
more [4][7][8]). On the other hand, it is essential to predict
the presence of cancers, or further classify different types of
cancers as accurate as possible for earlier and proper treatment
(minority interests).
Haibo He, Yang Bai, and Edwardo A. Garcia are with the Department of
Electrical and Computer Engineering, Stevens Institute of Technology, Hobo-
ken, New Jersey 07030, USA (email: {hhe, ybai1, egarcia}@stevens.edu).
Shutao Li is with the College of Electrical and Information Engineering,
Hunan University, Changsha, 410082, China.(Email: shutao li@hnu.cn)
This work was supported in part by the Center for Intelligent Networked
Systems (iNetS) at Stevens Institute of Technology and the Excellent Youth
Foundation of Hunan Province (Grant No. 06JJ1010).
Generally speaking, imbalanced learning occurs whenever
some types of data distribution significantly dominate the
instance space compared to other data distributions. In this
paper, we focus on the two-class classification problem for
imbalanced data sets, a topic of major focus in recent research
activities in the research community. Recently, theoretical
analysis and practical applications for this problem have
attracted a growing attention from both academia and industry.
This is reflected by the establishment of several major work-
shops and special issue conferences, including the American
Association for Artificial Intelligence workshop on Learning
from Imbalanced Data Sets (AAAI’00) [9], the International
Conference on Machine Learning workshop on Learning from
Imbalanced Data Sets (ICML’03) [10], and the Association
for Computing Machinery (ACM) Special Interest Group on
Knowledge Discovery and Data Mining explorations (ACM
SIGKDD Explorations’04) [11].
The state-of-the-art research methodologies to handle imbal-
anced learning problems can be categorized into the following
five major directions:
(1) Sampling strategies. This method aims to develop var-
ious oversampling and/or undersampling techniques to com-
pensate for imbalanced distributions in the original data sets.
For instance, in [12] the cost curves technique was used to
study the interaction of both oversampling and undersampling
with decision tree based learning algorithms. Sampling tech-
niques with the integration of probabilistic estimates, pruning,
and data preprocessing were studied for decision tree learning
in [13]. Additionally, in [14], “JOUS-Boost” was proposed
to handle imbalanced data learning by integrating adaptive
boosting with jittering sampling techniques.
(2) Synthetic data generation. This approach aims to over-
come imbalance in the original data sets by artificially gener-
ating data samples. The SMOTE algorithm [15], generates an
arbitrary number of synthetic minority examples to shift the
classifier learning bias toward the minority class. SMOTE-
Boost, an extension work based on this idea, was proposed
in [16], in which the synthetic procedure was integrated with
adaptive boosting techniques to change the method of updating
weights to better compensate for skewed distributions. In order
to ensure optimal classification accuracy for minority and
majority class, DataBoost-IM algorithm was proposed in [17]
where synthetic data examples are generated for both minority
and majority classes through the use of “seed” samples.
(3) Cost-sensitive learning. Instead of creating balanced
data distributions by sampling strategies or synthetic data
generation methods, cost-sensitive learning takes a different
1322
978-1-4244-1821-3/08/$25.00 c
2008 IEEE
Authorized licensed use limited to: UNIVERSIDAD DE GRANADA. Downloaded on April 14,2010 at 08:32:10 UTC from IEEE Xplore. Restrictions apply.
approach to address this issue: It uses a cost-matrix for
different types of errors or instance to facilitate learning from
imbalanced data sets. That is to say, cost-sensitive learning
does not modify the imbalanced data distribution directly;
instead, it targets this problem by using different cost-matrices
that describe the cost for misclassifying any particular data
sample. A theoretical analysis on optimal cost-sensitive learn-
ing for binary classification problems was studied in [18].
In [19] instead of using misclassification costs, an instance-
weighting method was used to induce cost-sensitive trees
and demonstrated better performance. In [20], Metacost, a
general cost-sensitive learning framework was proposed. By
wrapping a cost-minimizing procedure, Metacost can make
any arbitrary classifier cost-sensitive according to different
requirements. In [21], cost-sensitive neural network models
were investigated for imbalanced classification problems. A
threshold-moving technique was used in this method to adjust
the output threshold toward inexpensive classes, such that
high-cost (expensive) samples are unlikely to be misclassified.
(4) Active learning. Active learning techniques are conven-
tionally used to solve problems related to unlabeled training
data. Recently, various approaches on active learning from
imbalanced data sets have been proposed in literature [1] [22]
[23] [24]. In particular, an active learning method based on
support vector machines (SVM) was proposed in [23] [24].
Instead of searching the entire training data space, this method
can effectively select informative instances from a random
set of training populations, therefore significantly reducing
the computational cost when dealing with large imbalanced
data sets. In [25], active learning was used to study the class
imbalance problems of word sense disambiguation (WSD)
applications. Various strategies including max-confidence and
min-error were investigated as the stopping criteria for the
proposed active learning methods.
(5) Kernel-based methods. Kernel-based methods have also
been used to study the imbalanced learning problem. By
integrating the regularized orthogonal weighted least squares
(ROWLS) estimator, a kernel classifier construction algorithm
based on orthogonal forward selection (OFS) was proposed in
[26] to optimize the model generalization for learning from
two-class imbalanced data sets. In [27], a kernel-boundary-
alignment (KBA) algorithm based on the idea of modifying
the kernel matrix according to the imbalanced data distribution
was proposed to solve this problem. Theoretical analyses in
addition to empirical studies were used to demonstrate the
effectiveness of this method.
In this paper, we propose an adaptive synthetic (ADASYN)
sampling approach to address this problem. ADASYN is
based on the idea of adaptively generating minority data
samples according to their distributions: more synthetic data
is generated for minority class samples that are harder to learn
compared to those minority samples that are easier to learn.
The ADASYN method can not only reduce the learning bias
introduced by the original imbalance data distribution, but can
also adaptively shift the decision boundary to focus on those
difficult to learn samples.
The remainder of this paper is organized as follow. Section
II presents the ADASYN algorithm in detail, and discusses the
major advantages of this method compared to conventional
synthetic approaches for imbalanced learning problems. In
section III, we test the performance of ADASYN on various
machine learning test benches. Various evaluation metrics are
used to assess the performance of this method against existing
methods. Finally, a conclusion is presented in Section IV.
II. ADASYN ALGORITHM
Motivated by the success of recent synthetic approaches
including SMOTE [15], SMOTEBoost [16], and DataBoost-
IM [17], we propose an adaptive method to facilitate learning
from imbalanced data sets. The objective here is two-fold:
reducing the bias and adaptively learning. The proposed
algorithm for the two-class classification problem is described
in [Algorithm ADASYN]:
[Algorithm - ADASYN]
Input
(1) Training data set Dtr with msamples {xi,y
i},i=
1, ..., m, where xiis an instance in the ndimensional feature
space Xand yiY={1,1}is the class identity label asso-
ciated with xi. Define msand mlas the number of minority
class examples and the number of majority class examples,
respectively. Therefore, msmland ms+ml=m.
Procedure
(1) Calculate the degree of class imbalance:
d=ms/ml(1)
where d(0,1].
(2) If d<d
th then (dth is a preset threshold for the maximum
tolerated degree of class imbalance ratio):
(a) Calculate the number of synthetic data examples that
need to be generated for the minority class:
G=(mlms)×β(2)
Where β[0,1] is a parameter used to specify the desired
balance level after generation of the synthetic data. β=1
means a fully balanced data set is created after the general-
ization process.
(b) For each example ximinorityclass, find Knearest
neighbors based on the Euclidean distance in ndimensional
space, and calculate the ratio ridefined as:
ri
i/K, i =1,..., m s(3)
where Δiis the number of examples in the Knearest
neighbors of xithat belong to the majority class, therefore
ri[0,1];
(c) Normalize riaccording to ˆri=ri/
ms
i=1
ri, so that ˆriis
2008 International Joint Conference on Neural Networks (IJCNN 2008) 1323
Authorized licensed use limited to: UNIVERSIDAD DE GRANADA. Downloaded on April 14,2010 at 08:32:10 UTC from IEEE Xplore. Restrictions apply.
a density distribution (
i
ˆri=1)
(d) Calculate the number of synthetic data examples that
need to be generated for each minority example xi:
giri×G(4)
where Gis the total number of synthetic data examples that
need to be generated for the minority class as defined in
Equation (2).
(e) For each minority class data example xi, generate gi
synthetic data examples according to the following steps:
Do the Loop from 1to gi:
(i) Randomly choose one minority data example, xzi,
from the Knearest neighbors for data xi.
(ii) Generate the synthetic data example:
si=xi+(xzi xi)×λ(5)
where (xzi xi)is the difference vector in ndimensional
spaces, and λis a random number: λ[0,1].
End Loop
The key idea of ADASYN algorithm is to use a density
distribution ˆrias a criterion to automatically decide the
number of synthetic samples that need to be generated for
each minority data example. Physically, ˆriis a measurement
of the distribution of weights for different minority class
examples according to their level of difficulty in learning.
The resulting dataset post ADASYN will not only provide a
balanced representation of the data distribution (according to
the desired balance level defined by the βcoefficient), but it
will also force the learning algorithm to focus on those difficult
to learn examples. This is a major difference compared to the
SMOTE [15] algorithm, in which equal numbers of synthetic
samples are generated for each minority data example. Our
objective here is similar to those in SMOTEBoost [16] and
DataBoost-IM [17] algorithms: providing different weights for
different minority examples to compensate for the skewed
distributions. However, the approach used in ADASYN is
more efficient since both SMOTEBoost and DataBoost-IM
rely on the evaluation of hypothesis performance to update
the distribution function, whereas our algorithm adaptively
updates the distribution based on the data distribution char-
acteristics. Hence, there is no hypothesis evaluation required
for generating synthetic data samples in our algorithm.
Fig. 1 shows the classification error performance for differ-
ent βcoefficients for an artificial two-class imbalanced data
set. The training data set includes 50 minority class examples
and 200 majority class examples, and the testing data set
includes 200 examples. All data examples are generated by
multidimensional Gaussian distributions with different mean
and covariance matrix parameters. These results are based
on the average of 100 runs with a decision tree as the base
classifier. In Fig. 1, β=0corresponds to the classification
error based on the original imbalanced data set, while β=1
represents a fully balanced data set generated by the ADASYN
algorithm. Fig. 1 shows that the ADASYN algorithm can
improve the classification performance by reducing the bias
introduced in the original imbalanced data sets. Further more,
it also demonstrates the tendency in error reduction as balance
level is increased by ADASYN.
         











#&#5;0CNIQTKVJOHQTFKHHGTGPVEEQGHHKEKGPVU
&KHHGTGPVEEQGHHKEKGPVU
'TTQTRGTHQTOCPEG
EEQTTGURQPFUVQVJG
QTKIKPCNKODCNCPEGFFCVC
EEQTTGURQPFUVQHWNN[
DCNCPEGFFCVCCHVGT
#&#5;0CNIQTKVJO
Fig. 1. ADASYN algorithm for imbalanced learning
III. SIMULATION ANALYSIS AND DISCUSSIONS
A. Data set analysis
We test our algorithm on various real-world machine learn-
ing data sets as summarized in Table 1. All these data sets are
available from the UCI Machine Learning Repository [28].
In addition, since our interest here is to test the learning
capabilities from two-class imbalanced problems, we made
modifications on several of the original data sets according to
various literary results from similar experiments [17] [29]. A
brief description of such modifications is discussed as follows.
TABLE I
DATA SET CHARACTERISTICS USED IN THIS PAPER.
Data set # total # minority # majority #
Name examples examples examples attributes
Vehicle 846 199 647 18
Diabetes (PID) 768 268 500 8
Vowel 990 90 900 10
Ionosphere 351 126 225 34
Abalone 731 42 689 7
Vehicle dataset. This data set is used to classify a given
silhouette as one of four types of vehicles [30]. This dataset
has a total of 846 data examples and 4 classes (opel, saab,
bus and van). Each example is represented by 18 attributes. We
choose “Van” as the minority class and collapse the remaining
classes into one majority class. This gives us an imbalanced
two-class dataset, with 199 minority class examples and 647
majority class examples.
1324 2008 International Joint Conference on Neural Networks (IJCNN 2008)
Authorized licensed use limited to: UNIVERSIDAD DE GRANADA. Downloaded on April 14,2010 at 08:32:10 UTC from IEEE Xplore. Restrictions apply.
Pima Indian Diabetes dataset. This is a two-class data set
and is used to predict positive diabetes cases. It includes a
total of 768 cases with 8 attributes. We use the positive cases
as the minority class, which give us 268 minority class cases
and 500 majority class cases.
Vowel recognition dataset. This is a speech recognition
dataset used to classify different vowels. The original dataset
includes 990 examples and 11 classes. Each example is repre-
sented by 10 attributes. Since each vowel in the original data
set has 10 examples, we choose the first vowel as the minority
class and collapse the rest to be the majority class, which gives
90 and 900 minority and majority examples, respectively.
Ionosphere dataset. This data set includes 351 examples
with 2 classes (good radar returns versus bad radar returns).
Each example is represented by 34 numeric attributes. We
choose the “bad radar” instances as minority class and “good
radar” instance as the majority class, which gives us 126
minority class examples and 225 majority class examples.
Abalone dataset. This data set is used to predict the age
of abalone from physical measurements. The original data set
includes 4177 examples and 29 classes, and each example
is represented by 8 attributes. We choose class “18” as the
minority class and class “9” as the majority class as suggested
in [17]. In addition, we also removed the discrete feature
(feature “sex”) in our current simulation. This gives us 42
minority class examples and 689 majority class examples; each
represented by 7 numerical attributes.
B. Evaluation metrics for imbalanced data sets
Instead of using the overall classification accuracy as a
single evaluation criterion, we use a set of assessment metrics
related to receiver operating characteristics (ROC) graphs [31]
to evaluate the performance of ADASYN algorithm. We use
ROC based evaluation metrics because under the imbalanced
learning condition, traditional overall classification accuracy
may not be able to provide a comprehensive assessment
of the observed learning algorithm [17] [31] [32] [33] [6]
[34] [16]. Let {p, n}be the positive and negative testing
examples and {Y,N}be the classification results given by
a learning algorithm for positive and negative predictions. A
representation of classification performance can be formulated
by a confusion matrix (contingency table) as illustrated in
Fig. 2. We followed the suggestions of [15] [34] and use the
minority class as the positive class and majority class as the
negative class.
Based on Fig. 2, the evaluation metrics used to assess
learning from imbalanced data sets are defined as:
Overall Accuracy (OA):
OA =TP +TN
TP +FP +FN +TN (6)
Precision:
Precision =TP
TP +FP (7)
Recall:
Recall =TP
TP +FN (8)
Fig. 2. Confusion matrix for performance evaluation
FMeasure:
FMeasure =(1 + β2)·recall ·precision
β2·recall +precision (9)
Where βis a coefficient to adjust the relative importance of
precision versus recall (usually β=1).
Gmean:
Gmean =P ositiveAccuracy ×NegativeAccuracy
=TP
TP +FN ×TN
TN +FP (10)
C. Simulation analyses
We use the decision tree as the base learning model in our
current study. According to the assessment metrics presented
in Section III-B, Table 2 illustrates the performance of the
ADASYN algorithm compared to the SMOTE algorithm. As
a reference, we also give the performance of the decision tree
learning based on the original imbalanced data sets. These
results are based on the average of 100 runs. At each run, we
randomly select half of the minority class and majority class
examples as the training data, and use the remaining half for
testing purpose. For both SMOTE and ADASYN, we set the
number of nearest neighbors K=5. Other parameters include
N= 200 for SMOTE according to [15], β=1and dth =0.75
for ADASYN.
For each method, the best performance is highlighted in
each category. In addition, the total winning times for each
method across different evaluation metrics are also shown in
Table 2. Based on these simulation results, the ADASYN
algorithm can achieve competitive results on these five test
benches. As far as the overall winning times are concerned,
ADASYN outperforms the other methods. Further more,
ADASYN algorithm also provides the best performance in
terms of G-mean for all data sets. This means our algorithm
provides improved accuracy for both minority and majority
classes and does not sacrifice one class in preference for
another. This is one of the advantages of our method to handle
the imbalanced learning problems.
There is another interesting observation that merit further
discussion. From Table 2 one can see there are situations
that learning from the original data set can actually achieve
better performance for certain assessment criterion, such as
the precision assessment. This raises an important question:
generally speaking, to what level the imbalanced learning
2008 International Joint Conference on Neural Networks (IJCNN 2008) 1325
Authorized licensed use limited to: UNIVERSIDAD DE GRANADA. Downloaded on April 14,2010 at 08:32:10 UTC from IEEE Xplore. Restrictions apply.
TABLE II
EVALUATION METRICS AND PERFORMANCE COMPARISON
Dataset Methods OA Precision Recall F measure G mean
Vehicle
Decision tree 0.9220 0.8454 0.8199 0.8308 0.8834
SMOTE 0.9239 0.8236 0.8638 0.8418 0.9018
ADASYN 0.9257 0.8067 0.9015 0.8505 0.9168
Pima Indian Diabetes
Decision tree 0.6831 0.5460 0.5500 0.5469 0.6430
SMOTE 0.6557 0.5049 0.6201 0.5556 0.6454
ADASYN 0.6837 0.5412 0.6097 0.5726 0.6625
Vowel recognition
Decision tree 0.9760 0.8710 0.8700 0.8681 0.9256
SMOTE 0.9753 0.8365 0.9147 0.8717 0.9470
ADASYN 0.9678 0.7603 0.9560 0.8453 0.9622
Ionosphere
Decision tree 0.8617 0.8403 0.7698 0.8003 0.8371
SMOTE 0.8646 0.8211 0.8032 0.8101 0.8489
ADASYN 0.8686 0.8298 0.8095 0.8162 0.8530
Abalone
Decision tree 0.9307 0.3877 0.2929 0.3249 0.5227
SMOTE 0.9121 0.2876 0.3414 0.3060 0.5588
ADASYN 0.8659 0.2073 0.4538 0.2805 0.6291
Winning times
Decision tree 250 1 0
SMOTE 0 0 1 1 0
ADASYN 304 3 5
methods such as adjusting the class balance can help the learn-
ing capabilities? This is a fundamental and critical question in
this domain. In fact, the importance of this question has been
previously addressed by F. Provost in the invited paper for the
AAAI’2000 Workshop on Imbalanced Data Sets [1]:
“Isn’t the best research strategy to concentrate on how
machine learning algorithms can deal most effectively with
whatever data they are given?”
Based on our simulation results, we believe that this
fundamental question should be investigated in more depth
both theoretically and empirically in the research community
to correctly understand the essence of imbalanced learning
problems.
D. Discussions
As a new learning method, ADASYN can be further ex-
tended to handle imbalanced learning in different scenarios,
therefore potentially benefit a wide range of real-world ap-
plications for learning from imbalanced data sets. We give a
brief discussion on possible future research directions in this
Section.
Firstly of all, in our current study, we compared the
ADASYN algorithm to single decision tree and SMTOE
algorithm [15] for performance assessment. This is mainly
because all of these methods are single-model based learning
algorithms. Statistically speaking, ensemble based learning al-
gorithms can improve the accuracy and robustness of learning
performance, thus as a future research direction, the ADASYN
algorithm can be extended for integration with ensemble
based learning algorithms. To do this, one will need to use
a bootstrap sampling technique to sample the original training
data sets, and then embed ADASYN to each sampled set to
train a hypothesis. Finally, a weighted combination voting rule
similar to AdaBoost.M1 [35] [36] can be used to combine
all decisions from different hypotheses for the final predicted
outputs. In such situation, it would be interesting to see the
performance of such boosted ADASYN algorithm with those
of SMOTEBoost [16], DataBoost-IM [17] and other ensemble
1326 2008 International Joint Conference on Neural Networks (IJCNN 2008)
Authorized licensed use limited to: UNIVERSIDAD DE GRANADA. Downloaded on April 14,2010 at 08:32:10 UTC from IEEE Xplore. Restrictions apply.
based imbalanced learning algorithms.
Secondly, ADASYN can be generalized to multiple-class
imbalanced learning problems as well. Although two-class
imbalanced classification problems dominate the research ac-
tivities in today’s research community, this is not a limitation
to our method. To extend the ADASYN idea to multi-class
problems, one first needs to calculate and sort the degree
of class imbalance for each class with respect to the most
significant class, ysY={1, ..., C}, which is defined as
the class identity label with the largest number of examples.
Then for all classes that satisfy the condition d<d
th, the
ADASYN algorithm is executed to balance them according to
their own data distribution characteristics. In this situation, the
update of riin equation (3) can be modified to reflect different
needs in different applications. For instance, if one would like
to balance the examples in class yk,(yk∈{1, ..., C}and
yk=ys), then the definition of Δiin equation (3) can be
defined as the number of examples in the nearest neighbors
belonging to class ys, or belonging to all other classes except
yk(similar to transforming the calculation of the nearest
neighbors to a Boolean type function: belonging to ykor not
belonging to yk).
Further more, the ADASYN algorithm can also be modified
to facilitate incremental learning applications. Most current
imbalanced learning algorithms assume that representative
data samples are available during the training process. How-
ever, in many real-world applications such as mobile sensor
networks, Web mining, surveillance, homeland security, and
communication networks, training data may continuously be-
come available in small chunks over a period of time. In
this situation, a learning algorithm should have the capability
to accumulate previous experience and use this knowledge
to learn additional new information to aid prediction and
future decision-making processes. The ADASYN algorithm
can potentially be adapted to such an incremental learning
scenario. To do this, one will need to dynamically update the ri
distribution whenever a new chunk of data samples is received.
This can be accomplished by an online learning and evaluation
process.
IV. CONCLUSION
In this paper, we propose a novel adaptive learning al-
gorithm ADASYN for imbalanced data classification prob-
lems. Based on the original data distribution, ADASYN can
adaptively generate synthetic data samples for the minority
class to reduce the bias introduced by the imbalanced data
distribution. Further more, ADASYN can also autonomously
shift the classifier decision boundary to be more focused on
those difficult to learn examples, therefore improving learning
performance. These two objectives are accomplished by a
dynamic adjustment of weights and an adaptive learning
procedure according to data distributions. Simulation results
on five data sets based on various evaluation metrics show the
effectiveness of this method.
Imbalanced learning is a challenging and active research
topic in the artificial intelligence, machine learning, data
mining and many related areas. We are currently investigating
various issues, such as multiple classes imbalanced learning
and incremental imbalanced learning. Motivated by the results
in this paper, we believe that ADASYN may provide a
powerful method in this domain.
REFERENCES
[1] F. Provost, “Machine Learning from Imbalanced Data Sets 101,” Invited
paper for the AAAI’2000 Workshop on Imbalanced Data Sets, Menlo
Park, CA, 2000.
[2] P. K. Chan, W. Fan, A. L. Prodromidis, and S. J. Stolf, “Distributed Data
Mining in Credit Card Fraud Detection,” IEEE Intelligent Systems, pp.
67-74, November/December 1999.
[3] P. K. Chan and S. J. Stolfo, “Toward scalable learning with non-uniform
class and cost distributions: a case study in credit card fraud detection,”
in Proc. Int. Conf. Knowledge Discovery and Data Mining (KDD’01),
pp. 164-168, 2001.
[4] G. M. Weiss, “Mining with Rarity: A Unifying Framework,” SIGKDD
Explorations, 6(1):7-19, 2004.
[5] G. M. Weiss “Mining Rare Cases,” In O. Maimon and L. Rokach (.eds),
Data Mining and Knowledge Discovery Handbook: A Complete Guide
for Practitioners and Researchers, Kluwer Academic Publishers, pp. 765-
776, 2005.
[6] M. Kubat, R. C. Holte, and S. Matwin, “Machine Learning for the
Detection of Oil Spills in Satellite Radar Images,” Machine Learning,
30(2):195-215, 1998.
[7] H. He and X. Shen, “A Ranked Subspace Learning Method for Gene
Expression Data Classification,” in Proc. Int. Conf. Artificial Intelligence
(ICAI’07), pp. 358 - 364, June 2007
[8] R. Pearson, G. Goney, and J. Shwaber, “Imbalanced Clustering for
Microarray Time-Series,” in Proc. ICML’03 workshop on Learning from
Imbalanced Data Sets, 2003
[9] N. Japkowicz, (Ed.), “Learning from Imbalanced Data Sets,” the AAAI
Workshop, Technical Report WS-00-05, American Association for Arti-
ficial Intelligence, Menlo Park, CA, 2000.
[10] N. V. Chawla, N. Japkowicz, and A. Ko lcz, (Ed.), “Imbalanced
Clustering for Microarray Time-Series,” in Proc. ICML’03 Workshop on
Learning from Imbalanced Data Sets, 2003
[11] N. V. Chawla, N. Japkowicz and A. Kolcz, SIGKDD Explorations:
Special issue on Learning from Imbalanced Datasets, vol.6, issue 1, 2004.
[12] C. Drummond and R. Holte, “C4.5, Class Imbalance, and Cost Sen-
sitivity: Why Under-sampling Beats Oversampling,” in Proc. ICML’03
Workshop on Learning from Imbalanced Data Sets, 2003
[13] N. Chawla, “C4.5 and Imbalanced Datasets: Investigating the Effect of
Sampling Method, Probalistic Estimate, and Decision Tree Structure,” in
ICML-KDD’03 Workshop: Learning from Imbalanced Data Sets, 2003
[14] D. Mease, A. J. Wyner, and A. Buja, “Boosted Classification Trees
and Class Probability/Quantile Estimation,” Journal of Machine Learning
Research, vol. 8, pp. 409- 439, 2007.
[15] N. V. Chawla, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer,
“SMOTE: Synthetic Minority Oversampling TEchnique,Journal of
Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
[16] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “Smoteboost:
Improving Prediction of the Minority Class in Boosting,” in Proc.
European Conf. Principles and Practice of Knowledge Discovery in
Databases, pp. 107-119, Dubrovnik, Croatia, 2003.
[17] H. Guo and H. L. Viktor, “Learning from Imbalanced Data Sets with
Boosting and Data Generation: the DataBoost-IM Approach,” in SIGKDD
Explorations: Special issue on Learning from Imbalanced Datasets, vol.6,
issue 1, pp. 30 - 39, 2004.
[18] C. Elkan, “The foundations of cost-sensitive learning,” in Proc. Int. Joint
Conf. Artificial Intelligence (IJCAI’01), pp. 973-978, 2001.
[19] K. M. Ting, “An instance-weighting method to induce cost-sensitive
trees,” IEEE Transaction on Knowledge and Data Engineering, 14: pp.
659-665, 2002.
[20] P. Domingos, “Metacost: A general method for making classifiers cost-
sensitive,” in Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and
Data Mining, pp. 155-164, San Diego, CA, 1999.
[21] Z. H. Zhou and X. Y. Liu, “Training Cost-Sensitive Neural Networks
with Methods Addressing the Class Imbalance Problem,” IEEE Trans-
actions on Knowledge and Data Engineering, vol. 18, no. 1, pp. 63-77,
2006.
2008 International Joint Conference on Neural Networks (IJCNN 2008) 1327
Authorized licensed use limited to: UNIVERSIDAD DE GRANADA. Downloaded on April 14,2010 at 08:32:10 UTC from IEEE Xplore. Restrictions apply.
[22] N. Abe, “Invited talk: Sampling Approaches to Learning From Imbal-
anced Datasets: Active Learning, Cost Sensitive Learning and Beyond,
in ICML-KDD’03 Workshop: Learning from Imbalanced Data Sets, 2003.
[23] S. Ertekin, J. Huang, and C. L. Giles, “Active Learning for Class
Imbalance Problem,” in Proc. Annual Int. ACM SIGIR Conf. Research
and development in information retrieval, pp. 823 - 824, Amsterdam,
Netherlands, 2007.
[24] S. Ertekin, J. Huang, L. Bottou, C. L. Giles, “Learning on the Bor-
der: Active Learning in Imbalanced Data Classification,” in CIKM’07,
November 6-8, 2007, Lisboa, Portugal.
[25] J. Zhu and E. Hovy, “Active Learning for Word Sense Disambiguation
with Methods for Addressing the Class Imbalance Problem,” in Proc.
Joint Conf. Empirical Methods in Natural Language Processing and
Computational Natural Language Learning, pp. 783-790, Prague, June
2007.
[26] X. Hong, S. Chen, and C. J. Harris, “A Kernel-Based Two-Class
Classifier for Imbalanced Data Sets,” IEEE Transactions on Neural
Networks, vol. 18, no. 1, pp. 28-41, 2007.
[27] G. Wu and E. Y. Chang, “KBA: Kernel Boundary Alignment Consid-
ering Imbalanced Data Distribution,” IEEE Transactions on Knowledge
and Data Engineering, vol. 17, no.6, pp. 786-795, 2005.
[28] UCI Machine Learning Repository, [online], available:
http://archive.ics.uci.edu/ml/
[29] F. Provost, T. Fawcett, and R. Kohavi, “The Case Against Accuracy
Estimation for Comparing Induction Algorithms,” in Proc. Int. Conf.
Machine Learning, pp. 445-453 Madison, WI. Morgan Kauffmann, 1998
[30] J. P. Siebert, “Vehicle Recognition Using Rule Based Methods,” Turing
Institute Research Memorandum TIRM-87-018, March 1987.
[31] T. Fawcett, “ROC Graphs: Notes and Practical Considerations for Data
Mining Researchers,” Technical Report HPL-2003-4, HP Labs, 2003.
[32] F. Provost and T. Fawcett, “Analysis and Visualization of Classifier
Performance: Comparison Under Imprecise Class and Cost Distributions,”
in Proc. Int. Conf. Knowledge Discovery and Data Mining, Menlo Park,
CS, AAAI Press, 43-48, 1997.
[33] M. A. Maloof, “Learning When Data Sets Are Imbalanced and When
Cost Are Unequal and Unkown,” in ICML’03 Workshop on Learning from
Imbalanced Data Sets II, 2003
[34] M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced Training
Sets: One-sided Selection,” in Proc. Int. Conf. Machine Learning, San
Francisco, CA, Morgan Kaufmann, pp. 179-186, 1997.
[35] Y. Freund and R. E. Schapire, “Experiments With a New Boosting
Algorithm,” in Proc. Int. Conf. Machine Learning (ICML’96), pp. 148-
156, 1996.
[36] Y. Freund and R. E. Schapire, “Decision-theoretic Generalization of On-
line Learning and Application to Boosting,” in J. Computer and Syst.
Sciences, vol. 55, no. 1, pp. 119-139, 1997.
1328 2008 International Joint Conference on Neural Networks (IJCNN 2008)
Authorized licensed use limited to: UNIVERSIDAD DE GRANADA. Downloaded on April 14,2010 at 08:32:10 UTC from IEEE Xplore. Restrictions apply.
... The problem may not be easily addressed by reducing or duplicating data samples (Batista et al., 2004). Alternatively, the Synthetic Minority Oversampling Technique (SMOTE; Chawla et al., 2002) and Adaptive Synthetic (ADASYN) technique (He et al., 2008) may be employed to overcome the problem inherent in an imbalanced data set by creating more samples in the minority classes. ...
... Following that, one of the most popular approaches for imbalanced learning, the Synthetic Minority Oversampling Technique (SMOTE; Chawla et al., 2002), is presented in Section 2.1. Next, the improved approach, the ADASYN technique (He et al., 2008) is introduced in Section 2.2. ...
Preprint
Full-text available
Reliable policy search is essential in improving reservoir operations to satisfy multi-sectoral needs such as flood control and water supply. Given its importance, this topic has been widely explored in reservoir control studies. However, previous studies have observed that optimized policies tend to overfit to the training data, and are thus prone to be controlled mainly by infrequent extreme samples in the training data. This study proposes a bootstrap aggregation (bagging)-based Adaptive Synthetic (ADASYN) algorithm as an extension of the ADASYN and bagging techniques originated by machine learning literature. We illustrate the effectiveness of the bagging-based ADASYN algorithm using a case study of the Folsom Reservoir in Northern California with a binary tree-based control policy. The proposed algorithm variants are also developed to confirm the usefulness of the individual technique embedded in the final procedure. Results demonstrate that the proposed algorithm yields significant improvements in managing water supply and flood risks. In the proposed algorithm, the ADASYN technique facilitates creating a reliable set of policy trees while generating synthetic samples in reservoir inflow to augment infrequent extreme samples. Moreover, the bagging technique is beneficial in selecting the final policy tree while leading to improved out-of-sample performance. We conclude that this case study using the novel ADASYN algorithm highlights the potential to improve policy search algorithms by utilizing well-established training strategies from machine learning.
... The Patient-Reported Indices in MS (PRIMUS) (Chawla et al. 2002), a 15-item assessment, to evaluate changes in activities of daily living (PRIMUS activities, score 0-30) and QoL (PRIMUS QoL, score 0-22), higher scores indicating worse activity limitation. The Treatment Satisfaction Questionnaire for Medication (TSQM) (He et al. 2008) to assess treatment satisfaction. It is a 14-item assessment divided in four domains: effectiveness (3 items), side effects (5 items), convenience (3 items) and global satisfaction (3 items).The TSQM-9 domain scores range from 0 to 100 with higher scores representing higher satisfaction on each domain. ...
... • Adaptive synthetic sampling (ADASYN) algorithm (He et al. 2008), which operates similarly to SMOTE but also uses the data and feature distributions to better synthesize the new samples. ...
Article
Full-text available
Patients afflicted by multiple sclerosis experience a relapsing-remitting course in about 85% of the cases. Furthermore, after a 10/15-year period their situation tends to worse, resulting in what is considered the second phase of multiple sclerosis. While treatments are now available to reduce the symptoms and slow down the progression of the disease, the administration of drugs must be adapted to the course of the disease, and predicting relapsing periods and the worsening of the symptoms can greatly improve the outcome of the treatment. For this reason, indicators such as the patient-reported outcome measures (PROMs) have been largely used to support early diagnosis and prediction of future relapsing periods in patients affected by multiple sclerosis. However, such indicators are insufficient, as the prediction they provide is often not accurate enough. In this paper, machine learning techniques have been applied to data obtained from clinical trial, in order to improve the prediction capabilities and provide doctors with an additional instrument to evaluate the clinical situation of patients. After the application of correlation indicators and the use of principal component analysis for the reduction of the dimensionality of the feature space, classification algorithms have been applied and compared, in order to identify the best suiting one for our purposes. After the application of re-balance algorithms, the accuracy of the machine learning-based prediction system reaches 79%, demonstrating the capability of the framework to correctly predict future progression of disability.
... OverSam is mainly used in the minority class samples [146]. At present, a group of OverSam schemes have been proposed, e.g., random oversampling [141], synthetic minority oversampling (SMOTE) [147], borderline SMOTE [148], and the adaptive synthetic sampling [149]. A wide range of experimental results validates that OverSam schemes do not affect the model slope, but can amplify the model intercept [150], [151]. ...
... This experimental phenomenon is consistent with the conclusions of previous papers in [56], [102], which is recommended that we should use as large an image patch size as the GPU memory can accommodate. 2) With the help of the OverSam strategy [146], [149], the model performance is overall improved. In particular, + OverSam on KiTS19 [136] can bring the remarkable performance gain of 0.84% Recall, 0.43% Dice and 0.73% IoU on setting-i, and 9.11% Recall, 6.98% Precision, 6.06% Dice and 5.05% IoU on setting-ii, respectively. ...
Preprint
Over the past few years, the rapid development of deep learning technologies for computer vision has greatly promoted the performance of medical image segmentation (MedISeg). However, the recent MedISeg publications usually focus on presentations of the major contributions (e.g., network architectures, training strategies, and loss functions) while unwittingly ignoring some marginal implementation details (also known as "tricks"), leading to a potential problem of the unfair experimental result comparisons. In this paper, we collect a series of MedISeg tricks for different model implementation phases (i.e., pre-training model, data pre-processing, data augmentation, model implementation, model inference, and result post-processing), and experimentally explore the effectiveness of these tricks on the consistent baseline models. Compared to paper-driven surveys that only blandly focus on the advantages and limitation analyses of segmentation models, our work provides a large number of solid experiments and is more technically operable. With the extensive experimental results on both the representative 2D and 3D medical image datasets, we explicitly clarify the effect of these tricks. Moreover, based on the surveyed tricks, we also open-sourced a strong MedISeg repository, where each of its components has the advantage of plug-and-play. We believe that this milestone work not only completes a comprehensive and complementary survey of the state-of-the-art MedISeg approaches, but also offers a practical guide for addressing the future medical image processing challenges including but not limited to small dataset learning, class imbalance learning, multi-modality learning, and domain adaptation. The code has been released at: https://github.com/hust-linyi/MedISeg
... To obtain a more balanced training dataset-where the number of segments that contain maneuvers (known as the positive class) is closer to the number that contain no maneuvers (known as the negative class)-resampling strategies should be used on the train dataset to balance the two classes [3]. When used in combination, over-sampling the positive class (using methods such as SMOTE [4] or ADASYN [11]) and under-sampling the negative class (by simply removing a portion of the segments that contain no maneuvers) have been measured to perform particularly well [12]. ...
Conference Paper
Full-text available
This work describes an approach for detecting the components of longitudinal shift maneuvers in the geosynchronous (GEO) orbital regime using convolutional neural networks trained on publicly available two-line element (TLE) data from the U.S. Space Command's space object catalog. A method for converting TLE data to geographic position histories-longitude, latitude, and altitude positions over time in the Earth-centered, Earth-fixed geographic reference frame-and labeling longitudinal shift maneuvers by inspection is described. A preliminary maneuver detection algorithm is designed, trained, and tested on all GEO satellites in orbit from January 1 to December 31, 2020. Performance metrics are presented for a suite of algorithms trained on data sets corresponding to ten years' worth of geographic position time-histories labeled with longitudinal shift maneuvers. When detected, longitudinal shift maneuvers can be used to identify anomalous behavior in GEO. In this work, a satellite's behavior is considered nominal if it adheres to the satellite's pattern of life (PoL)-its previous on-orbit behavior made up of sequences of both natural and non-natural behavioral modes, including routine station-keeping, other on-orbit maneuvers, and uncontrolled motion-and anomalous if it deviates from the satellite's PoL. Identifying anomalous satellite behavior is of critical interest to space situational awareness system operators, who may choose to task their sensors to obtain more observations of anomalous behavior, and satellite operators themselves, who may wish to diagnose its root cause. Applications of this work for international space policymaking is also discussed.
... A wide range of data balancing approaches is found in the literature. Broadly these approaches are divided into two groups, namely, oversampling [23] and under-sampling [24]. Oversampling refers to creating more data of minority class, whereas under-sampling points to reducing the number of instances of the majority class. ...
Article
Full-text available
Post Translational Modification (PTM) plays an essential role in the biological and molecular mechanisms. They are also considered as a vital element in cell signaling and networking pathways. Among different PTMs, Methylation is regarded as one of the most important types. Methylation plays a crucial role in maintaining the dynamic balance, stability, and remodeling of chromatins. Methylation also leads to different abnormalities in cells and is responsible for many serious diseases. Methylation can be detected by experimental approaches such as methylation-specific antibodies, mass spectrometry, or characterizing methylation sites using the radioactive labeling method. However, these approaches are time-consuming and costly. Therefore, there is a demand for fast and accurate computational techniques to solve these issues. This study proposes a novel machine learning approach called MethEvo to predict methylation sites in proteins. To build this model, we use an evolutionary-based bi-gram profile approach to extract features. We also use SVM as our classification technique to build MethEvo. Our results demonstrate that MethEvo achieves 98.7%, 98.8%, 98.4%, and 0.974 in terms of accuracy, specificity, sensitivity, and Matthews Correlation Coefficient (MCC). MethEvo and its source code are publicly available at: https://github.com/islamsadia88/MethEvo.
... To overcome the observed correlation issue and for ease of interpretation, we used the lowest and highest frequency bands for each AAR when modelling the acoustic detection of whales (see sub-Section 3.2). Four different methods of addressing noticeable differences in class imbalance of whale acoustic detection were used: Synthetic Minority Over-sampling TEchnique (SMOTE; Chawla et al., 2002), ADAptive SYNthetic (ADASYN; He et al., 2008), downsampling and upsampling (Nallamuthu, 2020). All models were tuned using 70% of the balanced data for training and the remaining 30% was used for testing. ...
Article
Full-text available
Anthropogenic underwater noise has been shown to negatively affect marine organisms globally; yet little to no noise research has been conducted in most African waters including South Africa's. This study aimed to quantitatively describe sources of underwater noise and effects of underwater noise on the acoustic detectability of Antarctic blue, fin, minke, humpback, and sperm whales off South Africa's west coast. Noise from vessel traffic (<35 km to the location of recorders) dominated the soundscape below 500 Hz while wind-generated noise increased with wind speed above 5 m s − 1 and dominated the soundscape above 500 Hz. Acoustic detectability of humpback, minke and sperm whales decreased with increasing ambient noise levels whereas blue and fin whale acoustic detectability increased with the ambient noise levels. We provide baseline information on underwater noise sources and the effects of underwater noise on whale acoustic detectability off the west coast of South Africa.
... In order to cope with the class imbalance, we generated additional synthetic samples with the use of the four different SMOTE methods: (1) SMOTE [10], (2) SVM-SMOTE [10], (3) Borderline SMOTE [14], and (4) ADASYN [16]. The 10fold cross-validation scheme was applied during evaluation. ...
Chapter
Multiword Expression (MWE) detection is a crucial problem for many NLP applications. Recent methods approach it as a sequence labeling task and require manually annotated corpus. Traditional methods are based on statistical association measures and express limited accuracy, especially on smaller corpora. In this paper, we propose a novel weakly supervised method for extracting MWEs which concentrates on differences between interactions with context between the whole MWE and its component words. The interactions are represented by contextual embeddings (neural language models) and the observations are collected from various occurrence contexts of both the whole MWEs and their single word components. Our method uses a MWE lexicon as the sole knowledge base, and extracts training samples by matching the lexicon against a corpus to build classifiers for MWE recognition by Machine Learning. Thus, our approach does not require a corpus annotated with MWE occurrences, and also works with a limited corpus and a MWE list (≈1400 MWEs in this work). It uses a general contextual embeddings model, HerBERTa, a kind of BERT model for Polish. The proposed method was evaluated on the Polish part of the PARSEME corpus and expressed very significant gain in comparison to the top methods from the PARSEME competition. The proposed method can be quite easily applied to other languages.
Chapter
The blood–brain barrier (BBB) regulates the flow of 97.9% of the chemicals which reach the central nervous arrangement. To allow the manufacture of mind medicines for the handling of different brain illnesses, for instance, Parkinson's, Alzheimer's, and brain cancers, complexes with high penetrability be found. Several models have been created over the years to tackle this challenge, with satisfactory accurateness slashes in forecasting chemicals that cross the BBB. Nevertheless, forecasting molecules with “low” penetrability has proven to be difficult. In this research study, several machine learning classifiers such as Principal Component Analysis PCA, Neural Network SVC, and XGBoost have been compared using Molecule Net and presented in the result section. Before developing the classification model, several issues to improve the high-dimensional and unbalanced data are treated by oversampling techniques, and the high dimensionality is addressed using a nonlinear dimensionality decrease method recognized as kernel major constituent analysis has been done. A neural network with 500 epochs shows an accuracy of nearly 98% which is much better than the previous works.
Chapter
This paper attempts to examine the performance of preprocessing strategies with logistic regression classifier. The goal of this paper is to see if there is a feasible and efficient strategy to enhance the performance of classification techniques on imbalanced datasets for different training dataset percentages. The experiments were conducted on Cleveland dataset—binary class. Several data preprocessing methods like Smote, Borderline-Smote, and ADAYSN were applied to data in order to classify various training dataset percentages. It was necessary to ascertain how the training dataset percentage affected the final classification for preprocessing methods. The experimental results explained that the ratio of 70–30 datasets performed better or better than other ratios when on train and test datasets, respectively. It was found from experimental results that the algorithms gave better accuracy when the training to testing ratio was 70:30 compared to other ratios.KeywordsClassificationImbalanced datasetsLogistic regressionSmote
Chapter
The paper focuses on methods and algorithms for oversampling two-classes imbalanced datasets. We propose a taxonomy for oversampling approaches and review state-of-the-art algorithms. The paper discusses also some strengths and weaknesses of the oversampling methods. A computational experiment aims at comparing the performance of several oversampling algorithms. Conclusions discuss possible directions for future developments in the field of balancing imbalanced datasets to achieve better performance when mining them.KeywordsImbalanced datasetsData miningOversampling algorithms
Article
Full-text available
Receiver Operating Characteristics (ROC) graphs are useful for organizing classi-fiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining research. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.
Article
Full-text available
Rare objects are often of great interest and great value. Until recently, however, rarity has not received much attention in the context of data mining. Now, as increasingly complex real-world problems are addressed, rarity, and the related problem of imbalanced data, are taking center stage. This article discusses the role that rare classes and rare cases play in data mining. The problems that can result from these two forms of rarity are described in detail, as are methods for addressing these problems. These descriptions utilize examples from existing research. So that this article provides a good survey of the literature on rarity in data mining. This article also demonstrates that rare classes and rare cases are very similar phenomena---both forms of rarity are shown to cause similar problems during data mining and benefit from the same remediation methods.
Article
Full-text available
This paper takes a new look at two sampling schemes commonly used to adapt machine al- gorithms to imbalanced classes and misclas- sication costs. It uses a performance anal- ysis technique called cost curves to explore the interaction of over and under-sampling with the decision tree learner C4.5. C4.5 was chosen as, when combined with one of the sampling schemes, it is quickly becom- ing the community standard when evaluat- ing new cost sensitive learning algorithms. This paper shows that using C4.5 with under- sampling establishes a reasonable standard for algorithmic comparison. But it is recom- mended that the least cost classier be part of that standard as it can be better than under- sampling for relatively modest costs. Over- sampling, however, shows little sensitivity, there is often little dierence in performance when misclassication costs are changed.
Article
Full-text available
Imbalanced data sets are becoming ubiqui-tous, as many applications have very few in-stances of the "interesting" or "abnormal" class. Traditional machine learning algo-rithms can be biased towards majority class due to over-prevalence. It is desired that the interesting (minority) class prediction be improved, even if at the cost of additional majority class errors. In this paper, we study three issues, usually considered sepa-rately, concerning decision trees and imbal-anced data sets — quality of probabilistic es-timates, pruning, and effect of preprocessing the imbalanced data set by over or under-sampling methods such that a fairly balanced training set is provided to the decision trees. We consider each issue independently and in conjunction with each other, highlighting the scenarios where one method might be pre-ferred over another for learning decision trees from imbalanced data sets.
Article
We introduce an instance-weighting method to induce cost-sensitive trees. It is a generalization of the standard tree induction process where only the initial instance weights determine the type of tree to be induced-minimum error trees or minimum high cost error trees. We demonstrate that it can be easily adapted to an existing tree learning algorithm. Previous research provides insufficient evidence to support the idea that the greedy divide-and-conquer algorithm can effectively induce a truly cost-sensitive tree directly from the training data. We provide this empirical evidence in this paper. The algorithm incorporating the instance-weighting method is found to be better than the original algorithm in terms of total misclassification costs, the number of high cost errors, and tree size in two-class data sets. The instance-weighting method is simpler and more effective in implementation than a previous method based on altered priors.
Article
The problem of learning from imbalanced data sets, while not the same problem as learning when misclassication costs are un- equal and unknown, can be handled in a simi- lar manner. That is, in both contexts, we can use techniques from roc analysis to help with classier design. We present results from two studies in which we dealt with skewed data sets and unequal, but unknown costs of error. We also compare for one domain these re- sults to those obtained by over-sampling and under-sampling the data set. The operations of sampling, moving the decision threshold, and adjusting the cost matrix produced sets of classiers that fell on the same roc curve.
Article
During a project examining the use of machine learning techniques for oil spill detection, we encountered several essential questions that we believe deserve the attention of the research community. We use our particular case study to illustrate such issues as problem formulation, selection of evaluation measures, and data preparation. We relate these issues to properties of the oil spill application, such as its imbalanced class distribution, that are shown to be common to many applications. Our solutions to these issues are implemented in the Canadian Environmental Hazards Detection System (CEHDS), which is about to undergo field testing.