Conference PaperPDF Available

A Hybrid Approach to the Problem of Class Imbalance.

Authors:

Abstract and Figures

In Machine Learning classification tasks, the class imbalance problem is an important one which has received a lot of attention in the last few years. In binary classification, class imbalance occurs when there are significantly fewer examples of one class than the other. A variety of strategies have been applied to the problem with varying degrees of success. Typically previous approaches have involved attacking the problem either algo- rithmically or by manipulating the data in order to mitigate the imbalance. We propose a hybrid approach which combines Proportional Individualised Random Sampling(PIRS) with two different fitness functions designed to improve performance on imbalanced classification problems in Genetic Programming. We investigate the ef- ficacy of the proposed methods together with that of five different algorithmic GP solutions, two of which are taken from the recent literature. We conclude that the PIRS approach combined with either average accuracy or Matthews Correlation Coefficient, delivers superior results in terms of AUC score when applied to either balanced or imbalanced datasets.
Content may be subject to copyright.
A HYBRID APPROACH TO THE PROBLEM OF CLASS
IMBALANCE
Jeannie Fitzgerald and Conor Ryan
Biocomputing and Developmental Systems Group
University of Limerick
Ireland
jeannie.fitzgerald@ul.ie conor.ryan@ul.ie
Abstract: In Machine Learning classification tasks, the class imbalance problem is an important one which has
received a lot of attention in the last few years. In binary classification, class imbalance occurs when there are
significantly fewer examples of one class than the other. A variety of strategies have been applied to the problem
with varying degrees of success. Typically previous approaches have involved attacking the problem either algo-
rithmically or by manipulating the data in order to mitigate the imbalance. We propose a hybrid approach which
combines Proportional Individualised Random Sampling(PIRS) with two different fitness functions designed to
improve performance on imbalanced classification problems in Genetic Programming. We investigate the ef-
ficacy of the proposed methods together with that of five different algorithmic GP solutions, two of which are
taken from the recent literature. We conclude that the PIRS approach combined with either average accuracy
or Matthews Correlation Coefficient, delivers superior results in terms of AUC score when applied to either
balanced or imbalanced datasets.
Keywords: Genetic Programming, Binary Classification, Class Imbalance Problem, Over Sampling, Under Sam-
pling
1 Introduction
Each day 2.5 quintillion bytes of data are created. This is a relatively recent phenomenon, such that 90% of
the data in the world today has been created in the last two years alone [1]. This explosion in data offers
tremendous opportunities for knowledge acquisition and decision support, but the potential for unleashing the
power of these insights is balanced by several complex challenges. Aside from the problem of handling the sheer
volume of data, there is the challenge of identifying those instances which may be interesting or useful, in an
environment where such items may be in the minority. From a machine learning perspective, at its simplest,
this can be viewed as a binary classification problem.
In binary classification tasks, the class imbalance problem arises where there is a disparity in the number of
instances of each class in a particular dataset. Greater disparity makes classification tasks more difficult, as there
is an inherent bias towards the class which has greater representation in the dataset. When a machine learning
algorithm, designed for general classification tasks, is confronted with significant imbalance, the “intelligent”
thing for it to do is to classify all instances as belonging to the majority class. Ironically, it is frequently the
case that the minority class is the one which contains the most important or interesting instances. In datasets
from the medical domain, for example, it is generally the case that instances which represent malignancy or
disease are far fewer than those which do not.
The way in which GP, similar to other approaches which adhere to a paradigm of evolutionary computation
is realised: the evolution of a population of individuals over time (generations), means that it facilitates a very
flexible and potentially granular approach for tackling this type of problem. We have chosen to investigate
a hybrid approach which seeks to influence the learning process at both the individual and population levels,
using a strategy which combines sampling and algorithmic techniques. In this work, we propose a new sam-
pling technique which we call Individualised Random Sampling which we combine with Matthews Correlation
Coefficient and balanced accuracy.
2 Previous Work
The class imbalance problem is an important one which has generated a lot of interest in the research community
in recent years. In general, approaches can be divided between those which tackle the imbalance at the data
level, and those which seek an algorithmic solution. There have also been several hybrid techniques proposed
which combine aspects of the other two.
Methods which operate on the data try to repair the imbalance by creating more balanced datasets for
training purposes. This is done by under-sampling the majority class or over-sampling the minority class,
where the former involves removing some examples of the majority class and the latter is accomplished by
adding duplicate copies of minority instances until some predefined measure of balance is achieved. Over or
under-sampling may be random in nature [2] or “informed” [3], where in the latter, various criteria are used to
determine which instances from the majority class should be discarded. An interesting approach called SMOTE
(Synthetic Minority Oversampling Technique) was suggested by Chawla et al. [4] in which rather than over
sampling the minority class with replacement they generated new synthetic examples.
At the algorithmic level Joshi et al. [5] modified the well known AdaBoost [6] algorithm so that different
weights were applied for boosting instances of each class. Akbani et al. [7] modified the kernel function in
a Support Vector Machine implementation to use an adjusted decision threshold. Class imbalance tasks are
closely related to cost based learning problems, where misclassification costs are not the same for both classes.
Adacost [8] and MetaCost [9] are examples of this approach. See [10–12] for detailed reviews of these and
various other approaches found in the literature.
2.1 Genetic Programming (GP)
In the field of GP, much of the work on algorithmic approaches has been undertaken by Bhowan et al. [13–15]
in which they have studied the efficacy of a wide range of different fitness functions on various imbalanced data
sets. In this work we compare with two of those methods: Correlation Ratio based fitness, and Geometric Mean
based fitness, with which the researchers reported good results. These are described in Sections 4.1.4 and 4.1.5.
In other work Patterson and Zhang [16] investigated the use of average accuracy as a fitness function and also
a modified version which used the squares of the individual accuracies for each class. Both methods resulted in
improved performance on the minority class and a more balanced classification overall.
With regard to sampling approaches in GP, Hunt et al. [17] examined several different sampling approaches
including under sampling, over sampling and a combined approach. In each case they maintained equal numbers
of instances from both classes in their training set and sampled the majority class with replacement. While
they found that the various sampling approaches improved the classification accuracy on the minority class,
performance on the majority class decreased. Overall, they reported that the method was not as successful as
algorithmic approaches previously suggested by Bhowan et al. [14].
In other work, Doucette and Heywood [18] proposed a Simple Active Learning Heuristic (SALH): a hybrid
approach which combined a simplified version of the Random Subset Selection algorithm proposed by Gathercole
and Ross [19], together with a modified Wilcoxon-Mann-Whitney statistic. They reported that their hybrid
approach compared favourably with several other machine learning algorithms.
3 A Hybrid Approach: Proportional Individualised Random Sampling (PIRS)
with Matthews Correlation Coefficient or Average Accuracy
There several disadvantages associated with the use of over or under sampling strategies for tackling the the class
imbalance problem. The obvious disadvantage with under-sampling is that it discards potentially useful data.
The main drawback with standard oversampling is that it introduces exact copies of minority instances which
may increase the potential for over-fitting. Also, the use of over-sampling increases the size of the dataset,
thus adding to the computational cost. Here we propose a sampling approach which we call Proportional
Individualised Random Sampling (PIRS) which either eliminates or mitigates these disadvantages.
Firstly, the size of the dataset is exactly the same as the original, so there is no additional computational
cost, as is generally the case with random over sampling. Instead, in a new sampling technique, we vary the
number of instances of each class maintaining the original size of the dataset. At each generation and for each
individual in the population the percentage of majority instances is randomly selected in the range between
the percentages of minority (positive) and majority (negative) instances in the original distribution. Then, that
particular individual is trained on that percentage of majority instances with instances of the minority class
making up the remainder of the data. In both cases, each instance is randomly selected with replacement.
For example, in the case of the Yeast1.5 dataset, where 98.5% of the data makes up the majority class and
1.5% the minority, the training data for a given individual will be divided npercent majority instances where
1.5<=n <= 98.5 and mpercent minority instances, where m= 100 n. In this way, individuals within the
population are trained with different distributions of the data within the range of the original distribution.
The benefit of this approach from the under sampling perspective is that while the majority class may not
be fully represented at the level of the individual, all of the data for that class is available to the population
as a whole. Because all of the available knowledge is spread across the population the system is less likely to
suffer from the loss of useful data that is normally associated with under sampling techniques. From the under
sampling viewpoint, over-fitting may be less likely as the distribution of instances of each class is varied for
each individual at every generation. Also, as all sampling is done with replacement, there will be duplicates of
negative as well as positive instances.
A useful advantage of our proposed approach is that it is equally applicable to both balanced and unbalanced
datasets. Previous work [20] has shown that aside from the consideration of balance in the distribution of
instances, the use of random sampling techniques may have a beneficial effect in reducing over-fitting. Thus,
we believe that the proposed sampling approach can offer improved performance on a wide range of binary
classification tasks, whether a particular dataset is balanced or not. This important proposition was simply
addressed by Provost [21] in the invited paper for the AAAI 2000 Workshop on Imbalanced Data Sets ..“isn’t
the best research strategy to concentrate on how machine learning algorithms can deal most effectively with
whatever data they are given?”.
We combine the PIRS sampling technique with two different fitness functions which are designed to function
well with either balanced or unbalanced data: Average Accuracy and Matthews Correlation Coefficient. In
Machine Learning, Matthews Correlation Coefficient is widely regarded as a good measure for evaluating the
performance of a given model on binary classification tasks, in part because it has fewer inherent biases than
some other popular methods [22]. But also, because it is considered suitable for both balanced and imbalanced
data sets. Here, rather than using the measure to assess the performance of our model, we investigate its use
in the actual evolution of the model by incorporating it as a fitness function as described in Equation 1.
MCC(P) = (T P T N F P F N)
p(T P +F P )(T P +T N )(F P +F N)(T N +F N )(1)
MCC is regarded as a balanced measure of the quality of a binary classifier, which can be used even if the
classes are of different sizes. It is, in essence, a correlation coefficient between the observed and predicted binary
classifications. MCC returns a value between 1 and +1: where a value of +1 represents a perfect prediction,
a value of 0 no better than random and a value of 1 represents total disagreement between predicted and
observed class labels.
In addition to investigating the use of PIRS with Matthews Correlation Coefficient, we also study the
combination of PIRS and average accuracy also known as balanced accuracy which is a well know performance
measure used in classification. This method modifies the calculation for overall accuracy to better emphasise
the performance on each class as shown in Equation 2.
AV GA(P) = 0.5T P
(T P +F N)+T N
(T N +F P )(2)
4 Experimental Set-up
4.1 Configurations
In much of the literature on binary classification the classes in question are often identified as being positive
or negative, where instances of the positive class are usually (but not always) in the minority. This situation
is common, for example, in medical diagnosis, where the number of patients with the disease or condition
of interest are generally fewer in number than those without the disease. The results of a classifier can be
represented by a confusion matrix as shown in table 1. Where TP, TN, FP and FN represent the number of
instances which fall into the corresponding category: True Positive, True Negative, False Positive and False
Negative.
Table 1: Confusion Matrix
Prediction
Positive Negative
Truth Positive TP FN
Negative FP TN
For the purpose of discussing class imbalance, we are interested in the majority and minority classes, where
the majority class corresponds to the negative class and the minority class to the positive class. TP represents
the number of minority class instances correctly classified, TN the number of majority class instances correctly
classified, FP the number of majority class instances which have been incorrectly classified as belonging to the
minority class, and FN the number of minority class instances which have been mis-classified as majority class
instances. In describing the various experimental configurations below, we adhere to the standard nomenclature
for clarity.
4.1.1 Standard GP (StdGP)
The fitness measure used for the standard GP configuration is a commonly used measure of “overall” classi-
fication accuracy. If a program P correctly classifies all instances, its overall accuracy will be 1. The fitness
function for each program Pis 1 Accuracy(P), where Accuracy(P) is as described in Equation 3.
Accuracy(P) = T P +T N
T P +T N +F P +F N (3)
4.1.2 Standard GP with Average Accuracy (AVGA)
For the second configuration, we use a slight modification of the overall accuracy, which aims to maximise
the average of the accuracy over both classes. The fitness function to be minimised is 1 AV GA(P) where
AV GA(P) is described by Equation 2.
4.1.3 GP with Matthews Correlation Coefficient (MCC)
For this configuration we employ a standard GP implementation with the Matthews Correlation Coefficient as
the fitness function. This fitness function is described in Section 3 and Equation 1.
4.1.4 GP with Correlation Based Fitness (CORR)
Bhowan et al. [13] proposed a correlation ratio fitness measure to mitigate bias introduced by class imbalance
for image classification problems. In this method the correlation ratio is used to measure how well the outputs
of a GP Individual for the minority and majority classes are separated with respect to each other. The higher
the correlation ratio achieved by a particular model, the better the classification performance. This fitness
function is aimed at evolving solutions that perform equally well on both classes with the minimum loss to the
overall classification rate. The correlation ratio ”r” (generalised for Mclasses) is described in Equation 4.
r(P) = v
u
u
tPM
c=1 Nc(¯µc¯µ)2
PM
c=1 PNc
i=1(Pci ¯µ)2(4)
Where ¯µcis the mean of the outputs of the program for instances of class c, ¯µis the mean of the program
outputs over all classes, Mis the number of classes, Nis the number of total instances, Ncis the number of
examples of class c, and Pci represent the output of a genetic program classifier P when evaluated on the ith
example belonging to class c. This equation returns a value between 0 and 1, where values closer to 1 indicated
better separability.
The researchers also imposed and identity function to guide the evolution such that outputs for instances
of the majority class would be greater than zero, and outputs for instances of the minority class would be less
than zero. Their final fitness function is shown in Equation 5
Correlation(P) = r+I( ¯µminor ity,¯µmajority ) (5)
Where the indicator function, Ireturns 1 if the mean of the minority and majority observations are positive
and negative respectively, and 0 otherwise. Thus, the final fitness function returns a value between 0 and 2
where values closer to 2 represent good fitness, and those nearer to 0, poor fitness.
4.1.5 GP with Geometric Mean based Fitness (GMF)
In other work, Bhowan et al. [14] proposed a fitness function using a geometric mean as shown in Equation 6.
GM F (P) = rT P
T P +F N
T N
T N +F P (6)
This function has the property that if the number of instances of either class correctly classified is zero, then
the geometric mean itself will also be zero, which has the effect of penalising individuals which perform badly
on one or other class.
4.1.6 Individualised Random Sampling with Balanced Fitness Function (PIRS-BAL)
In this configuration, we employ the balanced fitness function defined in Equation 2. But we also randomly
select training instances to train each individual. The data is randomly selected with replacement, varying the
proportions of minority and majority class instances. The detail of our sampling technique is as previously
described in Section 3.
Table 2: GP Parameters
Parameter Value
Strategy Generational
Initialisation Ramped
half-and-half
Selection Tournament
Tournament Size 2
Crossover 80
Mutation 20
Initial Min Depth 1
Initial Max Depth 6
Max Depth 17
Function Set + - * /
ERC -5 to +5
Population 500
Max Gen 60
Table 3: Data Sets [23]
Data Set Acronym Features Instances %Minority
Bupa Liver Disorders BUPA42 7 345 42
Habermans Survival HS36 4 306 36
Yeast Yeast16 8 1484 16
Yeast(1) Yeast1.5 8 1484 1.5
Ecoli Ecoli10 7 332 10
4.1.7 Individualised Random Sampling with MCC (PIRS-MCC)
In this final experimental configuration we investigate Individualised Random Sampling (PIRS) together with
Matthews Correlation Coefficient: an aggregate objective function which represents a particular confusion
matrix as a single value. For the PIRS-MCC configuration, we minimise 1 M CC (P) where MCC(P) is as
previously outlined in Equation 1. Here again, the sampling method is as described in Section 3.
4.2 GP Parameters
The Genetic Programming parameters used for this investigation are as described in Table 2 and The datasets
used are detailed in Table 3. The yeast and ecoli datasets were originally multi-class datasets. In order to
experiment with various levels of class imbalance, we have “collapsed” several of the classes into one to create
binary classification tasks. The acronym used for each dataset indicates the % of the minority class in each
dataset. In each case we have used two thirds of the available data from training and the remaining one third
for test. We undertook 50 runs for each configuration, on each dataset, using identical random seeds for each
set of 50 runs.
5 Results and Discussion
For this investigation we have chosen the Area Under the Receiver Operating Curve (AUC) as the primary
measure of classification performance. Values for this measure are calculated using the equivalent [24] Wilcoxon-
Mann-Whitney statistic. We are also interested in the overall classification accuracy (particularly on test data),
performance on the minority and majority classes, the sizes of the evolved classifiers and how early or late in
the evolutionary process the best-of-run individual is discovered.
In the following subsections, we detail for each dataset investigated, run statistics for the best of run
individuals; the AUC measure, average overall %accuracy on training and test data, best individual %accuracy
for training and test data, average %error on the minority and majority classes for both training and test data,
the average size in nodes and the average generation in which the best-of-run individual emerged.
To gain a clearer insight as to which method performed best overall we carried out the non parametric
Friedman test which is regarded as a suitable test for the empirical comparison of the performance of different
algorithms [25]. This resulted in a p-value of 0.003 and indicated that the best performing algorithm in terms
of AUC score was PIRS-BAL closely followed by PIRS-MCC as shown in Figure 5.
5.1 BUPA42
The results shown in Table 4 illustrate that the stdGP method which uses the overall accuracy fitness measure
performs very poorly on the minority class. The best approach overall is the PIRS-BAL method which combines
PIRS with average accuracy. This method delivered a superior AUC measure of 0.80, produced the smallest
programs where the best of run individual was discovered earliest in the evolutionary process. It also exhibits
an absence of over-fitting, where the average test performance for both the minority and majority classes were
actually better than the training results.
5.2 Ecoli10
Looking at the Ecoli10 results in Table 5 we see that both methods which employed PIRS achieved good AUC
scores and performed very well on the minority class, having several runs with perfect classification in the
AVG CORR GMF GP IRS−B IRS−M MCC
1234567
Figure 1: Methods ranked from 1 to 7 based on average AUC, where 1 is best and 7 is worst.
Table 4: Performance of best-of-run Trained Individuals on the BUPA42 data.
Method
AUC
Avg. Train
StdDev
Best Train
Avg. Test
StdDev
Best Test
Min. Train
Min. Test
Maj. Train
Maj. Test
Size
Gen
StdGP 0.74 73.68 1.09 76.32 70.96 2.24 75.44 46.97 51.95 22.85 11.75 213.0 50.50
AVGA 0.80 80.99 1.26 83.33 74.75 3.07 78.95 27.79 42.89 10.22 11.93 88.85 48.22
MCC 0.78 76.22 1.36 78.95 74.03 3.66 78.95 34.02 41.02 16.33 14.02 144.72 54.16
CORR 0.76 66.46 5.06 75.43 70.10 6.64 78.94 31.52 31.79 35.00 28.46 114.25 57.08
GMF 0.69 73.32 1.76 77.19 68.28 4.78 76.31 28.12 36.24 25.62 28.31 161.04 56.28
PIRS-BAL 0.83 65.82 3.78 72.80 76.14 3.31 80.70 41.41 37.89 26.72 13.29 63.48 36.16
PIRS-MCC 0.78 80.66 1.39 84.21 73.91 3.06 78.07 22.23 33.46 17.05 20.52 89.76 48.96
training phase. The GMF and AVGA approaches also achieved good training scores on the minority class, but
these did not translate into good test results.
Table 5: Performance of best-of-run Trained Individuals on the Ecoli10 data.
Method
AUC
Avg. Train
StdDev
Best Train
Avg. Test
StdDev
Best Test
Min. Train
Min. Test
Maj. Train
Maj. Test
Size
Gen
StdGP 0.52 91.17 0.90 94.54 86.07 3.34 89.29 80.09 91.07 6.58 3.80 79.76 23.44
AVGA 0.72 87.68 2.80 93.18 75.71 9.99 84.82 0.63 36.92 19.07 22.63 198.40 49.00
MCC 0.64 92.92 1.79 95.45 79.79 4.36 85.71 13.81 56.77 6.32 15.42 185.85 49.50
CORR 0.56 73.86 24.53 91.82 70.30 23.65 88.39 27.27 36.15 26.01 28.54 121.00 50.84
GMF 0.74 90.43 1.87 93.64 78.82 6.12 85.72 0.45 43.38 10.58 18.20 164.80 47.96
PIRS-BAL 0.85 99.70 0.26 100 72.24 11.63 83.04 0.00 8.50 2.26 30.32 70.88 34.16
PIRS-MCC 0.86 99.61 0.24 100 71.80 12.18 83.03 0.00 6.61 3.05 31.03 70.16 39.82
5.3 HS36
For the HS36 task, once again both PIRS methods produced the best AUC scores, the best minority performance
and smallest programs. Again these programs were discovered earlier in the evolutionary process.
5.4 Yeast16
For the Yeast16 dataset, the results in Table 7 show that the CORR fitness function resulted in the best AUC
score of 0.83. This method delivered the best accuracy on the minority class and the results were balanced
across both classes. PIRS-BAL PIRS-MCC and MCC each had AUC scores of 0.82. MCC had relatively weak
accuracy on the minority class but very good results for the majority class. Between PIRS-BAL and PIRS-MCC,
the former had the better results on the minority class.
5.5 Yeast1.5
This dataset is the most unbalanced of those tested, and proved to the most difficult from the point of view
of minority classification. The results in Table 8 illustrate that StdGP, MCC and CORR achieved relatively
Table 6: Performance of best-of-run Trained Individuals on the HS36 data.
Method
AUC
Avg. Train
StdDev
Best Train
Avg. Test
StdDev
Best Test
Min. Train
Min. Test
Maj. Train
Maj. Test
Size
Gen
StdGP 0.44 78.22 0.73 79.90 75.22 1.72 79.41 75.32 81.77 14.32 4.26 223.60 47.68
AVGA 0.65 72.06 1.61 75.49 76.17 1.99 80.40 34.98 46.07 26.66 15.81 228.32 52.42
MCC 0.73 75.32 2.06 80.39 77.63 1.89 80.40 36.68 41.71 22.37 15.05 167.96 50.38
CORR 0.72 66.81 5.15 77.94 72.61 5.78 80.39 35.58 42.44 27.39 21.97 205.40 58.82
GMF 0.66 72.87 1.39 76.47 76.11 2.14 79.41 32.72 45.26 25.17 16.18 190.88 52.80
PIRS-BAL 0.75 79.74 2.07 83.25 75.63 3.54 80.39 20.85 32.28 24.37 21.38 101.85 46.80
PIRS-MCC 0.75 80.19 1.79 83.82 76.78 3.26 81.37 23.40 34.28 23.21 19.02 104.88 47.66
Table 7: Performance of best-of-run Trained Individuals on the Yeast16 data.
Method
AUC
Avg. Train
StdDev
Best Train
Avg. Test
StdDev
Best Test
Min. Train
Min. Test
Maj. Train
Maj. Test
Size
Gen
StdGP 0.71 87.05 0.45 88.63 86.43 0.90 88.16 61.74 58.00 9.05 4.89 166.08 46.32
AVGA 0.80 82.04 2.35 86.12 81.91 2.37 86.33 29.95 32.03 17.26 15.37 141.36 50.04
MCC 0.82 88.80 0.48 90.04 86.14 0.83 87.75 38.65 40.07 5.77 8.74 114.20 53.46
CORR 0.83 74.81 6.89 88.43 75.52 6.13 87.75 26.70 23.42 24.48 24.69 119.76 58.50
GMF 0.80 81.96 2.30 85.41 81.13 2.16 84.28 27.45 31.15 16.16 16.47 150.64 55.12
PIRS-BAL 0.82 83.66 4.26 88.63 82.32 1.77 86.53 24.35 33.42 10.83 14.61 65.56 40.90
PIRS-MCC 0.82 84.73 2.03 90.24 84.18 1.32 86.94 28.99 37.20 7.84 11.64 61.48 42.28
poor results in this respect: correctly classifying fewer than half of the minority examples. In contrast, the
PIRS-BAL method produced relatively good results on both classes and had the highest AUC score.
Table 8: Performance of best-of-run Trained Individuals on the Yeast1.5 data.
Method
AUC
Avg. Train
StdDev
Best Train
Avg. Test
StdDev
Best Test
Min. Train
Min. Test
Maj. Train
Maj. Test
Size
Gen
StdGP 0.61 99.30 0.01 99.39 99.15 0.19 99.59 53.69 44.57 0.45 0.21 57.48 18.08
AVGA 0.78 84.83 10.35 98.49 83.67 11.02 97.96 26.46 32.86 15.41 16.09 96.36 38.16
MCC 0.64 99.30 0.02 99.40 99.06 0.22 99.39 53.38 46.57 0.00 0.28 51.36 58.38
CORR 0.75 98.85 1.20 99.30 98.32 1.33 99.38 53.07 33.42 0.46 0.71 151.16 58.30
GMF 0.77 86.86 4.92 96.17 85.39 5.16 95.30 12.61 32.57 13.25 14.34 168.24 56.38
PIRS-BAL 0.80 92.58 6.44 99.59 87.36 13.81 99.18 15.87 29.14 2.47 12.39 71.16 39.32
PIRS-MCC 0.77 99.32 0.19 99.70 99.16 0.08 99.39 25.18 32.85 0.08 0.37 37.25 29.38
6 Conclusion
Looking at trends in the reported results it is clear that the overall accuracy measure commonly used for
classification tasks in GP is inferior to all of the other methods investigated, performing poorly even on the
relatively balanced Bupa42 dataset. In contrast, both of the PIRS methods performed well on all of the tasks,
under each of the criteria examined: either PIRS-BAL or PIRS-MCC achieved or shared the best AUC score for
all but one of the tasks, each delivered competitive results for overall accuracy on training and test data, and
for both minority and majority classification. These configurations also produced the smallest trees on average,
and the best-of-run individuals were discovered on average earlier in the evolutionary process.
These results suggest that Individualised Random Sampling combined with a fitness function that is designed
to operate well with unbalanced datasets can deliver superior results on both balanced and unbalanced data.
Acknowledgement: This work has been supported by the Science Foundation of Ireland.
Grant No. 10/IN.1/I3031.
References
[1] Zikopoulos, P. et al., Understanding big data: Analytics for enterprise class hadoop and streaming data,
McGraw-Hill Osborne Media, 2011.
[2] Batista, G. E., Prati, R. C., and Monard, M. C., ACM SIGKDD Explorations Newsletter 6(2004) 20.
[3] Kubat, M. et al., Addressing the curse of imbalanced training sets: one-sided selection, in MA-
CHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pages 179–186, MOR-
GAN KAUFMANN PUBLISHERS, INC., 1997.
[4] Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P., arXiv preprint arXiv:1106.1813 (2011).
[5] Joshi, M. V., Kumar, V., and Agarwal, R. C., Evaluating boosting algorithms to classify rare classes: Com-
parison and improvements, in Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference
on, pages 257–264, IEEE, 2001.
[6] Freund, Y. and Schapire, R. E., Experiments with a new boosting algorithm, in Thirteenth International
Conference on Machine Learning, pages 148–156, San Francisco, 1996, Morgan Kaufmann.
[7] Akbani, R., Kwek, S., and Japkowicz, N., Applying support vector machines to imbalanced datasets, in
Machine Learning: ECML 2004, pages 39–50, Springer, 2004.
[8] Fan, W., Stolfo, S. J., Zhang, J., and Chan, P. K., Adacost: misclassification cost-sensitive boosting, in
MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pages 97–105, Cite-
seer, 1999.
[9] Domingos, P., Metacost: a general method for making classifiers cost-sensitive, in Proceedings of the fifth
ACM SIGKDD international conference on Knowledge discovery and data mining, pages 155–164, ACM,
1999.
[10] Kotsiantis, S. et al., GESTS International Transactions on Computer Science and Engineering 30 (2006)
25.
[11] He, H. and Garcia, E. A., Knowledge and Data Engineering, IEEE Transactions on 21 (2009) 1263.
[12] SIGKDD Explor. Newsl. 6(2004).
[13] Bhowan, U., Zhang, M., and Johnston, M., Genetic programming for image classification with unbalanced
data, in Proceeding of the 24th International Conference Image and Vision Computing New Zealand,
IVCNZ ’09, pages 316–321, Wellington, 2009, IEEE.
[14] Bhowan, U., Johnston, M., and Zhang, M., Differentiating between individual class performance in genetic
programming fitness for classification with unbalanced data, in Evolutionary Computation, 2009. CEC’09.
IEEE Congress on, pages 2802–2809, IEEE, 2009.
[15] Bhowan, U., Johnston, M., and Zhang, M., Evolving ensembles in multi-objective genetic programming
for classification with unbalanced data, in Proceedings of the 13th annual conference on Genetic and
evolutionary computation, pages 1331–1338, ACM, 2011.
[16] Patterson, G. and Zhang, M., Fitness functions in genetic programming for classification with unbalanced
data, in Proceedings of the 20th Australian Joint Conference on Artificial Intelligence, edited by Orgun,
M. A. and Thornton, J., Lecture Notes in Computer Science, pages 769–775, Gold Coast, Australia, 2007,
Springer.
[17] Hunt, R., Johnston, M., Browne, W., and Zhang, M., Sampling methods in genetic programming for clas-
sification with unbalanced data, in AI 2010: Advances in Artificial Intelligence, pages 273–282, Springer,
2011.
[18] Doucette, J. and Heywood, M., Gp classification under imbalanced data sets: Active sub-sampling and
auc approximation, in Genetic Programming, edited by ONeill, M. et al., volume 4971 of Lecture Notes in
Computer Science, pages 266–277, Springer Berlin Heidelberg, 2008.
[19] Gathercole, C. and Ross, P., Dynamic training subset selection for supervised learning in genetic program-
ming, in Parallel Problem Solving from NaturePPSN III, pages 312–321, Springer, 1994.
[20] Liu, Y. and Khoshgoftaar, T., Reducing overfitting in genetic programming models for software qual-
ity classification, in High Assurance Systems Engineering, 2004. Proceedings. Eighth IEEE International
Symposium on, pages 56–65, IEEE, 2004.
[21] Provost, F., Learning with imbalanced data sets, in Invited paper for the AAAI2000 Workshop on Imbal-
anced Data Sets, 2000.
[22] Powers, D., Journal of Machine Learning Technologies 2(2011) 37.
[23] Frank, A. and Asuncion, A., UCI machine learning repository, 2010.
[24] Yan, L., Dodier, R., Mozer, M. C., and Wolniewicz, R., Optimizing classifier performance via the wilcoxon-
mann-whitney statistics, in Proceedings of the 20th international conference on machine learning, pages
848–855, Citeseer, 2003.
[25] Gerevini, A., Saetti, A., and Serina, I., An experimental study based on Friedman’s test of some local
search techniques for planning, pages 59–68.
... Each of these three distributions exhibit very significant class imbalance which, in and of itself, increases the level of difficulty of the classification problem. The imbalance in the data was handled in all cases by using Proportional Individualised Random Sampling (Fitzgerald and Ryan 2013), as described in Sect. 10.4.2 ...
... Also, the use of over-sampling generally results in increased computational cost because of the increased size of the dataset. In this study, we have employed a proportional sampling approach (Fitzgerald and Ryan 2013) which eliminates or mitigates these disadvantages. ...
Chapter
Full-text available
This chapter describes a general approach for image classification using Genetic Programming (GP) and demonstrates this approach through the application of GP to the task of stage 1 cancer detection in digital mammograms. We detail an automated work-flow that begins with image processing and culminates in the evolution of classification models which identify suspicious segments of mammograms. Early detection of breast cancer is directly correlated with survival of the disease and mammography has been shown to be an effective tool for early detection, which is why many countries have introduced national screening programs. However, this presents challenges, as such programs involve screening a large number of women and thus require more trained radiologists at a time when there is a shortage of these professionals in many countries.Also, as mammograms are difficult to read and radiologists typically only have a few minutes allocated to each image, screening programs tend to be conservative—involving many callbacks which increase both the workload of the radiologists and the stress and worry of patients.Fortunately, the relatively recent increase in the availability of mammograms in digital form means that it is now much more feasible to develop automated systems for analysing mammograms. Such systems, if successful could provide a very valuable second reader function.We present a work-flow that begins by processing digital mammograms to segment them into smaller sub-images and to extract features which describe textural aspects of the breast. The most salient of these features are then used in a GP system which generates classifiers capable of identifying which particular segments may have suspicious areas requiring further investigation. An important objective of this work is to evolve classifiers which detect as many cancers as possible but which are not overly conservative. The classifiers give results of 100 % sensitivity and a false positive per image rating of just 0.33, which is better than prior work. Not only this, but our system can use GP as part of a feedback loop, to both select and help generate further features.
... For comparison purposes we choose simple classification accuracy as a performance metric. It has been empirically established in the GP literature that simple classification accuracy is not a reliable measure of classification on unbalanced datasets [7], and that other measures such as average accuracy or Matthews Correlation Co-efficient might be more appropriate especially if combined with asam plingapproac h [18]. However, in this preliminary investigation, the classes are balanced which allows us to consider simple classification accuracy as a reasonable measure, particularly as we want to be able to observe differences in performance across the various levels of learning. ...
Conference Paper
Full-text available
In this paper, we propose a hybrid approach to solving multi-class problems which combines evolutionary computation with elements of traditional machine learning. The method, Grammatical Evolution Machine Learning (GEML) adapts machine learning concepts from decision tree learning and clustering methods and integrates these into a Grammatical Evolution framework. We investigate the effectiveness of GEML on several supervised, semi-supervised and unsupervised multi-class problems and demonstrate its competitive performance when compared with several well known machine learning algorithms. The GEML framework evolves human readable solutions which provide an explanation of the logic behind its classification decisions, offering a significant advantage over existing paradigms for unsupervised and semi-supervised learning. In addition we also examine the possibility of improving the performance of the algorithm through the application of several ensemble techniques.
... For comparison purposes we choose simple classification accuracy as a performance metric. It has been empirically established in the GP literature that simple classification accuracy is not a reliable measure of classification on unbalanced datasets (Bhowan et al., 2012), and that other measures such as average accuracy or Matthews Correlation Co-efficient might be more appropriate especially if combined with a sampling approach (Fitzgerald and Ryan, 2013). However, in this preliminary investigation, the classes are balanced which allows us to consider simple classification accuracy as a reasonable measure, particularly as we want to be able to observe differences in performance across the various levels of learning. ...
Conference Paper
Full-text available
This paper introduces a novel evolutionary approach which can be applied to supervised, semi-supervised and unsupervised learning tasks. The method, Grammatical Evolution Machine Learning (GEML) adapts machine learning concepts from decision tree learning and clustering methods, and integrates these into a Grammatical Evolution framework. With minor adaptations to the objective function the system can be trivially modified to work with the conceptually different paradigms of supervised, semi-supervised and unsupervised learning. The framework generates human readable solutions which explain the mechanics behind the classification decisions , offering a significant advantage over existing paradigms for unsupervised and semi-supervised learning. GEML is studied on a range of multi-class classification problems and is shown to be competitive with several state of the art multi-class classification algorithms.
... For the volumes studied, of the 75 usable positive cases, 3 of difficulty of the classification problem. The imbalance in the data was mitigated in all cases by using Proportional Individualised Random Sampling, as described in [8] Based on this master dataset, we consider several setups representing different configurations of breasts, segments and views (see Table 1). The following terminology is used to describe the composition of instances for a given setup: BXSY VZ, where X is the number of breasts, Y the number of segments and Z the number of views for a given instance. ...
Conference Paper
Full-text available
We present an automated, end-to-end approach for Stage~1 breast cancer detection. The first phase of our proposed work-flow takes individual digital mammograms as input and outputs several smaller sub-images from which the background has been removed. Next, we extract a set of features which capture textural information from the segmented images. In the final phase, the most salient of these features are fed into a Multi-Objective Genetic Programming system which then evolves classifiers capable of identifying those segments which may have suspicious areas that require further investigation. A key aspect of this work is the examination of several new experimental configurations which focus on textural asymmetry between breasts. The best evolved classifier using such a configuration can deliver results of 100% accuracy on true positives and a false positive per image rating of just 0.33, which is better than the current state of the art.
... We choose AUC as the primary measure of performance in considering bias error for these experiments. See [16] for a detailed explanation of why AUC should be preferred to accuracy as a measure of classifier performance, and also [5] as to why overall classification accuracy is a poor and misleading choice, even when the data is even mildly unbalanced. ...
Conference Paper
Full-text available
There have been many studies undertaken to determine the efficacy of parameters and algorithmic components of Genetic Programming, but historically, generalization considerations have not been of central importance in such investigations. Recent contributions have stressed the importance of generalisation to the future development of the field. In this paper we investigate aspects of selection bias as a component of generalisation error, where selection bias refers to the method used by the learning system to select one hypothesis over another. Sources of potential bias include the replacement strategy chosen and the means of applying selection pressure. We investigate the effects on generalisation of two replacement strategies, together with tournament selection with a range of tournament sizes. Our results suggest that larger tournaments are more prone to overfitting than smaller ones, and that a small tournament combined with a generational replacement strategy produces relatively small solutions and is least likely to over-fit.
... Note that changing the perspective from mammograms to segments changes the distribution of decision classes, as each negative mammogram contributes three negative segments, but also, each positive mammogram contributes two negative segments (typically, only one segment contains cancerous growth). The imbalance in the data was handled in all cases by using Proportional Individualised Random Sampling [17]. ...
Conference Paper
Full-text available
We describe a fully automated workflow for performing stage 1 breast cancer detection with GP as its cornerstone. Mammograms are by far the most widely used method for detecting breast cancer in women, and its use in national screening can have a dramatic impact on early detection and survival rates. With the increased availability of digital mammography, it is becoming increasingly more feasible to use automated methods to help with detection. A stage 1 detector examines mammograms and highlights suspicious areas that require further investigation. A too conservative approach degenerates to marking every mammogram (or segment of) as suspicious, while missing a cancerous area can be disastrous. Our workflow positions us right at the data collection phase such that we generate textural features ourselves. These are fed through our system, which performs PCA on them before passing the most salient ones to GP to generate classifiers. The classifiers give results of 100% accuracy on true positives and a false positive per image rating of just 1.5, which is better than prior work. Not only this, but our system can use GP as part of a feedback loop, to both select and help generate further features.
Conference Paper
Full-text available
This paper investigates improvements to the fitness function in Genetic Programming to better solve binary classification problems with unbalanced data. Data sets are unbalanced when there is a majority of examples for one particular class over the other class(es). We show that using overall classification accuracy as the fitness function evolves classifiers with a performance bias toward the majority class at the expense of minority class performance. We develop four new fitness functions which consider the accuracy of majority and minority class separately to address this learning bias. Results using these fitness functions show that good accuracy for both the minority and majority classes can be achieved from evolved classifiers while keeping overall performance high and balanced across the two classes.
Conference Paper
Full-text available
Image classification methods using unbalanced data can produce results with a performance bias. If the class representing important objects-of-interest is in the minority class, learning methods can produce the deceptive appearance of ¿good looking¿ results while recognition ability on the important minority class can be poor. This paper develops and compares two genetic programming (GP) methods for image classification problems with class imbalance. The first focuses on adapting the fitness function in GP to evolve classifiers with good individual class accuracy. The second uses a multi-objective approach to simultaneously evolve a set of classifiers along the trade-off surface representing minority and majority class accuracies. Evaluating our GP methods on two benchmark binary image classification problems with class imbalance, our results show that good solutions were evolved using both GP methods.
Conference Paper
Full-text available
When the goal is to achieve the best correct classification rate, cross entropy and mean squared error are typical cost functions used to optimize classifier performance. However, for many real-world classification problems, the ROC curve is a more meaningful perfor- mance measure. We demonstrate that min- imizing cross entropy or mean squared error does not necessarily maximize the area un- der the ROC curve (AUC). We then consider alternative objective functions for training a classifier to maximize the AUC directly. We propose an objective function that is an ap- proximation to the Wilcoxon-Mann-Whitney statistic, which is equivalent to the AUC. The proposed objective function is dierentiable, so gradient-based methods can be used to train the classifier. We apply the new objec- tive function to real-world customer behav- ior prediction problems for a wireless service provider and a cable service provider, and achieve reliable improvements in the ROC curve.
Conference Paper
Full-text available
Support Vector Machines (SVM) have been extensively studied and have shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbal- anced datasets in which negative instances heavily outnumber the positive in- stances (e.g. in gene profiling and detecting credit card fraud). This paper dis- cusses the factors behind this failure and explains why the common strategy of undersampling the training data may not be the best choice for SVM. We then propose an algorithm for overcoming these problems which is based on a vari- ant of the SMOTE algorithm by Chawla et al, combined with Veropoulos et al's different error costs algorithm. We compare the performance of our algorithm against these two algorithms, along with undersampling and regular SVM and show that our algorithm outperforms all of them.
Conference Paper
Full-text available
The problem of evolving binary classification models under increasingly unbalanced data sets is approached by proposing a strategy consisting of two components: Sub-sampling and 'robust' fitness func- tion design. In particular, recent work in the wider machine learning literature has recognized that maintaining the original distribution of exemplars during training is often not appropriate for designing classi- fiers that are robust to degenerate classifier behavior. To this end we propose a 'Simple Active Learning Heuristic' (SALH) in which a subset of exemplars is sampled with uniform probability under a class balance enforcing rule for fitness evaluation. In addition, an efficient estimator for the Area Under the Curve (AUC) performance metric is assumed in the form of a modified Wilcoxon-Mann-Whitney (WMW) statistic. Per- formance is evaluated in terms of six representative UCI data sets and benchmarked against: canonical GP, SALH based GP, SALH and the modified WMW statistic, and deterministic classifiers (Naive Bayes and C4.5). The resulting SALH-WMW model is demonstrated to be both efficient and effective at providing solutions maximizing performance as- sessed in terms of AUC.
Conference Paper
This paper describes a genetic programming (GP) approach to binary classification with class imbalance problems. This approach is examined on two benchmark and two synthetic data sets. The results show that when using the overall classification accuracy as the fitness function, the GP system is strongly biased toward the majority class. Two new fitness functions are developed to deal with the class imbalance problem. The experimental results show that both of them substantially improve the performance for the minority class, and the performance for the majority and minority classes is much more balanced.
Conference Paper
This work investigates the use of sampling methods in Genetic Programming (GP) to improve the classification accuracy in binary classification problems in which the datasets have a class imbalance. Class imbalance occurs when there are more data instances in one class than the other. As a consequence of this imbalance, when overall classification rate is used as the fitness function, as in standard GP approaches, the result is often biased towards the majority class, at the expense of poor minority class accuracy. We establish that the variation in training performance introduced by sampling examples from the training set is no worse than the variation between GP runs already accepted. Results also show that the use of sampling methods during training can improve minority class classification accuracy and the robustness of classifiers evolved, giving performance on the test set better than that of those classifiers which made up the training set Pareto front.