Conference PaperPDF Available

Selection Bias and Generalisation Error in Genetic Programming.



There have been many studies undertaken to determine the efficacy of parameters and algorithmic components of Genetic Programming, but historically, generalization considerations have not been of central importance in such investigations. Recent contributions have stressed the importance of generalisation to the future development of the field. In this paper we investigate aspects of selection bias as a component of generalisation error, where selection bias refers to the method used by the learning system to select one hypothesis over another. Sources of potential bias include the replacement strategy chosen and the means of applying selection pressure. We investigate the effects on generalisation of two replacement strategies, together with tournament selection with a range of tournament sizes. Our results suggest that larger tournaments are more prone to overfitting than smaller ones, and that a small tournament combined with a generational replacement strategy produces relatively small solutions and is least likely to over-fit.
Selection Bias and Generalisation Error
in Genetic Programming
Jeannie Fitzgerald
Bio-Computing and Developmental
Systems Group
University of Limerick
Conor Ryan
Bio-Computing and Developmental
Systems Group
University of Limerick
Abstract — There have been many studies undertaken to
determine the efficacy of parameters and algorithmic
components of Genetic Programming, but historically,
generalization considerations have not been of central
importance in such investigations. Recent contributions have
stressed the importance of generalisation to the future
development of the field. In this paper we investigate aspects of
selection bias as a component of generalisation error, where
selection bias refers to the method used by the learning system to
select one hypothesis over another. Sources of potential bias
include the replacement strategy chosen and the means of
applying selection pressure. We investigate the effects on
generalisation of two replacement strategies, together with
tournament selection with a range of tournament sizes. Our
results suggest that larger tournaments are more prone to
overfitting than smaller ones, and that a small tournament
combined with a generational replacement strategy produces
relatively small solutions and is least likely to over-fit.
Keywords — Genetic Programming; Generalisation; Tournament
Size; Elitism; Replacement Strategy.
Generalisation is one of the most important evaluation
criteria for any machine learning (ML) system [17] and a vital
concern of ML practitioners is how to develop learners which
generalize well [14]. In this context, under-fitting and over-
fitting can be understood to correspond to bias error and
variance error respectively, where generalisation error can be
decomposed into bias, variance and irreducible error
components [1]. In this view of generalisation error, bias is the
tendency of a learner to consistently learn the same wrong
thing (under-fitting) and variance is the tendency to learn
random things irrespective of the true signal (over-fitting) [3].
Or as Keijzer and Babovic [13] expressed it “the bias
component (of generalisation error) reflects the generic ability
of a method to fit a particular data set, while the variance error
reflects the intrinsic variability of the method in arriving at a
Genetic Programming (GP) is an evolutionary computation
approach inspired by Darwinian principles, in which a
population of candidate solutions is “evolved”, where this
evolutionary process realises a parallel guided search of the
problem space. The guided aspect of the search is controlled
through the application of genetic processes such as selection,
reproduction, crossover and mutation, roughly corresponding
to the Darwinian notions of natural selection, survival of the
fittest, sexual reproduction and genetic mutation. Through
these mechanisms, the population evolves increasingly better
solutions until an optimum solution is found, or until the
individuals in the population are no longer showing
significant improvement.
It is typical of GP experiments that as training
performance improves (decreasing bias), testing performance,
or more particularly, the difference between training and test
performance, deteriorates (increasing variance). As the
purpose of training is to gain information concerning the
target function from the training data, sensitivity to the
training data is essential, and generally more sensitivity
results in lower bias [8]. However, this in turn increases
variance, and so there is a natural “biasvariance trade-off” [9]
associated with function approximation and classification
One of the perceived disadvantages of GP, relative to
other ML methods, is potentially poor generalisation
performance, where, at least for regression tasks, GP has be
categorised as a low bias / high variance learner [13]. Thus,
the challenge for practitioners is to strive for an optimal
situation where bias and variance are both balanced and
relatively low, and where generalisation error (in terms of
bias and variance components) is minimised. In early work
Whigham [22] defined three types of inductive bias and
mapped those to particular aspects of GP:
1) Search Bias: Search bias refers to the factors that
control the way in which the learning system
transforms one hypothesis into another. In terms of GP
this bias is represented by the crossover and mutation
operators and the maximum depth of any created
program tree;
2) Selection Bias: Refers to the method used to select one
hypothesis over another. In GP this relates to the
chosen fitness function, and other selection biases such
as tournament size, elitism and replacement strategy;
3) Language Bias: The representation language
implicitly defines the space of all hypotheses that a
learner can represent and therefore can ever learn, as it
restricts the forms that those hypotheses may represent.
In GP this relates to the function and terminal sets.
We believe that these definitions and mappings may be useful
as a focus for research, to identify and understand the potential
contribution that each of these inductive biases may make to
the bias and variance components of generalisation error. In
so doing, we may learn heuristics for improving generalisation
performance in GP. In this empirical study, we have chosen
to focus on aspects of selection bias including replacement
strategy, tournament size and elitism. We investigate the pos-
sible effects on generalisation of two popular replacement
strategies together with tournament selection with a range of
tournament sizes, with and without elitism on nine different
binary classification tasks. To our knowledge this is the first
study on this topic which is focused on possible implications
for generalisation with regard to classification problems. Aside
from tournament selection, there are various other selection
methods possible, but we choose to concentrate on tournament
selection as it is the most popular method used by GP
practitioners today. Of course, the most significant component
of selection bias in GP is the fitness function chosen and we
intend to address that aspect in future work. The remainder
of this paper is laid out as follows: Section II provides details
of important previous work concerning the identified selection
biases, Section IV explains the set-up of the empirical study
and details the results, and Section V outlines conclusions.
In a study comparing genetic robustness of steady state versus
generational approaches Jones and Soule [12] investigated
the behaviour of both algorithms on a simple “two peaks”
problem. In this type of problem the fitness landscape [24]
consists of two peaks: a broad low peak and a high narrow
peak, and the objective is for the population to discover the
tall narrow peak. Jones and Soule determined that when a
generational algorithm was used, once the low broad peak
was discovered the system required code growth in order to
move off that peak onto the taller peak. By contrast, when
a population was encoded using the steady state approach,
it naturally made a smooth transition from one peak to the
other without requiring code growth. In that context, the term
robust refers to a feature of individual programs, such that
they are relatively impervious to small changes introduced by
crossover or mutation, i.e. small changes in their structure will
not lead to big changes in their behaviour. In this way, a robust
individual, once on the slopes of the high narrow peak, will
be less likely to produce offspring prone to “falling off” in the
event of a small change to its structure.
Another investigation by Whitley et al. [23] compared
steady-state with generational replacement strategies using
tournament sizes of 2and 7on several popular GP prob-
lems including Artificial Ant, 11 Multiplexer and a symbolic
regression problem. The results of that study showed that a
generational strategy with tournament size of 2 was the worst
performing whereas the steady-state strategy with tournament
size of 2was best overall. With regard to Grammatical
Evolution (GE), Ryan et al. [18] demonstrated that the steady
state approach delivered superior performance when used to
solve various symbolic regression problems. The authors hy-
pothesised that this was because GE has a tendency to produce
some sub-optimal solutions which a generational algorithm
may allow to filter through to later generations, but which are
usually eliminated with a steady state algorithm.
In early work comparing various selection strategies for
Genetic Algorithms (GAs) Goldberg and Deb [10] explained
that when steady-state genetic algorithms use a selective
strategy which involves replacing the worst individual coupled
with tournament selection the actual selective pressure is much
greater than the tournament size might suggest. Whitley et al.
[23] suggested that in terms of the time it takes for the
best individual to take over the population under selection,
a tournament size of 2 under the steady state model behaves
more like a tournament size of 7 in a generational model.
Xie [25] studied both tournament and population size from
the perspectives of selection pressure and an individual’s
probability of being sampled. They concluded that tournament
size has an effect on sampling probability such that larger tour-
naments result in higher sampling probability and conversely
lower probability of selection. The issue of multiple sampling
of individuals was shown not to be a serious problem whereas
for binary tournaments the higher probability of individuals
not being sampled at all may lead to sub-optimal solutions.
Xie investigated tournaments of size 2, 4 and 7.
Fitzgerald and Ryan [6] proposed a method of self-adaptive
tournament selection in which an initial small tournament size
becomes progressively more elitist depending on how well the
system is performing on the training data. They maintained
a threshold value tsuch that if the mean training fitness
failed to improve for tgenerations, the size of tournaments
increased by 1. They reported that their self-adaptive approach
produced improved accuracy on test data. There has been
quite a lot of other research on tournament selection, see for
example [19, 21, 25], much of which concentrates on effects
and properties of tournament selection with regard to fitness
distribution, diversity, selection pressure or sampling and is
focused on training data. See [4] for a review of the topic.
In this study we instead examine the effects of various
tournament sizes on generalisation for binary classification
Originating from World War II where it was initially used
to evaluate the performance of radio personnel at accurately
reading radar images, the Receiver Operating Characteristic
(ROC) is a tool which may be used to measure the perfor-
mance of medical tests, radiographers, and classifiers. The
area under the ROC curve, known as the AUC is a scalar
value which captures the accuracy of the entity under scrutiny.
The AUC is a non-parametric measure representing ROC
performance independent of any threshold [2]. A perfect ROC
will have an AUC of 1, whereas the ROC plot of a random
classifier will result in an AUC of approximately 0.5.
There are various ways to calculate the AUC and we have
chosen the Wilcoxon Mann Whitney approximation as it is
easy to calculate and has the advantage that it facilitates the
estimation of confidence intervals [20].
For these experiments we compare the performance of genera-
tional and steady-state replacement strategies with tournament
sizes of 2, 3, 6, 9, 12 and 15 on nine different binary
classification tasks. We report the AUC, best AUC, average
training accuracy, best training accuracy, average test accuracy,
best test accuracy and variance error. Each of these relates to
the best-of-run individuals.
For each configuration we carried out 50 runs with identical
random seeds. We choose a population size of 300 and
terminated evolution after 60 generations. A crossover rate of
0.8and a mutation rate of 0.2were employed. A generational
approach is indicated with “G” and a steady-state one with “S”
with the tournament size appended, together with an elitism
indicator: S3 represents a non elitist steady-state algorithm
with tournament size of 3whereas G2E represents an elitist
generational replacement strategy with a tournament of size 2.
Where elitism is applied, it is at the rate of 10%.
Both the generational and steady state algorithms used
for these experiments behave in a similar way in that each
successful crossover operation generates two offspring. How-
ever, in the case of steady state the offspring replace their
parents whereas for the generational approach they are copied
to the new population, leaving the parents available for re-
selection. In both cases, offspring are copied regardless of
fitness. Mutation functions in a similar fashion.
We have chosen 9binary classification datasets from the
UCI machine learning repository [7]. This relatively large
number of benchmarks allows for a robust evaluation, pro-
viding some insulation against suitability or otherwise of
particular dataset/s to the inductive bias of the GP paradigm.
Thus, we can reasonably expect that any trends observed may
have general application for binary classification.
Data Set
Biodegradation BIO 42 1082 33
Blood Transfusion BT 5 768 23
Liver Disorders BUPA 7 345 42
Caravan CAR 85 5946 6
German Credit GC 23 1000 30
Habermans Survival HS 4 306 36
Ionosphere ION 34 348 36
Diabetes PIMA 9 768 35
Wisconsin Breast Cancer WBC 10 699 46
We know from machine learning that generalisation error is
composed of bias and variance error:
generalisation error =bias +variance +noise (1)
We will assume for simplicity that our noise component = 0
and then bias error is = 1 AU C or 100 accuracy and
variance er ror =training accuracy test accur acy where
variance error (over-fitting) may have a negative value in the
case where test accuracy is better than training accuracy. We
acknowledge that this is a drastic simplification, however it
may be a useful way to reason about our practical experiments.
We choose AUC as the primary measure of performance in
considering bias error for these experiments. See [16] for
a detailed explanation of why AUC should be preferred to
accuracy as a measure of classifier performance, and also [5] as
to why overall classification accuracy is a poor and misleading
choice, even when the data is even mildly unbalanced.
A. Results
Experimental results for the nine datasets are extensive and
quite similar, so due to space restrictions we show full results
for two datasets here. Aggregate results for all datasets are
reported under various headings.
1) Variance: TableIII summarises the variance error (over-
fitting) for each dataset, for each configuration, where “E”
indicates elitism and “NE” no elitism. The lowest average
over-fitting is highlighted in bold font.
Here we see that in 5of the 9datasets an non-elitist
generational strategy produced the smallest variance error
whereas in 8of 9cases a steady-state strategy resulted in
the greatest variance error, of these 8,6were elitist set-ups.
These results would seem to suggest that a non-elitist
generational approach is best to minimise over-fitting and that
elitism is not helpful in this respect. Even though the steady-
state algorithm employed here does not involve replacing the
worst individuals with newly created offspring or mutants, it
is likely that due to the overlapping nature of the algorithm it
is essentially more elitist in nature, and thus there maybe be
elitist effects similar to those reported by Whitley et al. [23],
although perhaps to a lesser extent.
To clarify the effects of elitism, we ran further experiments
using varying levels of elitism (20%,30% and 40%) for
each of with both generational and steady-state replacement
strategies with a tournament of size 2. The results of those
particular experiments revealed that in 7of 9datasets the
G10 configuration (generational with 10% elitism) exhibited
the lowest variance error. Which would seem suggest that
elitism contributes to variance error, and that if elitism is used,
then small values should be preferred if we wish to minimise
variance error.
2) Bias: For 8of the 9datasets there is a generational
set-up that produces or shares the best average AUC score,
and the majority of these are elitist. In contrast, only 4of the
9datasets have a steady-state configuration that achieves or
shares best average AUC score.
Looking at the best overall AUC scores, the relative per-
formance of the steady-state algorithm is somewhat better,
where 8of the 9datasets have at lease one elitist generational
configuration which achieves the top score and 6of 9have a
steady-state configuration which does so. In 6of the 9datasets,
higher percentages of elitism resulted in or shared the best
result for average or best overall AUC.
Data Config. Avg. AUC Best AUC Var. Err.
S10 0.82 0.90 3.26
S20 0.80 0.90 4.76
S30 0.82 0.89 5.12
S40 0.81 0.90 4.68
G10 0.78 0.88 4.03
G20 0.83 0.90 5.54
G30 0.83 0.91 5.11
G40 0.80 0.90 4.10
S10 0.71 0.85 2.58
S20 0.72 0.89 3.60
S30 0.74 0.86 1.78
S40 0.72 0.89 2.44
G10 0.77 0.89 -3.93
G20 0.75 0.89 0.79
G30 0.74 0.88 4.16
G40 0.75 0.89 -0.99
S10 0.72 0.77 0.47
S20 0.72 0.77 0.48
S30 0.72 0.79 0.55
S40 0.72 0.78 0.57
G10 0.72 0.79 0.07
G20 0.74 0.79 0.23
G30 0.74 0.78 0.80
G40 0.73 0.79 0.13
S10 0.66 0.78 -0.47
S20 0.67 0.76 -0.43
S30 0.67 0.77 -0.49
S40 0.66 0.77 -0.78
G10 0.65 0.76 -0.92
G20 0.67 0.78 -0.77
G30 0.68 0.76 -0.42
G40 0.66 0.76 -0.60
S10 0.59 0.75 7.89
S20 0.64 0.71 7.73
S30 0.63 0.72 8.24
S40 0.63 0.73 7.68
G10 0.61 0.70 6.36
G20 0.64 0.71 7.73
G30 0.64 0.73 8.30
G40 0.64 0.74 7.92
S10 0.76 0.87 -5.14
S20 0.73 0.87 -0.66
S30 0.74 0.84 -1.74
S40 0.75 0.89 -3.09
G10 0.78 0.90 -4.68
G20 0.75 0.87 -2.59
G30 0.74 0.87 -1.28
G40 0.77 0.88 -3.97
S10 0.83 0.95 4.10
S20 0.84 0.95 4.38
S30 0.84 0.95 5.39
S40 0.83 0.93 4.61
G10 0.80 0.94 2.61
G20 0.85 0.95 3.77
G30 0.86 0.97 4.44
G40 0.83 0.96 3.24
S10 0.65 0.74 7.26
S20 0.65 0.73 8.95
S30 0.64 0.73 8.33
S40 0.64 0.74 8.09
G10 0.64 0.74 6.26
G20 0.64 0.78 8.18
G30 0.65 0,73 8.45
G40 0.63 0.76 8.21
S10 0.98 1 -0.98
S20 0.96 1-0.73
S30 0.93 0.99 -0.75
S40 0.96 1-0.78
G10 0.98 0.99 -1.23
G20 0.96 1-0.81
G30 0.94 0.99 -0.87
G40 0.90 1-0.84
BIO 5.54 5.45 5.43 5.33 5.65 5.56
BT 1.00 0.57 1.13 0.49 0.87 0.65
BUPA 7.01 3.02 7.01 3.10 7.01 2.93
CAR -0.53 -0.43 -0.54 -0.46 -0.52 -0.40
GC 9.58 8.91 9.42 8.77 4.38 9.06
HS -1.53 -1.70 -2.43 -1.54 -0.62 -1.86
ION 5.61 5.12 5.50 4.95 5.70 5.29
PIMA 10.13 9.60 9.92 9.32 10.31 9.88
WBC -0.46 -0.56 -0.48 -0.56 -0.45 -0.55
Tourn. Size
Best AUC
Best Train
Avg. Test
Best Test
Avg Size
20.75 0.85 77.87 2.20 82.22 75.68 3.62 81.92 157.40 2.21
30.75 0.85 78.89 2.18 83.29 75.71 3.69 80.60 183.79 3.22
60.74 0.84 79.53 2.53 84.65 75.55 3.91 82.42 200.88 4.00
90.74 0.84 79.58 2.77 84.84 75.48 4.07 82.73 200.65 4.07
12 0.74 0.84 79.36 2.87 84.79 75.10 4.28 82.45 197.70 4.24
15 0.74 70.84 79.38 2.78 85.07 75.19 4.17 82.57 200.34 4.37
Overall, an elitist generational strategy delivers better per-
formance in terms of bias error. Good results for average
and best AUC were achieved at various tournament sizes, but
overall tournaments of size 2delivered the best results.
3) Summary of Results: He we summarise the results,
showing averages across the datasets under the relevant head-
ings: tournament size, replacement strategy and elitism.
4) Tournament Size: In Table IV we show the averages
results for tournament size across all datasets and configura-
tions. These results indicate that smaller tournaments (2 or
3) result in better average and overall AUC scores and that
the smallest tournament produces the least over-fitting and the
smallest solutions, on average.
5) Replacement Strategy: Looking at results in Table V
organised by replacement strategy, we can see that the Gener-
ational approach results in best average and overall AUC and
produces smaller programs which have lower over-fitting.
6) Elitism: Finally, if we examine the results with respect to
elitism as seen in Table VI we can observe that the application
of elitism produces better results on training accuracy for
smaller tournaments but that the same benefit is not apparent
for larger tournaments. Equally good results for average AUC
are achieved with both elitist and non-elitist strategies, with the
elitist approach delivering the best overall AUC on average.
With regard to test accuracy, the application of elitism would
not appear to confer any significant advantage. The best
configuration overall from the perspective of generalisation is
a non-elitist strategy with a tournament size of 2.
Taking a more detailed look at the data, we observe that
in 8of the 9problems a non-elitist strategy resulted in the
least over-fitting overall. For the most part this was with a
Best AUC
Best Train
Avg. Test
Best Test
Avg Size
Steady-State 0.74 0.84 79.12 2.76 84.52 75.09 3.64 81.48 210.24 4.05
Generational 0.75 0.85 79.08 2.35 83.77 75.82 4.28 82.75 170.02 3.32
Best AUC
Best Train
Avg. Test
Best Test
Avg Size
2E 0.75 0.85 78.75 2.13 82.96 75.84 3.73 82.74 154.15 2.89
3E 0.75 0.85 79.23 2.16 83.66 75.91 3.74 82.65 174.32 3.33
6E 0.74 0.84 79.46 2.49 84.55 75.53 3.81 82.09 181.93 3.93
9E 0.75 0.84 79.42 2.69 84.61 75.32 4.13 82.62 187.16 4.10
12E 0.74 0.84 79.17 2.89 84.86 75.07 4.30 82.42 188.51 4.10
15E 0.74 0.84 79.45 2.76 84.67 75.12 4.22 82.67 189.11 4.32
20.75 0.84 76.99 2.27 81.49 75.52 3.51 81.10 160.66 1.53
30.74 0.84 78.56 2.19 82.93 75.52 3.64 82.53 196.97 3.11
60.74 0.84 79.59 2.56 84.74 75.57 4.02 82.75 219.84 4.07
90.74 0.84 79.73 2.85 85.06 75.65 4.01 82.85 214.14 4.04
12 0.74 0.84 79.55 2.85 84.72 75.14 4.26 82.49 206.89 4.38
15 0.74 0.84 79.31 2.79 85.46 75.27 4.12 82.47 211.57 4.42
tournament of size 2. However, when we compare the effects
of elitism on over-fitting across the range of tournament sizes
for the same replacement strategy there is little difference.
7) Further Remarks on Replacement Strategy: In other
experiments (not shown) which used an optimised GP set-
up, results in terms of average AUC were sub-optimal for
both the GC and CAR datasets using a generational approach.
Investigation indicated that the average program size with the
generational method was disproportionately smaller for both
the GC and CAR datasets in comparison with a steady state
configuration. As some of the optimisations used had been
shown to produce smaller programs, we hypothesise that this
combined with the observation that a generational replacement
strategy produces smaller solutions, created a situation where
the GP system was unable to evolve programs of sufficient
size to encode a good solution for these datasets, both of
which have relatively large numbers of attributes and instances.
We informally confirmed this hypothesis by re-running the
CAR experiments using a steady-state algorithm with reduced
crossover probability, whereupon the performance in terms of
average AUC was significantly degraded. This problem could
be overcome by running evolution for more iterations/genera-
tions. Hunt et al. [11] demonstrated that for classification, GP
has good scalability with regard to instances but does not scale
well with as the number of attributes increases. Our experience
here may hint that program size is a factor in scalability with
regard to attributes.
Considering the extensive experimental results, we can make
several interesting observations:
1) Average solution size is universally smaller for the
generational replacement strategy. A possible reason
for this phenomenon it that steady-state algorithms are
overlapping in the sense that until such time as the new
population is filled, all individuals in the population,
including newly created offspring, are available for
selection. Since crossover increases program size [15],
these newly created offspring are likely to be larger than
the individuals which they replace, and if these are in
turn selected for crossover in the same iteration that they
were created in, this will likely lead to an increase in
average size over a generational strategy;
2) In 8of the 9problems, a non-elitist strategy resulted in
the least over-fitting. Of these, 6were associated with a
generational algorithm;
3) in 7of the 9problems, a tournament size of 2resulted
in the least over-fitting;
4) For 7of the 9problems variance error is tending to
increase with tournament size. This, taken together with
the previous point, suggests that larger tournaments are
more prone to over-fitting than smaller ones, and that a
small generational algorithm is least likely to over-fit;
5) Elitism may reduce bias error but lead to increased
variance error, and is thus a parameter which influences
the bias variance trade-off;
6) The best performing configuration in terms of AUC
score from the main experiments is the G2E config-
uration. Experiments with higher percentages of elite
yielded even better results, on average;
7) When using a generational strategy together with param-
eter or algorithmic settings which my constrain program
size, it may be necessary to run evolution for longer in
order to achieve optimal results.
In this paper, we have examined the influence of selection bias
introduced through choices of tournament size, replacement
strategy and application of elitism on generalisation behaviour.
The results of the empirical study illustrate that, at least for
the 9problems studied, a tournament size of 2combined with
a non-elitist generational replacement strategy produced the
best results in terms of bias and variance error.
Acknowledgements: This work has been supported by the
Science Foundation of Ireland. Grant No. 10/IN.1/I3031.
Best AUC
Best Train
Avg. Test
Best Test
Avg Size
S2 0.76 0.87 75.14 79.65 80.92 2.58 84.21 223.04 -5.14
S3 0.75 0.88 76.21 80.08 80.18 2.91 85.52 256.00 -3.96
S6 0.76 0.88 76.99 83.55 79.63 2.67 84.21 280.60 -2.64
S9 0.73 0.86 77.80 83.98 78.71 3.96 85.52 273.88 -0.91
S12 0.73 0.87 77.62 83.11 78.66 3.57 85.52 303.80 -1.03
S15 0.74 0.86 77.16 85.71 78.07 4.55 84.21 299.80 -0.91
S2E 0.76 0.88 74.13 77.92 77.21 4.80 84.21 210.28 -3.08
S3E 0.73 0.91 75.93 82.55 75.47 4.94 85.53 222.40 0.46
S6E 0.74 0.90 77.03 84.42 76.92 4.80 85.53 228.21 0.10
S9E 0.76 0.90 76.88 82.88 77.05 5.41 85.53 256.68 -0.17
S12E 0.74 0.87 76.07 88.34 77.47 4.15 85.53 272.00 -1.40
S15E 0.75 0.86 77.22 85.71 76.84 4.78 85.53 256.48 0.37
G2 0.78 0.90 73.00 77.06 77.18 3.51 85.53 147.36 -4.68
G3 0.76 0.90 74.59 80.52 77.55 3.55 84.21 193.64 -2.96
G6 0.75 0.87 76.45 82.68 77.00 4.81 84.21 230.04 -0.54
G9 0.76 0.88 76.80 82.25 76.76 4.94 85.53 214.36 0.04
G12 0.73 0.89 77.15 83.98 77.44 5.01 84.21 225.28 -0.30
G15 0.76 0.87 76.60 83.55 77.39 4.98 84.21 226.36 -0.79
G2E 0.78 0.89 74.43 79.65 78.55 3.66 85.52 155.76 -4.11
G3E 0.77 0.91 75.83 81.81 77.86 4.71 84.21 164.24 -2.04
G6E 0.76 0.89 76.13 80.52 76.86 4.64 84.21 173.40 -0.73
G9E 0.78 0.88 76.19 83.11 77.99 4.60 86.84 190.44 -1.80
G12E 0.78 0.89 76.00 81.38 77.81 4.01 84.21 189.80 -1.85
G15E 0.75 0.88 76.78 83.12 77.39 3.79 82.89 210.04 -0.61
[1] Leo Breiman. Bias, variance, and arcing classifiers. Statistics, 1996.
[2] Christopher D. Brown and Herbert T. Davis. Receiver operating charac-
teristics curves and related decision measures: A tutorial. Chemometrics
and Intelligent Laboratory Systems, 80(1):24–38. 2006.
[3] Pedro Domingos. A few useful things to know about machine learning.
Communications of the ACM, 55(10):78–87, 2012.
[4] Yongsheng Fang and Jun Li. A review of tournament selection in
genetic programming. In Zhihua Cai et al. editors, Advances in
Computation and Intelligence, volume 6382 of LNCS, pages 181–192.
Springer Berlin Heidelberg, 2010.
[5] Jeannie Fitzgerald and Conor Ryan. A hybrid approach to the problem
of class imbalance. In International Conference on Soft Computing,
Brno, Czech Republic, June 2013.
[6] Jeannie Fitzgerald and Conor Ryan. Individualized self-adaptive genetic
operators with adaptive selection in genetic programming. In Nature
and Biologically Inspired Computing (NaBIC), 2013, pages 232–237.
IEEE, 2013.
[7] A. Frank and A. Asuncion. UCI machine learning repository, 2010.
[8] Jerome H Friedman. On bias, variance, 0/1 loss, and the curse-of-
dimensionality. Data mining and knowledge discovery, 1(1):55–77,
[9] Stuart Geman, Elie Bienenstock, and René Doursat. Neural networks
and the bias/variance dilemma. Neural computation, 4(1):1–58, 1992.
[10] David E Goldberg and Kalyanmoy Deb. A comparative analysis of
selection schemes used in genetic algorithms. Urbana, 51:61801–2996,
[11] Rachel Hunt, Kourosh Neshatian, and Mengjie Zhang. Scalability
analysis of genetic programming classifiers. In Xiaodong Li, editor,
Proceedings of the 2012 IEEE Congress on Evolutionary Computation,
pages 509–516, Brisbane, Australia, 10-15 June 2012.
[12] Josh Jones and Terry Soule. Comparing genetic robustness in genera-
tional vs. steady state evolutionary algorithms. In Proceedings of the
8th annual conference on genetic and evolutionary computation, pages
143–150. ACM, 2006.
[13] Maarten Keijzer and Vladan Babovic. Genetic programming, ensemble
methods and the bias/variance tradeoff–introductory investigations. In
Genetic Programming, pages 76–90. Springer, 2000.
Best AUC
Best Train
Avg. Test
Best Test
Avg Size
S2 0.65 0.74 70.26 75.00 63.00 3.32 68.75 186.44 7.26
S3 0.62 0.73 71.81 77.65 62.13 4.05 68.75 234.08 9.68
S6 0.63 0.73 72.99 80.38 62.26 3.56 71.87 271.68 10.74
S9 0.62 0.73 72.66 78.64 61.56 3.89 69.27 258.88 11.10
S12 0.62 0.72 73.44 78.22 62.24 3.50 70.83 238.12 10.80
S15 0.63 0.71 72.87 80.56 61.69 3.34 68.75 263.00 11.17
S2E 0.64 0.75 72.18 77.26 63.79 2.76 69.27 180.24 8.38
S3E 0.62 0.73 72.57 79.51 62.99 4.11 71.88 199.48 9.58
S6E 0.62 0.72 72.29 80.56 61.32 3.70 68.75 214.92 10.76
S9E 0.62 0.73 73.15 79.16 62.22 4.43 71.35 252.72 10.92
S12E 0.62 0.70 71.93 78.64 60.87 4.19 68.75 234.44 11.05
S15E 0.63 0.72 72.60 77.08 61.82 4.21 68.22 242.80 10.79
G2 0.64 0.74 68.48 74.13 62.22 3.65 69.27 156.80 6.26
G3 0.64 0.77 71.11 76.04 62.81 3.57 72.92 184.84 8.30
G6 0.63 0.75 72.59 76.21 62.83 4.00 71.35 216.88 9.76
G9 0.63 0.74 73.01 79.00 62.61 3.29 70.31 217.84 10.39
G12 0.64 0.74 72.94 77.60 62.54 4.13 69.27 195.04 10.40
G15 0.63 0.72 73.29 81.59 61.64 4.27 70.31 195.84 11.65
G2E 0.65 0.75 71.92 76.38 63.70 3.15 73.43 126.20 8.21
G3E 0.64 0.72 72.58 78.29 63.02 3.81 74.44 154.96 9.56
G6E 0.64 0.72 72.75 77.78 62.91 3.63 69.27 164.00 9.83
G9E 0.65 0.74 73.17 78.47 63.68 3.86 70.31 173.81 9.49
G12E 0.63 0.71 73.09 78.30 62.53 3.42 68.23 161.64 10.55
G15E 0.63 0.70 72.88 78.65 62.23 3.57 69.37 172.92 10.64
[14] Tim Kovacs and Andy J. Wills. Encyclopedia of the Sciences of
Learning, chapter Generalization Versus Discrimination in Machine
Learning. 2012.
[15] William B. Langdon. Size fair and homologous tree crossovers for tree
genetic programming. Genetic programming and evolvable machines,
1(1-2):95–119, 2000.
[16] Charles X Ling, Jin Huang, and Harry Zhang. Auc: a better measure
than accuracy in comparing learning algorithms. In Advances in
Artificial Intelligence, pages 329–341. Springer, 2003.
[17] Tom M. Mitchell. Machine learning. McGraw-Hill, New York, NY.
[18] Conor Ryan, JJ Collins, and Michael O Neill. Grammatical evolution:
Evolving programs for an arbitrary language. In Genetic Programming,
pages 83–96. Springer, 1998.
[19] Artem Sokolov and Darrell Whitley. Unbiased tournament selection.
In Proceedings of the 2005 Conference on Genetic and Evolutionary
Computation, GECCO ’05, pages 1131–1138, New York, NY, USA,
2005. ACM.
[20] Paul Stober and Shi-Tao Yeh. (2007) An explicit functional form
specification approach to estimate the area under a receiver op-
erating characteristic (roc) curve. Available at http://www2. sas.
com/proceedings/sugi27/p226–227. pdf Accessed March, Issue 7.
[21] Dirk Thierens. Selection schemes, elitist recombination, and selection
intensity. UU-CS, (1998-48), 1998.
[22] Peter A Whigham. Search bias, language bias and genetic program-
ming. In Proceedings of the First Annual Conference on Genetic
Programming, pages 230–237. MIT Press, 1996.
[23] Darrell Whitley, Marc Richards, Ross Beveridge, and Andre da Motta
Salles Barreto. Alternative evolutionary algorithms for evolving pro-
grams: Evolution strategies and steady state GP. In Proceedings of
the 8th Annual Conference on Genetic and Evolutionary Computation,
GECCO ’06, pages 919–926, New York, NY, USA, 2006. ACM.
[24] Sewall Wright. The roles of mutation, inbreeding, crossbreeding
and selection in evolution. In Proceedings of the sixth international
congress on genetics, volume 1, pages 356–366, 1932.
[25] Huayang Xie. An Analysis of Selection in Genetic Programming.
PhD thesis, Computer Science, Victoria University of Wellington, New
Zealand, 2008.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
In Machine Learning classification tasks, the class imbalance problem is an important one which has received a lot of attention in the last few years. In binary classification, class imbalance occurs when there are significantly fewer examples of one class than the other. A variety of strategies have been applied to the problem with varying degrees of success. Typically previous approaches have involved attacking the problem either algo- rithmically or by manipulating the data in order to mitigate the imbalance. We propose a hybrid approach which combines Proportional Individualised Random Sampling(PIRS) with two different fitness functions designed to improve performance on imbalanced classification problems in Genetic Programming. We investigate the ef- ficacy of the proposed methods together with that of five different algorithmic GP solutions, two of which are taken from the recent literature. We conclude that the PIRS approach combined with either average accuracy or Matthews Correlation Coefficient, delivers superior results in terms of AUC score when applied to either balanced or imbalanced datasets.
Conference Paper
Full-text available
In this paper we investigate a new method for improving generalization performance of Genetic Programming(GP) on Binary Classification tasks. The scheme of self adaptive, individualized genetic operators combined with adaptive tournament size is designed to provide balanced, self-adaptive exploration and exploitation. We test this scheme on several benchmark Binary Classification problems and find that the proposed techniques deliver superior performance when compared with both a tuned GP configuration and a feedback adaptive GP implementation.
Full-text available
In contrast with the diverse array of genetic algorithms, the Genetic Programming (GP) paradigm is usually applied in a relatively uniform manner. Heuristics have developed over time as to which replacement strategies and selection methods are best. The question addressed in this paper is relatively simple: since there are so many variants of evolutionary algorithm, how well do some of the other well known forms of evolutionary algorithm perform when used to evolve programs trees using s-expressions as the representation? Our results suggest a wide range of evolutionary algorithms are all equally good at evolving programs, including the simplest evolution strategies.
Conference Paper
Full-text available
This paper provides a detailed review of tournament selection in genetic programming. It starts from introducing tournament selection and genetic programming, followed by a brief explanation of the popularity of the tournament selection in genetic programming. It then reviews issues and drawbacks in tournament selection, followed by analysis of and solutions to these issues and drawbacks. It finally points out some interesting directions for future work.
Conference Paper
Genetic programming (GP) has been used extensively for classification due to its flexibility, interpretability and implicit feature manipulation. There are also disadvantages to the use of GP for classification, including computational cost, bloating and parameter determination. This work analyses how GP-based classifier learning scales with respect to the number of examples in the classification training data set as the number of examples grows, and with respect to the number of features in the classification training data set as the number of features grows. The scalability of GP with respect to the number of examples is studied analytically. The results show that GP scales very well (in linear or close to linear order) with the number of examples in the data set and the upper bound on testing error decreases. The scalability of GP with respect to the number of features is tested experimentally, with results showing that the computations increase exponentially with the number of features.
MACHINE LEARNING SYSTEMS automatically learn programs from data. This is often a very attractive alternative to manually constructing them, and in the last decade the use of machine learning has spread rapidly throughout computer science and beyond. Machine learning is used in Web search, spam filters, recommender systems, ad placement, credit scoring, fraud detection, stock trading, drug design, and many other applications. A recent report from the McKinsey Global Institute asserts that machine learning (a.k.a. data mining or predictive analytics) will be the driver of the next big wave of innovation. 15 Several fine textbooks are available to interested practitioners and researchers (for example, Mitchell 16 and Witten et al. 24). However, much of the "folk knowledge" that is needed to successfully develop machine learning applications is not readily available in them. As a result, many machine learning projects take much longer than necessary or wind up producing less-than-ideal results. Yet much of this folk knowledge is fairly easy to communicate. This is the purpose of this article.
The Receiver Operating Characteristic (ROC) curve is a curve presented in a probability scale graph and is used to judge the discrimination ability of various statistical methods for predictive purposes. The area under the ROC curve can be measured and converted to a single quantitative index for diagnostic accuracy. An explicit functional form approach is proposed as an alternative estimation method to evaluate the area under a ROC curve. This paper provides an explicit functional form to represent the ROC curve through SAS code for parameter estimation and the area under the curve calculation. The empirical ROC curves produced from this approach are much smoother with convexity of the curves.
Size fair and homologous crossover genetic operators for tree based genetic programming are described and tested. Both produce considerably reduced increases in program size (i.e., less bloat) and no detrimental effect on GP performance. GP search spaces are partitioned by the ridge in the number of program v. their size and depth. While search efficiency is little effected by initial conditions, these do strongly influence which half of the search space is searched. However a ramped uniform random initialization is described which straddles the ridge. With subtree crossover trees increase about one level per generation leading to subquadratic bloat in program length.
Chemometric and statistical tools for data reduction and analysis abound, but the end objective of most analytical undertakings is to make informed decisions based on the data. Decision theory provides some highly instructive and intuitive tools to bridge the gap between data and optimal decisions. This tutorial provides a user-centric introduction to receiver operator characteristic curves, and related measures such as predictive values, likelihood ratios, and cost curves. Important considerations for choosing between these tools are discussed, as well as the primary methods for determining confidence intervals on the various measures. Numerous worked examples illustrate the calculations, their interpretation and potential drawbacks.