ArticlePDF Available

Coevolutionary Construction of Features for Transformation of Representation in Machine Learning

Authors:

Abstract and Figures

The main objective of this paper is to study the usefulness of cooperative coevolutionary algorithms (CCA) for improving the performance of classification of machine learning (ML) classifiers, in particular those following the symbolic paradigm. For this purpose, we present a genetic programming (GP) -based coevolutionary feature construction procedure. In the experimental part, we confront the coevolutionary methodology with difficult real-world ML task with unknown internal structure and complex interrelationships between solution subcomponents (features), as opposed to artificial problems considered usually in the literature.
Content may be subject to copyright.
To appear in: Proceedings of Genetic and Evolutionary Computation Conference (GECCO), 2002 (Coevolution Workshop)
Coevolutionary Construction of Features
for Transformation of Representation in Machine Learning
Bir Bhanu
Center for Research in Intelligent Systems
University of California
Riverside, CA 92521
bhanu@cris.ucr.edu
Krzysztof Krawiec
Institute of Computing Science
Poznan University of Technology
Piotrowo 3A, 60965 Poznan, Poland
krawiec@cs.put.poznan.pl
Abstract
The main objective of this paper is to study the
usefulness of cooperative coevolutionary algo-
rithms (CCA) for improving the performance of
classification of machine learning (ML) classifi-
ers, in particular those following the symbolic
paradigm. For this purpose, we present a genetic
programming (GP) -based coevolutionary feature
construction procedure. In the experimental part,
we confront the coevolutionary methodology
with difficult real-world ML task with unknown
internal structure and complex interrelationships
between solution subcomponents (features), as
opposed to artificial problems considered usually
in the literature.
1 INTRODUCTION
Representation of knowledge and external stimuli is the
key issue in the area of intelligent systems design. An
inappropriate representation of the external world may
seriously limit the performance of an intelligent agent,
whereas a carefully designed one can significantly im-
prove its operation.
This principle affects in particular machine learning
(ML), a branch of artificial intelligence dealing with
automatic induction of knowledge from data (Langley,
1996; Mitchell, 1997). Many ML classifiers do not per-
form well on some problems due to their limited ability to
construct an internal representation of external inputs, i.e.
values of attributes that describe instances to be classified.
That affects mostly the classifiers implementing the sym-
bolic paradigm of knowledge representation, like decision
trees or decision rules.
On the other hand, there are several classification ap-
proaches, which do not suffer from this deficiency. Most
of them represent the non-symbolic and/or sub-symbolic
paradigm of knowledge representation and processing.
For instance, neural nets, discriminant analysis and sup-
port vector machines are able to benefit from the synergy
of selected attributes (e.g. by building linear combinations
of their values). A rich internal representation allows for
better discrimination between decision classes. Conse-
quently, as far as the accuracy of classification is con-
cerned, sub-symbolic ML systems often outperform the
symbolic ones (see, for instance, (Lim et al. 2001)). How-
ever, the price we usually pay for that is the inexplicabil-
ity of knowledge acquired by the classifier and, in particu-
lar, incomprehensibility of the internal representation.
In this paper, we continue our former research on GP
based change of representation for machine learners. The
evolutionary process constructs new features, deriving
them from the original ones, and searches for a subopti-
mal set of them. Evolving GP individuals encode feature
definitions expressed as LISP-like expressions. The con-
structed features have therefore symbolic, comprehensible
definitions. In particular, the central topic of this paper is
the applying the cooperative coevolution to the feature
construction task.
The following sections outline the background for the
proposed method, the method itself (Section 4), and pre-
sent the related work (Section 3). Then, an extensive
computational experiment is described in Section 5 and its
results are discussed in Section 6.
2 FEATURE CONSTRUCTION FOR
CHANGE OF REPRESENTATION IN
MACHINE LEARNING
Topics described in this study refer to several concepts
and research directions known in artificial intelligence
(AI), machine learning (ML) and related disciplines. The
crucial role of representation has been appreciated in AI
already in its infancy. That awareness was present also in
ML community (e.g. Chapter 1.3 of (Langley, 1996)),
where the representation problem appears at least at two
points: in the context of input representation and in the
context of hypothesis representation (referred also to as
hypothesis language representation). In this paper, the
former one is of interest.
In particular, we will focus here on the paradigm of learn-
ing from examples and attribute-value representation of
input data (Mitchell, 1997). In this setting, the original
representation is made up of a vector of attributes (fea-
tures, variables) F
0
describing examples (instances, ob-
jects). This representation is the starting point for the
process of representation transformation, which can be
posed as follows: given the original vector of features F
0
and the training set L, invent a derived representation F
better than F
0
with respect to some criteria. Undoubtedly,
the criterion most often considered at this point is the
predictive accuracy or, to be more precise, the accuracy of
classification on the test set (denoted hereafter by T). The
other measure mentioned relatively frequently is the size
of the representation, i.e. |F|, however in this paper we
will focus on the former one.
The approaches embedded in such environment and re-
ported in the literature can be roughly divided into three
categories:
Feature selection methods (also referred to as variable
selection). Here, the resulting representation is a subset
of the original one, i.e. FF
0
.
Feature weighting methods. In this case, the transfor-
mation method assigns weights to particular attributes
(thus, formally the representation does not change here,
i.e. F=F
0
). The weight reflects relative importance of
an attribute and may be utilized in the process of in-
ductive learning. Unfortunately, the group of inductive
learning methods that can successfully benefit from
these extra data is rather limited and encompasses
mostly distance-based classifiers (see (Dash & Liu,
1997) for an extensive overview and experimental
comparison of various methods).
Feature construction methods. Here, new features are
invented and defined (in some language) as expres-
sions, which refer to the values of original ones.
Principally, each of approaches listed here encompasses
its predecessors. For instance, feature selection may be
regarded as a special case of feature weighting with con-
strained domains of weights (e.g. to {0,1} set). Analo-
gously, feature construction usually does not forbid creat-
ing features being ‘clones’ of the original attributes (i.e.
F
0
F), what is essentially equivalent to feature selection.
The subject of this study is the feature construction as the
most general and, therefore, the most promising approach
to representation transformation. According to (Matheus,
1989), feature construction may be further subdivided into
constructive compilation and constructive induction of
features. Feature compilation consists in re-writing the
original representation in a new, usually more compact
way, so the result is logically equivalent to the original.
Constructive induction (CI) goes further and takes into
account the inductive characteristics of learning from
examples, which is inherently bounded with the incom-
pleteness and/or limited representativeness of the training
set. Therefore, CI enables building features that are not
necessarily supported by the training set, but may poten-
tially improve the predictive accuracy of the classifier.
A more precise taxonomy depending on the type of con-
trol mechanism used for feature construction has been
introduced in (Michalski, 1983). In particular, in data-
driven constructive induction (DCI) the input data (train-
ing examples) guides the feature construction process. In
hypothesis-driven constructive induction (HCI), the fea-
ture construction process benefits from the form of the
induced hypothesis. The methodology described further in
this paper represent both aforementioned paradigms.
3 EVOLUTIONARY COMPUTATION
FOR REPRESENTATION TRANS-
FORMATION
Evolutionary computation has been used in machine
learning for quite a long time. Contemporarily it is recog-
nized as a useful engine for many ML problems or even
as one of its paradigms (Langley, 1996; Mitchell, 1997).
It is highly appreciated due to its ability to perform global
parallel search in the solution space with low probability
of getting stuck in local minima.
Evolutionary computation is basically applied for learning
in one of the following ways:
Individuals (or groups of individuals) implement com-
plete ML classifiers.
Evolutionary search is responsible only for a part of
learning process, for instance for feature selection.
Although the former case (complete concept induction)
belongs to the most widely known applications of evolu-
tionary computation in ML (with classifier systems
(Goldberg, 1989) as probably the most prominent accent;
also (De Jong et al., 1993)), some efforts have been made
in the domain of representation transformation (see previ-
ous section).
Most research on GP-based change of representation for
ML learners reported so far in literature focuses on fea-
ture selection and feature weighting. In our previous study
on GA-based feature selection and weighting (Komosin-
ski & Krawiec, 2000) we noted a significant improvement
in accuracy of classification in an experiment concerning
medical image analysis and feature extraction. Also ex-
periments reported in (Raymer et al., 2000) proved use-
fulness of GA-based feature selection and feature weight-
ing, which yield increase of accuracy of classification on
three different ML domains, requiring only a fraction of
available attributes (however, authors admit that they
conducted some preliminary experiments to determine run
parameters). Other experiments, reported for instance in
(Vafaie & Imam, 1994), led to similar conclusions.
The topic of evolutionary feature construction received
more modest attention in the literature. One of the first
attempts to apply GP to feature construction for machine
learners were reported in (Bensusan & Kuscu, 1996). In
(Kishore et al., 2000) an evolutionary approach to multi-
category pattern classification has been proposed and a
GP-based classifier has been applied to the problem of
remotely sensed satellite data.
4 COOPERATIVE COEVOLUTION FOR
CHANGE OF REPRESENTATION
4.1 GENETIC PROGRAMMING FOR FEATURE
CONSTRUCTION
In this study we employ GP for constructive induction of
features. Therefore, we do not expect the evolutionary
computation to do the entire task of concept induction. On
the other hand, we attack the most advanced form of rep-
resentation transformation (feature construction). Thus,
the proposed methodology may be considered as located
on the boundary between the two approaches mentioned
above.
Our rationale for such choice is as follows. Firstly, feature
selection and weighting do not offer much as far as the
change of representation is concerned. On the other hand,
some experience we gained from experimenting with GP-
based visual pattern classification (Krawiec, 2001) and
GP-based ML feature construction (Krawiec, 2002) led us
to the conclusion that in most real-world cases it is rather
unreasonable to expect the GP individuals to evolve to
complete, well-performing classifiers, even for the two-
class discrimination problem. Therefore, the GP-based
constructive induction of features seems to be a good
compromise.
Let us introduce the background more fomally. It is as-
sumed that all examples x L are described by the set of
features F
0
, which will be further referred to as original
features (or variables), as opposed to the features con-
structed further in the search. The evolving individuals
encode feature definitions f
j
, j = 1..n. Each feature f
j
F
is defined by a LISP-like expression built from the values
of original features F
0
, in the manner usual for GP (Koza,
1994). Therefore, an individual consists of a vector of GP
expressions. Given a training instance xL and the values
of original features F
0
that describe it, an individual is
able to compute the values of feature(s) F it implements.
4.2 PROBLEM DECOMPOSITION
The task outlined above belongs undoubtedly to the com-
plex ones. That complexity manifests itself, among others,
in the fact that many features are required to obtain com-
petetive accuracy of classification (fitness); when facing
real-world problems, no one expects reasonable results by
constructing just one feature. It is the features’ synergy
that makes the representation useful.
Coevolution is at least for decade reported as an interest-
ing approach to handle the increasing complexity of prob-
lems posed in artificial intelligence and related disci-
plines. In particular, its collaborative variety, the coopera-
tive coevolution algorithms (CCA) (Potter & De Jong,
2000), besides being appealing from the theoretical view-
point, has been reported to yield interesting results in
some experiments (Wiegand et al., 2001).
These reports encouraged us to consider the use of CCA
to the task of representation transformation by feature
construction. In particular, we expected the CCA to cope
better with the feature development for inductive learners
than the plain evolutionary algorithm.
The main question that arises at this point is what should
be the general framework of competence sharing between
evolving species. In general, as the task here is the repre-
sentation transformation, each individual should be made
responsible for a part of that task. Therefore, it should be
equipped in a kind of input and output. Then, a particular
scheme of cooperation may be conveniently represented
in a form of directed graph showing the interconnections
between particular individuals. Although such a scheme
could be arbitrary, the following two approaches seem to
be most canonical:
parallel transformed representation consists of a set
of features, with each species responsible for one fea-
ture;
sequential representation transformation consists of a
sequence of chained steps, with each species responsi-
ble for one step.
This study is exclusively devoted to the former of men-
tioned approaches. In particular, each CCA species is
responsible for developing one feature for the final repre-
sentation, and each individual representing particular
species implements single feature. The selection of repre-
sentatives follows the optimistic CCA-1 approach: each
individual (feature) is evaluated in the context of best
individuals representing remaining subpopulations (spe-
cies) with respect to the previous evaluation process. This
method has been selected mostly due to the positive re-
sults reported in (Wiegand et al., 2001).
5 EXPERIMENTAL EVALUATION
The described methodology has been verified in an exten-
sive computational experiment. Two primary objectives
of the experiment were: (i) to explore the usefulness of
genetic programming-based construction of features, and
(ii) to compare the cooperative coevolution (GP-CCA)
with the standard approach (GP) on a real-world, difficult
data set.
5.1 THE DATA
The experimental data was the GLASS benchmark from
the Irvine repository of ML databases (Blake & Merz,
1998). Its training set L and test set T contain 142 and 72
examples respectively. Each example describes, by means
of 9 numeric attributes, selected physiochemical proper-
ties of a glass sample. The task of the machine learner is
to identify the glass type (float window, non-float win-
dow, container, tableware, headlamp) for the purpose of
criminological investigation. The decision classes are
highly imbalanced: the majority class occupies 35.6%
1
of
the database, whereas the least representative one only
4.2%. There are no missing values of attributes. All
conditional attributes were normalized before starting the
experiment to make their values reasonably comparable in
GP expressions.
5.2 EXPERIMENT DESIGN AND PARAMETER
SETTINGS
A single experiment consisted of the following steps:
1. Performing evolutionary construction of features using
training data L, original attributes F
0
, and set of GP
terminals and nonterminals described further in this
section.
2. Inducing the classifier from the training set L, using the
features constructed in the evolutionary search (F).
3. Testing the induced classifier on the external test set T.
To make the results statistically significant, this scheme
was repeated 20 times for different initial populations and
for each considered number of features n (n=[2,9]). The
settings of evolutionary run were almost the same as the
defaults provided in the ECJ package (Luke, 2001). The
function set used was rather limited and included +, -, *,
% (protected division), LOG (logarithmic function), LT,
GT, and EQ (arithmetic comparison operators). The termi-
nal set encompassed the ephemeral random constant and
the original attributes from F
0
. Weak typing has been
used, so no constraints were imposed on the mutation and
crossover operators, except for those concerning individ-
ual’s maximal depth (set to 5 for both crossover and
mutation). Standard tournament selection with tournament
size 7 and common GP recombination operations were
applied. Each run was terminated after 50 generations.
For a given number of features n and population size
s=100, a pair of corresponding GP and GP-CCA experi-
ments consisted of:
GP run involving one population of size s, with each
individual encoding n features, and
GP-CCA run involving n subpopulations, each of size
s, with each individual encoding one feature.
Such a design of the experiment provides that the number
of features potentially considered during the evolutionary
search is the same (n×s) for GP and GP-CCA, so the ap-
1
This is the accuracy of classification of the so-called default classifier,
which is a worst-case reference value for the experimental results.
proaches have equal chances in searching the space of
features
2
.
In GP-CCA, each individual is evaluated in the context of
the best individuals (with respect to the previous evalua-
tion) representing the remaining species, i.e. according to
the CCA-1 scheme (see, for instance, (Wiegand et al.,
2001)). After each generation, we estimate the contribu-
tion of each subpopulation to the entire solution basing on
the fitness differential of the best individual. Subpopula-
tions that do not contribute to the solution (or even dete-
riorate its evaluation) are re-initialized.
The evaluation of the entire solution (i.e. vector of n fea-
tures implemented by one GP individual or by n cooperat-
ing GP-CCA individuals) relies on the so-called wrapper
methodology (Kohavi & John, 1997). The values of fea-
tures defined by the evaluated solution are computed for
all training examples from L, what produces a new, de-
rived dataset. Then, a multiple train-and-test experiment
(here: 3-fold cross validation) is carried out using an in-
ductive learning algorithm and the resulting average accu-
racy of classification becomes the evaluation. For this
purpose, the C4.5 decision tree inducer (Quinlan 1992), as
implemented in WEKA (Witten & Frank 1999) with de-
fault settings (decision tree pruning on, pruning confi-
dence level 0.25) has been used. The choice of this par-
ticular inducer was motivated by its relatively low compu-
tational complexity (of the training algorithm as well as of
the querying process), readability of induced hypotheses,
and popularity in ML community.
According to (Kohavi & John, 1997) and (Dash & Liu,
1997), wrapper has the advantage of taking into account
the inductive bias of the classifier used in the evaluation
function. It also maintains the discovery of synergy be-
tween particular attributes, as opposed to some local ap-
proaches, where particular features are evaluated on indi-
vidual basis (see (Dash & Liu, 1997) for comparison of
different feature selection strategies).
5.3 PRESENTATION OF RESULTS
Figures 1 and 2 present the performance of GP and GP-
CGA approaches respectively on the training and test set.
For the training set (Fig. 1) that is the fitness of the best
individual of the run, whereas for the test set (Fig. 2) it is
the accuracy of classification obtained on the test set T by
the C4.5 algorithm trained using best evolved representa-
tion (see explanation on the beginning of Section 5). The
charts are drawn as a function of the number n of con-
structed features. Both charts show means over 20 genetic
runs with bars depicting 0.95 confidence intervals.
2
It should be noted, however, that for GP-CCA the number of calls of
the global fitness function (wrapper) is n times greater than for GP.
Therefore, unless some sophisticated programming tricks are used, the
computing times are significantly longer for GP-CCA.
Figure 1: GP-CCA versus GP on the training set.
Figure 2: GP-CCA versus GP on the test set.
6 CONCLUSIONS
Results presented in Figure 1 clearly prove that coevolu-
tionary search of representation space (GP-CCA) con-
structs better features with respect to the set of fitness
cases (training set L) that ‘vanilla’ GP. With except for
n=2 and n=3, which are apparently too small numbers of
features to build a reasonable representation on, GP-CCA
outperforms GP with respect to t-Student test at the confi-
dence level 0.02. When comparing both methods in a
broader view, the Wilcoxon’s rank test on the averages
produces p=0.012.
Figure 2 allows us to state that the superiority of GP-CCA
concerns also the performance on the test set T, which, let
us stress again, was ‘invisible’ for the evolutionary run.
This time the 0.02 t-test significance holds only for n=4
and n=7, but the general tendency remains in favor of the
coevolutionary approach (Wilcoxon’s p=0.017).
The more general conclusion of this experiment is that the
representations learned by means of GP-based feature
construction often outperform the original representation
F
0
as far as the predictive accuracy is concerned (C4.5
yields 62.5% accuracy of classification on the test set T
when using F
0
). In particular, this statement is true for
n
4 in case of GP-CCA and for n=6, 8, and 9 in case of
GP. Note also that most of the increases have been ob-
tained by means of a more compact representation (in all
runs, n does not exceed the size of original representation
|F
0
|=9). Larger increases are likely to be achieved after
more precise parameter tuning.
Let us note that the constructed features give extra insight
into the knowledge hidden in the dataset. Carefully se-
lected feature definitions (S-expressions) could be used
for explanatory purposes after some rewriting, verifica-
tion and expert’s assessment. For some examples of such
features see (Krawiec, 2002).
As far as CCA-related issues are concerned, more detailed
analysis is required to investigate and explain cooperation
patterns and dynamics taking place in CCA-GP feature
construction. In particular, the cooperation scheme seems
to be here more complex than in other studies related to
the topic. We base this hypothesis on the fact, that the
collaboration of separately coevolved features takes place
by the mediation of the inductive learning algorithm im-
plemented in the fitness function. The features developed
by particular subpopulations are selectively utilized by the
decision tree inducer embedded in the fitness function
(C4.5), rather than being just put together to build up the
solution. The inducer uses particular features at different
stages of the top-down decision tree construction, so the
effective contribution of particular feature to the final
fitness value is partially determined by its location in that
tree. Therefore, it is mostly the C4.5 inductive bias that
guides the search for promising synergies between repre-
sentation components.
The results obtained show also that overfitting is still a
challenge for feature-constructing learners. For both GP
and GP-CCA approaches, the external test set accuracy is
much worse than the estimate produced by the wrap-
per-based fitness function (by ca. 10%). This is due to the
infamous ‘curse of dimensionality’: the presence of new
features increases dimensionality of the hypothesis space;
instead of one representation space, we consider many of
them. Consequently, the inducer is very prone to overfit-
ting as it has much more ‘degrees of freedom’. Potential
benefits that overfitting prevention may draw from coop-
erative coevolution methodology might be one of other
interesting topics for future research.
66.2
68.8
70.0
71.4
71.5
67.0
72.5
73.5
76.2
71.0
71.5
68.2
76.3
68.8
74.4
77.2
60
62
64
66
68
70
72
74
76
78
80
2 3 4 5 6 7 8 9
n
(number of features)
fitness of best individual of the run [%]
GP
GP-CCA
54.7
59.9
62.0
63.9
57.2
63.6
63.5
64.4
63.0
61.7
64.4
57.7
66.5
58.0
64.4
64.2
50
52
54
56
58
60
62
64
66
68
70
2 3 4 5 6 7 8 9
n
(number of features)
acc. of classification on test set [%]
GP
GP-CCA
Acknowledgments
We would like to thank the authors of software packages:
ECJ (Evolutionary Computation in Java) (Luke, 2001)
and WEKA (Witten & Frank, 1999) for making their
software publicly available. This work has been supported
by the State Committee for Scientific Research, from
KBN research grant no. 8T11F 006 19, and by the Foun-
dation for Polish Science, from subsidy no. 11/2001.
References
H.N. Bensusan and I. Kuscu, “Constructive induction
using genetic programming,” in T. Fogarty and G. Ven-
turini, Proc. Int. Conf. Machine Learning, Evolutionary
computing and Machine Learning Workshop, 1996.
C.L Blake and C.J. Merz, “UCI Repository of machine
learning databases” [http://www.ics.uci.edu/~mlearn/
MLRepository.html]. University of California: Irvine,
CA, 1998.
M. Dash and H. Liu, “Feature Selection for Classifica-
tion,” Intelligent Data Analysis vol. 1(3), pp. 131-156,
1997.
K.A. De Jong, An analysis of the behavior of a class of
genetic adaptive systems, Doctoral dissertation, Univer-
sity of Michigan: Ann Arbor, 1975.
K.A. De Jong, W.M. Spears, and D.F. Gordon, “Using
genetic algorithms for concept learning,” Machine Learn-
ing vol. 13, pp.161-188, 1993.
D. Goldberg, Genetic algorithms in search, optimization
and machine learning, Addison-Wesley: Reading, 1989.
J.H. Holland, Adaptation in natural and artificial systems,
University of Michigan Press: Ann Arbor, 1975.
J.K. Kishore, L.M. Patnaik, V. Mani, and V.K. Agrawal,
“Application of Genetic Programming for Multicategory
Pattern Classification,” IEEE Trans. Evolutionary Comp.
vol. 4(3), pp. 242-258, 2000.
M. Komosinski and K. Krawiec, “Evolutionary weighting
of image features for diagnosing of CNS tumors,” Artif.
Intell. in Medicine vol. 19 (1), pp. 25-38, 2000.
R. Kohavi and G.H. John, “Wrappers for feature subset
selection,” Artificial Intelligence Journal, vol. 1-2, pp.
273-324, 1997.
J.R. Koza, Genetic programming 2, MIT Press: Cam-
bridge, 1994.
K. Krawiec, “Pairwise Comparison of Hypotheses in
Evolutionary Learning,” in Proc. Int. Conf. Machine
Learning, C.E. Brodley and A. Pohoreckyj Danyluk
(eds.), Morgan Kaufmann: San Francisco, 2001, pp. 266-
273.
K. Krawiec, “Genetic Programming with Local Improve-
ment for Visual Learning from Examples,” in Computer
Analysis of Images and Patterns (LNCS 2124), W. Skar-
bek, (ed.), Springer: Berlin 2001, pp. 209-216.
K. Krawiec, “Genetic Programming-based Construction
of Features for Machine Learning and Knowledge Dis-
covery Tasks”. Genetic Programming and Evolvable
Machines, 2002 (in press).
P. Langley, Elements of machine learning, Morgan
Kaufmann: San Francisco, 1996.
T.-S. Lim, W.-Y. Loh, and Y.-S. Shih, A Comparison of
Prediction Accuracy, Complexity, and Training Time of
Thirty-three Old and New Classification Algorithms,
Machine Learning, 2001 (to appear).
S. Luke, ECJ 7: An EC and GP system in Java.
http://www.cs.umd.edu/projects/plus/ec/ecj/, 2001.
C.J. Matheus, “A constructive induction framework,” in
Proceedings of the Sixth International Workshop on Ma-
chine Learning, Ithaca: New York, 1989, pp. 474-475.
R.S. Michalski, “A theory and methodology of inductive
learning,” Artificial Intelligence 20, pp. 111-161, 1983.
T.M. Mitchell, An introduction to genetic algorithms,
MIT Press: Cambridge, MA, 1996.
T.M. Mitchell, Machine learning, McGraw-Hill: New
York, 1997.
M.A. Potter, K.A. De Jong, Cooperative Coevolution: An
Architecture for Evolving Coadapted Subcomponents,
Evolutionary Computation, 8(1), pp. 1-29, 2000.
J.R. Quinlan, C4.5: Programs for machine learning, Mor-
gan Kaufmann: San Mateo, 1992.
M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn,
and A.K. Jain, “Dimensionality Reduction Using Genetic
Algorithm,” IEEE Trans. on Evolutionary Computation,
vol. 4, no. 2, pp. 164-171, 2000.
H. Vafaie and I.F. Imam, “Feature selection methods:
genetic algorithms vs. greedy-like search,” in Proceedings
of International Conference on Fuzzy and Intelligent
Control Systems, 1994.
R.P. Wiegand, W.C. Liles, K.A. De Jong. An Empirical
Analysis of Collaboration Methods in Cooperative
Coevolutionary Algorithms. Proceedings of the Genetic
and Evolutionary Computation Conference (GECCO-
2001), Morgan Kaufmann: San Francisco, 2001, pp.
1235-1242.
I.H. Witten and E. Frank, Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementa-
tions, Morgan Kaufmann: San Francisco, 1999.
J. Yang and V. Honavar, “Feature subset selection using a
genetic algorithm,” in Feature extraction, construction,
and subset selection: A data mining perspective, H. Mo-
toda and H. Liu (eds.), Kluwer Academic: New York,
1998.
... Various EC-based techniques have been proposed to reduce data dimensionality, speed up learning process, or improve model performance [370]. The common techniques can be categorized into FS [250], FC [32], FE [261], and DA [64]. [370]. ...
... Tree-based encoding: Tree-based encoding is natural for FC, where leaf nodes represent the feature information and internal nodes represent operators. Many studies [32,334] have demonstrated the effectiveness of tree encoding in FC. For example, Bhanu et al. [32] designed a GP-based coevolutionary FC procedure to improve the discriminative ability of classifiers. ...
... Many studies [32,334] have demonstrated the effectiveness of tree encoding in FC. For example, Bhanu et al. [32] designed a GP-based coevolutionary FC procedure to improve the discriminative ability of classifiers. In [334], an individual in EC was represented by a multi-tree encoding with multiple high-level features. ...
Article
Full-text available
Over recent years, there has been a rapid development of deep learning (DL) in both industry and academia fields. However, finding the optimal hyperparameters of a DL model often needs high computational cost and human expertise. To mitigate the above issue, evolutionary computation (EC) as a powerful heuristic search approach has shown significant merits in the automated design of DL models, so-called evolutionary deep learning (EDL). This paper aims to analyze EDL from the perspective of automated machine learning (AutoML). Specifically, we firstly illuminate EDL from DL and EC and regard EDL as an optimization problem. According to the DL pipeline, we systematically introduce EDL methods ranging from data preparation, model generation, to model deployment with a new taxonomy (i.e., what and how to evolve/optimize), and focus on the discussions of solution representation and search paradigm in handling the optimization problem by EC. Finally, key applications, open issues and potentially promising lines of future research are suggested. This survey has reviewed recent developments of EDL and offers insightful guidelines for the development of EDL.
... Various EC-based techniques have been proposed to reduce data dimensionality, speed up learning process, or improve model performance [225]. The common techniques can be categorized into feature selection [146], feature construction [16] and feature extraction [152]. ...
... Tree-based encoding: Tree-based encoding is natural for feature construction, where leaf nodes represent the feature information and internal nodes represent operators. Many studies [16,201] have demonstrated the effectiveness of tree encoding in feature construction. For example, Bhanu et al. [16] designed a GP-based coevolutionary feature construction procedure to improve the discriminative ability of classifiers. ...
... Many studies [16,201] have demonstrated the effectiveness of tree encoding in feature construction. For example, Bhanu et al. [16] designed a GP-based coevolutionary feature construction procedure to improve the discriminative ability of classifiers. In [201], an individual in EC was represented by a multi-tree encoding with multiple high-level features. ...
Preprint
Full-text available
Over recent years, there has been a rapid development of deep learning (DL) in both industry and academia fields. However, finding the optimal hyperparameters of a DL model often needs high computational cost and human expertise. To mitigate the above issue, evolutionary computation (EC) as a powerful heuristic search approach has shown significant merits in the automated design of DL models, so-called evolutionary deep learning (EDL). This paper aims to analyze EDL from the perspective of automated machine learning (AutoML). Specifically, we firstly illuminate EDL from machine learning and EC and regard EDL as an optimization problem. According to the DL pipeline, we systematically introduce EDL methods ranging from feature engineering, model generation, to model deployment with a new taxonomy (i.e., what and how to evolve/optimize), and focus on the discussions of solution representation and search paradigm in handling the optimization problem by EC. Finally, key applications, open issues and potentially promising lines of future research are suggested. This survey has reviewed recent developments of EDL and offers insightful guidelines for the development of EDL.
... For this reason, it is worth investigating optimization metaheuristics that proved particularly effective in case of large-scale optimization, trying to tailor them for the specific task of FS. Cooperative coevolution (CC) [32], a well-known divide-and-conquer strategy, is one of the approaches often adopted for largescale optimization and has been already investigated in some FS applications [2,9,10,37,38]. In a CC algorithm, the search space is decomposed into lower dimensional subcomponents, and the optimization problem is solved individually within each of them. During the process, each subcomponent collaborates to build a complete solution in the original search space [33]. ...
Article
Full-text available
In many fields, it is a common practice to collect large amounts of data characterized by a high number of features. These datasets are at the core of modern applications of supervised machine learning, where the goal is to create an automatic classifier for newly presented data. However, it is well known that the presence of irrelevant features in a dataset can make the learning phase harder and, most importantly, can lead to suboptimal classifiers. Consequently, it is becoming increasingly important to be able to select the right subset of features. Traditionally, optimization metaheuristics have been used with success in the task of feature selection. However, many of the approaches presented in the literature are not applicable to datasets with thousands of features because of the poor scalability of optimization algorithms. In this article, we address the problem using a cooperative coevolutionary approach based on differential evolution. In the proposed algorithm, parallelized for execution on shared-memory architectures, a suitable strategy for reducing the dimensionality of the search space and adjusting the population size during the optimization results in significant performance improvements. A numerical investigation on some high-dimensional and medium-dimensional datasets shows that, in most cases, the proposed approach can achieve higher classification performance than other state-of-the-art methods.
... The second algorithm introduces the idea of crowding, mutation, and dominance of PSO, to generate Pareto front solutions. The coevolution approach is also used to determine the best features which are required to increase classification accuracy (Derrac et al. 2010;García-Pedrajas et al. 2010;Bhanu and Krawiec 2002). Finally, it can be deduced from previous research's that there is a strong tendency to improve the RWN performance, either by developing new algorithms using evolutionary algorithms or through other approaches. ...
Article
Full-text available
Learning algorithms are mainly used to optimize a performance criterion. Random weight network (RWN) is one of learning algorithms with strong performance that used in wide range of applications. However, the performance of RWN is highly affected by the number of data inputs that is why finding the best input features becomes a necessity when dealing with high dimensional data. In literature, many methods have attempted to determine the optimal subset of features and structure of the RWN separately. In this paper, we propose a cooperative coevolution method based on Particle Swarm Optimisation. The goal of the proposed method is to optimise the structure of the RWN network and simultaneously to find the best subset of features. Three experiments are conducted on thirty medical classification datasets to assess the accuracy of the proposed method. The experimental results showed that reducing the number of features and minimising the complexity of RWN networks causes the high performance of the proposed method which exceeded all other methods in terms of accuracy on most datasets.
Article
Due to the flexibility of Genetic Programming (GP), GP has been used for feature construction, feature selection and classifier construction. In this paper, GP classifiers with feature selection and feature construction are investigated to obtain simple and effective classification rules. During the construction of a GP classifier, irrelevant and redundant features affect the search ability of GP, and make GP easily fall into local optimum. This paper proposes two new GP classifier construction methods to restrict bad impact of irrelevant and redundant features on GP classifier. The first is to use a multiple-objective fitness function that decreases both classification error rate and the number of selected features, which is named as GPMO. The second is to first use a feature selection method, i.e., linear forward selection (LFS) to remove irrelevant and redundant features and then use GPMO to construct classifiers, which is named as FSGPMO. Experiments on twelve datasets show that GPMO and FSGPMO have advantages over GP classifiers with a single-objective fitness function named GPSO in term of classification performance, the number of selected features, time cost and function complexity. The proposed FSGPMO can achieve better classification performance than GPMO on higher dimension datasets, however, FSGPMO may remove potential effective features for GP classifier and achieve much lower classification performance than GPMO on some datasets. Compared with two other GP-based classifiers, GPMO can significantly improve the classification performance. Comparisons with other classification algorithms show that GPMO can achieve better or comparable classification performance on most selected datasets. Our proposed GPMO can achieve better performance than wrapper-based feature construction methods using GP on applications with insufficient instances. Further investigations show that bloat phenomena exists in the process of GP evolution and overfitting phenomena is not obvious. Moreover, the benefits of GP over other machine learning algorithms are discussed.
Article
Feature construction and feature selection are two common pre-processing methods for classification. Genetic Programming (GP) can be used to solve feature construction and feature selection tasks due to its flexible representation. In this paper, a filter-based multiple feature construction approach using GP named FCM that stores top individuals is proposed, and a filter-based feature selection approach using GP named FS that uses correlation-based evaluation method is employed. A hybrid feature construction and feature selection approach named FCMFS that first constructs multiple features using FCM then selects effective features using FS is proposed. Experiments on nine datasets show that features selected by FS or constructed by FCM are all effective to improve the classification performance comparing with original features, and our proposed FCMFS can maintain the classification performance with smaller number of features comparing with FCM, and can obtain better classification performance with smaller number of features than FS on the majority of the nine datasets. Compared with another feature construction and feature selection approach named FSFCM that first selects features using FS then constructs features using FCM, FCMFS achieves better performance in terms of classification and the smaller number of features. The comparisons with three state-of-art techniques show that our proposed FCMFS approach can achieve better experimental results in most cases.
Conference Paper
Full-text available
This paper investigates the use of evolutionary algorithms for the search of hypothesis space in machine learning tasks. As opposed to the common scalar evaluation function imposing a complete order onto the hypothesis space, we propose genetic search incorporating pairwise comparison of hypotheses. Particularly, we allow incomparability of hypotheses, what implies a partial order in the hypothesis space. We claim that such an extension protects the ‘interesting’ hypotheses from being discarded in the search process, and thus increases the diversity of the population, allowing better exploration of the solution space. As a result it is more probable to reach hypotheses with good predictive accuracy. This supposition has been positively verified in an extensive comparative experiment of evolutionary visual learning concerning the recognition of handwritten characters.
Article
Full-text available
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain. We study the strengths and weaknesses of the wrapper approach and show a series of improved designs. We compare the wrapper approach to induction without feature subset selection and to Relief, a filter approach to feature subset selection. Significant improvement in accuracy is achieved for some datasets for the two families of induction algorithms used: decision trees and Naive-Bayes.
Conference Paper
Concept learning from examples can be relatively easy if the training set is represented in a suitable form. However, when the features used in describing the examples are inappropriate for the target concept, learning can be difficult or impossible using selective induction methods. This chapter presents a framework for constructive induction based on four inherent aspects of this problem: (1) need detection, (2) constructor selection, (3) constructor generalization, and (4) feature evaluation. This framework has proven useful for analyzing and contrasting existing systems, uncovering the strengths and weaknesses of current methods, and guiding the development of new systems such as CIE, a constructive induction engine, and CITRE, a learning system that performs constructive induction on decision trees. Four processes have been seen to be inherent to the problem of constructive induction: (1) detection of when constructive induction is necessary, (2) selection of constructive operators and operands, (3) generalization of the selected constructors, and (4) evaluation of the new features. Although each of these processes may not be explicit in all systems performing constructive induction, they represent four central aspects that account for much of the observed variability between systems. As a result, these aspects have proven useful as a framework for the analysis and comparison of systems, and also for the development of new methods and techniques.
Article
The presented theory views inductive learning as a heuristic search through a space of symbolic descriptions, generated by an application of various inference rules to the initial observational statements. The inference rules include generalization rules, which perform generalizing transformations on descriptions, and conventional truth-preserving deductive rules. The application of the inference rules to descriptions is constrained by problem background knowledge, and guided by criteria evaluating the “quality” of generated inductive assertions. Based on this theory, a general methodology for learning structural descriptions from examples, called Star, is described and illustrated by a problem from the area of conceptual data analysis.