ArticlePDF Available

Abstract and Figures

The performance of convolutional neural networks (CNNs) highly relies on their architectures. In order to design a CNN with promising performance, extensive expertise in both CNNs and the investigated problem domain is required, which is not necessarily available to every interested user. To address this problem, we propose to automatically evolve CNN architectures by using a genetic algorithm (GA) based on ResNet and DenseNet blocks. The proposed algorithm is completely automatic in designing CNN architectures. In particular, neither preprocessing before it starts nor postprocessing in terms of CNNs is needed. Furthermore, the proposed algorithm does not require users with domain knowledge on CNNs, the investigated problem, or even GAs. The proposed algorithm is evaluated on the CIFAR10 and CIFAR100 benchmark data sets against 18 state-of-the-art peer competitors. Experimental results show that the proposed algorithm outperforms the state-of-the-art CNNs hand-crafted and the CNNs designed by automatic peer competitors in terms of the classification performance and achieves a competitive classification accuracy against semiautomatic peer competitors. In addition, the proposed algorithm consumes much less computational resource than most peer competitors in finding the best CNN architectures.
Content may be subject to copyright.
Completely Automated CNN Architecture Design
Based on Blocks
Yanan Sun, Member, IEEE, Bing Xue, Member, IEEE, Mengjie Zhang, Fellow, IEEE,
and Gary G. Yen, Fellow, IEEE
Abstract—The performance of Convolutional Neural Networks
(CNNs) highly relies on their architectures. In order to design
a CNN with promising performance, extensive expertise in
both CNNs and the investigated problem domain is required,
which is not necessarily available to every interested user. To
address this problem, we propose to automatically evolve CNN
architectures by using a genetic algorithm based on ResNet
and DenseNet blocks. The proposed algorithm is completely
automatic in designing CNN architectures. In particular, neither
pre-processing before it starts nor post-processing in terms of
CNNs is needed. Furthermore, the proposed algorithm does not
require users with domain knowledge on CNNs, the investigated
problem or even genetic algorithms. The proposed algorithm is
evaluated on the CIFAR10 and CIFAR100 benchmark datasets
against 18 state-of-the-art peer competitors. Experimental results
show that the proposed algorithm outperforms state-of-the-art
CNNs hand-crafted and CNNs designed by automatic peer com-
petitors in terms of the classification performance, and achieves
a competitive classification accuracy against semi-automatic peer
competitors. In addition, the proposed algorithm consumes much
less computational resource than most peer competitors in finding
the best CNN architectures.
Index Terms—Convolutional neural networks, genetic algo-
rithms, evolutionary deep learning, automatic architecture design,
neural networks.
CONVOLUTIONAL Neural Networks (CNNs) [1] have
been showcasing their promising performance on various
real-world applications [2]–[5]. It has been known that the per-
formance of CNNs highly depends on their architectures, such
as how many building-block layers (e.g., the convolutional and
pooling layers) are used, how the used building-block layers
are composed, and how the parameters related to the used
building-block layers are specified.
This work was supported in part by the National Natural Science Foundation
of China under Grant 61803277, in part by the Fundamental Research Funds
for the Central Universities, in part by the National Natural Science Fund of
China for Distinguished Young Scholar under Grant 61625204, and in part by
the Marsden Fund of New Zealand Government under Contracts VUW1209,
VUW1509 and VUW1615, Huawei Industry Fund E2880/3663, and the
University Research Fund at Victoria University of Wellington 209862/3580,
and 213150/3662.
Yanan Sun is with the College of Computer Science, Sichuan University,
Chengdu 610065, China, and also with the School of Engineering and
Computer Science, Victoria University of Wellington, Wellington 6140, New
Zealand (e-mail:
Bing Xue, and Mengjie Zhang are with the School of Engineer-
ing and Computer Science, Victoria University of Wellington, PO Box
600, Wellington 6140, New Zealand (e-mail:; and
Gary G. Yen is with the School of Electrical and Computer Engi-
neering, Oklahoma State University, Stillwater, OK 74078 USA (email:
Generally, given a CNN, denoted by A, having narchitec-
ture related parameters λ1,· · · , λnwhose decision spaces are
Λ1,· · · ,Λn, respectively, the CNN architecture design is to
optimize the problem formulated by (1)
arg minλ
λ,Dtrain,Dv alid)
s.t. λ
where λ
λ={λ1,· · · , λn},Λ
Λ=Λ1× · · · × Λn,Aλ
the CNN Aadopting the architecture parameter setting λ
λ, and
L(·)measures the performance of Aλ
λon the validation data
Dvalid after Aλ
λhas been trained on the training data Dtrain. In
the case of classification tasks, L(·)measures the classification
error of the tasks to which Ais applied. Typically, the gradient-
based algorithms, such as stochastic gradient descent [6], are
employed to train the weights of Aλ
λas L(·)is differentiable
(or approximately differentiable) with respect to the weights.
Unfortunately, with respect to the architecture related param-
eters, L(·)is often non-convex and non-differentiable because
these parameters usually have discrete values, e.g., the feature
map sizes of convolutional layers are generally specified as
integers. To this end, the exact optimization algorithms (e.g.,
the gradient-based algorithms) are incapable of or ineffective
in solving the architecture optimization problem [7], [8].
As a result, researchers have proposed various architecture
optimization algorithms based on the heuristic computational
paradigms [9], such as random search [10], Bayesian-based
Gaussian process [11], [12], tree-structured Parzen estima-
tors [13], sequential model-based global optimization [14],
neuroevolution of augmenting topologies [15], evolutionary
unsupervised deep learning [8], etc. However, in CNN architec-
ture optimization, it is impossible to know the optimal numbers
of built layers in advance, e.g., the particular value of nin λ
to compose the best CNN architecture, i.e., the number of
decision variables for an optimal CNN architecture is also
unknown before the best CNN architecture is found. This
makes the architecture optimization methods aforementioned
also unable to be effectively and efficiently used for CNN
architecture design because they work under the assumption
where the number of optimized parameters is fixed. Although
we could enumerate each potential value of nand then perform
these methods for each different n, the run-time computational
complexity will increase in an order of magnitude as ngrows,
and the satisfactory solutions may not be obtained within the
acceptable time [16].
Due to this, state-of-the-art CNNs such as ResNet [17] and
DenseNet [18] are primarily hand-crafted. Designing CNNs
manually requires considerable expertise in CNN architecture,
as well as in the problem domain. This is often not available
in practice. For example, a medical doctor could find a CNN
extremely useful in evaluating the results of a Magnetic
Resonance Imaging (MRI) scan. While the doctor clearly
has expertise in the problem domain, they are very unlikely
to have comparable experience in CNN architectures. This
barrier has prevented CNNs from being utilised in a variety of
image classification tasks. There is a significant demand for
algorithms which are able to effectively and efficiently design1
CNN architectures without requiring such expertise.
Fortunately, in the last two years, multiple algorithms de-
veloped for designing CNN architectures have been proposed.
Based on whether the pre- or post-processing in terms of
CNNs is required when these algorithms are used, they can be
divided into two different categories: the semi-automatic CNN
architecture design algorithms and the completely automatic
ones. Particularly, the semi-automatic algorithms cover the
genetic CNN method (Genetic CNN) [19], the hierarchical
representation-based method (Hierarchical Evolution) [20], the
efficient architecture search method (EAS) [21], and the
block design method (Block-QNN-S) [22], to name a few.
The automatic algorithms include the large-scale evolution
method (Large-scale Evolution) [23], the Cartesian genetic pro-
gramming method (CGP-CNN) [24], the neural architecture
search method (NAS) [25], and the meta-modelling method
(MetaQNN) [26]. These algorithms are mainly based on
evolutionary algorithms [27] or reinforcement learning [28].
For example, Genetic CNN, Large-scale Evolution, Hierar-
chical Evolution and CGP-CNN are based on evolutionary
algorithms, while NSA, MetaQNN, EAS and Block-QNN-S
are built on reinforcement learning.
Experimental results from these algorithms have shown their
promising performance in finding the best CNN architectures
on the given data. However, major limitations remain. Firstly,
the expertise in the investigated data and CNNs is still needed
by the semi-automatic CNN architecture design algorithms.
For example, EAS takes effect on a base network which
already has a fairly good performance on the investigated
problem. However, the base network is manually designed
based on expertise. Block-QNN-S only designs several small
networks, and these networks are then integrated into a larger
CNN framework. However, the other types of layers, such
as the pooling layers, need to be properly assimilated into
the CNN framework with expertise. Secondly, the CNN ar-
chitecture design algorithms based on reinforcement learning
typically consume much more computational resource. For
instance, NAS consumes 28 days on 800 Graphic Process
Unit (GPU) cards for the CIFAR10 dataset [29]. However,
sufficient computation resource is not necessarily available to
every interested user. Thirdly, the CNN architecture design
algorithms based on evolutionary algorithm use only partial
principled merit of the evolutionary algorithms, which inadver-
tently results in the found CNNs usually without the promis-
ing performance for the investigated problems. For example,
1In this paper, the terms “design”, “find”, “learn” and “evolve” have
identical meaning when used to describe “CNN architectures”.
Genetic CNN employs a fixed-length encoding scheme to
represent CNNs. However, we never know the best depth of
the CNN in solving a new problem. To this end, Large-scale
Evolution utilizes a variable-length encoding scheme where
the CNNs can adaptively change their depths for the prob-
lems. However, Large-scale Evolution uses only the mutation
operator but not any crossover operator during the search
process. In evolutionary algorithms, the crossover operator and
mutation operator play complementary roles of local search
and global search. Without using the crossover operator, the
mutation operator works just like random search at different
start positions. Nevertheless, it is not surprising that Large-
scale Evolution does not use the crossover operator since the
crossover operator is originally designed for the fixed-length
encoding scheme.
To this end, the development of CNN architecture design
algorithms, especially for the completely automatic ones
with promising performance and relying on the limited com-
putational resource, is still in its infancy. The aim of this
paper is to design and develop a new genetic algorithm-
based algorithm to automatically design CNN architectures
by addressing the limitations discussed above. To achieve this
goal, the objectives below have been specified:
The proposed algorithm does mandate any prerequisite
knowledge from the users in base CNN design, inves-
tigated dataset and genetic algorithms. The CNN whose
architecture is designed by the proposed algorithm can be
directly used without any re-composition, pre-processing,
or post-processing.
The variable-length encoding scheme is employed for
searching the optimal depth of the CNN. To adopt the
variable-length encoding, a new crossover operator and a
mutation operator are designed and incorporated into the
proposed algorithm to collectively exploit and explore the
search space in finding the best CNN architectures.
An efficient encoding strategy is designed based on
the ResNet and DenseNet blocks for speeding up the
architecture design, and limited computational resource
is utilized, while the promising performance can be
achieved by the proposed algorithm. Noting that, although
the ResNet and DenseNet blocks are used in the proposed
algorithm, the users are not required to have expertise in
these blocks when they are using the proposed algorithm.
The remainder of the paper is organized as follows. The
background related to base knowledge of the proposed al-
gorithm is introduced in Section II. Then, the details of the
proposed algorithm are documented in Section III. To evaluate
the performance of the proposed algorithm, the experiment
design and the numerical results are shown in Sections IV
and V, respectively. Finally, the conclusions and future work
are summarized in Section VI.
As have highlighted in Section I, the proposed algorithm is
to design a novel Genetic Algorithm (GA), to automatically
design the CNN architectures, by using the blocks of ResNet
and DenseNet that are the state-of-the-art CNNs manually
Fig. 1. An example of the ResNet block (RB).
Fig. 2. An example of the DenseNet block (DB) including four convolutional
designed. In order to help readers easily understand the details
of the proposed algorithm to be shown in Section III, the
fundamentals to GAs, ResNet Blocks (RBs) and DenseNet
Blocks (DBs) are discussed in this section.
A. Genetic Algorithms
GAs [30] are a class of heuristic population-based compu-
tational paradigm. They are also the most popular type of evo-
lutionary algorithms (evolutionary algorithms broadly include
genetic programming [31], evolutionary strategy [32] and so
on, in addition to GAs). Because of the nature of gradient-free
and insensitiveness to the local minimum, GAs are preferred
especially in engineering fields where the optimization prob-
lems are commonly non-convex and non-differentiable [33],
[34]. GAs address optimization problems by imitating the
biological evolution through a series of bio-inspired operators,
such as crossover, mutation and selection [35], [36]. Generally,
a GA works as follows:
Step 1: Initialization of a population of individuals each of
which represents a candidate solution of the problem
through the employed encoding strategy;
Step 2: Evaluation of the fitness of each individual in the
population based on the encoded information and the
fitness function;
Step 3: Mating selection of promising parent individuals from
the current population, and then generate offspring
with crossover and mutation operators;
Step 4: Evaluation of the fitness of the generated offspring;
Step 5: Environmental selection of a population of individuals
with promising performance from the current popula-
tion, and then replace the current population by the
selected population;
Step 6: Go to Step 3 if the termination cretiration is not met;
otherwise return the individual with the best fitness as
the best solution for the problem.
Commonly, a maximal generation number is predefined as the
termination criterion.
B. ResNet and DenseNet Blocks
ResNet [17] and DenseNet [18] are two state-of-the-art
CNNs proposed in recent years. The success of ResNet and
DenseNet largely owes to their building blocks, i.e., RBs and
DBs, respectively.
Fig. 1 shows an example of an RB which is composed of
three convolutional layers2and one skip connection. In this
example, the convolutional layers are denoted as conv1,conv2
and conv3. On conv1, the spatial size of the input is reduced
by a smaller number of filters with the size of 1×1, to lower
the computational complexity of conv2. On conv2, filters with
a larger size, such as 3×3, are used to learn features with
the same spatial size. On conv3, filters with the size of 1×1
are used again, and the spatial size is increased for generating
more features. The input is added, denoted by , to the output
of conv3as the final output of the RD. Noting that if the
spatial sizes of input and conv3’s output are unequal, a group
of convolutional operations with the filters of 1×1size is
applied on the input, to achieve the same spatial size as that
of conv3’s output, for the addition.
Fig. 2 exhibits an example of a DB. For the convenience
of the introduction, we give only four convolutional layers in
the DB. In practice, a DB can have a different number of
convolutional layers, which is tuned by users. In a DB, each
convolutional layer receives inputs from not only the input data
but also the output of all the previous convolutional layers. In
addition, there is a parameter, k, for controlling the spatial
size of the input and output of the same convolutional layer.
If the spatial size of the input is a, then the spatial size of
the output is a+k, which is achieved by the convolutional
operation using the corresponding number of filters.
Efforts in [37], [38] have been put on investigating the
mechanism behind the success of RBs and DBs, and revealed
that RBs and DBs are able to mitigate the adverse impact of
the gradient vanishing problem [39], based on which a deep
architecture is capable of effectively learning the hierarchical
representations of the input data, and then improving the
final classification accuracy in turn. In addition, the dense
connections in DBs have also been claimed to be able to
reuse the low-level features, to increase the discrimination of
features learned at the top layers of CNNs [18]. Mainly based
on these good characteristics, RBs and DBs are chosen as the
building blocks in the proposed algorithm.
In this section, the framework of the proposed algorithm
and its main components are discussed in detail. For the
convenience of the development, the proposed algorithm is
named AE-CNN (Automatically Evolving CNNs) in short, and
the evolved CNN is used solely for image classification tasks.
A. Algorithm Overview
Algorithm 1 shows the framework of AE-CNN, which is
composed of three parts. Firstly, the population is randomly
initialized with a predefined size of N(line 1). Then, the indi-
viduals are evaluated for the fitness (line 2). Next, all individ-
uals in the population take part into the evolutionary process
of GA with the maximal generation number of T(lines 3-
14). Finally, the best CNN architecture is decoded from the
2Here we only detail this type of blocks which is used to build deeper
networks. Indeed, ResNet also has another type of block which is typically
used for building networks with no more than 34 layers.
Algorithm 1: Framework of AE-CNN
Input: The population size N, the maximal generation
number T, the crossover propability µ, the
mutation probability ν.
Output: The best CNN.
1P0Initialize a population with the size of Nby
using the proposed encoding strategy;
2Evaluate the fitness of individuals in P0;
4while t < T do
5Qt← ∅;
6while |Qt|< N do
7p1, p2Select two parent individuals from Pt
by using binary tournament selection;
8q1, q2Generate two offspring by p1and p2
by crossover operation with the probability of
µand mutation operation with the probability
of ν;
10 end
11 Evaluate the fitness of individuals in Qt;
12 Pt+1 Select Nindividuals from PtQtby
environmental selection;
13 tt+ 1;
14 end
15 Select the best individual from Ptand decode it to the
corresponding CNN.
best individual that is chosen from the final population based
on the fitness (line 15). During the evolutionary process, an
empty population is initialized for including offspring (line 5),
and then new offspring are generated from selected parents
with the crossover and mutation operations, while the parents
are selected by the binary tournament selection (lines 6-10);
after the fitness of the generated offspring has been evaluated
(line 11), a new population is selected with the environmental
selection operation (line 12) from the current population
(containing the current individuals and the generated offspring)
as the parent solutions surviving into the next evolutionary
process (i.e., the next generation). Noting that the symbol of
|·| shown in line 6 is a cardinality operator. In the following
subsections, the phases of “ Population Initialization,” “Fitness
Evaluation,” “Offspring Generation” and “Environmental Se-
lection” are documented in Subsections III-B, III-C, III-D and
III-E, respectively.
B. Population Initialization
Population initialization provides a base population contain-
ing multiple individuals for the following evolutionary process.
Generally, all the individuals are initialized in a random
manner with a uniform distribution. As have introduced in
Subsection II-A that each individual in GAs represents a
candidate solution of the problem to be solved. Because GAs
in the proposed algorithm are employed to find the best CNN
architecture, each individual in the proposed algorithm should
represent a CNN architecture. Generally, the architecture of a
CNN is constructed by multiple convolutional layers, pooling
layers and fully-connected layers in a particular order, as
well as their parameter settings. In the proposed algorithm,
CNNs are constructed based on RBs, DBs and pooling layers,
which is motivated by the remarkable success of ResNet [17]
and DenseNet [18], while the fully-connected layers are not
considered in the proposed algorithm. The main reason is
that the fully-connected layers easily cause the over-fitting
phenomenon [40] due to their full-connection nature. To
reduce this phenomenon, other techniques must be adopted,
such as Dropout [41]. However, these techniques will also give
rise to extra parameters that need to be carefully tuned, which
will increase the computational complexity of the proposed
algorithm. The experimental results shown in Section V will
justify that the promising performance of the proposed algo-
rithm can still be achieved without using the fully-connected
layers. The details of initializing the population of AE-CNN
are summarized in Algorithm 2.
Algorithm 2: Initialize Population
Input: The population size N, the training instance
dimension d×d.
Output: The initialized population P0.
1P0← ∅;
2mpCalculate the maximal number of pooling
layers by log2(d);
3for i1to Ndo
4krandomly initialize a positive integer;
5ainitialize an empty array with the size of k;
6for j1to kdo
7uRandomly choose one from {RBU, DBU,
8if uis a PU and the number of used PU is not
less than mpthen
9uRandomly choose one from {RBU,
10 end
11 Encode uand put the encoded information into
the j-th position of a;
12 end
13 P0P0a;
14 end
15 Return P0.
Next, we will explain details of lines 8 and 11 because
other parts of Algorithm 2 are straightforward. Specifically,
the pooling layers in CNNs perform the dimension reduction
on their input data, and the most commonly used pooling
operation is to halve the input size, which can be seen from
state-of-the-art CNNs [2]–[5], [17], [18]. To this end, the
employed pooling layers cannot be arbitrarily specified, but
following the constraint that has been calculated as shown in
line 2. For example, if the input size is 32 ×32, the number
of used pooling layers cannot be larger than six because six
pooling layers will reduce the dimension of the input data to
1×1, and one extra pooling layer on the dimension of 1×1
will lead to the logic error.
Encoding enables GAs with the ability to model real-
world problems, and then the problems can be solved by the
GAs directly. The encoding is achieved by the corresponding
encoding strategy which is the first step of employing GAs.
There is not a unified encoding strategy that can be used
for all the problems. In the proposed algorithm, we design a
new encoding strategy aiming at effectively modelling CNNs
with different architectures. For the used RBs, based on the
configuration of state-of-the-art CNNs [17], [42], we set the
filter size of conv2to 3×3, which is also used for the
convolutional layers in the used DBs. For the used pooling
layers, we set the same stride as the step size to 2×2based
on the conventions, which means that such a single pooling
layer in the evolved CNN halves the input dimension for one
time. To this end, the unknown parameter settings for RBs
are the spatial sizes of input and output, those for DBs are
the spatial sizes of input and output, as well as k, and that
for pooling layers are only their types, i.e., the max or mean
pooling type. Note that the number of convolutional layers in
a DB is known because it can be derived by the spatial sizes
of input and output as well as k. Accordingly, the proposed
encoding strategy is based on three different types of units
and their positions in the CNNs. The units are the RB Unit
(RBU), the DB Unit (DBU) and the Pooling layer Unit (PU).
Specifically, an RBU and a DBU contain multiple RBs and
DBs, respectively, while a PU is composed of only a single
pooling layer. Our justifications are that: 1) by putting multiple
of RBs or DBs into an RBU or a DBU, the depth of the CNN
can be significantly changed compared to stacking RBs or
DBs one by one, which will speed up the heuristic search
of the proposed algorithm by easily changing the depth of
the CNN; and 2) one PU consisting of a single pooling layer
is more flexible than consisting of multiple pooling layers,
because the effect of multiple consequent pooling layers can
be achieved by stacking multiple PUs. In addition, we also add
one parameter to represent the unit type for the convenience
of the algorithm implementation. In summary, the encoded
information for an RBU are the type, the number of RBs, the
input spatial size and the output spatial size, which are denoted
as type,amount,in and out, respectively. On the other hand,
the encoded information for a DBU is the same as those of
an RBU, in addition to the additional parameter k. Only one
parameter is needed in a PU for encoding the pooling type.
Fig. 3. An example of the proposed encoding strategy.
Fig. 3 shows an example of the proposed algorithm in
encoding a CNN containing nine units. Specifically, each
number in the upper-left corner of the block denotes the
position of the unit in the CNN. The unit is an RBU, a DBU
or a PU if the type is 1,2, or 3, respectively. Noting that the
proposed encoding strategy does not constrain the maximal
length of each individual, which means that the proposed
algorithm can adaptively find the best CNN architecture with
a proper depth through the designed variable-length encoding
C. Fitness Evaluation
The fitness of the individuals provides a quantitative mea-
surement indicating how well they adapt to the environment,
and is calculated based on the information these individuals
encode and the task at hand. In AE-CNN, the fitness of
an individual is the classification accuracy based on the
architecture encoded by the individual and the corresponding
validation data. According to the principle of evolutionary
algorithms, an individual with a higher fitness has a higher
probability to generate an offspring hopefully with an even
higher fitness than itself. For evaluating the fitness, each
individual in AE-CNN is decoded to the corresponding CNN,
and then added to a classifier to be trained like that of a
common CNN. Typically, the widely used classifier is the
Logistic regression for binary classification and the Softmax
regression for multiple classification. As formulated by (1), in
AE-CNN, the decoded CNN is trained on the training data,
the fitness is the best classification accuracy on the validation
data after the CNN training.
Algorithm 3: Evaluate Fitness
Input: The population Ptfor fitness evaluation, traing
data Dtrain, validation data Dval id.
Output: The population Ptwith fitness.
1for each individual in Ptdo
2cnn Transform the information encoded in
individual to a CNN with the corresponding
3Initialize the weights of cnn;
4Train cnn on Dtrain;
5acc Evaluate the classification accuracy of the
trained cnn on Dvalid ;
6Assign acc as the fitness of individual;
8Return Pt.
The fitness evaluation of the proposed algorithm is shown in
Algorithm 3, where each individual in the population is evalu-
ated in the same manner. Firstly, the architecture information
encoded in the individual is transformed to a CNN with the
corresponding architecture (line 2), which is an inverse of the
encoding strategy introduced in Subsection III-B. Secondly,
the CNN is initialized with weights (lines 3) like that of a
hand-crafted CNN and then trained on the provided training
data (line 4). Noting that the weight initialize method and the
training method are the Xavier initializer [43] and the stochas-
tic gradient descend with momentum, respectively, which are
commonly used in deep learning community. Thirdly, the
trained CNN is evaluated on the validation data (line 5),
and the evaluated classification accuracy is considered as the
fitness of the individual (line 6).
D. Offspring Generation
In order to generate a population of offspring, parent individ-
uals need to be chosen in advance. Based on the principle of
evolutionary algorithms, the generated offspring are expected
to have higher fitness than their parents, through inheriting the
quality traits from both parents. To this end, the individuals
having the best fitness should be chosen as the parent individ-
uals. However, adopting the best ones as the parents could
easily cause the loss of diversity in the population, which
in turn leads to the premature convergence [44], [45], and
as a result the best performance of the population cannot be
achieved [46], [47] due to trapping into the local minima [48],
[49]. To address this problem, a general way is to select
promising parents via the random way. In the proposed AE-
CNN algorithm, the binary tournament selection [50] is used
for this purpose [50], [51], based on the conventions of the GA
community. The binary tournament selection randomly selects
two individuals from the population, and the one with a higher
fitness is chosen as one parent individual. By repeating this
process again, another parent individual is chosen, and then
these two parent individuals perform the crossover operation.
Noting that two offspring are generated after each crossover
operation, and Noffspring are generated in each generation,
i.e., the crossover operation is performed N/2times during
each generation where Nstands for the population size.
In traditional GAs, the crossover operation is performed on
two individuals with the same length, which is biologically
evident. Based on the proposed encoding strategy, individuals
in the proposed algorithm have different lengths, i.e., the
corresponding CNNs are with different depths. In this regard,
the traditional crossover operator cannot be used. However,
the crossover operator often refers to the local search ability of
GAs, exploiting the search space for a promising performance.
The performance of the final solution may be deteriorated
due to the lacking of the crossover operation in GAs. In
the proposed algorithm, we employ the one-point crossover
operator. The reason is that the one-point crossover has been
widely used in Genetic Programming (GP) [31]. GP is another
important class of evolutionary algorithms, and the individuals
in GP are commonly with different lengths. Algorithm 4 shows
the crossover operation in the proposed algorithm.
(a) Selected parent individuals
(b) Generated offspring
Fig. 4. The two selected parent individuals for the crossover operation (shown
in Fig. 4a) and the generated offspring (shown in Fig. 4b). The numbers in
each block denote the corresponding configuration, and the red numbers in
Fig. 4b denote the necessary changes after the crossover operation.
Algorithm 4: Crossover Operation of AE-CNN
Input: Two parent individuals, p1and p2, selected by
the binary tournament selection, crossover
propability µ.
Output: Two offspring.
1rUniformly generate a number from [0,1];
2if r < µ then
3Randomly choose a position from p1and p2,
4Separate p1and p2based on the chosen positions;
5q1Combine the first part of p1and the second
part of p2;
6q2Combine the first part of p2and the first part
of p1;
10 end
11 Return q1and q2.
Noting that some necessary changes are automatically made
on the generated offspring if required. For example, the in of
the current unit should be equal to the out of the previous
unit, and other cascade adjustments caused by this change.
For a better understanding of the crossover operation, an
example is shown in Fig. 4 where Fig. 4a shows the two
parent individuals. Supposing the separation positions of these
two parent individuals are the 3-th and 4-th units, respectively,
then Fig. 4b shows the corresponding generated offspring, the
red numbers imply the corresponding changes needed after the
crossover operation for the logic representing a valid CNN.
The mutation operation typically performs the global search
in GAs, exploring the search space for promising performance.
It works on one generated offspring with a predefined probabil-
ity and the allowed mutation types. Available mutation types
are designed based on the proposed encoding strategy. In the
proposed algorithm, the available mutation types are:
Adding (adding an RBU, adding a DBU, or adding a PU
to the selected position);
Removing (removing the unit at the selected position);
Modifying (modifying the encoded information of the
unit at the selected position).
The mutation operation in the proposed algorithm is detailed
in Algorithm 5. Because all the generated offspring use the
same routine for the mutation, Algorithm 5 shows only the
process of one offspring for the reason of simplicity. Noting
that the offspring will be kept the same if it is not mutated.
In addition, a series of necessary adjustments will also be
automatically performed based on the logic of composing
a valid CNN as highlighted in the crossover operation. For
better understanding the mutation, an example in terms of the
“adding an RBU” is shown in Fig. 5, where Fig. 5a shows
the selected individual for the mutation and the randomly
initialized RBU, and Fig. 5b shows the mutated individual.The
Algorithm 5: Mutation Operation of AE-CNN
Input: The offspring q1, mutation propability ν.
Output: The mutated offspring.
1rUniformly generate a number from [0,1];
2if r < ν then
3Randomly choose a position from q1;
4type Randomly select one from {Adding,
Removing, Modifying};
5if type is Adding then
6mu Randomly select one from {adding an
RBU, adding a DBU, adding a PU}
7else if type is Removing then
8mu removing a unit;
10 mu modifying the encoded information;
11 end
12 Perform mu at the chosen position;
13 Return q1.
red numbers in Fig. 5b also mean the necessary changes when
the mutation has been performed. In the proposed crossover
and mutation operations, all these necessary changes are made
(a) The selected indivial for mutation and the randomly initialized RBU for
the corresponding mutation
(b) Mutated individual
Fig. 5. An example of the “adding an RBU” mutation. Specifically, the
first row and the second row in Fig. 5a denote the selected individual for
the mutation and the randomly initialized RBU for the “adding and RBU”
mutation at the fourth position of the individual to be mutated. Fig. 5b shows
the mutated individual, and the red numbers denote the necessary changes
after the mutation.
E. Environmental Selection
In the environmental selection, a population of individuals
in the size of Nis to be selected from the current population,
i.e., PtQt, serving as the parent individuals for the next
generation. Theoretically, a good population has the charac-
teristics of both convergence and diversity [30], to prevent
from trapping into local minima [48], [49] and premature
convergence [44], [45]. In practice, the parent individuals
should be composed of individuals with the best fitness for
the convergence, and individuals whose fitness have significant
differences from each other for the diversity. To this end, we
will purposely select the individual with the best fitness, along
with N1individuals which are selected by binary tournament
selection [50], [51], as parent individuals to generate offspring
Algorithm 6: Environmental Selection
Input: The population Pt, the generated offspring
population Qt, the population size N.
Output: The population Pt+1 surviving in the next
1Pt+1 ← ∅;
2for j1to Ndo
3p1, p2Randomly selected two individuals from
4pSelect the one with higher fitness from
{p1, p2};
5Pt+1 Pt+1 p;
7pbest Select the one with the highest fitness from
8if pbest is not in Pt+1 then
9Randomly select one from Pt+1 and then replace it
by pbest;
10 end
11 Return Pt+1.
for the new population. Explicitly selecting the best one as
the parent for the next generation is an implementation of
the “elitism” mechanism [52] in GAs, which could prevent
the performance of the population from degrading as the
evolutionary progresses..
Algorithm 6 shows the details of the environmental selection
in the proposed algorithm. Specifically, given the current
population Ptand the generated offspring population Qt,N
individuals are selected with the binary tournament selection
that are shown in lines 2-6. After that, the best individual pbest
(i.e., the individual having the highest fitness) is selected from
PtQt(line 7), and then to check whether pbest has been
selected into Pt+1 or not. A random one selected from Pt+1
will be replaced by pbest if it does not exist in Pt+1 (lines 8-
10). Noting that the offspring in Qtshould have been evaluated
for their fitness prior to the environmental selection because
the binary tournament selection works based on the fitness.
The experiment is purposely designed to verify whether the
proposed automatic CNN architecture design algorithm is able
to achieve the promising performance on image classification
tasks. In this section, we will first introduce the chosen peer
competitors (in Subsection IV-A) to which the performance
of the proposed algorithm is compared, and then highlight
the adopted benchmark datasets (in Subsection IV-B) and the
parameter settings (in Subsection IV-C).
A. Peer Competitors
In order to demonstrate the superiority of the proposed
algorithm, various peer competitors are chosen to perform the
comparison. Particularly, the chosen peer competitors can be
divided into three different categories.
The first includes the state-of-the-art CNNs whose archi-
tectures are hand-crafted with extensive domain expertise:
DenseNet [18], ResNet [17], Maxout [53], VGG [54], Network
in Network [55], Highway Network [56], All-CNN [57] and
FractalNet [58]. In addition, considering the promising perfor-
mance of ResNet, we use two different versions in the exper-
iment, they are the ResNet with 101 layers and ResNet with
1,202 layers, which are labelled as ResNet (depth=101) and
ResNet (depth=1,202), respectively. Owing to the promising
performance, most peer competitors in this category win the
champions of the large-scale vision challenge [59] in the recent
years. The intention of choosing these state-of-the-art CNNs
is to verify if the proposed automatic CNN architecture design
algorithm can show competitive performance to the hand-
crafted CNNs. The second covers the CNN architecture design
algorithms with a semi-automatic means, including Genetic
CNN [19], Hierarchical Evolution [20], EAS [21], and Block-
QNN-S [22]. The third refers to Large-scale Evolution [23],
CGP-CNN [24], NAS [25], and MetaQNN [26], which design
CNN architectures in a completely automatic way.
B. Benchmark Datasets
(a) CIFAR10
(b) CIFAR100
Fig. 6. Randomly selected examples from each three categories of CIFAR10
(shown in Fig. 6a) and CIFAR100 (shown in Fig. 6b), and each category has
10 examples.
The CNNs typically perform image classification tasks to
compare their performance through looking at the classifica-
tion performance. For the state-of-the-art CNNs, the mostly
used image classification benchmark datasets are CIFAR10
and CIFAR100 [29], while for the CNN architecture design
algorithms, the widely used benchmark dataset is only CI-
FAR10 because CIFAR100 is much more challenging due
to its large number of classes for the classification tasks at
hand. Considering the adopted peer competitors covering the
state-of-the-art CNNs and architecture design algorithms, both
CIFAR10 and CIFAR100 are chosen as the benchmark datasets
in the experiment.
CIFAR10 and CIFAR100 are two widely used image classifi-
cation benchmark datasets for recognizing nature objects, such
as bird, boat and air plane. Each set has 50,000 training images
and 10,000 test images. The differences between CIFAR10
and CIFAR100 are that CIFAR10 is 10-class classification
while CIFAR100 is 100-class. However, each benchmark has
nearly the same number of training images for each class, i.e.,
each category of CIFAR10 has 5,000 training images, while
that of CIFAR100 has 500 training images.
Fig. 6 illustrates the images from each benchmark for
reference, where images in each row denote the ones from
the same class, and the words in the left column refer to the
corresponding class name. As can be seen from Fig. 6, the ob-
ject to be recognized in each image has different resolution to
each other, mixes with the background and occupies different
position, which generally increase the difficulty in correctly
recognizing the objects. Based on the conventions of the
chosen peer competitors [17]–[26], CIFAR10 and CIFAR100
are augmented by padding four zeros to each side of one
image, and then randomly cropped to the original size followed
by a randomly horizontal flip, prior to be input to the proposed
C. Parameter Settings
In the comparison, we extract the results of the peer com-
petitors reported in their seminal papers rather than performing
them by ourselves. The reason is that the results reported are
usually the best. In doing so, there is no need to set the
parameters of the peer competitors. For the proposed algo-
rithm, we follow the principle that all the parameters are set
based on their commonly used values, to lower the difficulty
to researchers, who would like to use the proposed algorithm
in finding the best CNN architectures for their investigated
data, even they have no expertise in GAs. Particularly, the
population size and maximal generation number are set to be
20, the probabilities of crossover and mutation are set to 0.9
and 0.2, respectively. Based on the conventions of the machine
learning community, the validation data is randomly split from
the training data with the proportion of 1/5. Finally, all the
classification error rate are evaluated on the same test data for
the comparison.
In evaluating the fitness, each individual is trained by
Stochastic Gradient Descent (SGD) with a batch size of
128. The parameter settings for SGD are also based on
the conventions from the peer competitors. Specifically, the
momentum is set to 0.9. The learning rate is initialized to
0.01, but with a warming up setting of 0.1during the second
to the 150-th epoch, and scaled by dividing 10 at the 250-
th epoch. The weight decay is set to 5×104. In addition,
the fitness of the individual is set to zero if it is out of
memory during the training. When the evolutionary process
terminates, the best individual is retrained on the original
training data with the same SGD settings, and the error rate
on the test data is reported for the comparison. Considering
the heuristic nature of the proposed algorithm as well as the
expensive computational cost, the best individual is trained
for five independent runs. Because all the peer competitors
chosen for the comparisons only show their best results no
matter how many times they have performed, the best result
of the proposed algorithm among the five independent trials
is presented here for a fair comparison.
In addition, the available choices of kin a DB are 12,
20 and 40 based on the design of DenseNet, the maximal
convolutional layers in a DB are specified as 10 (when k= 12
CIFAR10 CIFAR100 # of Parameter GPU Days
DenseNet (k=12) [18] 5.24 24.42 1.0M hand-crafted architecture
ResNet (depth=101) [17] 6.43 25.16 1.7M hand-crafted architecture
ResNet (depth=1,202) [17] 7.93 27.82 10.2M hand-crafted architecture
Maxout [53] 9.3 38.6 hand-crafted architecture
VGG [54] 6.66 28.05 20.04M hand-crafted architecture
Network in Network [55] 8.81 35.68 hand-crafted architecture
Highway Network [56] 7.72 32.39 hand-crafted architecture
All-CNN [57] 7.25 33.71 hand-crafted architecture
FractalNet [58] 5.22 22.3 38.6M hand-crafted architecture
Genetic CNN [19] 7.1 29.05 17 semi-automatic algorithm
Hierarchical Evolution [20] 3.63 300 semi-automatic algorithm
EAS [21] 4.23 23.4M 10 semi-automatic algorithm
Block-QNN-S [22] 4.38 20.65 6.1M 90 semi-automatic algorithm
Large-scale Evolution [23] 5.4 5.4M 2,750 completely automatic algorithm
Large-scale Evolution [23] 23 40.4M 2,750 completely automatic algorithm
CGP-CNN [24] 5.98 2.64M 27 completely automatic algorithm
NAS [25] 6.01 2.5M 22,400 completely automatic algorithm
MetaQNN [26] 6.92 27.14 100 completely automatic algorithm
AE-CNN 4.3 2.0M 27 completely automatic algorithm
AE-CNN 20.85 5.4M 36 completely automatic algorithm
and k= 20) and 5(when k= 40). Both the maximal
numbers of RBUs and DBUs in a CNN are set to 4. Both the
numbers of DBs and RBs in a DBU and an RBU, respectively,
are set from 3 to 10. Noting that these settings are mainly
based on our available computational resources because any
number beyond these settings will easily render out of the
memory. If the user’ computational platform is equipped
with more powerful GPUs, they can set the number to an
arbitrary one. The proposed algorithm for the experiment is
performed on three GPU cards with the model of Nvidia
GeForce GTX 1080 Ti, and the codes are implemented based
on a GPU-based parallel framework designed in our previous
work written by PyTorch [60]. The codes are made available
In the experiments, we investigate the performance of the
proposed algorithm in terms of not only the classification error,
but also the number of parameters as well as the computational
complexity for a comprehensive comparison to the chosen
peer competitors (shown in Subsection V-A). Because it is
hard to theoretically analyze the computational complexity of
each peer competitor, the consumed “GPU Days” is used as
an indicator of the computational complexity. Specifically, the
number of GPU Days is calculated by multiplying the number
of employed GPU cards and the days the algorithms performed
for finding the best architectures. For example, the proposed
algorithm performed nine days on three GPU cards for the
CIFAR10 dataset, therefore, the corresponding GPU Days is
27 by multiplying nine (days) with three (used GPU cards).
Obviously, the state-of-the-art CNNs hand-crafted do not have
the data regarding the “GPU days.” In addition, we also
provide the evolutionary trajectories of the proposed algorithm
in finding the best architectures on the chosen benchmark
datasets, which could help the readers know whether the
proposed algorithm converges with the adopted parameter
settings (shown in Subsection V-B). Finally, the found best
architectures are provided in Subsection V-C, which may
provide useful knowledge to researchers in hand-crafting CNN
A. Performance Overview
Table I shows the experimental results of the proposed
algorithm as well as the chosen peer competitors. In order
to conveniently investigate the comparisons, Table I is divided
into five “rows” by six horizontal lines. The first denotes the
title of each column, the second, third and fourth rows refer
to the state-of-the-art peer competitors whose architectures are
manually designed, semi-automatic and automatic CNN archi-
tecture design algorithms, respectively. The fifth row shows
the results of the proposed algorithm which is an automatic
algorithm in designing CNN architectures. In addition, the
symbol “–” in Table I implies there is no result publicly
reported by the corresponding peer competitor.
As shown in Table I, AE-CNN outperforms all the state-
of-the-art peer competitors manually designed for CIFAR10.
Specifically, AE-CNN achieves the classification error of ap-
proximately 1.0% lower than DenseNet (k=12) and FractalNet,
2.1% lower than ResNet (depth=101), VGG and All-CNN,
3.5% lower than ResNet (depth=1,202) and Highway Network,
and even 5.0% lower than Maxout and Network in Network.
On CIFAR100, AE-CNN shows significantly lower classifi-
cation error than Maxout, Network in Network, Highway
Network and All-CNN, slightly lower classification error than
DenseNet (k=12), ResNet (depth=101), ResNet (depth=1,202)
and VGG, and similar to but still better than the performance
of FractalNet. The number of parameters of the CNN evolved
by AE-CNN on both CIFAR10 and CIFAR100 are larger than
DenseNet (k=12) and ResNet (depth=101), but much smaller
than that of ResNet (depth=1,202), VGG and FractalNet.
Compred with the semi-automatic peer competitors, AE-
CNN performs much better than Genetic CNN on both
CIFAR10 and CIFAR100. Although Hierarchical Evolution
shows better performance than AE-CNN on CIFAR10, AE-
CNN consumes only 1/10 GPU days as that consumed by
Hierarchical Evolution on CIFAR10. Block-QNN-S shows
a bit worse performance on CIFAR10 but slightly better
performance on CIFAR100 compared to AE-CNN, while AE-
CNN consumes 1/3 of the GPU days as that consumed by
Block-QNN-S, and also the best CNN found by AE-CNN
has a smaller number of parameters than that of Block-QNN-
S. In addition, EAS and AE-CNN perform nearly the same
classification error on CIFAR10, while the best CNN evolved
by AE-CNN only has 2.0M parameters, which is only 1/11
of that from EAS. In summary, compared with the semi-
automatic peer competitors, AE-CNN shows the competitive
performance but has significantly fewer number of parameters.
It is important to note that domain expertise is still required
when using the algorithms from this category. For example,
EAS only consumes 10 GPU Days for the best CNN on
CIFAR10, which is based on a base CNN with known fairly
good performance. Therefore, the comparison in terms of the
consumed GPU days is not fair to the proposed AE-CNN
algorithm, which is completely automatic without using any
human expertise and/or extra resources.
Among the automatic peer competitors, on both the CI-
FAR10 and the CIFAR100 datasets, AE-CNN shows the best
performance in terms of the classification error, number of
parameters and the consumed GPU days. Specifically, AE-
CNN achieves 4.3% classification error on CIFAR10, while the
best and worst classification error from the peer competitors
are 5.4% and 6.92%, respectively. In addition, AE-CNN also
shows the lower classification error than that of MetaQNN.
On CIFA100, AE-CNN shows 2.15% lower classification error
than that of Large-scale Evolution, and has 5.4M number of
parameters which is much smaller than that of Large-scale
Evolution (40.4M). Furthermore, AE-CNN also consumes
much less GPU Days than that of Large-scale Evolution,
NAS and MetaQNN on both CIFAR10 and CIFAR100. The
comparison shows that the proposed algorithm achieves the
best performance among the automatic peer competitors to
which the proposed algorithm belongs.
The rationale for AE-CNN outperforming Large-scale Evo-
lution, CGP-CNN, NAS and Meta-QNN can be justified as
follows. Firstly, Large-scale Evolution does not apply the
crossover operator which provides the local search ability. The
GA-based design consequently deteriorates its performance.
Secondly, CGP-CNN employs a fixed-length encoding strategy
to design the best CNN architecture. In order to make the
encoding strategy work, CGP-CNN must predefine a maximal
length of CNNs during the architecture design. As can be
seen from [24], the predefined maximal length of CGP-CNN
is smaller than the best one identified by AE-CNN. Thirdly,
NAS and Meta-QNN are designed based on reinforcement
learning. Because the fitness value is not computed when the
reinforcement learning methods are used, the reinforcement
learning-based methods often consume more computational
resources than GA does for the same performance [7]. Ex-
pectedly, NAS and Meta-QNN perform worse than AE-CNN
given the available computational resources.
B. Evolution Trajectory
When the evolutionary algorithms are used to address real-
world problems, we usually like to know whether they have
converged or not when they terminate. A better way to observe
this is to plot the evolutionary trajectories. In this subsection,
the evolutionary trajectories of the proposed algorithm in
terms of the investigated benchmark datasets are provided and
analyzed. To achieve this, we firstly collect the classification
accuracy of each individual in every generation, and then plot
the statistical results.
(a) Evolution trajectory of CIFAR10
(b) Evolution trajectory of CIFAR100
Fig. 7. Evolution trajectories of the proposed algorithm in CIFAR10 (shown
in Fig. 7a) and CIFAR100 (shown in Fig. 7b).
The evolutionary trajectories of the proposed algorithm
are shown in Fig. 7 where Figs. 7a and 7b show those on
CIFAR10 an CIFAR100, respectively. In Fig. 7, the horizontal
axis denotes the generation number, and the vertical axis
denotes the classification accuracy; the red line denotes the
mean classification accuracy of the individuals in the same
generation, while the light-green area is contoured by the best
and worst classification accuracy of the individuals in each
As can be seen from Fig. 7a, the mean classification
accuracy sharply increases from the 1-st generation to the
3-rd generation; and then steadily improves as the evolution
process proceeds until the 14-th generation; from then, the
mean classification accuracy has a significant increase from
about 75% to about 95%; and finally the proposed algorithm
converges when it terminates. As can be seen from the lower
boundary of the light-green area, the worst classification
accuracy in the first two generations is zero, which is caused
because the randomly initialized architecture cannot run on
the employed GPUs due to the out-of-memory problem; From
the 3-rd generation, the individuals with the out-of-memory
architectures are eliminated from the population due to their
uncompetitive fitness, and classification accuracy steadily im-
proves until the algorithm terminates, although there is an
exception at the 4-th generation. As can be seen from the upper
boundary of the light-area, the best performance almost keeps
the same improvement as the mean classification accuracy with
the evolutionary process continues. In addition, the difference
between the best classification accuracy and the worst accuracy
also becomes smaller, which implies the population converges
to a steady state.
A similar situation can also be seen from Fig. 7b. Specif-
ically, the mean classification accuracy increased from about
30% to about 45% from the 1-st generation to the 4-th gener-
ation, although there is a slight drop at the 3-rd generation;
since the 4-th generation, the mean classification accuracy
keeps improving until the 14-th generation; and then increases
from about 50% at the 14-th generation to about 79% at the
17-th generation; after that the mean classification accuracy
converges until the evolutionary process terminates. During
the first three generations, the worst classification accuracy
stays at zero because the randomly initialized out-of-memory
individuals; from the 4-th generation, the worst classification
accuracy improves until the 20-th generation with the excep-
tion at the 10- and 15-th generations. As can be seen from
the evolutionary trajectories of the best classification accuracy,
the best classification accuracy improves almost with the same
trend as that of the mean classification accuracy, and also
archives the converged performance from the 17-th generation.
A common trend can both be seen from Figs. 7a and 7b that
the best classification accuracy (i.e., the upper boundaries of
the light-green areas) will not be degraded, which is achieved
through the utilized elitism detailed in Subsection III-E, i.e.,
the individual with the best fitness is unconditionally kept
into the next generation. In summary, the proposed algorithm
converges within the default parameter settings in terms of
GAs, which could help the users to employ the proposed
algorithm to find the best CNN architectures for their own
data, even though the users have no expertise in GAs. However,
the maximal generation number and the population size can
be set to larger numbers if more computational resources are
C. Designed CNN Architectures
In this subsection, the best CNN architectures found by the
proposed algorithm on CIFAR10 and CIFAR100 are provided
in Tables II and III, respectively.
id type configuration
1 RBU amount=8, in=3, out=64
2 PU mean pooling
3 RBU amount=5, in=64, out=28
4 PU mean pooling
5 RBU amount=7. in=128, out=64
6 DBU amount=7, in=64, out=204, k=20
7 DBU amount=7, in=204, out=204, k=20
8 PU mean pooling
9 PU max pooling
id type configuration
1 DBU amount=10, in=3, out=203, k=20
2 PU max pooling
3 PU mean pooling
4 RBU amount=7, in=203, out=256
5 PU mean pooling
6 PU mean pooling
As can be seen from Tables II and III, the best architecture
on CIFAR10 is composed of nine units that are designed in the
proposed encoding strategy in Subsection III-B, and altogether
has 38 layers that consist of 34 convolutional layers and four
pooling layers; while the best architecture on CIFAR100 is
composed of six units that consist of 21 layers, containing 17
convolutional layers and four pooling layer.
Compared to the state-of-the-art CNNs that are solely built
on DenseNet blocks or ResNet blocks, the automatically
found architectures based on both blocks have much simpler
architectures and much better performance. This may serve as
a priori knowledge in hand-crafting CNN architectures that en-
semble blocks may be more effective. In addition, CIFAR100
is commonly viewed as a more complex benchmark than
CIFAR10, and researchers usually consider CNN architectures
with more layers than that of CIFAR10 when dealing with
CIFAR100. However, based on the found architectures shown
in Tables II and III, the best architecture for CIFAR100 has
surprisingly a smaller number of layers than that of CIFAR10.
To this end, finding the best architecture through evolution
search may also provide useful domain expertise, which is in
contract to our common sense.
The goal of this study is to develop a CNN architec-
ture design algorithm by using GAs, which is capable of
designing/searching/learning/evolving the best CNN architec-
ture for the given task in a completely automatic manner
and based on the limited computational resource. This goal
has been successfully achieved by the proposed encoding
strategy built on the state-of-the-art blocks with a variable-
length representation, presenting a crossover operator for the
variable-length individuals, and the corresponding mutation
operators. Building upon the blocks is able to speed up the
CNN architecture design. The variable-length of individuals
can adaptively evolve the proper depth of a CNN for tasks
with different complexity. The presented crossover operator
and the designed mutation operators provide the proposed
algorithm with effective local search and global search ability,
which in turn helps the proposed algorithm to be able to
find the best CNN architectures. The proposed algorithm is
examined on CIFAR10 and CIFAR100 image classification
datasets, against nine state-of-the-art CNNs manually designed,
four peer competitors designing CNN architectures with a
semi-automatic way and five peer competitors designing CNN
architectures with the completely automatic way. The results
show that the proposed algorithm outperforms all the state-
of-the-art CNNs hand-crafted and all the peer competitors
from the automatic category in terms of the classification
error rate. In addition, the proposed algorithm also consumes a
much smaller number of GPU Days than the peer competitors
in the same category. Furthermore, the proposed algorithm
shows competitive performance against the semi-automatic
peer competitors. Our future work will focus on effectively
speeding up the fitness evaluation.
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,Nature, vol. 521,
no. 7553, pp. 436–444, 2015.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural Infor-
mation Processing Systems 25, Lake Tahoe, Nevada, USA., 2012, pp.
[3] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhadran,
“Deep convolutional neural networks for LVCSR,” in Proceeding of
2013 IEEE International Conference on Acoustics, Speech and Signal
Processing. Vancouver, Canada: IEEE, 2013, pp. 8614–8618.
[4] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” in Advances in Neural Information Processing
Systems 27, Montreal, Canada, 2014, pp. 3104–3112.
[5] C. Clark and A. Storkey, “Training deep convolutional neural networks
to play go,” in 32nd International Conference on Machine Learning
(ICML 2015), Lille, France, 2015, pp. 1766–1774.
[6] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[7] Y. Sun, B. Xue, and M. Zhang, “Evolving deep convolutional neural
networks for image classification,” arXiv preprint arXiv:1710.10741,
[8] Y. Sun, G. G. Yen, and Z. Yi, “Evolving unsupervised deep neural net-
works for learning meaningful representations,” IEEE Transactions on
Evolutionary Computation, 2018, DOI:10.1109/TEVC.2018.2808689.
[9] M. Dorigo, V. Maniezzo, and A. Colorni, “Ant system: optimization by
a colony of cooperating agents,” IEEE Transactions on Systems, Man,
and Cybernetics, Part B (Cybernetics), vol. 26, no. 1, pp. 29–41, 1996.
[10] J. Bergstra and Y. Bengio, “Random search for hyper-parameter opti-
mization,” Journal of Machine Learning Research, vol. 13, no. Feb, pp.
281–305, 2012.
[11] C. E. Rasmussen and C. K. Williams, Gaussian processes for machine
learning. MIT press Cambridge, 2006, vol. 1.
[12] J. Moˇ
ckus, “On bayesian methods for seeking the extremum,” in
Optimization Techniques IFIP Technical Conference. Springer, 1975,
pp. 400–404.
[13] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. K´
egl, “Algorithms
for hyper-parameter optimization,” in Advances in Neural Information
Processing Systems 24, Granada, Spain, 2011, pp. 2546–2554.
[14] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-based
optimization for general algorithm configuration.” LION, vol. 5, pp. 507–
523, 2011.
[15] K. O. Stanley and R. Miikkulainen, “Evolving neural networks through
augmenting topologies,” Evolutionary Computation, vol. 10, no. 2, pp.
99–127, 2002.
[16] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “An experimental study on
hyper-parameter optimization for stacked auto-encoders,” in Proceedings
of 2018 IEEE Congress on Evolutionary Computation. IEEE, 2018, pp.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of 2016 IEEE Conference on Computer
Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770–
[18] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely
connected convolutional networks,” in Proceedings of 2017 IEEE Con-
ference on Computer Vision and Pattern Recognition, Honolulu, HI,
USA, 2017, pp. 2261–2269.
[19] L. Xie and A. Yuille, “Genetic CNN,” in Proceedings of 2017 IEEE
International Conference on Computer Vision, Venice, Italy, 2017, pp.
[20] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu,
“Hierarchical representations for efficient architecture search,” in Pro-
ceedings of 2018 Machine Learning Research, Stockholm, Sweden,
[21] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang, “Efficient architecture
search by network transformation,” in Proceedings of the 2018 AAAI
Conference on Artificial Intelligence, Louisiana, USA, 2018.
[22] Z. Zhong, J. Yan, and C.-L. Liu, “Practical network blocks design with
q-learning,” in Proceedings of the 2018 AAAI Conference on Artificial
Intelligence, Louisiana, USA, 2018.
[23] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and
A. Kurakin, “Large-scale evolution of image classifiers,” in Proceedings
of 2017 Machine Learning Research, Sydney, Australia, 2017, pp. 2902–
[24] M. Suganuma, S. Shirakawa, and T. Nagao, “A genetic programming
approach to designing convolutional neural network architectures,” in
Proceedings of the 2017 Genetic and Evolutionary Computation Con-
ference. Berlin, Germany: ACM, 2017, pp. 497–504.
[25] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement
learning,” in Proceedings of the 2017 International Conference on
Learning Representations, Toulon, France, 2017.
[26] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network
architectures using reinforcement learning,” in Proceedings of the 2017
International Conference on Learning Representations, Toulon, France,
[27] T. Back, Evolutionary Algorithms in Theory and Practice: Evolution
Strategies, Evolutionary Programming, Genetic Algorithms. England,
UK: Oxford university press, 1996.
[28] R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction.
MIT press Cambridge, 1998, vol. 1.
[29] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from
tiny images,” online:, 2009.
[30] H. Holland John, Adaptation in natural and artificial systems: an
introductory analysis with applications to biology, control, and artificial
intelligence. MIT Press Cambridge, MA, USA, 1975.
[31] W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone, Genetic
programming: an introduction. Morgan Kaufmann San Francisco, 1998,
vol. 1.
[32] C. Janis, “The evolutionary strategy of the equidae and the origins of
rumen and cecal digestion,” Evolution, vol. 30, no. 4, pp. 757–774, 1976.
[33] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist
multiobjective genetic algorithm: NSGA-II,IEEE Transactions on
Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002.
[34] Y. Sun, G. G. Yen, and Z. Yi, “IGD indicator-based evolutionary algo-
rithm for many-objective optimization problems,IEEE Transactions on
Evolutionary Computation, DOI:10.1109/TEVC.2018.2791283., 2018.
[35] M. Mitchell, An introduction to genetic algorithms. Cambridge,
Massachusetts, USA: MIT press, 1998.
[36] L. M. Schmitt, “Theory of genetic algorithms,” Theoretical Computer
Science, vol. 259, no. 1-2, pp. 1–61, 2001.
[37] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep
networks,” in Advances in Neural Information Processing Systems 28:
29th Annual Conference on Neural Information Processing Systems
2015, Montreal, Canada, 2015, pp. 2377–2385.
[38] A. E. Orhan and X. Pitkow, “Skip connections eliminate singularities,” in
Proceedings of 2018 Machine Learning Research, Stockholm, Sweden,
[39] S. Hochreiter and J. Schmidhuber, “Long short-term memory,Neural
Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[40] G. C. Cawley and N. L. Talbot, “On over-fitting in model selection
and subsequent selection bias in performance evaluation,Journal of
Machine Learning Research, vol. 11, no. Jul, pp. 2079–2107, 2010.
[41] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: a simple way to prevent neural networks from over-
fitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.
1929–1958, 2014.
[42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions,
in Proceedings of 2015 IEEE Conference on Computer Vision and
Pattern Recognition, Boston, MA, USA, 2015, pp. 1–9.
[43] X. Glorot and Y. Bengio, “Understanding the difficulty of training
deep feedforward neural networks,” in Proceedings of the thirteenth
international conference on artificial intelligence and statistics, 2010,
pp. 249–256.
[44] Y. Leung, Y. Gao, and Z.-B. Xu, “Degree of population diversity-a
perspective on premature convergence in genetic algorithms and its
markov chain analysis,” IEEE Transactions on Neural Networks, vol. 8,
no. 5, pp. 1165–1176, 1997.
[45] Z. Michalewicz and S. J. Hartley, “Genetic algorithms+ data structures=
evolution programs,Mathematical Intelligencer, vol. 18, no. 3, p. 71,
[46] Y. Sun, G. G. Yen, and Z. Yi, “Improved regularity model-based EDA
for many-objective optimization,IEEE Transactions on Evolutionary
Computation, 2018, DOI:10.1109/TEVC.2018.2794319.
[47] ——, “Reference line-based estimation of distribution algorithm for
many-objective optimization,Knowledge-Based Systems, vol. 132, pp.
129–143, 2017.
[48] L. Davis, Handbook of genetic algorithms. New York: Van Nostrand
Reinhold, 1991.
[49] D. E. Goldberg and J. H. Holland, “Genetic algorithms and machine
learning,” Machine Learning, vol. 3, no. 2, pp. 95–99, 1988.
[50] B. L. Miller, D. E. Goldberg et al., “Genetic algorithms, tournament
selection, and the effects of noise,” Complex systems, vol. 9, no. 3, pp.
193–212, 1995.
[51] G. Zhang, Y. Gu, L. Hu, and W. Jin, “A novel genetic algorithm and its
application to digital filter design,” in Proceedings of 2003 Intelligent
Transportation Systems, vol. 2. IEEE, 2003, pp. 1600–1605.
[52] J. Vasconcelos, J. A. Ramirez, R. Takahashi, and R. Saldanha, “Improve-
ments in genetic algorithms,” IEEE Transactions on Magnetics, vol. 37,
no. 5, pp. 3414–3417, 2001.
[53] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Ben-
gio, “Maxout networks,” in Proceedings of the 30th International
Conference on Machine Learning, Atlanta, Georgia, USA, Jun 2013,
pp. 1319–1327.
[54] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” in 32nd International Conference on
Machine Learning (ICML 2015), Lille, France, 2015.
[55] M. Lin, Q. Chen, and S. Yan, “Network in network,” in Proceedings of
the 2014 International Conference on Learning Representations, Banff,
Canada, 2014.
[56] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,”
in Proceedings of the 2015 International Conference on Learning
Representations Workshop, San Diego, CA, 2015.
[57] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving
for simplicity: the all convolutional net,” in Proceedings of the 2015
International Conference on Learning Representations, San Diego, CA,
[58] G. Larsson, M. Maire, and G. Shakhnarovich, “Fractalnet: Ultra-
deep neural networks without residuals,” The 5th International
Conference on Learning Representations, 2016. [Online]. Available:
[59] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual
recognition challenge,” International Journal of Computer Vision, vol.
115, no. 3, pp. 211–252, 2015.
[60] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
pytorch,” 2017. [Online]. Available:
Yanan Sun (S’15-M’18) received a Ph.D. degree in
engineering from the Sichuan University, Chengdu,
China, in 2017. He is currently a Professor (research)
in the College of Computer Science, Sichuan Univer-
sity, China. Prior to that, He was a Research Fellow
in the School of Engineering and Computer Science,
Victoria University of Wellington, Wellington, New
Zealand. Dr. Sun’s research topics are evolutionary
algorithms, deep learning, and evolutionary deep
learning. He is the leading organizer of the First
Workshop on Evolutionary Deep Learning, the lead-
ing organizer of the Special Session on Evolutionary Deep Learning and
Applications in CEC19, and the founding chair of the IEEE CIS Task Force
on Evolutionary Deep Learning and Applications.
Bing Xue (M’10) received the B.Sc. degree from
the Henan University of Economics and Law,
Zhengzhou, China, in 2007, the M.Sc. degree in
management from Shenzhen University, Shenzhen,
China, in 2010, and the PhD degree in computer
science in 2014 at Victoria University of Wellington,
New Zealand. She is currently an Associate Profes-
sor in School of Engineering and Computer Science
at Victoria University of Wellington. Her research
focuses mainly on evolutionary computation, feature
selection, feature construction, multi-objective opti-
mization, image analysis, transfer learning, data mining, and machine learning.
She has over 100 papers published in fully refereed international journals
and conferences and most of them are on evolutionary feature selection
and construction. Dr Xue is currently the Chair of the IEEE Task Force
on Evolutionary Feature Selection and Construction, IEEE Computational
Intelligence Society (CIS), Vice-Chair of the IEEE CIS Data Mining and
Big Data Analytics Technical Committee, and Vice-Chair of IEEE CIS Task
Force on Transfer Learning & Transfer Optimization.
Mengjie Zhang (M’04-SM’10-F’18) received the
B.E. and M.E. degrees from Artificial Intelligence
Research Center, Agricultural University of Hebei,
Hebei, China, and the Ph.D. degree in computer
science from RMIT University, Melbourne, VIC,
Australia, in 1989, 1992, and 2000, respectively. He
is currently Professor of Computer Science, Head
of the Evolutionary Computation Research Group,
and the Associate Dean (Research and Innovation)
in the Faculty of Engineering. His current research
interests include evolutionary computation, particu-
larly genetic programming, particle swarm optimization, and learning classifier
systems with application areas of image analysis, multi-objective optimization,
feature selection and reduction, job shop scheduling, and transfer learning. He
has published over 350 research papers in refereed international journals and
conferences. Prof. Zhang is a Fellow of Royal Society of New Zealand and
have been a Panel member of the Marsden Fund (New Zealand Government
Funding). He is a vice-chair of the IEEE CIS Task Force on Evolutionary
Feature Selection and Construction, a vice-chair of the Task Force on
Evolutionary Computer Vision and Image Processing, and the founding chair
of the IEEE Computational Intelligence Chapter in New Zealand. He is also
a committee member of the IEEE NZ Central Section.
Gary G. Yen (S’87-M’88-SM’97-F’09) received a
Ph.D. degree in electrical and computer engineer-
ing from the University of Notre Dame in 1992.
Currently he is a Regents Professor in the School
of Electrical and Computer Engineering, Oklahoma
State University (OSU). Before joined OSU in 1997,
he was with the Structure Control Division, U.S.
Air Force Research Laboratory in Albuquerque. His
research interest includes intelligent control, compu-
tational intelligence, conditional health monitoring,
signal processing and their industrial/defense appli-
Dr. Yen was an associate editor of the IEEE Control Systems Magazine,
IEEE Transactions on Control Systems Technology,Automatica,Mechantron-
ics,IEEE Transactions on Systems, Man and Cybernetics, Parts A and B and
IEEE Transactions on Neural Networks. He is currently serving as an associate
editor for the IEEE Transactions on Evolutionary Computation and the IEEE
Transactions on Cybernetics. He served as the General Chair for the 2003
IEEE International Symposium on Intelligent Control held in Houston, TX and
2006 IEEE World Congress on Computational Intelligence held in Vancouver,
Canada. Dr. Yen served as Vice President for the Technical Activities in 2005-
2006 and then President in 2010-2011 of the IEEE Computational intelligence
Society. He was the founding editor-in-chief of the IEEE Computational
Intelligence Magazine, 2006-2009. In 2011, he received Andrew P Sage Best
Transactions Paper award from IEEE Systems, Man and Cybernetics Society
and in 2014, he received Meritorious Service award from IEEE Computational
Intelligence Society.
... Several researchers conducted the insight study on the structure of block. In their works, a collection of basic computational nodes is pre-defended, which contains many different convolution and pool layers [14,15]. The encoding strategy is used to construct the bock structure using these computational nodes, and to generate the DL network based on the blocks further. ...
... Suganuma et al. [30] designed the CNN network using the Cartesian Genetic Programming (CGP), which adopted the convolutional nodes and tensor concatenation as the CGP nodes to form the complex CNN structure. Sun et al. [15] investigated the block based NAS for CNN network. The ResNet block and the DenseNet block are used to construct the deep CNN models as well as pooling layers. ...
... The denotation of "O-B2-C" mean that it is without cutout technique, Num is 2 and the attention type is CBAM. Other hyper-parameters are adopted as the literatures [15]. ...
Deep Learning (DL) has achieved the great breakthrough in image classification. As DL structure is problem-dependent and it has the crucial impact on its performance, it is still necessary to re-design the structures of DL according to the actual needs, even there exists various benchmark DL structures. Neural Architecture Search (NAS) which can design the DL network automatically has been widely investigated. However, many NAS methods suffer from the huge computation time. To overcome this drawback, this research proposed a new Evolutionary Neural Architecture Search with RepVGG nodes (EvoNAS-Rep). Firstly, a new encoding strategy is developed, which can map the fixed-length encoding individual to DL structure with variable depth using RepVGG nodes. Secondly, Genetic Algorithm (GA) is adopted for searching the optimal individual and its corresponding DL model. Thirdly, the iterative training process is designed to train the DL model and to evolve the GA simultaneously. The proposed EvoNAS-Rep is validated on the famous CIFAR 10 and CIFAR 100. The results show that EvoNAS-Rep has obtained 96.35% and 79.82% with only near 0.2 GPU days, which is both effectiveness and efficiency. EvoNAS-Rep is also validated on two real-world applications, including the NEU-CLS and the Chest XRay 2017 datasets, and the results show that EvoNAS-Rep has achieved the competitive results.
... To limit the search space, Yu et al. (2021) and Sun et al. (2019) proposed block-based methods. However, the results are insufficient and unstable due to the lack of theoretical support. ...
... Table 6 shows that test accuracy depends strongly on the number of images per class. We selected eight species, including Cirsium arvense, and searched for architecture based on s2] be the interval of the image number per class (for example (Rahaman et al., 2019;Sun et al., 2019),), and we searched twice with s1 and s2 per class (for >2,000, set [2,500, 3,000]), then calculated the mean accuracy. From Table 6, we found that our method got higher accuracy than Ref (Camille et al., 2021). ...
Full-text available
It is well known that crop classification is essential for genetic resources and phenotype development. Compared with traditional methods, convolutional neural networks can be utilized to identify features automatically. Nevertheless, crops and scenarios are quite complex, which makes it challenging to develop a universal classification method. Furthermore, manual design demands professional knowledge and is time-consuming and labor-intensive. In contrast, auto-search can create network architectures when faced with new species. Using rapeseed images for experiments, we collected eight types to build datasets (rapeseed dataset (RSDS)). In addition, we proposed a novel target-dependent search method based on VGGNet (target-dependent neural architecture search (TD-NAS)). The result shows that test accuracy does not differ significantly between small and large samples. Therefore, the influence of the dataset size on generalization is limited. Moreover, we used two additional open datasets (Pl@ntNet and ICL-Leaf) to test and prove the effectiveness of our method due to three notable features: (a) small sample sizes, (b) stable generalization, and (c) free of unpromising detections.
... Fortunately, the potential of evolutionary algorithm (EA) in neural architecture searching (ENAS) has attracted considerable attention in the last few years and has yielded very promising results [11,12]. Systematic reviews of ENAS can be found in [13,14], where several representative algorithms are introduced, such as Genetic CNN and Evo-CNN (CNN architectures searching based on genetic algorithm) [15,16], EAS and Meta-QNN (CNN architectures searching based on Q-learning) [17,18], Large-scale Evolution [19], CGP-CNN (CNN architectures searching based on Cartesian genetic programming) [20], NAS (CNN architectures searching based on reinforcement learning) [21], and CNN-GA and AE-CNN (CNN architectures searching based on genetic algorithm and block encoding) [22,23], etc. Experimental results of these methods have shown that pretty good performance in searching for the optimal network structures and excellent results have been achieved. But there are still some important limitations that need to be addressed. ...
Full-text available
As a popular research in the field of artificial intelligence in the last 2 years, evolutionary neural architecture search (ENAS) compensates the disadvantage that the construction of convolutional neural network (CNN) relies heavily on the prior knowledge of designers. Since its inception, a great deal of researches have been devoted to improving its associated theories, giving rise to many related algorithms with pretty good results. Considering that there are still some limitations in the existing algorithms, such as the fixed depth or width of the network, the pursuit of accuracy at the expense of computational resources, and the tendency to fall into local optimization. In this article, a multi-objective genetic programming algorithm with a leader–follower evolution mechanism (LF-MOGP) is proposed, where a flexible encoding strategy with variable length and width based on Cartesian genetic programming is designed to represent the topology of CNNs. Furthermore, the leader–follower evolution mechanism is proposed to guide the evolution of the algorithm, with the external archive set composed of non-dominated solutions acting as the leader and an elite population updated followed by the external archive acting as the follower. Which increases the speed of population convergence, guarantees the diversity of individuals, and greatly reduces the computational resources. The proposed LF-MOGP algorithm is evaluated on eight widely used image classification tasks and a real industrial task. Experimental results show that the proposed LF-MOGP is comparative with or even superior to 35 existing algorithms (including some state-of-the-art algorithms) in terms of classification error and number of parameters.
... The latter is also referred to as neuroevolution [14]. It has been reported that neuroevolution requires less computational time compared to RL-based NAS methods [15]. There are abundant mature neuroevolution theories lying idle subject to traditional simple DNN architecture composed by only fullconnected nodes and shortage of computing resources in early age. ...
Neuroevolution has greatly promoted Deep Neural Network (DNN) architecture design and its applications, while there is a lack of methods available across different DNN types concerning both their scale and performance. In this study, we propose a self-adaptive neuroevolution (SANE) approach to automatically construct various lightweight DNN architectures for different tasks. One of the key settings in SANE is the search space defined by cells and organs self-adapted to different DNN types. Based on this search space, a constructive evolution strategy with uniform evolution settings and operations is designed to grow DNN architectures gradually. SANE is able to self-adaptively adjust evolution exploration and exploitation to improve search efficiency. Moreover, a speciation scheme is developed to protect evolution from early convergence by restricting selection competition within species. To evaluate SANE, we carry out neuroevolution experiments to generate different DNN architectures including convolutional neural network, generative adversarial network and long short-term memory. The results illustrate that the obtained DNN architectures could have smaller scale with similar performance compared to existing DNN architectures. Our proposed SANE provides an efficient approach to self-adaptively search DNN architectures across different types.
... Modern architecture search methods produce state-of-the-art results for many tasks, but it usually takes a long time even on GPU to obtain high-performance architecture [22]. The genetic algorithm has also been applied to architecture searches, such as evolving unsupervised DNNS (EUDNN) [23], automatic evolving CNN (AE-CNN) [24], and the evolving deep convolutional neural network (EvoCNN) [25]. RL is also used to find compressed architecture from well-trained networks. ...
Full-text available
Deep neural networks (DNNs) have achieved great success in the field of computer vision. The high requirements for memory and storage by DNNs make it difficult to apply them to mobile or embedded devices. Therefore, compression and structure optimization of deep neural networks have become a hot research topic. To eliminate redundant structures in deep convolutional neural networks (DCNNs), we propose an efficient filter pruning framework via deep reinforcement learning (DRL). The proposed framework is based on a deep deterministic policy gradient (DDPG) algorithm for filter pruning rate optimization. The main features of the proposed framework are as follows: (1) AA tailored reward function considering both accuracy and complexity of DCNN is proposed for the training of DDPG and (2) a novel filter sorting criterion based on Taylor expansion is developed for filter pruning selection. To illustrate the effectiveness of the proposed framework, extensive comparative studies on large public datasets and well-recognized DCNNs are conducted. The experimental results demonstrate that the Taylor-expansion-based filter sorting criterion is much better than the widely used minimum-weight-based criterion. More importantly, the proposed filter pruning framework can achieve over 10× parameter compression and 3× floating point operations (FLOPs) reduction while maintaining similar accuracy to the original network. The performance of the proposed framework is promising compared with state-of-the-art DRL-based filter pruning methods.
... This algorithm is not able to explore a lot of network structures because of the fixed-length encoding scheme. In AE-CNN (Sun et al. 2019b), GA is automatically used to design CNN architecture. This algorithm uses ResNet and DenseNet blocks as the basic building blocks. ...
Full-text available
Convolutional neural networks (CNN) are highly effective for image classification and computer vision activities. The accuracy of CNN architecture depends on the design and selection of optimal parameters. The number of parameters increases exponentially with every connected layer in deep CNN architecture. Therefore, the manual selection of efficient parameters entirely remains ad-hoc. To solve that problem, we must carefully examine the relationship between the depth of architecture, input parameters, and the model’s accuracy. The evolutionary algorithms are prominent in solving the challenges in architecture design and parameter selection. However, the adoption of evolutionary algorithms itself is a challenging task as the computation cost increases with its evolution. The performance of evolutionary algorithms depends on the type of encoding technique used to represent a CNN architecture. In this article, we presented a comprehensive study of the recent approaches involved in the design and training of CNN architecture. The advantages and disadvantages of selecting a CNN architecture using evolutionary algorithms are discussed. The manual architecture is compared against automated CNN architecture based on the accuracy and range of parameters in the existing benchmark datasets. Furthermore, we have discussed the ongoing issues and challenges involved in evolutionary algorithms-based CNN architecture design.
... In 2017, Google company took the initial step in proposing an ENAS algorithm especially for CNNs [21]. Since then, a slew of excellent ENAS algorithms have been proposed [22]- [25], showing superior performance mainly in image classifications. ...
Evolutionary computation-based neural architecture search (ENAS) is a popular technique for automating architecture design of deep neural networks. In recent years, various ENAS algorithms have been proposed and shown promising performance on diverse real-world applications. In contrast to these groundbreaking applications, there is no theoretical guideline for assigning a reasonable running time (mainly affected by the generation number, population size, and evolution operator) given both the anticipated performance and acceptable computation budget on ENAS problems. The expected hitting time (EHT), which refers to the average generations, is considered to analyze the running time of ENAS algorithms. This paper proposes a general framework for estimating the EHT of ENAS algorithms, which includes common configuration, search space partition, transition probability estimation, and hitting time analysis. By exploiting the proposed framework, we consider the so-called ($\lambda$+$\lambda$)-ENAS algorithms with different mutation operators and manage to estimate the lower bounds of the EHT {which are critical for the algorithm to find the global optimum}. Furthermore, we study the theoretical results on the NAS-Bench-101 architecture searching problem, and the results show that the one-bit mutation with "bit-based fair mutation" strategy needs less time than the "offspring-based fair mutation" strategy, and the bitwise mutation operator needs less time than the $q$-bit mutation operator. To the best of our knowledge, this is the first work focusing on the theory of ENAS, and the above observation will be substantially helpful in designing efficient ENAS algorithms.
Existing neural architecture search (NAS) methods usually explore a limited feature-transformation-only search space, ignoring other advanced feature operations such as feature self-calibration by attention and dynamic convolutions. This disables the NAS algorithms to discover more advanced network architectures. We address this limitation by additionally exploiting feature self-calibration operations, resulting in a heterogeneous search space. To solve the challenges of operation heterogeneity and significantly larger search space, we formulate a neural operator search (NOS) method. NOS presents a novel heterogeneous residual block for integrating the heterogeneous operations in a unified structure, and an attention guided search strategy for facilitating the search process over a vast space. Extensive experiments show that NOS can search novel cell architectures with highly competitive performance on the CIFAR and ImageNet benchmarks.
Most VAEs were developed with symmetrical architecture in mind, which means that the encoder and decoder must have the same number of layers. However, when completing the step of unsupervised pre-training step, the decoder portion is deprecated and will never be employed when fine-tuning image classification problems. As a result, maintaining a symmetrical architecture is not nearly necessary [1]. However, new complications develop when the asymmetrical architecture is used, despite the fact that it is a quite appealing decision. This shows that if the asymmetrical architectures of CVAEs are planned to improve, one needs to evaluate the architectures of each partition on its own before making any changes to the architectures as a whole. In addition, for the correct forwarding of the asymmetrical CVAEs, it is essential to make certain that the decoder appropriately fits the resolution of the raw input data in the appropriate manner. This also adds another layer of complication to the manual design process for VAE designs.
Conference Paper
Full-text available
Techniques for automatically designing deep neural network architectures such as reinforcement learning based approaches have recently shown promising results. However, their success is based on vast computational resources (e.g. hundreds of GPUs), making them difficult to be widely used. A noticeable limitation is that they still design and train each network from scratch during the exploration of the architecture space, which is highly inefficient. In this paper, we propose a new framework toward efficient architecture search by exploring the architecture space based on the current network and reusing its weights. We employ a reinforcement learning agent as the meta-controller, whose action is to grow the network depth or layer width with function-preserving transformations. As such, the previously validated networks can be reused for further exploration, thus saves a large amount of computational cost. We apply our method to explore the architecture space of the plain convolutional neural networks (no skip-connections, branching etc.) on image benchmark datasets (CIFAR-10, SVHN) with restricted computational resources (5 GPUs). Our method can design highly competitive networks that outperform existing networks using the same design scheme. On CIFAR-10, our model without skip-connections achieves 4.23% test error rate, exceeding a vast majority of modern architectures and approaching DenseNet. Furthermore, by applying our method to explore the DenseNet architecture space, we are able to achieve more accurate networks with fewer parameters.
Full-text available
The performance of multi-objective evolutionary algorithms deteriorates appreciably in solving many-objective optimization problems which encompass more than three objectives. One of the known rationales is the loss of selection pressure which leads to the selected parents not generating promising offspring towards Pareto-optimal front with diversity. Estimation of distribution algorithms sample new solutions with a probabilistic model built from the statistics extracting over the existing solutions so as to mitigate the adverse impact of genetic operators. In this paper, an improved regularity-based estimation of distribution algorithm is proposed to effectively tackle unconstrained many-objective optimization problems. In the proposed algorithm, diversity repairing mechanism is utilized to mend the areas where need non-dominated solutions with a closer proximity to the Pareto-optimal front. Then favorable solutions are generated by the model built from the regularity of the solutions surrounding a group of representatives. These two steps collectively enhance the selection pressure which gives rise to the superior convergence of the proposed algorithm. In addition, dimension reduction technique is employed in the decision space to speed up the estimation search of the proposed algorithm. Finally, by assigning the Pareto-optimal solutions to the uniformly distributed reference vectors, a set of solutions with excellent diversity and convergence is obtained. To measure the performance, NSGA-III, GrEA, MOEA/D, HypE, MBN-EDA, and RM-MEDA are selected to perform comparison experiments over DTLZ and DTLZ- test suites with 3-, 5-, 8-, 10-, and 15-objective. Experimental results quantified by the selected performance metrics reveal that the proposed algorithm shows considerable competitiveness in addressing unconstrained many-objective optimization problems.
Full-text available
Inverted Generational Distance (IGD) has been widely considered as a reliable performance indicator to concurrently quantify the convergence and diversity of multi-and many-objective evolutionary algorithms. In this paper, an IGD indicator-based evolutionary algorithm for solving many-objective optimization problems (MaOPs) has been proposed. Specifically, the IGD indicator is employed in each generation to select the solutions with favorable convergence and diversity. In addition, a computationally efficient dominance comparison method is designed to assign the rank values of solutions along with three newly proposed proximity distance assignments. Based on these two designs, the solutions are selected from a global view by linear assignment mechanism to concern the convergence and diversity simultaneously. In order to facilitate the accuracy of the sampled reference points for the calculation of IGD indicator, we also propose an efficient decomposition-based nadir point estimation method for constructing the Utopian Pareto front which is regarded as the best approximate Pareto front for real-world MaOPs at the early stage of the evolution. To evaluate the performance, a series of experiments is performed on the proposed algorithm against a group of selected state-of-the-art many-objective optimization algorithms over optimization problems with 8-, 15-, and 20-objective. Experimental results measured by the chosen performance metrics indicate that the proposed algorithm is very competitive in addressing MaOPs.
Full-text available
Deep Learning (DL) aims at learning the \emph{meaningful representations}. A meaningful representation refers to the one that gives rise to significant performance improvement of associated Machine Learning (ML) tasks by replacing the raw data as the input. However, optimal architecture design and model parameter estimation in DL algorithms are widely considered to be intractable. Evolutionary algorithms are much preferable for complex and non-convex problems due to its inherent characteristics of gradient-free and insensitivity to local optimum. In this paper, we propose a computationally economical algorithm for evolving \emph{unsupervised deep neural networks} to efficiently learn \emph{meaningful representations}, which is very suitable in the current Big Data era where sufficient labeled data for training is often expensive to acquire. In the proposed algorithm, finding an appropriate architecture and the initialized parameter values for a ML task at hand is modeled by one computational efficient gene encoding approach, which is employed to effectively model the task with a large number of parameters. In addition, a local search strategy is incorporated to facilitate the exploitation search for further improving the performance. Furthermore, a small proportion labeled data is utilized during evolution search to guarantee the learnt representations to be meaningful. The performance of the proposed algorithm has been thoroughly investigated over classification tasks. Specifically, error classification rate on MNIST with $1.15\%$ is reached by the proposed algorithm consistently, which is a very promising result against state-of-the-art unsupervised DL algorithms.
This book presents a unified view of evolutionary algorithms: the exciting new probabilistic search tools inspired by biological models that have immense potential as practical problem-solvers in a wide variety of settings, academic, commercial, and industrial. In this work, the author compares the three most prominent representatives of evolutionary algorithms: genetic algorithms, evolution strategies, and evolutionary programming. The algorithms are presented within a unified framework, thereby clarifying the similarities and differences of these methods. The author also presents new results regarding the role of mutation and selection in genetic algorithms, showing how mutation seems to be much more important for the performance of genetic algorithms than usually assumed. The interaction of selection and mutation, and the impact of the binary code are further topics of interest. Some of the theoretical results are also confirmed by performing an experiment in meta-evolution on a parallel computer. The meta-algorithm used in this experiment combines components from evolution strategies and genetic algorithms to yield a hybrid capable of handling mixed integer optimization problems. As a detailed description of the algorithms, with practical guidelines for usage and implementation, this work will interest a wide range of researchers in computer science and engineering disciplines, as well as graduate students in these fields.
Conference Paper
We explore efficient neural architecture search methods and present a simple yet powerful evolutionary algorithm that can discover new architectures achieving state of the art results. Our approach combines a novel hierarchical genetic representation scheme that imitates the modularized design pattern commonly adopted by human experts, and an expressive search space that supports complex topologies. Our algorithm efficiently discovers architectures that outperform a large number of manually designed models for image classification, obtaining top-1 error of 3.6% on CIFAR-10 and 20.3% when transferred to ImageNet, which is competitive with the best existing neural architecture search approaches and represents the new state of the art for evolutionary strategies on this task. We also present results using random search, achieving 0.3% less top-1 accuracy on CIFAR-10 and 0.1% less on ImageNet whilst reducing the architecture search time from 36 hours down to 1 hour.