Content uploaded by Yanan Sun

Author content

All content in this area was uploaded by Yanan Sun on Dec 25, 2019

Content may be subject to copyright.

1

Completely Automated CNN Architecture Design

Based on Blocks

Yanan Sun, Member, IEEE, Bing Xue, Member, IEEE, Mengjie Zhang, Fellow, IEEE,

and Gary G. Yen, Fellow, IEEE

Abstract—The performance of Convolutional Neural Networks

(CNNs) highly relies on their architectures. In order to design

a CNN with promising performance, extensive expertise in

both CNNs and the investigated problem domain is required,

which is not necessarily available to every interested user. To

address this problem, we propose to automatically evolve CNN

architectures by using a genetic algorithm based on ResNet

and DenseNet blocks. The proposed algorithm is completely

automatic in designing CNN architectures. In particular, neither

pre-processing before it starts nor post-processing in terms of

CNNs is needed. Furthermore, the proposed algorithm does not

require users with domain knowledge on CNNs, the investigated

problem or even genetic algorithms. The proposed algorithm is

evaluated on the CIFAR10 and CIFAR100 benchmark datasets

against 18 state-of-the-art peer competitors. Experimental results

show that the proposed algorithm outperforms state-of-the-art

CNNs hand-crafted and CNNs designed by automatic peer com-

petitors in terms of the classiﬁcation performance, and achieves

a competitive classiﬁcation accuracy against semi-automatic peer

competitors. In addition, the proposed algorithm consumes much

less computational resource than most peer competitors in ﬁnding

the best CNN architectures.

Index Terms—Convolutional neural networks, genetic algo-

rithms, evolutionary deep learning, automatic architecture design,

neural networks.

I. INTRODUCTION

CONVOLUTIONAL Neural Networks (CNNs) [1] have

been showcasing their promising performance on various

real-world applications [2]–[5]. It has been known that the per-

formance of CNNs highly depends on their architectures, such

as how many building-block layers (e.g., the convolutional and

pooling layers) are used, how the used building-block layers

are composed, and how the parameters related to the used

building-block layers are speciﬁed.

This work was supported in part by the National Natural Science Foundation

of China under Grant 61803277, in part by the Fundamental Research Funds

for the Central Universities, in part by the National Natural Science Fund of

China for Distinguished Young Scholar under Grant 61625204, and in part by

the Marsden Fund of New Zealand Government under Contracts VUW1209,

VUW1509 and VUW1615, Huawei Industry Fund E2880/3663, and the

University Research Fund at Victoria University of Wellington 209862/3580,

and 213150/3662.

Yanan Sun is with the College of Computer Science, Sichuan University,

Chengdu 610065, China, and also with the School of Engineering and

Computer Science, Victoria University of Wellington, Wellington 6140, New

Zealand (e-mail: ysun@scu.edu.cn).

Bing Xue, and Mengjie Zhang are with the School of Engineer-

ing and Computer Science, Victoria University of Wellington, PO Box

600, Wellington 6140, New Zealand (e-mail: bing.xue@ecs.vuw.ac.nz; and

mengjie.zhang@ecs.vuw.ac.nz).

Gary G. Yen is with the School of Electrical and Computer Engi-

neering, Oklahoma State University, Stillwater, OK 74078 USA (email:

gyen@okstate.edu).

Generally, given a CNN, denoted by A, having narchitec-

ture related parameters λ1,· · · , λnwhose decision spaces are

Λ1,· · · ,Λn, respectively, the CNN architecture design is to

optimize the problem formulated by (1)

arg minλ

λ

λL(Aλ

λ

λ,Dtrain,Dv alid)

s.t. λ

λ

λ∈Λ

Λ

Λ

(1)

where λ

λ

λ={λ1,· · · , λn},Λ

Λ

Λ=Λ1× · · · × Λn,Aλ

λ

λdenotes

the CNN Aadopting the architecture parameter setting λ

λ

λ, and

L(·)measures the performance of Aλ

λ

λon the validation data

Dvalid after Aλ

λ

λhas been trained on the training data Dtrain. In

the case of classiﬁcation tasks, L(·)measures the classiﬁcation

error of the tasks to which Ais applied. Typically, the gradient-

based algorithms, such as stochastic gradient descent [6], are

employed to train the weights of Aλ

λ

λas L(·)is differentiable

(or approximately differentiable) with respect to the weights.

Unfortunately, with respect to the architecture related param-

eters, L(·)is often non-convex and non-differentiable because

these parameters usually have discrete values, e.g., the feature

map sizes of convolutional layers are generally speciﬁed as

integers. To this end, the exact optimization algorithms (e.g.,

the gradient-based algorithms) are incapable of or ineffective

in solving the architecture optimization problem [7], [8].

As a result, researchers have proposed various architecture

optimization algorithms based on the heuristic computational

paradigms [9], such as random search [10], Bayesian-based

Gaussian process [11], [12], tree-structured Parzen estima-

tors [13], sequential model-based global optimization [14],

neuroevolution of augmenting topologies [15], evolutionary

unsupervised deep learning [8], etc. However, in CNN architec-

ture optimization, it is impossible to know the optimal numbers

of built layers in advance, e.g., the particular value of nin λ

λ

λ,

to compose the best CNN architecture, i.e., the number of

decision variables for an optimal CNN architecture is also

unknown before the best CNN architecture is found. This

makes the architecture optimization methods aforementioned

also unable to be effectively and efﬁciently used for CNN

architecture design because they work under the assumption

where the number of optimized parameters is ﬁxed. Although

we could enumerate each potential value of nand then perform

these methods for each different n, the run-time computational

complexity will increase in an order of magnitude as ngrows,

and the satisfactory solutions may not be obtained within the

acceptable time [16].

Due to this, state-of-the-art CNNs such as ResNet [17] and

DenseNet [18] are primarily hand-crafted. Designing CNNs

2

manually requires considerable expertise in CNN architecture,

as well as in the problem domain. This is often not available

in practice. For example, a medical doctor could ﬁnd a CNN

extremely useful in evaluating the results of a Magnetic

Resonance Imaging (MRI) scan. While the doctor clearly

has expertise in the problem domain, they are very unlikely

to have comparable experience in CNN architectures. This

barrier has prevented CNNs from being utilised in a variety of

image classiﬁcation tasks. There is a signiﬁcant demand for

algorithms which are able to effectively and efﬁciently design1

CNN architectures without requiring such expertise.

Fortunately, in the last two years, multiple algorithms de-

veloped for designing CNN architectures have been proposed.

Based on whether the pre- or post-processing in terms of

CNNs is required when these algorithms are used, they can be

divided into two different categories: the semi-automatic CNN

architecture design algorithms and the completely automatic

ones. Particularly, the semi-automatic algorithms cover the

genetic CNN method (Genetic CNN) [19], the hierarchical

representation-based method (Hierarchical Evolution) [20], the

efﬁcient architecture search method (EAS) [21], and the

block design method (Block-QNN-S) [22], to name a few.

The automatic algorithms include the large-scale evolution

method (Large-scale Evolution) [23], the Cartesian genetic pro-

gramming method (CGP-CNN) [24], the neural architecture

search method (NAS) [25], and the meta-modelling method

(MetaQNN) [26]. These algorithms are mainly based on

evolutionary algorithms [27] or reinforcement learning [28].

For example, Genetic CNN, Large-scale Evolution, Hierar-

chical Evolution and CGP-CNN are based on evolutionary

algorithms, while NSA, MetaQNN, EAS and Block-QNN-S

are built on reinforcement learning.

Experimental results from these algorithms have shown their

promising performance in ﬁnding the best CNN architectures

on the given data. However, major limitations remain. Firstly,

the expertise in the investigated data and CNNs is still needed

by the semi-automatic CNN architecture design algorithms.

For example, EAS takes effect on a base network which

already has a fairly good performance on the investigated

problem. However, the base network is manually designed

based on expertise. Block-QNN-S only designs several small

networks, and these networks are then integrated into a larger

CNN framework. However, the other types of layers, such

as the pooling layers, need to be properly assimilated into

the CNN framework with expertise. Secondly, the CNN ar-

chitecture design algorithms based on reinforcement learning

typically consume much more computational resource. For

instance, NAS consumes 28 days on 800 Graphic Process

Unit (GPU) cards for the CIFAR10 dataset [29]. However,

sufﬁcient computation resource is not necessarily available to

every interested user. Thirdly, the CNN architecture design

algorithms based on evolutionary algorithm use only partial

principled merit of the evolutionary algorithms, which inadver-

tently results in the found CNNs usually without the promis-

ing performance for the investigated problems. For example,

1In this paper, the terms “design”, “ﬁnd”, “learn” and “evolve” have

identical meaning when used to describe “CNN architectures”.

Genetic CNN employs a ﬁxed-length encoding scheme to

represent CNNs. However, we never know the best depth of

the CNN in solving a new problem. To this end, Large-scale

Evolution utilizes a variable-length encoding scheme where

the CNNs can adaptively change their depths for the prob-

lems. However, Large-scale Evolution uses only the mutation

operator but not any crossover operator during the search

process. In evolutionary algorithms, the crossover operator and

mutation operator play complementary roles of local search

and global search. Without using the crossover operator, the

mutation operator works just like random search at different

start positions. Nevertheless, it is not surprising that Large-

scale Evolution does not use the crossover operator since the

crossover operator is originally designed for the ﬁxed-length

encoding scheme.

To this end, the development of CNN architecture design

algorithms, especially for the completely automatic ones

with promising performance and relying on the limited com-

putational resource, is still in its infancy. The aim of this

paper is to design and develop a new genetic algorithm-

based algorithm to automatically design CNN architectures

by addressing the limitations discussed above. To achieve this

goal, the objectives below have been speciﬁed:

•The proposed algorithm does mandate any prerequisite

knowledge from the users in base CNN design, inves-

tigated dataset and genetic algorithms. The CNN whose

architecture is designed by the proposed algorithm can be

directly used without any re-composition, pre-processing,

or post-processing.

•The variable-length encoding scheme is employed for

searching the optimal depth of the CNN. To adopt the

variable-length encoding, a new crossover operator and a

mutation operator are designed and incorporated into the

proposed algorithm to collectively exploit and explore the

search space in ﬁnding the best CNN architectures.

•An efﬁcient encoding strategy is designed based on

the ResNet and DenseNet blocks for speeding up the

architecture design, and limited computational resource

is utilized, while the promising performance can be

achieved by the proposed algorithm. Noting that, although

the ResNet and DenseNet blocks are used in the proposed

algorithm, the users are not required to have expertise in

these blocks when they are using the proposed algorithm.

The remainder of the paper is organized as follows. The

background related to base knowledge of the proposed al-

gorithm is introduced in Section II. Then, the details of the

proposed algorithm are documented in Section III. To evaluate

the performance of the proposed algorithm, the experiment

design and the numerical results are shown in Sections IV

and V, respectively. Finally, the conclusions and future work

are summarized in Section VI.

II. BACKGROU ND

As have highlighted in Section I, the proposed algorithm is

to design a novel Genetic Algorithm (GA), to automatically

design the CNN architectures, by using the blocks of ResNet

and DenseNet that are the state-of-the-art CNNs manually

3

Fig. 1. An example of the ResNet block (RB).

Fig. 2. An example of the DenseNet block (DB) including four convolutional

layers.

designed. In order to help readers easily understand the details

of the proposed algorithm to be shown in Section III, the

fundamentals to GAs, ResNet Blocks (RBs) and DenseNet

Blocks (DBs) are discussed in this section.

A. Genetic Algorithms

GAs [30] are a class of heuristic population-based compu-

tational paradigm. They are also the most popular type of evo-

lutionary algorithms (evolutionary algorithms broadly include

genetic programming [31], evolutionary strategy [32] and so

on, in addition to GAs). Because of the nature of gradient-free

and insensitiveness to the local minimum, GAs are preferred

especially in engineering ﬁelds where the optimization prob-

lems are commonly non-convex and non-differentiable [33],

[34]. GAs address optimization problems by imitating the

biological evolution through a series of bio-inspired operators,

such as crossover, mutation and selection [35], [36]. Generally,

a GA works as follows:

Step 1: Initialization of a population of individuals each of

which represents a candidate solution of the problem

through the employed encoding strategy;

Step 2: Evaluation of the ﬁtness of each individual in the

population based on the encoded information and the

ﬁtness function;

Step 3: Mating selection of promising parent individuals from

the current population, and then generate offspring

with crossover and mutation operators;

Step 4: Evaluation of the ﬁtness of the generated offspring;

Step 5: Environmental selection of a population of individuals

with promising performance from the current popula-

tion, and then replace the current population by the

selected population;

Step 6: Go to Step 3 if the termination cretiration is not met;

otherwise return the individual with the best ﬁtness as

the best solution for the problem.

Commonly, a maximal generation number is predeﬁned as the

termination criterion.

B. ResNet and DenseNet Blocks

ResNet [17] and DenseNet [18] are two state-of-the-art

CNNs proposed in recent years. The success of ResNet and

DenseNet largely owes to their building blocks, i.e., RBs and

DBs, respectively.

Fig. 1 shows an example of an RB which is composed of

three convolutional layers2and one skip connection. In this

example, the convolutional layers are denoted as conv1,conv2

and conv3. On conv1, the spatial size of the input is reduced

by a smaller number of ﬁlters with the size of 1×1, to lower

the computational complexity of conv2. On conv2, ﬁlters with

a larger size, such as 3×3, are used to learn features with

the same spatial size. On conv3, ﬁlters with the size of 1×1

are used again, and the spatial size is increased for generating

more features. The input is added, denoted by ⊕, to the output

of conv3as the ﬁnal output of the RD. Noting that if the

spatial sizes of input and conv3’s output are unequal, a group

of convolutional operations with the ﬁlters of 1×1size is

applied on the input, to achieve the same spatial size as that

of conv3’s output, for the addition.

Fig. 2 exhibits an example of a DB. For the convenience

of the introduction, we give only four convolutional layers in

the DB. In practice, a DB can have a different number of

convolutional layers, which is tuned by users. In a DB, each

convolutional layer receives inputs from not only the input data

but also the output of all the previous convolutional layers. In

addition, there is a parameter, k, for controlling the spatial

size of the input and output of the same convolutional layer.

If the spatial size of the input is a, then the spatial size of

the output is a+k, which is achieved by the convolutional

operation using the corresponding number of ﬁlters.

Efforts in [37], [38] have been put on investigating the

mechanism behind the success of RBs and DBs, and revealed

that RBs and DBs are able to mitigate the adverse impact of

the gradient vanishing problem [39], based on which a deep

architecture is capable of effectively learning the hierarchical

representations of the input data, and then improving the

ﬁnal classiﬁcation accuracy in turn. In addition, the dense

connections in DBs have also been claimed to be able to

reuse the low-level features, to increase the discrimination of

features learned at the top layers of CNNs [18]. Mainly based

on these good characteristics, RBs and DBs are chosen as the

building blocks in the proposed algorithm.

III. THE PRO PO SE D ALG OR IT HM

In this section, the framework of the proposed algorithm

and its main components are discussed in detail. For the

convenience of the development, the proposed algorithm is

named AE-CNN (Automatically Evolving CNNs) in short, and

the evolved CNN is used solely for image classiﬁcation tasks.

A. Algorithm Overview

Algorithm 1 shows the framework of AE-CNN, which is

composed of three parts. Firstly, the population is randomly

initialized with a predeﬁned size of N(line 1). Then, the indi-

viduals are evaluated for the ﬁtness (line 2). Next, all individ-

uals in the population take part into the evolutionary process

of GA with the maximal generation number of T(lines 3-

14). Finally, the best CNN architecture is decoded from the

2Here we only detail this type of blocks which is used to build deeper

networks. Indeed, ResNet also has another type of block which is typically

used for building networks with no more than 34 layers.

4

Algorithm 1: Framework of AE-CNN

Input: The population size N, the maximal generation

number T, the crossover propability µ, the

mutation probability ν.

Output: The best CNN.

1P0←Initialize a population with the size of Nby

using the proposed encoding strategy;

2Evaluate the ﬁtness of individuals in P0;

3t←0;

4while t < T do

5Qt← ∅;

6while |Qt|< N do

7p1, p2←Select two parent individuals from Pt

by using binary tournament selection;

8q1, q2←Generate two offspring by p1and p2

by crossover operation with the probability of

µand mutation operation with the probability

of ν;

9Qt←Qt∪q1∪q2;

10 end

11 Evaluate the ﬁtness of individuals in Qt;

12 Pt+1 ←Select Nindividuals from Pt∪Qtby

environmental selection;

13 t←t+ 1;

14 end

15 Select the best individual from Ptand decode it to the

corresponding CNN.

best individual that is chosen from the ﬁnal population based

on the ﬁtness (line 15). During the evolutionary process, an

empty population is initialized for including offspring (line 5),

and then new offspring are generated from selected parents

with the crossover and mutation operations, while the parents

are selected by the binary tournament selection (lines 6-10);

after the ﬁtness of the generated offspring has been evaluated

(line 11), a new population is selected with the environmental

selection operation (line 12) from the current population

(containing the current individuals and the generated offspring)

as the parent solutions surviving into the next evolutionary

process (i.e., the next generation). Noting that the symbol of

|·| shown in line 6 is a cardinality operator. In the following

subsections, the phases of “ Population Initialization,” “Fitness

Evaluation,” “Offspring Generation” and “Environmental Se-

lection” are documented in Subsections III-B, III-C, III-D and

III-E, respectively.

B. Population Initialization

Population initialization provides a base population contain-

ing multiple individuals for the following evolutionary process.

Generally, all the individuals are initialized in a random

manner with a uniform distribution. As have introduced in

Subsection II-A that each individual in GAs represents a

candidate solution of the problem to be solved. Because GAs

in the proposed algorithm are employed to ﬁnd the best CNN

architecture, each individual in the proposed algorithm should

represent a CNN architecture. Generally, the architecture of a

CNN is constructed by multiple convolutional layers, pooling

layers and fully-connected layers in a particular order, as

well as their parameter settings. In the proposed algorithm,

CNNs are constructed based on RBs, DBs and pooling layers,

which is motivated by the remarkable success of ResNet [17]

and DenseNet [18], while the fully-connected layers are not

considered in the proposed algorithm. The main reason is

that the fully-connected layers easily cause the over-ﬁtting

phenomenon [40] due to their full-connection nature. To

reduce this phenomenon, other techniques must be adopted,

such as Dropout [41]. However, these techniques will also give

rise to extra parameters that need to be carefully tuned, which

will increase the computational complexity of the proposed

algorithm. The experimental results shown in Section V will

justify that the promising performance of the proposed algo-

rithm can still be achieved without using the fully-connected

layers. The details of initializing the population of AE-CNN

are summarized in Algorithm 2.

Algorithm 2: Initialize Population

Input: The population size N, the training instance

dimension d×d.

Output: The initialized population P0.

1P0← ∅;

2mp←Calculate the maximal number of pooling

layers by ⌊log2(d)⌋;

3for i←1to Ndo

4k←randomly initialize a positive integer;

5a←initialize an empty array with the size of k;

6for j←1to kdo

7u←Randomly choose one from {RBU, DBU,

PU};

8if uis a PU and the number of used PU is not

less than mpthen

9u←Randomly choose one from {RBU,

DBU};

10 end

11 Encode uand put the encoded information into

the j-th position of a;

12 end

13 P0←P0∪a;

14 end

15 Return P0.

Next, we will explain details of lines 8 and 11 because

other parts of Algorithm 2 are straightforward. Speciﬁcally,

the pooling layers in CNNs perform the dimension reduction

on their input data, and the most commonly used pooling

operation is to halve the input size, which can be seen from

state-of-the-art CNNs [2]–[5], [17], [18]. To this end, the

employed pooling layers cannot be arbitrarily speciﬁed, but

following the constraint that has been calculated as shown in

line 2. For example, if the input size is 32 ×32, the number

of used pooling layers cannot be larger than six because six

pooling layers will reduce the dimension of the input data to

1×1, and one extra pooling layer on the dimension of 1×1

will lead to the logic error.

5

Encoding enables GAs with the ability to model real-

world problems, and then the problems can be solved by the

GAs directly. The encoding is achieved by the corresponding

encoding strategy which is the ﬁrst step of employing GAs.

There is not a uniﬁed encoding strategy that can be used

for all the problems. In the proposed algorithm, we design a

new encoding strategy aiming at effectively modelling CNNs

with different architectures. For the used RBs, based on the

conﬁguration of state-of-the-art CNNs [17], [42], we set the

ﬁlter size of conv2to 3×3, which is also used for the

convolutional layers in the used DBs. For the used pooling

layers, we set the same stride as the step size to 2×2based

on the conventions, which means that such a single pooling

layer in the evolved CNN halves the input dimension for one

time. To this end, the unknown parameter settings for RBs

are the spatial sizes of input and output, those for DBs are

the spatial sizes of input and output, as well as k, and that

for pooling layers are only their types, i.e., the max or mean

pooling type. Note that the number of convolutional layers in

a DB is known because it can be derived by the spatial sizes

of input and output as well as k. Accordingly, the proposed

encoding strategy is based on three different types of units

and their positions in the CNNs. The units are the RB Unit

(RBU), the DB Unit (DBU) and the Pooling layer Unit (PU).

Speciﬁcally, an RBU and a DBU contain multiple RBs and

DBs, respectively, while a PU is composed of only a single

pooling layer. Our justiﬁcations are that: 1) by putting multiple

of RBs or DBs into an RBU or a DBU, the depth of the CNN

can be signiﬁcantly changed compared to stacking RBs or

DBs one by one, which will speed up the heuristic search

of the proposed algorithm by easily changing the depth of

the CNN; and 2) one PU consisting of a single pooling layer

is more ﬂexible than consisting of multiple pooling layers,

because the effect of multiple consequent pooling layers can

be achieved by stacking multiple PUs. In addition, we also add

one parameter to represent the unit type for the convenience

of the algorithm implementation. In summary, the encoded

information for an RBU are the type, the number of RBs, the

input spatial size and the output spatial size, which are denoted

as type,amount,in and out, respectively. On the other hand,

the encoded information for a DBU is the same as those of

an RBU, in addition to the additional parameter k. Only one

parameter is needed in a PU for encoding the pooling type.

Fig. 3. An example of the proposed encoding strategy.

Fig. 3 shows an example of the proposed algorithm in

encoding a CNN containing nine units. Speciﬁcally, each

number in the upper-left corner of the block denotes the

position of the unit in the CNN. The unit is an RBU, a DBU

or a PU if the type is 1,2, or 3, respectively. Noting that the

proposed encoding strategy does not constrain the maximal

length of each individual, which means that the proposed

algorithm can adaptively ﬁnd the best CNN architecture with

a proper depth through the designed variable-length encoding

strategy.

C. Fitness Evaluation

The ﬁtness of the individuals provides a quantitative mea-

surement indicating how well they adapt to the environment,

and is calculated based on the information these individuals

encode and the task at hand. In AE-CNN, the ﬁtness of

an individual is the classiﬁcation accuracy based on the

architecture encoded by the individual and the corresponding

validation data. According to the principle of evolutionary

algorithms, an individual with a higher ﬁtness has a higher

probability to generate an offspring hopefully with an even

higher ﬁtness than itself. For evaluating the ﬁtness, each

individual in AE-CNN is decoded to the corresponding CNN,

and then added to a classiﬁer to be trained like that of a

common CNN. Typically, the widely used classiﬁer is the

Logistic regression for binary classiﬁcation and the Softmax

regression for multiple classiﬁcation. As formulated by (1), in

AE-CNN, the decoded CNN is trained on the training data,

the ﬁtness is the best classiﬁcation accuracy on the validation

data after the CNN training.

Algorithm 3: Evaluate Fitness

Input: The population Ptfor ﬁtness evaluation, traing

data Dtrain, validation data Dval id.

Output: The population Ptwith ﬁtness.

1for each individual in Ptdo

2cnn ←Transform the information encoded in

individual to a CNN with the corresponding

architecture;

3Initialize the weights of cnn;

4Train cnn on Dtrain;

5acc ←Evaluate the classiﬁcation accuracy of the

trained cnn on Dvalid ;

6Assign acc as the ﬁtness of individual;

7end

8Return Pt.

The ﬁtness evaluation of the proposed algorithm is shown in

Algorithm 3, where each individual in the population is evalu-

ated in the same manner. Firstly, the architecture information

encoded in the individual is transformed to a CNN with the

corresponding architecture (line 2), which is an inverse of the

encoding strategy introduced in Subsection III-B. Secondly,

the CNN is initialized with weights (lines 3) like that of a

hand-crafted CNN and then trained on the provided training

data (line 4). Noting that the weight initialize method and the

training method are the Xavier initializer [43] and the stochas-

tic gradient descend with momentum, respectively, which are

commonly used in deep learning community. Thirdly, the

trained CNN is evaluated on the validation data (line 5),

and the evaluated classiﬁcation accuracy is considered as the

ﬁtness of the individual (line 6).

D. Offspring Generation

In order to generate a population of offspring, parent individ-

uals need to be chosen in advance. Based on the principle of

6

evolutionary algorithms, the generated offspring are expected

to have higher ﬁtness than their parents, through inheriting the

quality traits from both parents. To this end, the individuals

having the best ﬁtness should be chosen as the parent individ-

uals. However, adopting the best ones as the parents could

easily cause the loss of diversity in the population, which

in turn leads to the premature convergence [44], [45], and

as a result the best performance of the population cannot be

achieved [46], [47] due to trapping into the local minima [48],

[49]. To address this problem, a general way is to select

promising parents via the random way. In the proposed AE-

CNN algorithm, the binary tournament selection [50] is used

for this purpose [50], [51], based on the conventions of the GA

community. The binary tournament selection randomly selects

two individuals from the population, and the one with a higher

ﬁtness is chosen as one parent individual. By repeating this

process again, another parent individual is chosen, and then

these two parent individuals perform the crossover operation.

Noting that two offspring are generated after each crossover

operation, and Noffspring are generated in each generation,

i.e., the crossover operation is performed N/2times during

each generation where Nstands for the population size.

In traditional GAs, the crossover operation is performed on

two individuals with the same length, which is biologically

evident. Based on the proposed encoding strategy, individuals

in the proposed algorithm have different lengths, i.e., the

corresponding CNNs are with different depths. In this regard,

the traditional crossover operator cannot be used. However,

the crossover operator often refers to the local search ability of

GAs, exploiting the search space for a promising performance.

The performance of the ﬁnal solution may be deteriorated

due to the lacking of the crossover operation in GAs. In

the proposed algorithm, we employ the one-point crossover

operator. The reason is that the one-point crossover has been

widely used in Genetic Programming (GP) [31]. GP is another

important class of evolutionary algorithms, and the individuals

in GP are commonly with different lengths. Algorithm 4 shows

the crossover operation in the proposed algorithm.

(a) Selected parent individuals

(b) Generated offspring

Fig. 4. The two selected parent individuals for the crossover operation (shown

in Fig. 4a) and the generated offspring (shown in Fig. 4b). The numbers in

each block denote the corresponding conﬁguration, and the red numbers in

Fig. 4b denote the necessary changes after the crossover operation.

Algorithm 4: Crossover Operation of AE-CNN

Input: Two parent individuals, p1and p2, selected by

the binary tournament selection, crossover

propability µ.

Output: Two offspring.

1r←Uniformly generate a number from [0,1];

2if r < µ then

3Randomly choose a position from p1and p2,

respectively;

4Separate p1and p2based on the chosen positions;

5q1←Combine the ﬁrst part of p1and the second

part of p2;

6q2←Combine the ﬁrst part of p2and the ﬁrst part

of p1;

7else

8q1←p1;

9q2←p2;

10 end

11 Return q1and q2.

Noting that some necessary changes are automatically made

on the generated offspring if required. For example, the in of

the current unit should be equal to the out of the previous

unit, and other cascade adjustments caused by this change.

For a better understanding of the crossover operation, an

example is shown in Fig. 4 where Fig. 4a shows the two

parent individuals. Supposing the separation positions of these

two parent individuals are the 3-th and 4-th units, respectively,

then Fig. 4b shows the corresponding generated offspring, the

red numbers imply the corresponding changes needed after the

crossover operation for the logic representing a valid CNN.

The mutation operation typically performs the global search

in GAs, exploring the search space for promising performance.

It works on one generated offspring with a predeﬁned probabil-

ity and the allowed mutation types. Available mutation types

are designed based on the proposed encoding strategy. In the

proposed algorithm, the available mutation types are:

•Adding (adding an RBU, adding a DBU, or adding a PU

to the selected position);

•Removing (removing the unit at the selected position);

•Modifying (modifying the encoded information of the

unit at the selected position).

The mutation operation in the proposed algorithm is detailed

in Algorithm 5. Because all the generated offspring use the

same routine for the mutation, Algorithm 5 shows only the

process of one offspring for the reason of simplicity. Noting

that the offspring will be kept the same if it is not mutated.

In addition, a series of necessary adjustments will also be

automatically performed based on the logic of composing

a valid CNN as highlighted in the crossover operation. For

better understanding the mutation, an example in terms of the

“adding an RBU” is shown in Fig. 5, where Fig. 5a shows

the selected individual for the mutation and the randomly

initialized RBU, and Fig. 5b shows the mutated individual.The

7

Algorithm 5: Mutation Operation of AE-CNN

Input: The offspring q1, mutation propability ν.

Output: The mutated offspring.

1r←Uniformly generate a number from [0,1];

2if r < ν then

3Randomly choose a position from q1;

4type ←Randomly select one from {Adding,

Removing, Modifying};

5if type is Adding then

6mu ←Randomly select one from {adding an

RBU, adding a DBU, adding a PU}

7else if type is Removing then

8mu ←removing a unit;

9else

10 mu ←modifying the encoded information;

11 end

12 Perform mu at the chosen position;

13 Return q1.

red numbers in Fig. 5b also mean the necessary changes when

the mutation has been performed. In the proposed crossover

and mutation operations, all these necessary changes are made

automatically.

(a) The selected indivial for mutation and the randomly initialized RBU for

the corresponding mutation

(b) Mutated individual

Fig. 5. An example of the “adding an RBU” mutation. Speciﬁcally, the

ﬁrst row and the second row in Fig. 5a denote the selected individual for

the mutation and the randomly initialized RBU for the “adding and RBU”

mutation at the fourth position of the individual to be mutated. Fig. 5b shows

the mutated individual, and the red numbers denote the necessary changes

after the mutation.

E. Environmental Selection

In the environmental selection, a population of individuals

in the size of Nis to be selected from the current population,

i.e., Pt∪Qt, serving as the parent individuals for the next

generation. Theoretically, a good population has the charac-

teristics of both convergence and diversity [30], to prevent

from trapping into local minima [48], [49] and premature

convergence [44], [45]. In practice, the parent individuals

should be composed of individuals with the best ﬁtness for

the convergence, and individuals whose ﬁtness have signiﬁcant

differences from each other for the diversity. To this end, we

will purposely select the individual with the best ﬁtness, along

with N−1individuals which are selected by binary tournament

selection [50], [51], as parent individuals to generate offspring

Algorithm 6: Environmental Selection

Input: The population Pt, the generated offspring

population Qt, the population size N.

Output: The population Pt+1 surviving in the next

generation.

1Pt+1 ← ∅;

2for j←1to Ndo

3p1, p2←Randomly selected two individuals from

Pt∪Qt;

4p←Select the one with higher ﬁtness from

{p1, p2};

5Pt+1 ←Pt+1 ∪p;

6end

7pbest ←Select the one with the highest ﬁtness from

Pt∪Qt;

8if pbest is not in Pt+1 then

9Randomly select one from Pt+1 and then replace it

by pbest;

10 end

11 Return Pt+1.

for the new population. Explicitly selecting the best one as

the parent for the next generation is an implementation of

the “elitism” mechanism [52] in GAs, which could prevent

the performance of the population from degrading as the

evolutionary progresses..

Algorithm 6 shows the details of the environmental selection

in the proposed algorithm. Speciﬁcally, given the current

population Ptand the generated offspring population Qt,N

individuals are selected with the binary tournament selection

that are shown in lines 2-6. After that, the best individual pbest

(i.e., the individual having the highest ﬁtness) is selected from

Pt∪Qt(line 7), and then to check whether pbest has been

selected into Pt+1 or not. A random one selected from Pt+1

will be replaced by pbest if it does not exist in Pt+1 (lines 8-

10). Noting that the offspring in Qtshould have been evaluated

for their ﬁtness prior to the environmental selection because

the binary tournament selection works based on the ﬁtness.

IV. EXPERIMENT DES IG N

The experiment is purposely designed to verify whether the

proposed automatic CNN architecture design algorithm is able

to achieve the promising performance on image classiﬁcation

tasks. In this section, we will ﬁrst introduce the chosen peer

competitors (in Subsection IV-A) to which the performance

of the proposed algorithm is compared, and then highlight

the adopted benchmark datasets (in Subsection IV-B) and the

parameter settings (in Subsection IV-C).

A. Peer Competitors

In order to demonstrate the superiority of the proposed

algorithm, various peer competitors are chosen to perform the

comparison. Particularly, the chosen peer competitors can be

divided into three different categories.

The ﬁrst includes the state-of-the-art CNNs whose archi-

tectures are hand-crafted with extensive domain expertise:

8

DenseNet [18], ResNet [17], Maxout [53], VGG [54], Network

in Network [55], Highway Network [56], All-CNN [57] and

FractalNet [58]. In addition, considering the promising perfor-

mance of ResNet, we use two different versions in the exper-

iment, they are the ResNet with 101 layers and ResNet with

1,202 layers, which are labelled as ResNet (depth=101) and

ResNet (depth=1,202), respectively. Owing to the promising

performance, most peer competitors in this category win the

champions of the large-scale vision challenge [59] in the recent

years. The intention of choosing these state-of-the-art CNNs

is to verify if the proposed automatic CNN architecture design

algorithm can show competitive performance to the hand-

crafted CNNs. The second covers the CNN architecture design

algorithms with a semi-automatic means, including Genetic

CNN [19], Hierarchical Evolution [20], EAS [21], and Block-

QNN-S [22]. The third refers to Large-scale Evolution [23],

CGP-CNN [24], NAS [25], and MetaQNN [26], which design

CNN architectures in a completely automatic way.

B. Benchmark Datasets

horse

ship

truck

(a) CIFAR10

rose

squirrel

tank

(b) CIFAR100

Fig. 6. Randomly selected examples from each three categories of CIFAR10

(shown in Fig. 6a) and CIFAR100 (shown in Fig. 6b), and each category has

10 examples.

The CNNs typically perform image classiﬁcation tasks to

compare their performance through looking at the classiﬁca-

tion performance. For the state-of-the-art CNNs, the mostly

used image classiﬁcation benchmark datasets are CIFAR10

and CIFAR100 [29], while for the CNN architecture design

algorithms, the widely used benchmark dataset is only CI-

FAR10 because CIFAR100 is much more challenging due

to its large number of classes for the classiﬁcation tasks at

hand. Considering the adopted peer competitors covering the

state-of-the-art CNNs and architecture design algorithms, both

CIFAR10 and CIFAR100 are chosen as the benchmark datasets

in the experiment.

CIFAR10 and CIFAR100 are two widely used image classiﬁ-

cation benchmark datasets for recognizing nature objects, such

as bird, boat and air plane. Each set has 50,000 training images

and 10,000 test images. The differences between CIFAR10

and CIFAR100 are that CIFAR10 is 10-class classiﬁcation

while CIFAR100 is 100-class. However, each benchmark has

nearly the same number of training images for each class, i.e.,

each category of CIFAR10 has 5,000 training images, while

that of CIFAR100 has 500 training images.

Fig. 6 illustrates the images from each benchmark for

reference, where images in each row denote the ones from

the same class, and the words in the left column refer to the

corresponding class name. As can be seen from Fig. 6, the ob-

ject to be recognized in each image has different resolution to

each other, mixes with the background and occupies different

position, which generally increase the difﬁculty in correctly

recognizing the objects. Based on the conventions of the

chosen peer competitors [17]–[26], CIFAR10 and CIFAR100

are augmented by padding four zeros to each side of one

image, and then randomly cropped to the original size followed

by a randomly horizontal ﬂip, prior to be input to the proposed

algorithm.

C. Parameter Settings

In the comparison, we extract the results of the peer com-

petitors reported in their seminal papers rather than performing

them by ourselves. The reason is that the results reported are

usually the best. In doing so, there is no need to set the

parameters of the peer competitors. For the proposed algo-

rithm, we follow the principle that all the parameters are set

based on their commonly used values, to lower the difﬁculty

to researchers, who would like to use the proposed algorithm

in ﬁnding the best CNN architectures for their investigated

data, even they have no expertise in GAs. Particularly, the

population size and maximal generation number are set to be

20, the probabilities of crossover and mutation are set to 0.9

and 0.2, respectively. Based on the conventions of the machine

learning community, the validation data is randomly split from

the training data with the proportion of 1/5. Finally, all the

classiﬁcation error rate are evaluated on the same test data for

the comparison.

In evaluating the ﬁtness, each individual is trained by

Stochastic Gradient Descent (SGD) with a batch size of

128. The parameter settings for SGD are also based on

the conventions from the peer competitors. Speciﬁcally, the

momentum is set to 0.9. The learning rate is initialized to

0.01, but with a warming up setting of 0.1during the second

to the 150-th epoch, and scaled by dividing 10 at the 250-

th epoch. The weight decay is set to 5×10−4. In addition,

the ﬁtness of the individual is set to zero if it is out of

memory during the training. When the evolutionary process

terminates, the best individual is retrained on the original

training data with the same SGD settings, and the error rate

on the test data is reported for the comparison. Considering

the heuristic nature of the proposed algorithm as well as the

expensive computational cost, the best individual is trained

for ﬁve independent runs. Because all the peer competitors

chosen for the comparisons only show their best results no

matter how many times they have performed, the best result

of the proposed algorithm among the ﬁve independent trials

is presented here for a fair comparison.

In addition, the available choices of kin a DB are 12,

20 and 40 based on the design of DenseNet, the maximal

convolutional layers in a DB are speciﬁed as 10 (when k= 12

9

TABLE I

THE C OMPA RIS ON S BET WEE N TH E PRO POS ED A LGO RI THM A ND T HE STATE -OF-T HE -ART PE ER C OMP ETI TO RS IN T ER MS OF T HE C LAS SIFI CATI ON E RROR

(%), NUMBER OF PARAMETERS AND THE CONSUMED GPU DAYS O N TH E CIFAR10 A ND CIFAR100 BEN CHM AR K DATASET S.

CIFAR10 CIFAR100 # of Parameter GPU Days

DenseNet (k=12) [18] 5.24 24.42 1.0M – hand-crafted architecture

ResNet (depth=101) [17] 6.43 25.16 1.7M – hand-crafted architecture

ResNet (depth=1,202) [17] 7.93 27.82 10.2M – hand-crafted architecture

Maxout [53] 9.3 38.6 – – hand-crafted architecture

VGG [54] 6.66 28.05 20.04M – hand-crafted architecture

Network in Network [55] 8.81 35.68 – – hand-crafted architecture

Highway Network [56] 7.72 32.39 – – hand-crafted architecture

All-CNN [57] 7.25 33.71 – – hand-crafted architecture

FractalNet [58] 5.22 22.3 38.6M – hand-crafted architecture

Genetic CNN [19] 7.1 29.05 – 17 semi-automatic algorithm

Hierarchical Evolution [20] 3.63 – – 300 semi-automatic algorithm

EAS [21] 4.23 – 23.4M 10 semi-automatic algorithm

Block-QNN-S [22] 4.38 20.65 6.1M 90 semi-automatic algorithm

Large-scale Evolution [23] 5.4 – 5.4M 2,750 completely automatic algorithm

Large-scale Evolution [23] – 23 40.4M 2,750 completely automatic algorithm

CGP-CNN [24] 5.98 – 2.64M 27 completely automatic algorithm

NAS [25] 6.01 – 2.5M 22,400 completely automatic algorithm

MetaQNN [26] 6.92 27.14 – 100 completely automatic algorithm

AE-CNN 4.3 – 2.0M 27 completely automatic algorithm

AE-CNN – 20.85 5.4M 36 completely automatic algorithm

and k= 20) and 5(when k= 40). Both the maximal

numbers of RBUs and DBUs in a CNN are set to 4. Both the

numbers of DBs and RBs in a DBU and an RBU, respectively,

are set from 3 to 10. Noting that these settings are mainly

based on our available computational resources because any

number beyond these settings will easily render out of the

memory. If the user’ computational platform is equipped

with more powerful GPUs, they can set the number to an

arbitrary one. The proposed algorithm for the experiment is

performed on three GPU cards with the model of Nvidia

GeForce GTX 1080 Ti, and the codes are implemented based

on a GPU-based parallel framework designed in our previous

work written by PyTorch [60]. The codes are made available

at: https://gitlab.ecs.vuw.ac.nz/yanan/ea-cnn.

V. EXPERIMENTA L RES ULTS

In the experiments, we investigate the performance of the

proposed algorithm in terms of not only the classiﬁcation error,

but also the number of parameters as well as the computational

complexity for a comprehensive comparison to the chosen

peer competitors (shown in Subsection V-A). Because it is

hard to theoretically analyze the computational complexity of

each peer competitor, the consumed “GPU Days” is used as

an indicator of the computational complexity. Speciﬁcally, the

number of GPU Days is calculated by multiplying the number

of employed GPU cards and the days the algorithms performed

for ﬁnding the best architectures. For example, the proposed

algorithm performed nine days on three GPU cards for the

CIFAR10 dataset, therefore, the corresponding GPU Days is

27 by multiplying nine (days) with three (used GPU cards).

Obviously, the state-of-the-art CNNs hand-crafted do not have

the data regarding the “GPU days.” In addition, we also

provide the evolutionary trajectories of the proposed algorithm

in ﬁnding the best architectures on the chosen benchmark

datasets, which could help the readers know whether the

proposed algorithm converges with the adopted parameter

settings (shown in Subsection V-B). Finally, the found best

architectures are provided in Subsection V-C, which may

provide useful knowledge to researchers in hand-crafting CNN

architectures.

A. Performance Overview

Table I shows the experimental results of the proposed

algorithm as well as the chosen peer competitors. In order

to conveniently investigate the comparisons, Table I is divided

into ﬁve “rows” by six horizontal lines. The ﬁrst denotes the

title of each column, the second, third and fourth rows refer

to the state-of-the-art peer competitors whose architectures are

manually designed, semi-automatic and automatic CNN archi-

tecture design algorithms, respectively. The ﬁfth row shows

the results of the proposed algorithm which is an automatic

algorithm in designing CNN architectures. In addition, the

symbol “–” in Table I implies there is no result publicly

reported by the corresponding peer competitor.

As shown in Table I, AE-CNN outperforms all the state-

of-the-art peer competitors manually designed for CIFAR10.

Speciﬁcally, AE-CNN achieves the classiﬁcation error of ap-

proximately 1.0% lower than DenseNet (k=12) and FractalNet,

2.1% lower than ResNet (depth=101), VGG and All-CNN,

3.5% lower than ResNet (depth=1,202) and Highway Network,

and even 5.0% lower than Maxout and Network in Network.

On CIFAR100, AE-CNN shows signiﬁcantly lower classiﬁ-

cation error than Maxout, Network in Network, Highway

Network and All-CNN, slightly lower classiﬁcation error than

DenseNet (k=12), ResNet (depth=101), ResNet (depth=1,202)

and VGG, and similar to but still better than the performance

of FractalNet. The number of parameters of the CNN evolved

by AE-CNN on both CIFAR10 and CIFAR100 are larger than

DenseNet (k=12) and ResNet (depth=101), but much smaller

than that of ResNet (depth=1,202), VGG and FractalNet.

Compred with the semi-automatic peer competitors, AE-

CNN performs much better than Genetic CNN on both

10

CIFAR10 and CIFAR100. Although Hierarchical Evolution

shows better performance than AE-CNN on CIFAR10, AE-

CNN consumes only 1/10 GPU days as that consumed by

Hierarchical Evolution on CIFAR10. Block-QNN-S shows

a bit worse performance on CIFAR10 but slightly better

performance on CIFAR100 compared to AE-CNN, while AE-

CNN consumes 1/3 of the GPU days as that consumed by

Block-QNN-S, and also the best CNN found by AE-CNN

has a smaller number of parameters than that of Block-QNN-

S. In addition, EAS and AE-CNN perform nearly the same

classiﬁcation error on CIFAR10, while the best CNN evolved

by AE-CNN only has 2.0M parameters, which is only 1/11

of that from EAS. In summary, compared with the semi-

automatic peer competitors, AE-CNN shows the competitive

performance but has signiﬁcantly fewer number of parameters.

It is important to note that domain expertise is still required

when using the algorithms from this category. For example,

EAS only consumes 10 GPU Days for the best CNN on

CIFAR10, which is based on a base CNN with known fairly

good performance. Therefore, the comparison in terms of the

consumed GPU days is not fair to the proposed AE-CNN

algorithm, which is completely automatic without using any

human expertise and/or extra resources.

Among the automatic peer competitors, on both the CI-

FAR10 and the CIFAR100 datasets, AE-CNN shows the best

performance in terms of the classiﬁcation error, number of

parameters and the consumed GPU days. Speciﬁcally, AE-

CNN achieves 4.3% classiﬁcation error on CIFAR10, while the

best and worst classiﬁcation error from the peer competitors

are 5.4% and 6.92%, respectively. In addition, AE-CNN also

shows the lower classiﬁcation error than that of MetaQNN.

On CIFA100, AE-CNN shows 2.15% lower classiﬁcation error

than that of Large-scale Evolution, and has 5.4M number of

parameters which is much smaller than that of Large-scale

Evolution (40.4M). Furthermore, AE-CNN also consumes

much less GPU Days than that of Large-scale Evolution,

NAS and MetaQNN on both CIFAR10 and CIFAR100. The

comparison shows that the proposed algorithm achieves the

best performance among the automatic peer competitors to

which the proposed algorithm belongs.

The rationale for AE-CNN outperforming Large-scale Evo-

lution, CGP-CNN, NAS and Meta-QNN can be justiﬁed as

follows. Firstly, Large-scale Evolution does not apply the

crossover operator which provides the local search ability. The

GA-based design consequently deteriorates its performance.

Secondly, CGP-CNN employs a ﬁxed-length encoding strategy

to design the best CNN architecture. In order to make the

encoding strategy work, CGP-CNN must predeﬁne a maximal

length of CNNs during the architecture design. As can be

seen from [24], the predeﬁned maximal length of CGP-CNN

is smaller than the best one identiﬁed by AE-CNN. Thirdly,

NAS and Meta-QNN are designed based on reinforcement

learning. Because the ﬁtness value is not computed when the

reinforcement learning methods are used, the reinforcement

learning-based methods often consume more computational

resources than GA does for the same performance [7]. Ex-

pectedly, NAS and Meta-QNN perform worse than AE-CNN

given the available computational resources.

B. Evolution Trajectory

When the evolutionary algorithms are used to address real-

world problems, we usually like to know whether they have

converged or not when they terminate. A better way to observe

this is to plot the evolutionary trajectories. In this subsection,

the evolutionary trajectories of the proposed algorithm in

terms of the investigated benchmark datasets are provided and

analyzed. To achieve this, we ﬁrstly collect the classiﬁcation

accuracy of each individual in every generation, and then plot

the statistical results.

(a) Evolution trajectory of CIFAR10

(b) Evolution trajectory of CIFAR100

Fig. 7. Evolution trajectories of the proposed algorithm in CIFAR10 (shown

in Fig. 7a) and CIFAR100 (shown in Fig. 7b).

The evolutionary trajectories of the proposed algorithm

are shown in Fig. 7 where Figs. 7a and 7b show those on

CIFAR10 an CIFAR100, respectively. In Fig. 7, the horizontal

axis denotes the generation number, and the vertical axis

denotes the classiﬁcation accuracy; the red line denotes the

mean classiﬁcation accuracy of the individuals in the same

generation, while the light-green area is contoured by the best

and worst classiﬁcation accuracy of the individuals in each

generation.

As can be seen from Fig. 7a, the mean classiﬁcation

accuracy sharply increases from the 1-st generation to the

3-rd generation; and then steadily improves as the evolution

process proceeds until the 14-th generation; from then, the

mean classiﬁcation accuracy has a signiﬁcant increase from

about 75% to about 95%; and ﬁnally the proposed algorithm

converges when it terminates. As can be seen from the lower

boundary of the light-green area, the worst classiﬁcation

accuracy in the ﬁrst two generations is zero, which is caused

because the randomly initialized architecture cannot run on

the employed GPUs due to the out-of-memory problem; From

the 3-rd generation, the individuals with the out-of-memory

architectures are eliminated from the population due to their

11

uncompetitive ﬁtness, and classiﬁcation accuracy steadily im-

proves until the algorithm terminates, although there is an

exception at the 4-th generation. As can be seen from the upper

boundary of the light-area, the best performance almost keeps

the same improvement as the mean classiﬁcation accuracy with

the evolutionary process continues. In addition, the difference

between the best classiﬁcation accuracy and the worst accuracy

also becomes smaller, which implies the population converges

to a steady state.

A similar situation can also be seen from Fig. 7b. Specif-

ically, the mean classiﬁcation accuracy increased from about

30% to about 45% from the 1-st generation to the 4-th gener-

ation, although there is a slight drop at the 3-rd generation;

since the 4-th generation, the mean classiﬁcation accuracy

keeps improving until the 14-th generation; and then increases

from about 50% at the 14-th generation to about 79% at the

17-th generation; after that the mean classiﬁcation accuracy

converges until the evolutionary process terminates. During

the ﬁrst three generations, the worst classiﬁcation accuracy

stays at zero because the randomly initialized out-of-memory

individuals; from the 4-th generation, the worst classiﬁcation

accuracy improves until the 20-th generation with the excep-

tion at the 10- and 15-th generations. As can be seen from

the evolutionary trajectories of the best classiﬁcation accuracy,

the best classiﬁcation accuracy improves almost with the same

trend as that of the mean classiﬁcation accuracy, and also

archives the converged performance from the 17-th generation.

A common trend can both be seen from Figs. 7a and 7b that

the best classiﬁcation accuracy (i.e., the upper boundaries of

the light-green areas) will not be degraded, which is achieved

through the utilized elitism detailed in Subsection III-E, i.e.,

the individual with the best ﬁtness is unconditionally kept

into the next generation. In summary, the proposed algorithm

converges within the default parameter settings in terms of

GAs, which could help the users to employ the proposed

algorithm to ﬁnd the best CNN architectures for their own

data, even though the users have no expertise in GAs. However,

the maximal generation number and the population size can

be set to larger numbers if more computational resources are

available.

C. Designed CNN Architectures

In this subsection, the best CNN architectures found by the

proposed algorithm on CIFAR10 and CIFAR100 are provided

in Tables II and III, respectively.

TABLE II

THE INFORMATION OF THE BEST ARCHITECTURE FOUND ON CI FAR1 0.

id type conﬁguration

1 RBU amount=8, in=3, out=64

2 PU mean pooling

3 RBU amount=5, in=64, out=28

4 PU mean pooling

5 RBU amount=7. in=128, out=64

6 DBU amount=7, in=64, out=204, k=20

7 DBU amount=7, in=204, out=204, k=20

8 PU mean pooling

9 PU max pooling

TABLE III

THE INFORMATION OF THE BEST ARCHITECTURE FOUND ON CIFAR100.

id type conﬁguration

1 DBU amount=10, in=3, out=203, k=20

2 PU max pooling

3 PU mean pooling

4 RBU amount=7, in=203, out=256

5 PU mean pooling

6 PU mean pooling

As can be seen from Tables II and III, the best architecture

on CIFAR10 is composed of nine units that are designed in the

proposed encoding strategy in Subsection III-B, and altogether

has 38 layers that consist of 34 convolutional layers and four

pooling layers; while the best architecture on CIFAR100 is

composed of six units that consist of 21 layers, containing 17

convolutional layers and four pooling layer.

Compared to the state-of-the-art CNNs that are solely built

on DenseNet blocks or ResNet blocks, the automatically

found architectures based on both blocks have much simpler

architectures and much better performance. This may serve as

a priori knowledge in hand-crafting CNN architectures that en-

semble blocks may be more effective. In addition, CIFAR100

is commonly viewed as a more complex benchmark than

CIFAR10, and researchers usually consider CNN architectures

with more layers than that of CIFAR10 when dealing with

CIFAR100. However, based on the found architectures shown

in Tables II and III, the best architecture for CIFAR100 has

surprisingly a smaller number of layers than that of CIFAR10.

To this end, ﬁnding the best architecture through evolution

search may also provide useful domain expertise, which is in

contract to our common sense.

VI. CONCLUSIONS AND FU TU RE WO RK

The goal of this study is to develop a CNN architec-

ture design algorithm by using GAs, which is capable of

designing/searching/learning/evolving the best CNN architec-

ture for the given task in a completely automatic manner

and based on the limited computational resource. This goal

has been successfully achieved by the proposed encoding

strategy built on the state-of-the-art blocks with a variable-

length representation, presenting a crossover operator for the

variable-length individuals, and the corresponding mutation

operators. Building upon the blocks is able to speed up the

CNN architecture design. The variable-length of individuals

can adaptively evolve the proper depth of a CNN for tasks

with different complexity. The presented crossover operator

and the designed mutation operators provide the proposed

algorithm with effective local search and global search ability,

which in turn helps the proposed algorithm to be able to

ﬁnd the best CNN architectures. The proposed algorithm is

examined on CIFAR10 and CIFAR100 image classiﬁcation

datasets, against nine state-of-the-art CNNs manually designed,

four peer competitors designing CNN architectures with a

semi-automatic way and ﬁve peer competitors designing CNN

architectures with the completely automatic way. The results

show that the proposed algorithm outperforms all the state-

of-the-art CNNs hand-crafted and all the peer competitors

12

from the automatic category in terms of the classiﬁcation

error rate. In addition, the proposed algorithm also consumes a

much smaller number of GPU Days than the peer competitors

in the same category. Furthermore, the proposed algorithm

shows competitive performance against the semi-automatic

peer competitors. Our future work will focus on effectively

speeding up the ﬁtness evaluation.

REF ER EN CE S

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,

no. 7553, pp. 436–444, 2015.

[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation

with deep convolutional neural networks,” in Advances in Neural Infor-

mation Processing Systems 25, Lake Tahoe, Nevada, USA., 2012, pp.

1097–1105.

[3] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhadran,

“Deep convolutional neural networks for LVCSR,” in Proceeding of

2013 IEEE International Conference on Acoustics, Speech and Signal

Processing. Vancouver, Canada: IEEE, 2013, pp. 8614–8618.

[4] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning

with neural networks,” in Advances in Neural Information Processing

Systems 27, Montreal, Canada, 2014, pp. 3104–3112.

[5] C. Clark and A. Storkey, “Training deep convolutional neural networks

to play go,” in 32nd International Conference on Machine Learning

(ICML 2015), Lille, France, 2015, pp. 1766–1774.

[6] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning

applied to document recognition,” Proceedings of the IEEE, vol. 86,

no. 11, pp. 2278–2324, 1998.

[7] Y. Sun, B. Xue, and M. Zhang, “Evolving deep convolutional neural

networks for image classiﬁcation,” arXiv preprint arXiv:1710.10741,

2017.

[8] Y. Sun, G. G. Yen, and Z. Yi, “Evolving unsupervised deep neural net-

works for learning meaningful representations,” IEEE Transactions on

Evolutionary Computation, 2018, DOI:10.1109/TEVC.2018.2808689.

[9] M. Dorigo, V. Maniezzo, and A. Colorni, “Ant system: optimization by

a colony of cooperating agents,” IEEE Transactions on Systems, Man,

and Cybernetics, Part B (Cybernetics), vol. 26, no. 1, pp. 29–41, 1996.

[10] J. Bergstra and Y. Bengio, “Random search for hyper-parameter opti-

mization,” Journal of Machine Learning Research, vol. 13, no. Feb, pp.

281–305, 2012.

[11] C. E. Rasmussen and C. K. Williams, Gaussian processes for machine

learning. MIT press Cambridge, 2006, vol. 1.

[12] J. Moˇ

ckus, “On bayesian methods for seeking the extremum,” in

Optimization Techniques IFIP Technical Conference. Springer, 1975,

pp. 400–404.

[13] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. K´

egl, “Algorithms

for hyper-parameter optimization,” in Advances in Neural Information

Processing Systems 24, Granada, Spain, 2011, pp. 2546–2554.

[14] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-based

optimization for general algorithm conﬁguration.” LION, vol. 5, pp. 507–

523, 2011.

[15] K. O. Stanley and R. Miikkulainen, “Evolving neural networks through

augmenting topologies,” Evolutionary Computation, vol. 10, no. 2, pp.

99–127, 2002.

[16] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “An experimental study on

hyper-parameter optimization for stacked auto-encoders,” in Proceedings

of 2018 IEEE Congress on Evolutionary Computation. IEEE, 2018, pp.

1–8.

[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

recognition,” in Proceedings of 2016 IEEE Conference on Computer

Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770–

778.

[18] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely

connected convolutional networks,” in Proceedings of 2017 IEEE Con-

ference on Computer Vision and Pattern Recognition, Honolulu, HI,

USA, 2017, pp. 2261–2269.

[19] L. Xie and A. Yuille, “Genetic CNN,” in Proceedings of 2017 IEEE

International Conference on Computer Vision, Venice, Italy, 2017, pp.

1388–1397.

[20] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu,

“Hierarchical representations for efﬁcient architecture search,” in Pro-

ceedings of 2018 Machine Learning Research, Stockholm, Sweden,

2018.

[21] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang, “Efﬁcient architecture

search by network transformation,” in Proceedings of the 2018 AAAI

Conference on Artiﬁcial Intelligence, Louisiana, USA, 2018.

[22] Z. Zhong, J. Yan, and C.-L. Liu, “Practical network blocks design with

q-learning,” in Proceedings of the 2018 AAAI Conference on Artiﬁcial

Intelligence, Louisiana, USA, 2018.

[23] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and

A. Kurakin, “Large-scale evolution of image classiﬁers,” in Proceedings

of 2017 Machine Learning Research, Sydney, Australia, 2017, pp. 2902–

2911.

[24] M. Suganuma, S. Shirakawa, and T. Nagao, “A genetic programming

approach to designing convolutional neural network architectures,” in

Proceedings of the 2017 Genetic and Evolutionary Computation Con-

ference. Berlin, Germany: ACM, 2017, pp. 497–504.

[25] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement

learning,” in Proceedings of the 2017 International Conference on

Learning Representations, Toulon, France, 2017.

[26] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network

architectures using reinforcement learning,” in Proceedings of the 2017

International Conference on Learning Representations, Toulon, France,

2017.

[27] T. Back, Evolutionary Algorithms in Theory and Practice: Evolution

Strategies, Evolutionary Programming, Genetic Algorithms. England,

UK: Oxford university press, 1996.

[28] R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction.

MIT press Cambridge, 1998, vol. 1.

[29] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from

tiny images,” online: http://www.cs.toronto.edu/kriz/cifar.html, 2009.

[30] H. Holland John, Adaptation in natural and artiﬁcial systems: an

introductory analysis with applications to biology, control, and artiﬁcial

intelligence. MIT Press Cambridge, MA, USA, 1975.

[31] W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone, Genetic

programming: an introduction. Morgan Kaufmann San Francisco, 1998,

vol. 1.

[32] C. Janis, “The evolutionary strategy of the equidae and the origins of

rumen and cecal digestion,” Evolution, vol. 30, no. 4, pp. 757–774, 1976.

[33] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist

multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on

Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002.

[34] Y. Sun, G. G. Yen, and Z. Yi, “IGD indicator-based evolutionary algo-

rithm for many-objective optimization problems,” IEEE Transactions on

Evolutionary Computation, DOI:10.1109/TEVC.2018.2791283., 2018.

[35] M. Mitchell, An introduction to genetic algorithms. Cambridge,

Massachusetts, USA: MIT press, 1998.

[36] L. M. Schmitt, “Theory of genetic algorithms,” Theoretical Computer

Science, vol. 259, no. 1-2, pp. 1–61, 2001.

[37] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep

networks,” in Advances in Neural Information Processing Systems 28:

29th Annual Conference on Neural Information Processing Systems

2015, Montreal, Canada, 2015, pp. 2377–2385.

[38] A. E. Orhan and X. Pitkow, “Skip connections eliminate singularities,” in

Proceedings of 2018 Machine Learning Research, Stockholm, Sweden,

2018.

[39] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural

Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[40] G. C. Cawley and N. L. Talbot, “On over-ﬁtting in model selection

and subsequent selection bias in performance evaluation,” Journal of

Machine Learning Research, vol. 11, no. Jul, pp. 2079–2107, 2010.

[41] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-

dinov, “Dropout: a simple way to prevent neural networks from over-

ﬁtting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.

1929–1958, 2014.

[42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,

V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions,”

in Proceedings of 2015 IEEE Conference on Computer Vision and

Pattern Recognition, Boston, MA, USA, 2015, pp. 1–9.

[43] X. Glorot and Y. Bengio, “Understanding the difﬁculty of training

deep feedforward neural networks,” in Proceedings of the thirteenth

international conference on artiﬁcial intelligence and statistics, 2010,

pp. 249–256.

[44] Y. Leung, Y. Gao, and Z.-B. Xu, “Degree of population diversity-a

perspective on premature convergence in genetic algorithms and its

markov chain analysis,” IEEE Transactions on Neural Networks, vol. 8,

no. 5, pp. 1165–1176, 1997.

[45] Z. Michalewicz and S. J. Hartley, “Genetic algorithms+ data structures=

evolution programs,” Mathematical Intelligencer, vol. 18, no. 3, p. 71,

1996.

13

[46] Y. Sun, G. G. Yen, and Z. Yi, “Improved regularity model-based EDA

for many-objective optimization,” IEEE Transactions on Evolutionary

Computation, 2018, DOI:10.1109/TEVC.2018.2794319.

[47] ——, “Reference line-based estimation of distribution algorithm for

many-objective optimization,” Knowledge-Based Systems, vol. 132, pp.

129–143, 2017.

[48] L. Davis, Handbook of genetic algorithms. New York: Van Nostrand

Reinhold, 1991.

[49] D. E. Goldberg and J. H. Holland, “Genetic algorithms and machine

learning,” Machine Learning, vol. 3, no. 2, pp. 95–99, 1988.

[50] B. L. Miller, D. E. Goldberg et al., “Genetic algorithms, tournament

selection, and the effects of noise,” Complex systems, vol. 9, no. 3, pp.

193–212, 1995.

[51] G. Zhang, Y. Gu, L. Hu, and W. Jin, “A novel genetic algorithm and its

application to digital ﬁlter design,” in Proceedings of 2003 Intelligent

Transportation Systems, vol. 2. IEEE, 2003, pp. 1600–1605.

[52] J. Vasconcelos, J. A. Ramirez, R. Takahashi, and R. Saldanha, “Improve-

ments in genetic algorithms,” IEEE Transactions on Magnetics, vol. 37,

no. 5, pp. 3414–3417, 2001.

[53] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Ben-

gio, “Maxout networks,” in Proceedings of the 30th International

Conference on Machine Learning, Atlanta, Georgia, USA, Jun 2013,

pp. 1319–1327.

[54] K. Simonyan and A. Zisserman, “Very deep convolutional networks

for large-scale image recognition,” in 32nd International Conference on

Machine Learning (ICML 2015), Lille, France, 2015.

[55] M. Lin, Q. Chen, and S. Yan, “Network in network,” in Proceedings of

the 2014 International Conference on Learning Representations, Banff,

Canada, 2014.

[56] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,”

in Proceedings of the 2015 International Conference on Learning

Representations Workshop, San Diego, CA, 2015.

[57] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving

for simplicity: the all convolutional net,” in Proceedings of the 2015

International Conference on Learning Representations, San Diego, CA,

2015.

[58] G. Larsson, M. Maire, and G. Shakhnarovich, “Fractalnet: Ultra-

deep neural networks without residuals,” The 5th International

Conference on Learning Representations, 2016. [Online]. Available:

https://openreview.net/forum?id=S1VaB4cex

[59] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,

A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual

recognition challenge,” International Journal of Computer Vision, vol.

115, no. 3, pp. 211–252, 2015.

[60] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,

A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in

pytorch,” 2017. [Online]. Available: https://openreview.net/forum?id=

BJJsrmfCZ

Yanan Sun (S’15-M’18) received a Ph.D. degree in

engineering from the Sichuan University, Chengdu,

China, in 2017. He is currently a Professor (research)

in the College of Computer Science, Sichuan Univer-

sity, China. Prior to that, He was a Research Fellow

in the School of Engineering and Computer Science,

Victoria University of Wellington, Wellington, New

Zealand. Dr. Sun’s research topics are evolutionary

algorithms, deep learning, and evolutionary deep

learning. He is the leading organizer of the First

Workshop on Evolutionary Deep Learning, the lead-

ing organizer of the Special Session on Evolutionary Deep Learning and

Applications in CEC19, and the founding chair of the IEEE CIS Task Force

on Evolutionary Deep Learning and Applications.

Bing Xue (M’10) received the B.Sc. degree from

the Henan University of Economics and Law,

Zhengzhou, China, in 2007, the M.Sc. degree in

management from Shenzhen University, Shenzhen,

China, in 2010, and the PhD degree in computer

science in 2014 at Victoria University of Wellington,

New Zealand. She is currently an Associate Profes-

sor in School of Engineering and Computer Science

at Victoria University of Wellington. Her research

focuses mainly on evolutionary computation, feature

selection, feature construction, multi-objective opti-

mization, image analysis, transfer learning, data mining, and machine learning.

She has over 100 papers published in fully refereed international journals

and conferences and most of them are on evolutionary feature selection

and construction. Dr Xue is currently the Chair of the IEEE Task Force

on Evolutionary Feature Selection and Construction, IEEE Computational

Intelligence Society (CIS), Vice-Chair of the IEEE CIS Data Mining and

Big Data Analytics Technical Committee, and Vice-Chair of IEEE CIS Task

Force on Transfer Learning & Transfer Optimization.

Mengjie Zhang (M’04-SM’10-F’18) received the

B.E. and M.E. degrees from Artiﬁcial Intelligence

Research Center, Agricultural University of Hebei,

Hebei, China, and the Ph.D. degree in computer

science from RMIT University, Melbourne, VIC,

Australia, in 1989, 1992, and 2000, respectively. He

is currently Professor of Computer Science, Head

of the Evolutionary Computation Research Group,

and the Associate Dean (Research and Innovation)

in the Faculty of Engineering. His current research

interests include evolutionary computation, particu-

larly genetic programming, particle swarm optimization, and learning classiﬁer

systems with application areas of image analysis, multi-objective optimization,

feature selection and reduction, job shop scheduling, and transfer learning. He

has published over 350 research papers in refereed international journals and

conferences. Prof. Zhang is a Fellow of Royal Society of New Zealand and

have been a Panel member of the Marsden Fund (New Zealand Government

Funding). He is a vice-chair of the IEEE CIS Task Force on Evolutionary

Feature Selection and Construction, a vice-chair of the Task Force on

Evolutionary Computer Vision and Image Processing, and the founding chair

of the IEEE Computational Intelligence Chapter in New Zealand. He is also

a committee member of the IEEE NZ Central Section.

Gary G. Yen (S’87-M’88-SM’97-F’09) received a

Ph.D. degree in electrical and computer engineer-

ing from the University of Notre Dame in 1992.

Currently he is a Regents Professor in the School

of Electrical and Computer Engineering, Oklahoma

State University (OSU). Before joined OSU in 1997,

he was with the Structure Control Division, U.S.

Air Force Research Laboratory in Albuquerque. His

research interest includes intelligent control, compu-

tational intelligence, conditional health monitoring,

signal processing and their industrial/defense appli-

cations.

Dr. Yen was an associate editor of the IEEE Control Systems Magazine,

IEEE Transactions on Control Systems Technology,Automatica,Mechantron-

ics,IEEE Transactions on Systems, Man and Cybernetics, Parts A and B and

IEEE Transactions on Neural Networks. He is currently serving as an associate

editor for the IEEE Transactions on Evolutionary Computation and the IEEE

Transactions on Cybernetics. He served as the General Chair for the 2003

IEEE International Symposium on Intelligent Control held in Houston, TX and

2006 IEEE World Congress on Computational Intelligence held in Vancouver,

Canada. Dr. Yen served as Vice President for the Technical Activities in 2005-

2006 and then President in 2010-2011 of the IEEE Computational intelligence

Society. He was the founding editor-in-chief of the IEEE Computational

Intelligence Magazine, 2006-2009. In 2011, he received Andrew P Sage Best

Transactions Paper award from IEEE Systems, Man and Cybernetics Society

and in 2014, he received Meritorious Service award from IEEE Computational

Intelligence Society.