Available via license: CC BY 4.0
Content may be subject to copyright.
Zero-Cost Proxies Meet
Differentiable Architecture Search
Lichuan Xiang1†, Łukasz Dudziak2†, Mohamed S. Abdelfattah2†
Thomas Chau2, Nicholas D. Lane2,3, Hongkai Wen1,2∗
1University of Warwick, UK 2Samsung AI Center Cambridge, UK
3University of Cambridge, UK †Indicates equal contributions
{l.xiang.2, hongkai.wen}@warwick.ac.uk
{l.dudziak, mohamed1.a, thomas.chau, nic.lane}@samsung.com
Abstract
Differentiable neural architecture search (NAS) has attracted significant attention
in recent years due to its ability to quickly discover promising architectures of deep
neural networks even in very large search spaces. Despite its success, DARTS
lacks robustness in certain cases, e.g. it may degenerate to trivial architectures
with excessive parametric-free operations such as skip connection or random noise,
leading to inferior performance. In particular, operation selection based on the
magnitude of architectural parameters was recently proven to be fundamentally
wrong showcasing the need to rethink this aspect. On the other hand, zero-cost
proxies have been recently studied in the context of sample-based NAS showing
promising results – speeding up the search process drastically in some cases but
also failing on some of the large search spaces typical for differentiable NAS.
In this work we propose a novel operation selection paradigm in the context of
differentiable NAS which utilises zero-cost proxies. Our “perturbation-based zero-
cost operation selection” (Zero-Cost-PT) improves searching time and, in many
cases, accuracy compared to the best available differentiable architecture search,
regardless of the search space size. Specifically, we are able to find comparable
architectures to DARTS-PT on the DARTS CNN search space while being over
40×
faster (total searching time 25 minutes on a single GPU). Our code will be
available at: https://github.com/avail-upon-acceptance.
1 Introduction
Since the recent dawn of deep learning, researchers have been working on developing new archi-
tectures of neural networks on an unprecedented scale, with more efficient and accurate networks
being proposed each year [
1
,
2
,
3
]. However, designing new architectures in order to push the
frontier has proven to be a challenging task, requiring extensive domain knowledge and persistence
in searching for the most optimal combination of hyperparameters [
4
,
5
]. Recently this process has
been successfully aided by the usage of automated methods, especially neural architecture search
(NAS) which, at the moment of writing, can be found behind many of the state-of-the-art deep neural
networks [6, 7, 8, 9].
However, one of the biggest problems in NAS is the computational cost – even training a single deep
network can require enormous computational resources and many NAS methods need to train tens,
if not hundreds, of networks in order to fully show their potential [
6
,
10
,
11
]. A related problem
concerns search space size – often more available options mean that we can find a better combination,
but it also means longer searching time [
6
]. Differentiable architecture search (DARTS) was first
∗Corresponding author.
Preprint. Under review.
arXiv:2106.06799v1 [cs.LG] 12 Jun 2021
proposed to tackle those challenges, showcasing promising results when searching for a network in a
set of over 1018 possible variations [12].
Unfortunately, DARTs has been proven to have some significant robustness issues [
13
,
14
,
15
]. It
also requires careful selection of hyperparameters which makes it relatively hard to use on a new task.
At the same time, sample-based NAS often does not exhibit those challenges although it requires
orders of magnitude more computation. In order to reduce searching time during sample-based NAS,
proxies are often used to obtain approximated performance of different models without the necessity
of their full training (which is the main factor behind their expensive searching cost). Most recently,
zero-cost proxies [
16
,
17
], which are extreme types of NAS proxies that do not require any training,
have gained interest and are shown to empirically deliver outstanding results on common NAS
benchmarks [
16
]. However, their efficient usage in a much larger setting, typical for differentiable
NAS, is much more challenging compared to what is commonly used in sample-based NAS and thus
remains an open problem [17].
In this paper we combine the most recent advances in differentiable NAS and zero-cost proxies to
further push the efficiency of NAS on very large search spaces.
2 Related work
Classic NAS and Proxies.
Zoph & Lee were among the first to propose an automated method to
search for a neural network’s structure [
18
]. In their paper, a reinforcement learning agent is optimised
to maximise rewards coming from training of models with different architectures. Since then, a
number of alternative approaches have been proposed in order to reduce the significant searching time
introduced by the need to train each proposed model. Among those, Pham et al. proposed to train
models using only one epoch but preserve weights between trainings [
19
] – an approach classified as
one-shot method that is very close to the differentiable NAS described below [
20
,
21
]. In general,
reduced training (with or without weight sharing) can be found in many works using NAS [
22
]. Other
popular proxies involve searching for a model on a smaller dataset and then transfer the architecture
to the larger target dataset [
6
,
23
] or incorporating a predictor into the search process [
24
,
11
,
25
,
26
].
Zero-cost Proxies.
Very recently, zero-cost proxies for NAS emerged from pruning-at-initialisation
literature [
17
,
16
]. Zero-cost proxies for NAS aim to score a network architecture at initialisation,
which is hoped to have certain correlations with its final trained accuracy. The scores, indicating the
“salicency” of the architecture, can be either used to substitute the expensive training step in traditional
NAS, or better guide the exploration in existing NAS solutions. Such proxies can be formulated
as architecture scoring functions
S(A)
that evaluate the potential of a given neural architecture
A
in achieving accuracy. In this paper, we adopt zero-cost proxies as in [
16
], namely
grad_norm
,
snip
,
grasp
,
synflow
and
fisher
. Those metrics draw heavily from the pruning-at-initialisation
literature [
27
,
28
,
29
,
30
] and aggregate the saliency of model parameters to compute the score of
an architecture. In addition, we also consider the metric
nwot
introduced in [
17
], which uses the
overlapping of activations between different samples within a minibatch of data as a performance
indicator.
Differentiable NAS.
Liu et al. first proposed to search for a neural network’s architecture by
parameterizing it with continuous values (called architectural parameters) in a differentiable way
and then optimising using:
∇αLv(w−ξ∇wLt(w, α), α)
[
12
]. Their method achieved significant
reduction in searching time, due to the fact that architectural parameters (
α
) are optimised together
with a network’s parameters (
w
) in the same process. However, to do that all candidate operations
have to be present and have architectural parameters assigned to them in a single structure that is
commonly called a supernet, that is, a network that is a superposition of all networks in a search
space. In order to extract the final architecture from the supernet, after architectural parameters
have converged, operations with the largest magnitudes of
α
are preserved (it is assumed that higher
α
means more important operation). Despite efficiency of DARTS in terms of searching time, its
stability and generalizability have been challenged. For instance, Zela et al. pointed out that DARTS
generates architectures which yield high validation accuracy during the search phase but perform
poorly in the final evaluation, and that it favours architectures dominated by skip connections [
13
].
SDARTS [
31
] proposed to overcome the issues mentioned above by smoothing the loss landscape
by either random smoothing or adversarial training. SGAS [
32
] alleviated the discrepancy between
the search and evaluation phases by selecting and pruning candidate operations sequentially with
a greedy algorithm, such that at the end of the search phase, a network without weight sharing is
obtained and thus the validation accuracy reflects the final evaluation accuracy.
2
n(2)
n(3)
n(1)
e(1)={o0,o1,o2}
A + (e(2),o1)
A - (e(2),o1)
discretize e(2)
perturb e(2)
e(2)={o0,o1,o2}
e(3)={o0,o1,o2}
n(2)
n(3)
n(1)
e(1)={o0,o1,o2}
e(2)=o1
e(3)={o0,o1,o2}
n(2)
n(3)
n(1)
e(1)={o0,o1,o2}
e(2)={o0,o2}
e(3)={o0,o1,o2}
Figure 1: Visualisation of perturbation and discretization of an edge in a supernet.
Perturbation-based NAS.
Recently, Wang et al. showed that the operation selection in DARTS
based on the magnitude of
α
is fundamentally wrong and proposed an alternative approach based
on perturbations [
33
]. In this approach, the importance of operations is determined by the decrease
of the super-network’s validation accuracy when an operation is removed (perturbation), and the
most important operations are identified by exhaustively comparing them with other alternatives
on a single edge of the super-network. Although their method uses a super-network as in DARTS,
architecture selection is no longer performed by the means of derivative of structure-controlling
parameters, formally it should not be put under the umbrella of differentiable NAS because operation
selection occurs in a non-differentiable manner. However, it was proposed as a solution to problems
specific to differentiable NAS, and to the best of our knowledge, is currently the only method in
its category. Therefore, in the remaining parts of the paper we will consider perturbation-based
algorithms to be a special case of differentiable methods.
3 Rethinking Operation Scoring in Differentiable NAS
In the context of differentiable architecture search, a supernet would contain multiple candidate
operations on each edge as shown in Figure 1. Operation scoring functions assign a score to each
operation out of the candidate operations – the operation with the best score is selected. In this section,
we empirically quantify the effectiveness of existing operation scoring methods in differentiable NAS,
with a specific focus on DARTS [
12
] and the recently-proposed DARTS-PT [
33
]. We challenge some
of the assumptions made in previous work and show that, at least empirically, we can outperform
existing operation scoring methods with a lightweight alternative based on zero-cost proxies [16].
3.1 Operation Scoring Preliminaries
For a supernet
A
we want to be able to start discretizing edges in order to derive a subnetwork.
When discretizing we replace an edge composed of multiple candidate operations and their respective
(optional) architectural parameters αwith only one operation selected from the candidates. We will
denote the process of discretization of an edge
e
with operation
o
, given a model
A
, as:
A+ (e, o)
.
Analogically, the perturbation of a supernet
A
by removing an operation
o
from an edge
e
will be
denoted as A−(e, o). Figure 1 illustrates discretization and perturbation.
NAS can then be performed by iterative discretization of edges in the supernet, yielding in the process
a sequence of partially discretized architectures:
A0, A1, ..., A|E|
, where
A0
is the original supernet,
A|E|
is the final fully-discretized subnetwork (result of NAS), and
At
is
At−1
after discretizing a next
edge, i.e., At=At−1+ (et, ot)where tis an iteration counter.
The goal of operation scoring in this setting is to find the operation that maximizes the achievable
accuracy for a given edge and its supernet at iteration
t
– let’s denote this function as
f(At, e) :
A × E → Oe
, where
A
is the set of all possible variations of
A
,
E
is the set of edges in a supernet,
and
Oe
is the set of candidate operations for the edge
e
. Let
A|E|
denote all possible fully-discretized
subnetworks. Furthermore, let us denote the set of possible fully-discretized subnetworks
A|E|
derived
from At+ (e, o)at iteration tas At,e,o ⊂ A|E| . The optimal operation scoring function is:
fbest-acc(At, e) = arg max
o∈Oe
max
A|E| ∈At,e,o
V∗(A|E| )(1)
where
V∗
is validation accuracy of a network after full training (we will use
V
to denote validation
accuracy without training). It is easy to see that if
fbest
were to be used to discretize all edges in a
supernet, it should yield the best model in the search space, regardless of the order in which edges are
discretized. Because of this, we argue that this function is the ultimate target for operation scoring in
supernet-based NAS.
3
Table 1: Model selected based on maximizing each operation strength independently.
best-acc avg-acc disc-acc darts-pt zc-pt disc-zc darts
Avg. Error1
[%] 5.63 6.24 13.55 19.43 5.81 22.96 45.7
Rank in NAS-Bench-201 1 166 12,744 13,770 14 14,274 15,231
1Computed as the average of all available seeds for the selected model in NAS-Bench-201 CIFAR-10 dataset.
It might be more practical to consider the expected achievable accuracy when an operation is selected,
instead of the best. Therefore we also define the function favg :
favg-acc(At, e) = arg max
o∈Oe
E
A|E| ∈At,e,o
V∗(A|E| )(2)
In practice, we are unable to use either of the two functions
fbest-acc
or
favg-acc
, since we would need
to have the final validation accuracy
V∗
of all the networks in the search space. There have been
many attempts at approximating the operation scoring functions, in the following we consider the
following practical alternatives presented in DARTS [12] and DARTS-PT [33]:
fdarts(At, e) = arg max
o∈Oe
αe,o (3)
fdisc-acc(At, e) = arg max
o∈Oe
V∗(At+ (e, o)) (4)
fdarts-pt(At, e) = arg min
o∈Oe
V(At−(e, o)) (5)
where
αe,o
is the architectural parameter assigned to operation
o
on edge
e
as presented in darts [
12
].
fdisc-acc
is the accuracy of a supernet after an operation
o
is assigned to an edge
e
– this is referred
to as “discretization accuracy" in the DARTS-PT paper and is assumed to be a good operation
scoring function [
33
], most intuitively, it could approximate
favg-acc
.
fdarts-pt
is the perturbation-based
approach used by DARTS-PT – it is presented as a practical and lightweight alternative
fdisc-acc
[
33
].
Similarly, we present the following scoring functions that use a zero-cost proxy
S
instead of validation
accuracy when discretizing an edge or perturbing an operation. Note that the supernet is randomly-
initialized and untrained in this case.
fdisc-zc(At, e) = arg max
o∈Oe
S(At+ (e, o)) (6)
fzc-pt(At, e) = arg min
o∈Oe
S(At−(e, o)) (7)
3.2 Empirical Evaluation of Operation Scoring Methods
In this subsection we investigate the performance of different operation scoring methods. Because we
want to compare with the optimal
fbest-acc
and
favg-acc
, we conducted experiments on NAS-Bench-201
which contains the validation accuracy for all 15,625 subnetworks in their supernet search space [
34
].
Figure 2: Spearman’s rank correlation coefficient of
different operation scoring metrics with each other.
We computed the operation score for all op-
erations and all edges at the first iteration
of NAS, that is,
g(A0, e, o)2∀e∈ E, o ∈
Oe
. We then used these operation strengths
to perform two experiments: The first ex-
periment is shown in Figure 2. We plot the
Spearman rank correlation coefficient of
different scoring functions, averaged over
all edges. In the second experiment, we
use the operation scores
g(A0, e, o)
to se-
lect the best operation according to that
score. Note that this is precisely the NAS
approach used in DARTS [
12
] where they
check their operation score
α
in one shot
at the end of the supernet training. However, with the remaining methods, the operation score is
2Note that g(A0, e, o)is simply f(A0, e)without the arg min or arg max part.
4
Algorithm 1: Zero-Cost Perturbation-based Architecture Search (Zero-Cost-PT)
Input :
An untrained supernetwork
A0
with set of edges
E
, # of search iterations
N
, # of validation iterations
V
Result: A selected architecture A∗
|E|
// Stage 1: search for architecture candidates
1C=∅
2for i= 1 : Ndo
3for t= 1 : |E| do
4Select next edge etusing the chosen discretization ordering
5ot=fzc-pt(At−1, et)
6At=At−1+ (et, ot)
7end
8Add A|E| to the set of candidate architectures C
9end
// Stage 2: Validate the architecture candidates
10 for j= 1 : Vdo
11 Calculate S(j)(A)for each A∈ C using a random mini-batch data;
12 end
13 Select the best architecture A∗
|E| = arg maxA∈C Pj=1:VS(j)(A);
iteratively computed after each edge is discretized, potentially with additional training epochs in
between iterations (as in the darts-pt and disc-acc metrics).
Figure 2 shows many surprising findings. First, disc-acc is inversely correlated to best-acc. This
refutes the claim in the DARTS-PT paper that disc-acc is a reasonable operation score [
33
]. Our
findings are aligned with prior work that has already shown that the supernet accuracy is unrelated to
the final subnetwork accuracy [
32
]. Second, the darts-pt score does not track disc-acc, in fact, it is
inversely-correlated to it as well. This means that the darts-pt score is not a good approximation of
disc-acc. However, darts-pt is weakly-correlated to the “oracle" best-acc and avg-acc scores which
could explain (empirically) why it works well. Third, zc-pt is strongly-correlated with both the
best-acc and avg-acc metrics, indicating that there could be huge promise when using this scoring
function within NAS. Note that disc-zc, like disc-acc is inversely correlated with our oracle score
suggesting that perturbation is generally a better scoring paradigm that discretization. Finally, the
original darts
α
score is weakly and inversely correlated with the oracle scores, further supporting
arguments in prior work that this is not an effective operation scoring method.
Table 1 shows which subnetwork is found when we use our seven operation scoring functions. As
expected, best-acc chooses the best subnetwork from the NAS-Bench-201 supernet. avg-acc selects
a very good model but not the best one, this is likely due to the large variance of accuracies in
NAS-Bench-201. zc-pt selected one of the top models in NAS-Bench-201 as expected from the
strong correlation with the oracle best-acc function. The remaining operation scoring functions
failed to produce a good model in this experiment. This does not mean that these scores do not
work in the general NAS setting in which operations are selected iteratively and with training in
between iterations. However, our results suggest that these metrics do not make a good initial choice
of operations at iteration 0. We take these results as an indication that existing operation scoring
functions could be improved upon, especially using zc-pt which performed exceptionally well in our
NAS-Bench-201 experiment.
4 Zero-Cost-PT Neural Architecture Search
In this section, we introduce our proposed NAS based on zero-cost perturbation, and we perform
ablation studies to find the best set of heuristics for our search methodology, including: edge
discretization order, number of search and validation iterations, and the choice of the zero-cost metric.
4.1 Architecture Search with Zero-cost Proxies
Our algorithm begins with an untrained supernet
A0
which contains a set of edges
E
, the number of
searching iterations
N
, and the number of validation iterations
V
. In each searching iteration
i
, we
start discretizing the supernet
A0
using one of the possible edge orderings (more on that later). When
discretizing each edge, we decide on the operation to use by using our proposed zero-cost-based
5
Table 2: Test error (%) of Zero-Cost-PT when using different search orders on NAS-Bench-201.
Search Order1# of Perturbations2C10 C100 ImageNet-16
fixed |O||E| 5.98±0.50 27.60±1.63 54.23±0.93
global-op-iter 1
2|O||E|(|E| + 1) 5.69±0.19 26.80±0.51 53.64±0.40
global-op-once 2|O||E| − |O| 6.30±0.57 28.96±1.66 55.04±1.47
global-edge-iter 1
2|O||E|(|E| + 1) 6.23±0.45 28.42±0.59 54.39±0.47
global-edge-once 2|O||E| − |O| 6.30±0.57 28.96±1.66 55.04±1.47
random |O||E| 5.97±0.17 27.47±0.28 53.82±0.77
1All methods use nwot metric, N=10 search iterations and V=100 validation iteration.
2Number of perturbations per search iteration.
perturbation function
fzc-pt
which was able to achieve promising results in the preliminary experiments
presented in the previous section. After all edges have been descretized, the final architecture is added
to the set of candidates and we begin the process again for i+ 1 starting with the original A0.
After all
N
candidate architectures have been constructed, the second stage begins. We score the
candidate architectures again using a selected zero-cost metric (the same which is used in
fzc-pt
), but
this time computing their end-to-end score rather than using the perturbation paradigm. We calculate
the zero-cost metric for each network using
V
different minibatches of data. The final architecture
is the architecture that achieves the best total score during the second stage. The full algorithm is
outlined as Algorithm 1.
Our algorithm contains four main hyperparameters:
N
,
V
, ordering of edges to follow when discretiz-
ing, and the zero-cost metric to use (
S
). In the remaining of this section we present detailed ablation
study which we performed to decide on the best possible configuration of these.
4.2 Ablation Study on NAS-Bench-201
We conduct a ablations of the proposed Zero-Cost-PT approach on NAS-Bench-201 [
34
]. NAS-
Bench-201 constructed a unified cell-based search space, where each architecture has been trained on
three different datasets, CIFAR-10, CIFAR-100 and ImageNet-16-120
3
. In our experiments, we take
a randomly initialised supernet for this search space and apply our Zero-Cost-PT algorithm to search
for architectures without any training. We run the search with four different random seeds (0, 1, 2,
3) and report the average and standard deviation of the test errors of the obtained architectures. All
searches were performed on CIFAR-10, and obtained architectures were then additionally evaluated
on the other two datasets.
Edge Discretization Order.
We first study how different edge discretization order may impact the
performance of our Zero-Cost-PT approach. We consider the following edge discretization orders:
•fixed
: discretizes the edges in a fixed order, where in our experiments we discretize from the
input towards the output of the cell structure;
•random: discretizes the edges in a random order (DARTS-PT);
•global-op-iter
: iteratively evaluates
S(A−(e, o))
for all operations on all edges in
E
,
selects the edge
e
containing the operation
o∗
with globally best score. Discretizes
e
with
o∗
,
then repeat the process to decide on the next edge (involves re-evaluation of scores) until all
edges have been discretized;
•global-op-once
: only evaluates
S(A−(e, o))
for all operations once to obtain a ranking
order of the operations, and discretizes the edges Eaccording to this order;
•global-edge-iter
: similar to
global-op-iter
but iteratively selects edge
e
from
E
based
on the average score of all operations on each edge;
•global-edge-once
: similar to
global-op-once
but uses the average score of operations on
edges to obtain the edge discretization order.
In our experiments we run
N= 10
search iterations and
V= 100
validation iterations for all variants.
Table. 2 shows the performance of the approaches. We see that the
global-op-iter
consistently
performs best across all three datasets, since it iteratively explores the search space of remaining
operations, aiming to select the currently best in a greedy way. It also comes with a higher cost
than
fixed
or
random
, since it needs to perturb
1
2|O||E|(|E| + 1)
operations in total, while the latter
3We use the three random seeds available in NAS-Bench-201: 777, 888, 999.
6
3180 3200 3220 3240 3260 3280
92.5
93
93.5
94
94.5
N
random(n=1000)
random(n=100)
global_op_gready(n=100)
random(n=10)
CIFAR10
score_1
acc
Test Accuracy
Architecture Score
3180 3200 3220 3240 3260 3280
92.5
93.0
93.5
94.0
94.5
random
(N=10)
random
(N=100)
random
(N=1000)
global-op-iter
(N=10)
(a)
110 100
93
93.5
94
94.5
110 100 110 100
CIFAR10
V V V
acc
N=10 N=100 N=1000
Test Accuracy
V=1
93.0
93.5
94.0
94.5
V=10 V=100 V=1 V=10 V=100 V=1 V=10 V=100
N=10 N=100 N=1000
(b)
Figure 3: (a) Accuracy vs. score of architectures discovered on C10 by Zero-Cost-PT with different
N. (b) Accuracy distribution of discovered architectures with different Nand V.
require
|O||E|
(
|E|
is the number of edges in the cell and
|O|
is the number of operations on each
edge). On the other hand, we see that although cheaper, the performance of
global-op-once
is
inferior since it determines the order to perturb edges by assessing the importance of operations
once for all at the beginning, which may not be appropriate as discretization continues. Note that
when discretizing an edge according to the obtained order,
global-op-once
still need to perturb
the
|O|
operations on each remaining edge. We observe similar behaviour in
global-edge-iter
and
global-edge-once
, both of which use the average importance of operations on edges to decide
search order, leading to suboptimal performance. It is also worth pointing out that
fixed
performs
relatively well comparing to the other variants, offering comparable performance with
random
. This
shows that Zero-Cost-PT is generally robust to the edge discretization order. For simplicity, in the
following experiments we use
random
order with a moderate setting in search iterations (
N= 10
) to
balance exploration and exploitation during search, while maintaining the efficiency of Zero-Cost-PT.
Search vs. Validation.
We also study the impact of different search iterations
N
and validation
iterations
V
when our Zero-Cost-PT use
random
as the search order. Intuitively, larger
N
lead to more
architecture candidates being found, while
V
indicates the amount of data used to rank the search
candidates. As shown in Figure 3a, we see larger
N
does lead to more architectures discovered, but not
proportional to the value of
N
on NAS-Bench-201 space. Notably for
N
=100 we discover 27.8 distinct
architectures on average, but when increased to
N
=1000 the number only roughly doubles. We also
see that even with
N
=10, Zero-Cost-PT (both
random
and
global-edge-iter
) can already discover
top models in the space, demonstrating desirable balance between search quality and efficiency. On
the other hand, as shown in Figure 3b, larger
V
tends to reduce the performance variance, especially
for smaller
N
. This is also expected as more validation iterations could stabilise the ranking of selected
architecture candidates, helping Zero-Cost-PT to retain the most promising ones at manageable cost
(Vminibatches of data).
Table 3: Test error (%) of Zero-Cost-PT with dif-
ferent proxies on NAS-Bench-201.
Proxy1CIFAR-10 CIFAR-100 ImageNet-16
fisher 10.64±1.27 38.48±1.96 82.85±12.63
grad_norm 10.55±1.11 38.43±2.10 80.71±12.10
grasp 9.81±3.42 36.52±6.33 64.27±8.82
snip 8.32±2.02 34.00±4.03 65.35±11.04
synflow26.24±0.00 28.89±0.00 58.56±0.00
nwot 5.97±0.17 27.47±0.28 53.82±0.77
1All proxies use N=10 search iterations and V=100 validation itera-
tion.
2Only 1 model was selected across all 4 seeds.
Different Zero-cost Metrics.
We empiri-
cally investigate the performance of our Zero-
Cost-PT algorithm when using different zero-
cost metrics. In particular, we consider the
following metrics that have been proposed
in recent zero-cost NAS literature [
16
,
17
]:
grad_norm
[
16
],
snip
[
29
],
grasp
[
28
],
synflow
[
27
],
fisher
[
35
] and
nwot
[
17
]. Ta-
ble 3 compares the average test errors of archi-
tectures selected by different proxies on NAS-
Bench-201. We see that
nwot
and
synflow
perform considerably better across the three
datasets than the others, where
nwot
offers
around 0.27% improvement over
synflow
.
However, even the worst performing
fisher
and naive
grad_norm
outperform the state-of-the-art DARTS-PT on this benchmark (see Table 4).
This confirms that the zero-cost metrics, when combined with the perturbation-based NAS paradigms
as in Zero-Cost-PT, could become promising proxies to the actual trained accuracy. We also observed
that the ranking of those metrics are quite stable on the three datasets (descending order in terms
7
of error as in Table 3), indicating that architectures discovered by our Zero-Cost-PT have good
transferability across datasets. It is also clear that
nwot
consistently performs the best, reducing test
errors on all three datasets by a considerable margin.
5 Results
In the previous section we introduced our Zero-Cost-PT method and performed some ablation studies
on NAS-Bench-201 [
34
] to find the best performing set of hyperparameters. In this section we further
compare the proposed Zero-Cost-PT approach with the state-of-the-art zero-cost and perturbation
based NAS algorithms on a number of search spaces, including NAS-Bench-201 [
34
], DARTS’ CNN
space [
12
] and the four DARTS subspaces S1-S4 [
13
]. We use the same setting as in Section 4.2, and
details can be found in supplementary materials (S.M.).
5.1 Comparison with SOTA on NAS-Bench-201
Table 4: Comparison in test error (%) with the state-of-the-art
perturbation-based and zero-cost NAS on NAS-Bench-201.
Method CIFAR-10 CIFAR-100 ImageNet-16
Random 13.39±13.28 39.17±12.58 66.87±9.66
DARTS [12] 45.70±0.00 84.39±0.00 83.68±0.00
DARTS-PT [33] 111.89±0.00 45.72±6.26 69.60±4.40
DARTS-PT (fix α) [33] 16.20±0.00 34.03±2.24 61.36±1.91
NASWOT(synflow) [16] 26.54±0.62 29.53±2.13 58.22±4.18
NASWOT(nwot) [17] 27.04±0.80 29.97±1.16 55.57±2.07
Zero-Cost-PT(synflow) 6.24±0.00 28.89±0.00 58.56±0.00
Zero-Cost-PT(nwot)5.97±0.17 27.47±0.28 53.82±0.77
1Results on C10 taken from [33]. Results on other datasets computed using official code
in [33], and averaged over results using 4 random seeds (0 - 3).
2Calculated with N=1000 and averaged over 500 runs as described in [17].
Table. 4 shows the average
test error (%) of the compet-
ing approaches and our Zero-
Cost-PT on the three datasets in
NAS-Bench-201. Here we in-
clude the naive random search
and original DARTS as base-
lines, and compare our approach
with the recent zero-cost NAS
algorithm NASWOT [
17
], as
well as the perturbation-based
NAS approaches DARTS-PT and
DARTS-PT (fix
α
) [
33
]. As in
all competing approaches, we per-
form search on CIFAR-10 and
evaluate the final model on all
three datasets. We see that on all
datasets our Zero-Cost-PT consis-
tently offers superior performance, especially on CIFAR-100 and ImageNet-16. We see that even
the best perturbation-based algorithm, DARTS-PT (fix
α
), fails on those two datasets, producing
suboptimal results with limited improvements compared to random search. This suggests that archi-
tectures discovered by DARTS-PT might not transfer well to other datasets. On the other hand our
Zero-Cost-PT can discover architectures that consistently perform well even after transferring to a
different dataset – especially when used with nwot metric.
5.2 Results on DARTS CNN Search Space
Table 5: Comparison with the state-of-the-art differentiable
NAS methods on the DARTS CNN search space (CIFAR-10).
Method Test Error (%) Params (M) Cost2
Avg. Best
DARTS [12] 3.00±0.14 - 3.3 0.4
SDARTS-RS [31] 2.67±0.03 - 3.4 0.4
SGAS [32] 2.66±0.24 - 3.7 0.25
DARTS+PT [33] 2.61±0.08 2.48 3.0 0.8
DARTS+PT+none12.73±0.13 2.67 3.2 0.8
Zero-Cost-PTrandom 2.68±0.17 2.43 4.7 0.018
Zero-Cost-PTglobal-op-iter 2.62±0.09 2.49 4.6 0.17
1none operation muted in code provided by DARTS+PT [33]. Results obtained by
re-enabling none in search.
2In GPU days. Cost of existing approaches taken from [33]. Cost of Zero-Cost-PT
measured on a single 2080Ti GPU.
We now move to the much larger
DARTS CNN search spaces. We
compare the proposed Zero-Cost-
PT with the original DARTS and
its variants, as well as the recent
perturbation-based DARTS-PT [
33
].
We use the same settings as in
DARTS-PT [
33
], but instead of pre-
training the supernetwork and fine-
tuning it after each perturbation, we
take an untrained supernet and di-
rectly perform zero-cost perturbation-
based architecture search as in Al-
gorithm 1. As in previous experi-
ments, we run our Zero-Cost-PT al-
gorithm with
N
=10 search iterations
and
V
=100 validation iteration, using
the same random seeds as in DARTS-
PT. We then train the selected four
8
architectures under different initialisation (seeds 0-3) for 600 epochs, and report both the best and
average test errors on CIFAR-10. Training details can be found the in S.M.
As shown in Table 5 the proposed Zero-Cost-PT approaches can achieve much better average test error
then the original DARTS and comparable to its newer variants SDARTS-RS [
31
] and SGAS [
32
]
at a much lower searching cost (especially when using
random
edge ordering). The significant
improvement over DARTS+PT comes from the fact that DARTS-PT needs to compute validation
accuracy of the supernet after each operation perturbed (removed), where the remaining supernet
requires extra fine-tuning to recover from accuracy drop between two perturbations. On the other hand
in our Zero-Cost-PT
random
, we only need to evaluate the score of the perturbed supernet with zero-cost
proxies (in these experiments we use
Snwot
), requiring no more than a minibatch of data. Note that
here the cost of Zero-Cost-PT reported in Table 5 is for
N
=10 search iterations (randomly selecting
the order of edges to perturb in each iteration), and thus a single search iteration only takes about a
few minutes to run. The other variant Zero-Cost-PT
global-op-iter
, which uses
global-op-iter
to
determine edge discretization order (see Section 4.2), offers better performance with lower variance
compared to random but incurs slightly heavier computation.
5.3 Results on DARTS Spaces S1-S4
As described in the previous sections, it is well known that DARTS could generate trivial architectures
with degenerative performance in certain cases. Zela et al. [
13
] have designed various special search
spaces for DARTS in order to investigate its failure cases on them. As in DARTS+PT, we consider
spaces S1-S4 and study the performance of our Zero-Cost-PT on them to validate its robustness in a
controlled environment. Detailed specifications of S1-S4 can be found in the S.M.
Table 6: Comparison in test error (%) with state-of-the-
art perturbation-based NAS on DARTS spaces S1-S4
(best in red, 2nd best in blue).
Space DARTS1DARTS-PT1Zero-Cost-PT2
Best Best Best (fix α) Avg. Best
CIFAR-10
S1 3.84 3.5 2.86 2.75±0.28 2.55
S2 4.85 2.79 2.59 2.49±0.05 2.45
S3 3.34 2.49 2.52 2.47±0.09 2.40
S4 7.20 2.64 2.58 5.23±0.76 4.69
CIFAR-100
S1 29.64 24.48 24.4 22.05±0.29 21.84
S2 26.05 23.16 23.3 20.97±0.50 20.61
S3 28.9 22.03 21.94 21.02±0.57 20.61
S4 22.85 20.80 20.66 25.70±0.01 25.69
SVHN
S1 4.58 2.62 2.39 2.37±0.06 2.33
S2 3.53 2.53 2.32 2.40±0.05 2.36
S3 3.41 2.42 2.32 2.34±0.05 2.30
S4 3.05 2.42 2.39 2.83±0.06 2.79
1Results taken from [33].
2Results obtained using random seeds 0 and 2.
As shown in Table 6, our approach con-
sistently outperforms the original DARTS,
the state-of-the-art DARTS-PT and DARTS-
PT(fix
α
) across S1 to S3 on both datasets
C10 and C100, while on SVHN it of-
fers competitive performance comparing the
competing algorithms (best in S1, second
best in space S2/S3 with .08/.02% gap). This
confirms that our Zero-Cost-PT is robust
in finding good performing architectures in
spaces where DARTS typically fails, e.g.
it has been shown [
33
] that in S2 DARTS
tends to produce trivial architectures satu-
rated with skip connections. On the other
hand, we observe that Zero-Cost-PT doesn’t
perform well in search space S4. In par-
ticular, our approach struggles with opera-
tion
noise
, which simply outputs a random
Gaussian noise
N(0,1)
regardless of the in-
put. This leads to unpredictable behaviour
when our approach assesses the importance
of this operation, i.e., the score
S(A\o)
can
be completely random when
o=noise
.
Therefore it is expected for our Zero-Cost-
PT approach which involves no training in
architecture search, to generate suboptimal
results in this spaces.
6 Conclusion
In this paper, we propose Zero-Cost-PT, a perturbation-based NAS that utilises zero-cost proxies in
the context of differentiable NAS. We shows that lightweight operation scoring methods based on
zero-cost metrics empirically outperform existing operation scoring functions such as darts [
12
] and
darts-pt [
33
]. We then presented a lightweight NAS algorithm based on perturbation and the zero-cost
metrics. Our approach outperforms the best available differentiable architecture search in terms
of searching time and accuracy even in very large search spaces – something that was previously
impossible to achieve by using zero-cost proxies.
9
References
[1]
Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and
Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and
<
0.5mb
model size. arXiv:1602.07360, 2016.
[2]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias
Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient Convolutional Neural
Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861, 2017.
[3]
Mingxing Tan and Quoc Le. EfficientNet: Rethinking model scaling for convolutional neural
networks. In International Conference on Machine Learning (ICML), 2019.
[4]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.
Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018.
[5]
Mingxing Tan and Quoc V. Le. Efficientnetv2: Smaller models and faster training.
arXiv:2104.00298, 2021.
[6]
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized Evolution for
Image Classifier Architecture Search. In AAAI Conference on Artificial Intelligence (AAAI),
2019.
[7]
Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong
Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet
design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pages 10726–10734, 2019.
[8]
Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train one
network and specialize it for efficient deployment. In International Conference on Learning
Representations (ICLR), 2020.
[9]
Bert Moons, Parham Noorzad, Andrii Skliar, Giovanni Mariani, Dushyant Mehta, Chris Lott,
and Tijmen Blankevoort. Distilling optimal neural networks: Rapid search in diverse spaces.
arXiv:2012.08859, 2020.
[10]
Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimiza-
tion. In Proceedings of the 32nd International Conference on Neural Information Processing
Systems, NIPS’18, page 7827–7838, Red Hook, NY, USA, 2018. Curran Associates Inc.
[11]
Łukasz Dudziak, Thomas Chau, Mohamed S. Abdelfattah, Royson Lee, Hyeji Kim, and
Nicholas D. Lane. BRP-NAS: Prediction-based NAS using GCNs. In Neural Information
Processing Systems (NeurIPS), 2020.
[12]
Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search.
In International Conference on Learning Representations (ICLR), 2019.
[13]
Arber Zela, Thomas Elsken, Tonmoy Saikia, Yassine Marrakchi, Thomas Brox, and Frank
Hutter. Understanding and robustifying differentiable architecture search. In International
Conference on Learning Representations (ICLR), volume 3, page 7, 2020.
[14]
Yao Shu, Wei Wang, and Shaofeng Cai. Understanding architectures learnt by cell-based neural
architecture search. In International Conference on Learning Representations, 2020.
[15]
Kaicheng Yu, Christian Sciuto, Martin Jaggi, and Mathieu Salzmann Claudiu Musat. Evaluating
the search phase of neural architecture search. In International Conference on Learning
Representations, 2020.
[16]
Mohamed S Abdelfattah, Abhinav Mehrotra, Łukasz Dudziak, and Nicholas Donald Lane. Zero-
cost proxies for lightweight NAS. In International Conference on Learning Representations
(ICLR), 2021.
[17]
Joseph Mellor, Jack Turner, Amos Storkey, and Elliot J. Crowley. Neural architecture search
without training. In International Conference on Machine Learning (ICML), 2021 (to appear).
[18]
Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In
International Conference on Learning Representations (ICLR), 2017.
[19]
Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture
search via parameters sharing. In International Conference on Machine Learning (ICML), pages
4095–4104, 2018.
10
[20]
Gabriel Bender, Hanxiao Liu, Bo Chen, Grace Chu, Shuyang Cheng, Pieter-Jan Kindermans,
and Quoc V. Le. Can weight sharing outperform random architecture search? an investigation
with tunas. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2020.
[21]
Royson Lee, Łukasz Dudziak, Mohamed Abdelfattah, Stylianos I. Venieris, Hyeji Kim, Hongkai
Wen, and Nicholas D. Lane. Journey towards tiny perceptual super-resolution. In European
Conference on Computer Vision (ECCV), 2020.
[22]
Dongzhan Zhou, Xinchi Zhou, Wenwei Zhang, Chen Change Loy, Shuai Yi, Xuesen Zhang,
and Wanli Ouyang. Econas: Finding proxies for economical neural architecture search. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2020.
[23]
Abhinav Mehrotra, Alberto Gil Ramos, Sourav Bhattacharya, Łukasz Dudziak, Ravichander
Vipperla, Thomas Chau, Mohamed S. Abdelfattah, Samin Ishtiaq, and Nicholas D. Lane. NAS-
Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition. In International
Conference on Learning Representations (ICLR), 2021.
[24]
Chen Wei, Chuang Niu, Yiping Tang, and Ji min Liang. Npenas: Neural predictor guided
evolution for neural architecture search. arXiv:2003.12857, 2020.
[25]
Junru Wu, Xiyang Dai, Dongdong Chen, Yinpeng Chen, Mengchen Liu, Ye Yu, Zhangyang
Wang, Zicheng Liu, Mei Chen, and Lu Yuan. Weak NAS predictors are all you need.
arXiv:2102.10490, 2021.
[26]
Wei Wen, Hanxiao Liu, Hai Li, Yiran Chen, Gabriel Bender, and Pieter-Jan Kindermans. Neural
predictor for neural architecture search. arXiv:1912.00848, 2019.
[27]
Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli. Pruning neural
networks without any data by iteratively conserving synaptic flow. In Neural Information
Processing Systems (NeurIPS), 2020.
[28]
Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by
preserving gradient flow. In International Conference on Learning Representations (ICLR),
2020.
[29]
Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning
based on connection sensitivity. In International Conference on Learning Representations
(ICLR), 2019.
[30]
Jack Turner, Elliot J. Crowley, Michael O’Boyle, Amos Storkey, and Gavin Gray. Block-
swap: Fisher-guided block substitution for network compression on a budget. In International
Conference on Learning Representations (ICLR), 2020.
[31]
Xiangning Chen and Cho-Jui Hsieh. Stabilizing differentiable architecture search via
perturbation-based regularization. In International Conference on Machine Learning (ICML),
pages 1554–1565. PMLR, 2020.
[32]
Guohao Li, Guocheng Qian, Itzel C Delgadillo, Matthias Muller, Ali Thabet, and Bernard
Ghanem. Sgas: Sequential greedy architecture search. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages 1620–1630, 2020.
[33]
Ruochen Wang, Minhao Cheng, Xiangning Chen, Xiaocheng Tang, and Cho-Jui Hsieh. Re-
thinking architecture selection in differentiable NAS. In International Conference on Learning
Representations (ICLR), 2021.
[34]
Xuanyi Dong and Yi Yang. NAS-Bench-201: Extending the Scope of Reproducible Neural
Architecture Search. In International Conference on Learning Representations (ICLR), 2020.
[35]
Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction
with dense networks and fisher pruning. arXiv:1801.05787, 2018.
[36]
Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi, Qi Tian, and Hongkai Xiong.
Pc-darts: Partial channel connections for memory-efficient architecture search. In International
Conference on Learning Representations, 2020.
11
A Appendix
A.1 More on Operation Scoring
In this section, we provide more experimental details and results of our analysis of the operation
scoring functions introduced in Section 3. Most notably, we empirically study these functions in the
iterative setting.
A.1.1 Iterative Operation Selection
One major finding from Figure 2 is that discretization accuracy does not represent the true operation
strength at all, in fact, in our experiments it was negatively-correlated with best-acc (which is the
ideal scoring function). Another interesting finding is that zero-cost based methods, especially
zc-pt, performed very well, even outperforming the recently-introduced darts-pt [
33
]. However, one
shortcoming of our analysis and conclusions in Section 3 is that we only consider operation scores at
iteration 0, whereas most of the operation scoring functions are meant to be used within an iterative
algorithm: score →discretize →(optional) train →score.
To investigate the correlation of scoring functions in the iterative setting, we do the following:
1. score operations on all undiscretized edges,
2. discretize edge i,
3. retrain for 5 epochs (darts-pt and disc-acc only),
4. increment iand repeat from step 1 until all edges are discretized.
At each iteration
i
, we calculate the scores for the operations on all remaining undiscretized edges and
compute their Spearman rank correlation coefficients (Spearman-
ρ
) w.r.t. best-acc. This is plotted
in Figure A1, averaged over four seeds. Our results confirm many of our iteration-0 analysis. zc-pt
continues to be the best operation scoring function, and darts-pt is the second-best. Additionally,
disc-acc continues to be unrepresentative of operation strength even when used in the iterative setting.
This is not what we expected, especially in the very last iteration when disc-acc is supposed to match
a subnetwork exactly. As Figure A1 shows, the variance in the last iteration is quite large – we believe
this happens because we do not train to convergence every time we discretize an edge, and instead
we only train for 5 epochs. In future work (and possibly, in revisions of this manuscript) we will
re-evaluate discretization accuracy with a larger number of retraining epochs. For now, our main
conclusion regarding the disc-acc scoring function is that it is unrepresentative of true operation
strength when we only use 5 epochs of retraining.
Figure A1: Rank correlation coefficient of operation scoring functions versus best-acc when invoked
iteratively for each edge. In iteration
i
, only edge
i
is discretized then all scores for all operations on
the remaining edges is computed and correlated against best-acc.
12
Table A1: Comparison to previous work on the ImageNet classification task. We performed our
search using the CIFAR-10 dataset as described in Section 5, similarly to previous work.
Architecture Test Error [%] Params Search Time
Top-1 Top-5 [MB] [GPU-days]
DARTS [12] 26.7 8.7 4.7 0.4
SDARTS-RS [31] 25.6 8.2 - 0.4
DARTS-PT [33] 25.5 8.0 4.6 0.8
PC-DARTS [36] 25.1 7.8 5.3 0.1
SGAS [32] 24.1 7.3 5.4 0.25
Zero-Cost-PT1
(best) 24.4 7.5 6.3 0.018
Zero-Cost-PT1
(4 seeds) 24.6±0.13 7.6±0.09 6.3 0.018
1We use the same training pipeline from DARTS [12].
A.1.2 Detailed Methodology
Here, we provide some additional experimental details for the data presented in Section 3. The
following list describes how we compute each operation score.
•best-acc
: To get the score for an operation
o
on a specific edge
e
, we find the maximum test
accuracy of all NAS-Bench-201 architectures with (o, e).
•avg-acc
: Same as best-acc but we average all NAS-Bench-201 architecture test accuracies
instead of finding the maximum.
•disc-acc
: We discretize one edge
e
by selecting an operation
o
, then we train for 5 epochs
4
and record the supernet accuracy – this is used as the score for (o, e).
•darts-pt
: We perturb one edge with one operation
A−(e, o)
and record the validation
accuracy. For perturbation-based scoring functions, we multiply the score by
−1
before
computing correlations.
•disc-zc
: We discretize one edge
e
by selecting an operation
o
and then compute the zero-cost
metric.
•zc-pt
: We perturb one edge with one operation
A−(e, o)
and compute the zero-cost metric.
For perturbation-based scoring functions, we multiply the score by
−1
before computing
correlations.
•darts
: We record the value of the architecture parameters
α
after 60 epochs of training the
supernet.
A.1.3 Detailed Operation Scores
Table A2 (at the end of this Appendix) shows all operation scores at iteration 0. This data was used to
compute Spearman-
ρ
in Figure 2. Note that we compute Spearman-
ρ
per edge, and average over all
edges – this summarizes how well each score tracks our “oracle" best-acc score.
A.2 Evaluation on ImageNet
Table A1 shows the ImageNet classification accuracy for architectures searched on CIFAR-10. Our
Zero-Cost-PT
random
algorithm (from Table 5) is able to find architectures with a comparable accuracy
much faster than previous work.
A.3 Description of DARTS subspaces (S1-S4)
RobustDARTS introduced four different DARTS subspaces to evaluate robustness of the original
DARTS algorithm [
13
]. In our work we validate robustness of Zero-Cost-PT against some of the
more recent algorithms using the same subspaces originally proposed in the RobustDARTS paper
(Section 5.3). The search spaces are defined as follows:
4
DARTS-PT defines discretization accuracy as the accuracy after convergence. We elected to only train for 5
epochs to make our experiments feasible but we are now investigating whether longer training will affect our
results.
13
Algorithm 2: Zero-Cost Perturbation-based Architecture Search (Zero-Cost-PT)
Input :
An untrained supernetwork
A0
with set of edges
E
and set of nodes
N
, # of search iterations
N
, # of
validation iterations V
Result: A selected architecture A∗
|E|
// Stage 1: search for architecture candidates
1C=∅
2for i= 1 : Ndo
3for t= 1 : |E| do
4Select next edge etusing the chosen discretization ordering
5ot=fzc-pt(At−1, et)
6At=At−1+ (et, ot)
7end
8while |N | >0do // prune the edges of the obtained architecture A|E|
9Randomly select a node n∈ N
10 forall Input edge eto node ndo
11 Evaluate the zc-pt score of the architecture A|E| when eis removed
12 end
13 Retain only edges e(1)∗
n,e(2)∗
nwith the 1st and 2nd best zc-pt score, and remove nfrom N
14 end
15 Add A|E| to the set of candidate architectures C
16 end
// Stage 2: Validate the architecture candidates
17 for j= 1 : Vdo
18 Calculate S(j)(A)for each A∈ C using a random mini-batch data;
19 end
20 Select the best architecture A∗
|E| = arg maxA∈C Pj=1:VS(j)(A);
•
in S1 each edge of a supernet consists only of the two candidate operations having the
highest magnitude of
α
in the vanilla DARTS (these operations can be different for different
edges);
• S2 only considers two operations: skip_connect and sep_conv_3x3;
• similarly, S3 consists of three operations: none,skip_connect and sep_conv_3x3;
•
finally, S4 again considers just two operation:
noise
and
sep_conv_3x3
, where
noise
op-
eration generates random Gaussian noise
N(0,1)
in every forwards pass that is independent
from the input.
A.4 Detailed Zero-Cost-PT Algorithm
For DARTS CNN search space, our Zero-Cost-PT algorithm has an additional topology selection
step, where for each node in an architecture we only retain the top two incoming edges based on the
zc-pt score – this is similar to the vanilla DARTS algorithm [
12
]. The detailed algorithms is presented
in Algorithm 2.
A.5 Experimental Details
All searches were run multiple times with different searching seeds (usually 0, 1, 2 and 3). Ad-
ditionally, each found architecture was trained multiple times using different training seeds – for
DARTS the same set of seeds was used for training and searching, for NAS-Bench-201 (NB201)
training seeds were taken from the dataset (777, 888 and 999, based on their availability in the
dataset). Therefore, for each experiment we got a total of
searching_seeds ×training_seeds
accuracy values. Whenever average performance is reported, it was averaged across all obtained
results. Similarly, best values were selected by taking the best single result from among the searching
and training seeds.
A.5.1 Experimental Details – NAS-Bench-201
Searching was performed using 4 different seeds (0, 1, 2, and 3) to initialise a supernet. Whenever
we had to perform training of a supernet during the searching phase (Section 3), we used the same
14
hyperparameters as used by the original DARTS-PT code. When searching using our Zero-Cost-PT
we used batch size of 256, N=10, V=100 and S=
nwot
, unless mentioned otherwise (e.g., during
ablation studies). Inputs for calculating zero-cost scores were coming from the training dataloader(s),
as defined for relevant datasets in the original DARTS-PT code (including augmentation). For zero-
cost proxies that require a loss function standard cross-entropy was used. For any searching method,
after a final subnetwork had been identified by an algorithm, we extracted the final architecture and
queried the NB201 dataset to obtain test accuracy – one value for each training seed available in the
dataset.
All experiments specifically concerning operation scoring (Sections 3 and A.1) were using averaged
accuracy of models from NB201 for simplicity.
We did not search for architectures targeting CIFAR-100 or ImageNet-16 directly – whenever we
report results for these datasets we used the same architecture that had been found using CIFAR-10.
A.5.2 Experimental Details – DARTS
DARTS experiments follow a similar methodology to NB201. Each algorithm was run with 4 different
initialisation seeds for a supernet (0, 1, 2 and 3). When running Zero-Cost-PT, we used the following
hyperparameters: batch size of 64, N=10, V=100 and S=
nwot
. Inputs and loss function for zero-cost
metrics were defined analogically to NB201. We did not run any baseline method on the DARTS
search space (all results were taken from the literature) so we did not have to perform any training
of a supernet. After an algorithm had identified a final subnetwork, we then trained it 4 times using
different initialisation seeds again (0, 1, 2 and 3). When training subnetworks we were using a setting
aligned with the previous work [12, 33].
Unlike NB201, whenever different datasets were considered (Section 5.3) architectures were searched
on each relevant dataset directly.
For CIFAR-10 experiments, we trained models using a heavy configuration with
init_channels
=
36
and
layers
=
20
. Models found on CIFAR-100 and SVHN were trained using a mobile setting with
init_channels=16 and layers=8. Both choices follow the previous work [13, 33].
A.6 Discovered Architectures
Figures A2 and A3 present cells found by our Zero-Cost-PT on the DARTS CNN search space
(Section 5.2) when using
global-op-iter
and
random
discretization orders, respectively (see
Section 4.2 for the definition of the two discretization orders). Figures A4 through A15 show cells
discovered on the four DARTS subspaces and the three relevant datasets (Sections 5.3 and A.3).
c_{k-2}
0
skip_connect
1
sep_conv_3x3
2
sep_conv_3x3
3
sep_conv_3x3
c_{k-1}
sep_conv_5x5
sep_conv_5x5
sep_conv_3x3
c_{k}
sep_conv_3x3
(a) Normal cell
c_{k-2}
0
dil_conv_3x3
1
sep_conv_3x3
2
sep_conv_5x5 3
sep_conv_3x3
c_{k-1}
sep_conv_3x3
sep_conv_5x5
sep_conv_3x3
sep_conv_3x3 c_{k}
(b) Reduction cell
Figure A2: Cells found by Zero-Cost-PT (
global-op-iter
discretization order) on the DARTS
search space using CIFAR-10.
15
c_{k-2}
0
skip_connect 1
sep_conv_3x3
2
sep_conv_3x3
3
sep_conv_3x3
c_{k-1} sep_conv_5x5
sep_conv_5x5
sep_conv_3x3
c_{k}
sep_conv_5x5
(a) Normal cell
c_{k-2}
0
sep_conv_5x5
1
sep_conv_3x3
2
sep_conv_5x5 3
sep_conv_3x3
c_{k-1}
sep_conv_5x5
sep_conv_5x5
sep_conv_3x3
sep_conv_5x5 c_{k}
(b) Reduction cell
Figure A3: Cells found by Zero-Cost-PT (
random
discretization order) on the DARTS search space
using CIFAR-10.
c_{k-2}
0
skip_connect
3
sep_conv_3x3
c_{k-1}
dil_conv_5x5
1
sep_conv_3x3
2
sep_conv_3x3
dil_conv_3x3
sep_conv_3x3
c_{k}
dil_conv_5x5
(a) Normal cell
c_{k-2}
0
max_pool_3x3
1
max_pool_3x3
c_{k-1}
dil_conv_3x3
avg_pool_3x3
2
sep_conv_3x3
c_{k}
dil_conv_5x5
3
dil_conv_5x5 dil_conv_5x5
(b) Reduction cell
Figure A4: Cells found by Zero-Cost-PT (
random
discretization order) on the DARTS-S1 space
using CIFAR-10.
c_{k-2}
0
sep_conv_3x3 2
sep_conv_3x3
c_{k-1} skip_connect 1
sep_conv_3x3
sep_conv_3x3 3
sep_conv_3x3
sep_conv_3x3
sep_conv_3x3
c_{k}
(a) Normal cell
c_{k-2}
0
sep_conv_3x3
1
skip_connect
2
sep_conv_3x3 3
sep_conv_3x3
c_{k-1}
sep_conv_3x3
skip_connect
sep_conv_3x3
sep_conv_3x3 c_{k}
(b) Reduction cell
Figure A5: Cells found by Zero-Cost-PT (
random
discretization order) on the DARTS-S2 space
using CIFAR-10.
c_{k-2} 0
sep_conv_3x3
2
skip_connect
c_{k-1}
sep_conv_3x3
1
sep_conv_3x3
3
sep_conv_3x3
sep_conv_3x3
c_{k}
sep_conv_3x3
sep_conv_3x3
(a) Normal cell
c_{k-2}
0
sep_conv_3x3
1
sep_conv_3x3
2
sep_conv_3x3 3
sep_conv_3x3
c_{k-1}
sep_conv_3x3
sep_conv_3x3
sep_conv_3x3
sep_conv_3x3 c_{k}
(b) Reduction cell
Figure A6: Cells found by Zero-Cost-PT (
random
discretization order) on the DARTS-S3 space
using CIFAR-10.
c_{k-2} 0
sep_conv_3x3
c_{k-1} noise
1
noise 2
sep_conv_3x3
sep_conv_3x3
sep_conv_3x3
3
sep_conv_3x3
c_{k}
sep_conv_3x3
(a) Normal cell
c_{k-2}
0
sep_conv_3x3
1
sep_conv_3x3
2
sep_conv_3x3 3
sep_conv_3x3
c_{k-1}
sep_conv_3x3
sep_conv_3x3
sep_conv_3x3
sep_conv_3x3 c_{k}
(b) Reduction cell
Figure A7: Cells found by Zero-Cost-PT (
random
discretization order) on the DARTS-S4 space
using CIFAR-10.
c_{k-2}
0
dil_conv_3x3
3
sep_conv_3x3
c_{k-1}
dil_conv_5x5
1
sep_conv_3x3
2
sep_conv_3x3
dil_conv_3x3
sep_conv_3x3
c_{k}
dil_conv_3x3
(a) Normal cell
c_{k-2}
0
avg_pool_3x3 1
avg_pool_3x3
c_{k-1}
dil_conv_3x3
2
sep_conv_3x3
3
avg_pool_3x3
dil_conv_5x5
dil_conv_3x3
dil_conv_5x5
c_{k}
(b) Reduction cell
Figure A8: Cells found by Zero-Cost-PT (
random
discretization order) on the DARTS-S1 space
using CIFAR-100.
16
c_{k-2}
0
skip_connect 1
sep_conv_3x3
c_{k-1}
sep_conv_3x3
2
sep_conv_3x3 3
sep_conv_3x3
sep_conv_3x3
sep_conv_3x3 c_{k}
sep_conv_3x3
(a) Normal cell
c_{k-2}
0
sep_conv_3x3
1
skip_connect
2
sep_conv_3x3 3
sep_conv_3x3
c_{k-1}
sep_conv_3x3
skip_connect
sep_conv_3x3
sep_conv_3x3 c_{k}
(b) Reduction cell
Figure A9: Cells found by Zero-Cost-PT (
random
discretization order) on the DARTS-S2 space
using CIFAR-100.
c_{k-2}
0
sep_conv_3x3
1
skip_connect
c_{k-1}
sep_conv_3x3 2
sep_conv_3x3 3
sep_conv_3x3
sep_conv_3x3 c_{k}
sep_conv_3x3
sep_conv_3x3
(a) Normal cell
c_{k-2}
0
skip_connect
1
sep_conv_3x3
2
sep_conv_3x3 3
sep_conv_3x3
c_{k-1}
skip_connect
skip_connect
sep_conv_3x3
sep_conv_3x3 c_{k}
(b) Reduction cell
Figure A10: Cells found by Zero-Cost-PT (
random
discretization order) on the DARTS-S3 space
using CIFAR-100.
c_{k-2}
0
noise
1
noise 2
sep_conv_3x3
c_{k-1} sep_conv_3x3
sep_conv_3x3 3
sep_conv_3x3
c_{k}
sep_conv_3x3
sep_conv_3x3
(a) Normal cell
c_{k-2}
0
sep_conv_3x3
1
sep_conv_3x3
2
sep_conv_3x3 3
sep_conv_3x3
c_{k-1}
sep_conv_3x3
sep_conv_3x3
sep_conv_3x3
sep_conv_3x3 c_{k}
(b) Reduction cell
Figure A11: Cells found by Zero-Cost-PT (
random
discretization order) on the DARTS-S4 space
using CIFAR-100.
c_{k-2}
0
dil_conv_3x3 1
dil_conv_5x5
3
sep_conv_3x3
c_{k-1} dil_conv_5x5
sep_conv_3x3 2
sep_conv_3x3
sep_conv_3x3
c_{k}
dil_conv_3x3
(a) Normal cell
c_{k-2}
0
max_pool_3x3 1
max_pool_3x3 3
avg_pool_3x3
c_{k-1}
dil_conv_3x3
2
sep_conv_3x3
dil_conv_5x5 c_{k}
dil_conv_5x5
dil_conv_5x5
(b) Reduction cell
Figure A12: Cells found by Zero-Cost-PT (
random
discretization order) on the DARTS-S1 space
using SVHN.
c_{k-2}
0
skip_connect 2
sep_conv_3x3
c_{k-1} sep_conv_3x3 1
sep_conv_3x3
sep_conv_3x3 3
sep_conv_3x3
sep_conv_3x3
sep_conv_3x3
c_{k}
(a) Normal cell
c_{k-2}
0
skip_connect
1
sep_conv_3x3
2
sep_conv_3x3 3
sep_conv_3x3
c_{k-1}
sep_conv_3x3
sep_conv_3x3
skip_connect
sep_conv_3x3 c_{k}
(b) Reduction cell
Figure A13: Cells found by Zero-Cost-PT (
random
discretization order) on the DARTS-S2 space
using SVHN.
c_{k-2}
0
skip_connect 2
sep_conv_3x3
c_{k-1} sep_conv_3x3
1
sep_conv_3x3
3
sep_conv_3x3
sep_conv_3x3
sep_conv_3x3 c_{k}
sep_conv_3x3
(a) Normal cell
c_{k-2}
0
sep_conv_3x3
1
sep_conv_3x3
2
sep_conv_3x3 3
sep_conv_3x3
c_{k-1}
sep_conv_3x3
skip_connect
sep_conv_3x3
sep_conv_3x3 c_{k}
(b) Reduction cell
Figure A14: Cells found by Zero-Cost-PT (
random
discretization order) on the DARTS-S3 space
using SVHN.
17
c_{k-2}
0
noise
1
sep_conv_3x3
c_{k-1} sep_conv_3x3 noise
2
sep_conv_3x3
3
sep_conv_3x3
c_{k}
sep_conv_3x3
sep_conv_3x3
(a) Normal cell
c_{k-2}
0
sep_conv_3x3
1
sep_conv_3x3
3
sep_conv_3x3
c_{k-1}
sep_conv_3x3
sep_conv_3x3 2
sep_conv_3x3
sep_conv_3x3
c_{k}
noise
(b) Reduction cell
Figure A15: Cells found by Zero-Cost-PT (
random
discretization order) on the DARTS-S4 space
using SVHN.
Table A2: Raw values of operation scoring functions at iteration 0 to reproduce Figure 2.
edge\op none skip_connect nor_conv_1x1 nor_conv_3x3 avg_pool_3x3
best-acc
0 94.15 94.18 94.44 94.68 93.86
1 94.24 94.16 94.49 94.68 94.09
2 94.25 94.43 94.49 94.68 94.19
394.16 94.68 94.03 94.04 93.85
4 94.29 94.18 94.56 94.68 94.23
5 94.05 94.16 94.68 94.56 94.1
avg-acc
0 77.36 81.02 83.81 86.38 87.32
1 80.03 83.11 85.23 85.99 81.52
2 82.9 82.44 84.05 84.49 81.98
3 74.02 85.17 87.3 88.28 81.38
4 80.14 83.05 85.09 85.7 81.89
5 77.61 83.43 86.18 86.95 81.74
disc-acc
083.27 82.24 65.0 71.76 54.31
184.94 83.23 73.23 76.77 83.45
283.87 83.73 77.33 76.83 83.25
3 65.77 84.44 75.82 78.68 62.7
483.57 82.03 75.02 76.09 82.56
583.95 82.45 66.69 71.36 80.31
darts-pt1
0 -85.43 -17.02 -78.13 -59.09 -85.34
1 -85.52 -36.1 -84.39 -80.95 -85.49
2 -85.51 -80.29 -81.86 -77.68 -85.32
3 -85.49 -9.86 -81.79 -59.18 -85.48
4 -85.45 -51.15 -78.84 -64.64 -85.14
5 -85.54 -32.43 -81.04 -72.75 -85.51
disc-zc
0 3331.01 3445.49 3366.88 3437.55 3423.18
1 3429.07 3435.75 3407.87 3434.58 3421.44
2 3428.8 3423.36 3440.93 3437.29 3416.89
3 3408.99 3464.05 3359.89 3382.18 3431.81
43433.99 3435.57 3424.47 3431.14 3423.15
53434.42 3437.66 3418.57 3397.52 3424.17
zc-pt1
0 -3455.23 -3449.9 -3449.54 -3441.82 -3461.18
1 -3452.15 -3448.7 -3441.81 -3440.65 -3453.74
2 -3446.52 -3447.61 -3435.46 -3436.4 -3449.28
3 -3453.81 -3435.99 -3444.04 -3445.6 -3447.07
4 -3451.06 -3449.8 -3442.63 -3441.13 -3453.31
5 -3450.97 -3448.21 -3440.8 -3443.24 -3452.99
darts
0 0.14 0.48 0.13 0.18 0.07
1 0.12 0.55 0.11 0.12 0.09
20.24 0.33 0.15 0.17 0.11
3 0.06 0.65 0.08 0.13 0.07
4 0.12 0.48 0.13 0.17 0.1
50.16 0.49 0.12 0.14 0.09
1Lower is better so we add a negative sign to *-pt scores.
18