PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Neural Architecture Search (NAS) defines the design of Neural Networks as a search problem. Unfortunately, NAS is computationally intensive because of various possibilities depending on the number of elements in the design and the possible connections between them. In this work, we extensively analyze the role of the dataset size based on several sampling approaches for reducing the dataset size (unsupervised and supervised cases) as an agnostic approach to reduce search time. We compared these techniques with four common NAS approaches in NAS-Bench-201 in roughly 1,400 experiments on CIFAR-100. One of our surprising findings is that in most cases we can reduce the amount of training data to 25\%, consequently reducing search time to 25\%, while at the same time maintaining the same accuracy as if training on the full dataset. Additionally, some designs derived from subsets out-perform designs derived from the full dataset by up to 22 p.p. accuracy.
Content may be subject to copyright.
Less is More: Proxy Datasets in NAS approaches
Brian Moser1,2, Federico Raue1, J¨
orn Hees1, Andreas Dengel1,2
1German Research Center for Artificial Intelligence (DFKI), Germany
2TU Kaiserslautern, Germany
Neural Architecture Search (NAS) defines the design of
Neural Networks as a search problem. Unfortunately, NAS
is computationally intensive because of various possibili-
ties depending on the number of elements in the design
and the possible connections between them. In this work,
we extensively analyze the role of the dataset size based
on several sampling approaches for reducing the dataset
size (unsupervised and supervised cases) as an agnostic
approach to reduce search time. We compared these tech-
niques with four common NAS approaches in NAS-Bench-
201 in roughly 1,400 experiments on CIFAR-100. One of
our surprising findings is that in most cases we can reduce
the amount of training data to 25%, consequently reducing
search time to 25%, while at the same time maintaining the
same accuracy as if training on the full dataset. Addition-
ally, some designs derived from subsets out-perform designs
derived from the full dataset by up to 22 p.p. accuracy.
1. Introduction
In recent years, a novel field called Neural Architecture
Search (NAS) has gained interest, which aims to automat-
ically find designs instead of hand-designed Neural Net-
works (NNs) created by researchers based on their knowl-
edge and experience [19]. For example, the NAS approach
AmoebaNet (developed by Google) reached state-of-the-art
performances for the ImageNet Classification Task [4, 18].
Despite promising results, the main drawback is the com-
putation time (especially for large datasets) that NAS ap-
proaches require to derive an architecture. In addition, regu-
lar weight optimization of found architecture designs is still
necessary to evaluate the quality of design choices. For in-
stance, this is the case for AmoebaNet, which selects differ-
ent configurations via trial-and-error as an evolutionary ap-
proach. Thus, the selection requires training of each config-
uration to evaluate the fitness. Therefore, researchers tend
to constrain the search space of a given NAS algorithm as a
trade-off to runtime speed [15, 17, 6, 21, 19].
Nevertheless, NAS approaches sometimes use datasets
that are sub-optimal for NAS as a whole. In more detail, the
role of each sample is not always positive and can even hurt
the performance, which is observable for datasets used for
Image Classification tasks like ImageNet [4, 20, 9]. With
this in mind, we are interested in analyzing the role of the
training dataset size as an approach to reducing the search
time in NAS. Thus, this work evaluates several sampling
methods for selecting a subset of a dataset for supervised
and unsupervised scenarios with four NAS approaches from
NAS-Bench-201 [7]. We evaluated on CIFAR-100 [12] that
the NAS approach DARTS [15] derived an architecture with
53.75% top-1 accuracy and a search time of 54 hours as a
baseline on an RTX 2080 GPU by NVIDIA. In contrast,
it was possible to reach 75.20% top-1 accuracy within a
search time of just 13 hours on 25% of the training data with
the same NAS approach. Furthermore, for another NAS ap-
proach, GDAS [6], it was possible to derive an architecture
with comparable results on a 50% reduced subset compared
to the baseline. The contributions of this work are
Evaluation of six different sampling methods for NAS,
divided into three supervised and three unsupervised
methods. The evaluation was done with ca. 1,400 ex-
periments on CIFAR-100 with four NAS algorithms
from NAS-Bench-201 (DARTS-V1 [15], DARTS-V2
[15], ENAS [17], and GDAS [6]).
Improvement of NAS search time by using 25 % of the
dataset, resulting in only 25 % of computation time,
parallel to better cell designs that outperformed the
baseline, sometimes by a large margin (22 p.p.).
Explanation of performances by detailed investigation
of design choices taken by the NAS algorithms and
showing the generalizability with ImageNet-16-120.
2. Related Work
For this work, two related areas are essential to be de-
scribed. The first area is the role of each sample of the
dataset on model performance. We use the idea of reducing
arXiv:2203.06905v1 [cs.LG] 14 Mar 2022
the dataset size as an approach to scale down searching time
in NAS. The second area is benchmarking NAS approaches
for Image Classification using a common framework called
NAS-Bench-201 to evaluate different sampling methods.
2.1. Proxy Datasets
Aproxy usually refers to an intermediary in Computer
Science [8]. In our case, a proxy dataset Dris an interme-
diary for the original dataset Dand the NAS search phase.
Mathematically, a proxy dataset is a subset of the original
dataset Dr D with a size-ratio r(0,1). So far, the
proxy dataset concept was shown to be successful in Image
Classification tasks [20, 9]. There are two ways of creating
such proxy datasets. One way is to generate syntactic sam-
ples which represent a compressed version of the original
dataset. Dataset Distillation [24] and Dataset Condensa-
tion [27] propose similar approaches in which NNs train
on small-sized datasets of synthetic dataset (e.g., 10 or 30
samples per class) and reach better results than using real
samples from the original datasets. Another way is to se-
lect only training samples that are beneficial for training.
Schleifer et al. [20] proposed a hyper-parameter search on
proxy datasets such that experiments on the proxies highly
correlate with experimental results on the entire dataset.
Therefore, the hyper-parameter search can be performed
faster on a proxy dataset without forfeiting significant per-
formance on the complete dataset.
In this work, we explore the second approach for se-
lecting samples from training data during the search for
architecture designs. Our goal is to compare several sam-
pling methods that derive proxy datasets and speed up
NAS approaches by reducing the training dataset size (i.e.,
computation-intensive Cell-Based Search).
2.2. NAS-Bench-201
One major problem in NAS research is how hard it is to
compare NAS approaches due to different spaces, e.g., un-
alike macro skeletons or sets of possible operations for an
architecture [28, 23, 17]. Additionally, researchers use dif-
ferent training procedures, such as hyper-parameters, data
augmentation, and regularization, which makes a fair com-
parison even harder [14, 25, 5, 7]. Therefore, Dong et
al. [7] released NAS-Bench-201 (an extension of NAS-
Bench-101 [25]) that supports fair comparisons between
NAS approaches. The benefit of using NAS-Bench-201
is efficiently concluding the contributions of various NAS
algorithms. Its set of operations Oprovides five types of
operations that are commonly used in the NAS literature:
In this work, we exploit NAS-Bench-201 for comparing
several NAS approaches applied to proxy datasets.
3. Methodology
Our work relies on several sampling methods for extract-
ing proxy datasets. Additionally, we consider sampling ap-
proaches for both scenarios, supervised and unsupervised,
as will be described in the following sections.
3.1. Proxy Datasets and Sampling
Let D={(Xi, yi)}be a dataset with cardinality nD.Xi
with 0i<nDdenotes the i-th sample of the dataset
and yiits corresponding label, if it exists (supervised case).
For the dataset, C={Cj}with cardinality nCdenotes the
labels. A subset Dr D will be called proxy with cardi-
nality nDr. The ratio r(0,1) denoted as index here in-
dicates the remaining percentage size of the original dataset
D:nDrr·nD(approximate because the datasize is not
always divisible without remainder). The proxy dataset op-
erates between the original dataset and the NN. In this work,
defining a proxy dataset aims to decrease the time needed
for experiments without suffering a quality loss compared
to a run on the entire dataset. Ideally, the proxy dataset
should also improve the quality of the NAS design choices.
The goal is to derive a proxy dataset Dr D with car-
dinality nDrr·nD,r(0,1). Thus, for any given
r(0,1), the sampling method has to ensure
where 1Dr:D {0,1}is the indicator function that indi-
cates the membership of an element in a subset Drof D.
3.2. Unsupervised Sampling
Unsupervised sampling methods do not take the label yi
of a training sample into account for sampling.
3.2.1 Random Sampling (RS)
Each sample of the dataset has an equal probability of
being chosen. Therefore, it holds for i6=jthat
P[1Dr(Xi) = 1] = P[1Dr(Xj) = 1]. In consequence, for
any r(0,1), one can derive a randomly composed subset
DrDsuch that
P[1Dr(Xi) = 1] r·nD(2)
3.2.2 K-Means Outlier Removal (KM-OR)
An outlier is a data point that differs significantly from the
leading group of data. While there is no generally accepted
mathematical definition of what constitutes an outlier, it is
straightforward to define outliers for any r(0,1) as the
(1 r)·nDsample points that have the highest cumulative
distance to its group centers in the context of this work. In
order to identify groups, one can use the K-Means cluster-
ing algorithm. Thus, it is possible to derive a proxy dataset
by removing outliers from each cluster. Let dbe a distance
metric, e.g., the Frobenius norm. Given the cluster centroids
S={µ1, ..., µK}of K-Means and r(0,1), the derived
proxy dataset is then
Dr= arg min
s.t. |D0|≈r·nD
3.2.3 Loss-value-based sampling via AE (AE)
The typical use case of an Autoencoder (AE) is to ap-
proximate XiD(E(Xi)), where Eis an encoder and
Da decoder of an AE [10]. Given a trained AE, it can
provide a distance metric based on a loss function. Let
L:Rh×w×c×Rh×w×cRbe a loss function like Mean
Squared Error, with h,w,cas the height, width, and chan-
nel size, respectively. Given r(0,1), one can derive a
proxy dataset with
Dr= arg min
s.t. |D0|≈r·nD
L(D(E(Xi)) ,Xi),(4)
where samples with high loss-values are removed, which
are typically the hardest samples to reconstruct for the AE.
The opposite direction (arg max instead of arg min) was
tested with less significant results. It can be found in the
supplemental material [1].
3.3. Supervised Sampling
Supervised sampling methods take the label yiof a train-
ing sample into account for sampling.
3.3.1 Class-Conditional Random Sampling (CC-RS)
Each sample of a class of the dataset has an equal probabil-
ity of being chosen. In contrast to pure Random Sampling,
the class is considered to ensure an equal sampling within
each class:
P[1Dr(Xi) = 1] r·nD
s.t. P[1Dr(Xi) = 1 |yi=Cj] = P[1Dr(Xi) = 1]
∀Cj C
3.3.2 Class-Conditional Outlier Removal (CC-OR)
Similar to K-Means Outlier Removal, one can use the class
centroids in order to define clusters. Therefore, it is possi-
ble to derive a proxy dataset by removing outliers from each
class. Let Cj C be a class, Cjbe the data points that lie
within the class Cjand µCjthe class centroid of Cj. Collect-
ing the samples to derive a proxy dataset Drwith r(0,1)
can be defined as
arg min
s.t. |D0|≈r·nCj
3.3.3 Loss-value-based sampling via Transfer Learn-
ing (TL)
Instead of using centroids or AE, one can use Transfer
Learning (TL) to derive a classifier ϕthat enables a loss-
value-based distance metric. Hence, it gives the possibility
to obtain a proxy dataset of Dwith the easiest to classify
samples. Given r(0,1), one can derive such a proxy
dataset, where samples with high loss-values are removed:
Dr= arg min
s.t. |D0|≈r·nD
L(ϕ(Xi), yi),(7)
In the following, we use a classifier ϕtrained on ImageNet
and Transfer Learning it to CIFAR-100. Similar to the un-
supervised case AE, the opposite direction (arg max instead
of arg min) was tested with less significant results. It can
be found in the supplemental material [1].
4. Experiments
This Section introduces the CIFAR-100 dataset and the
evaluation strategy used in this work. Then, it continues
with the quantitative results, which show the performance
of all experiments and the time savings, and ends with the
qualitative results. The code for our experiments can be
found on GitHub1. For more details, the supplemantal ma-
terial [1] lists the hyper-parameters for the experiments as
well as other experimental results, e.g., alternative loss-
value-based sampling methods or additional K-values for
K-Means Outlier Removal.
4.1. Data Set
The dataset Dused in this work is CIFAR-100 [12],
which is a standard dataset used for benchmarking NAS
approaches because of its size and complexity compared
to CIFAR-10 [11] and SVHN [16]. It contains nC= 100
classes and has a uniform distribution of samples between
all classes (ntrain
Cj= 500 training and ntest
Cj= 100 testing
samples ∀Cj C). In total, Dhas nD= 60,000 samples.
4.2. Evaluation Strategy
This work uses NAS-Bench-2012(MIT licence) as a
framework for the search space as discussed in Section 2.2.
2 Y/AutoDL-Projects/
Cell Search
on modified dataset
Modify dataset
to reduce size
Cell Evaluation from
scratch on full dataset
Figure 1. Evaluation Process. First, sampling is applied to derive
a proxy dataset. After that, this work applies a NAS algorithm to
the reduced proxy dataset. Next, the derived Cell is trained from
scratch on the full dataset.
The NAS algorithms used in this work are DARTS (first-
order approximation V1 and second-order approximation
V2) [15], ENAS [17], and GDAS [6]. The sample ratios
are defined by r {0.25,0.50,0.75,1.0}, where r= 1.0
means that the whole dataset is used. Given any sampling
method, the proxy dataset Dris selected once as discussed
in Section 3 and evaluated with the previously-mentioned
NAS algorithms for a fair comparison. We evaluate the sam-
pling methods on the proxy dataset Drbased on the NAS
algorithm, which returns a cell design.
Two processes are essential in our experimental setup:
Cell Search and Cell Evaluation. The Cell Search uses the
proxy dataset and is applied once for all NAS approaches
and sampling methods. Additionally, we have fixed the
channel size of each convolution layer in the cell to 16 since
it is suggested by the NAS-Bench-201 framework. After-
wards, the Cell Evaluation process starts: A macro skeleton
network uses the cell design and is trained from scratch on
the original dataset D. This is repeated three times with dif-
ferent weight initialization for evaluating the robustness of
the cell design. Thus, the results report a mean and standard
deviation value. Figure 1 illustrates the proxy dataset sam-
pling and the processes taken afterward. Compared to the
default setup of NAS-Bench-201, where the Cell Evalua-
tion is done with one fixed channel size, this work extended
the evaluation for different channel sizes (16, 32, and 64) to
survey the scalability of the found cell designs.
In summary, each sampling method is applied under two
conditions. The first condition is the size of the dataset
after applying the sampling method that is three different
rsettings. The second condition is the NAS approaches,
which are four in this work. Furthermore, the Cell Eval-
uation applies three different channel sizes repeated three
times with non-identical starting weights, which results in
nine experiments for each design choice. Thus, we are run-
ning (4 ·3) + (4 ·3) ·(3 ·3) = 120 experiments to evalu-
ate an individual sampling method. This work presents the
six sampling methods (listed in Section 3). Additionally,
we evaluate other five sampling approaches (two alterna-
tive loss-value-based formulations and three other K-Values
Table 1. Cell Evaluation with DARTS-V1 (all accuracy values in
percent). The poorly performing so called “local optimum cell
design” consisting only of Skip-Connections (discussed in Sec-
tion 4.4.2) is marked with *. It is found on randomly sampled
proxy datasets. Besides that, cell designs derived in all other un-
supervised cases reach similar results like the baseline. However,
cell designs for the supervised case outperform the baseline by a
large margin (7.3 p.p. up to 34 p.p. for C=16). Note also that
manually adding more channels (C=32 and C=64) to the found
cell design has a positive effect. The performance improvement
gained by adding more channels seems to be consistent across all
Method rA, C=16 A, C=32 A, C=64
Full DS 1.0 32.6 ±0.7 40.2 ±0.4 47.3 ±0.3
RS .75 40.0 ±0.1 50.9 ±0.5 55.2 ±0.5
RS* .50 15.9 ±0.7 17.7 ±0.2 18.3 ±0.1
RS* .25 15.9 ±0.7 17.7 ±0.2 18.3 ±0.1
KM-OR .75 35.2 ±0.2 44.8 ±0.3 51.9 ±0.7
KM-OR .50 27.3 ±0.8 33.1 ±0.2 38.5 ±0.4
KM-OR .25 40.0 ±0.1 51.0 ±0.5 55.3 ±0.5
AE .75 34.1 ±0.3 42.6 ±0.2 50.6 ±0.0
AE .50 66.6 ±0.5 72.4 ±0.3 75.7 ±0.1
AE .25 33.3 ±0.4 42.3 ±0.2 50.6 ±0.3
CC-RS* .75 15.9 ±0.7 17.7 ±0.2 18.3 ±0.1
CC-RS* .50 15.9 ±0.7 17.7 ±0.2 18.3 ±0.1
CC-RS* .25 15.9 ±0.7 17.7 ±0.2 18.3 ±0.1
CC-OR .75 63.8 ±0.4 70.4 ±0.3 74.9 ±0.5
CC-OR .50 64.1 ±0.3 70.4 ±0.2 75.4 ±0.3
CC-OR .25 54.6 ±0.2 59.1 ±0.4 60.8 ±1.1
TL .75 60.6 ±0.4 62.3 ±0.5 64.6 ±0.3
TL .50 60.6 ±0.2 62.8 ±0.4 65.0 ±0.5
TL .25 39.9 ±0.4 51.3 ±0.2 57.0 ±0.5
for K-Means) for completeness, and they are presented in
the supplemental material [1]. As a result, we run roughly
1,400 experiments (11 ·120 + baselines).
4.3. Quantitative Analysis
This Section describes and analyzes the results of all four
NAS algorithms applied on sampled proxy datasets (as dis-
cussed in Section 3).
4.3.1 DARTS-V1
Table 1 shows the Cell Evaluation for DARTS-V1. As
mentioned in Section 4.2, the Cell Search was done with
a channel size of 16, and the found cell design was eval-
uated by training it from scratch on the full dataset with
channel sizes of 16, 32, and 64. Besides RS and CC-RS,
the proxy datasets worked well for DARTS-V1. As can
be observed, even the worst accuracy results among proxy
datasets derived by loss-value-based (AE & TL) sampling
achieved similar results as the baseline with a possibility
to reduce the size to 25%. Additionally, it was possible to
outperform the baseline on 25% of the dataset size with CC-
OR. The best performance gain (AE, r= 0.5) achieved a
margin of +28.4 p.p. compared to baseline. Also, one can
observe that increasing the channel sizes consistently in-
creases the network performance. Regarding RS, DARTS-
V1 and the following NAS algorithms derive a cell design
consisting only of Skip-Connections (marked with *). It
explains the bad accuracy during Cell Evaluation because
it has no learnable parameters. This happens due to insta-
bility within DARTS, which is known in literature [2] and
discussed in Section 4.4.2.
4.3.2 DARTS-V2
Table 2 shows the Cell Evaluation for DARTS-V2, which
is similar to DARTS-V1. However, the evaluation on all
loss-value-based (AE & TL) sampled proxy datasets deliv-
ers better results than the baseline. The best performance
gain (TL, r= 0.5) achieved a margin of +22 p.p. compared
to baseline. The AE approach even gets better with decreas-
ing dataset size. The observation of good results also holds
for CC-OR. For KM-OR and CC-RS, the results are close to
the baseline if the cell design with only Skip-Connections is
not derived. Nevertheless, the proxy dataset derived by RS
concludes the aforementioned bad-performing design for all
4.3.3 ENAS
Table 3 shows the Cell Evaluation for ENAS. Unfortunately,
ENAS did not perform well on all datasets, including the
baseline. However, the evaluations on the proxy datasets
reach similar results to the baseline in almost all experi-
ments, which are also close to the results reported by Dong
et al. [7]. The best cell design is surprisingly the design
with only Skip-Connections. A cell design consisting of
Skip-Connections and Average-Connections, where the av-
eraging effect seems to lower the performance additionally,
can explain the worse results. We observed that ENAS only
chooses between those two operations, which will be exam-
ined in Section 4.4.4 in more detail.
4.3.4 GDAS
Table 4 shows the Cell Evaluation for GDAS. It has a ro-
bust performance on proxies, which means that almost all
experiments conclude similar performances to the baseline,
except for r= 0.25 (excluding CC-OR). Another excep-
tion is present for the proxy dataset KM-OR with r= 0.50.
Thus, GDAS benefits also from proxy datasets up to 50%.
Table 2. Cell Evaluation with DARTS-V2. The local optimum cell
design (discussed in Section 4.4.2) is marked with *. The results
are similar to DARTS-V1, slightly better for AE, TL, and CC-OR.
Method rA, C=16 A, C=32 A, C=64
Full DS 1.0 35.7 ±0.5 46.0 ±0.3 53.7 ±0.4
RS* .75 15.9 ±0.7 17.7 ±0.2 18.3 ±0.1
RS* .50 15.9 ±0.7 17.7 ±0.2 18.3 ±0.1
RS* .25 15.9 ±0.7 17.7 ±0.2 18.3 ±0.1
KM-OR .75 32.1 ±0.6 39.7 ±0.4 46.4 ±0.2
KM-OR* .50 15.9 ±0.6 17.7 ±0.2 18.3 ±0.1
KM-OR .25 32.1 ±0.6 39.7 ±0.4 46.4 ±0.2
AE .75 46.8 ±0.6 56.1 ±0.3 58.9 ±0.1
AE .50 58.6 ±0.3 60.0 ±0.2 61.7 ±0.1
AE .25 65.9 ±0.6 71.6 ±0.1 75.2 ±0.4
CC-RS .75 33.5 ±0.2 42.0 ±0.5 49.5 ±0.4
CC-RS .50 35.2 ±0.2 44.8 ±0.3 51.9 ±0.7
CC-RS* .25 15.9 ±0.7 17.7 ±0.2 18.3 ±0.1
CC-OR .75 66.6 ±0.2 72.5 ±0.2 75.7 ±0.4
CC-OR .50 57.0 ±0.8 60.9 ±0.7 64.2 ±0.3
CC-OR .25 58.6 ±0.3 62.4 ±0.1 66.6 ±0.2
TL .75 58.8 ±0.3 60.0 ±0.3 62.2 ±0.5
TL .50 67.0 ±0.2 73.1 ±0.1 75.7 ±0.2
TL .25 49.0 ±0.2 54.5 ±0.4 54.7 ±0.1
Also, one can observe that the cell design with only Skip-
Connections does not appear, which indicates that GDAS is
more stable in search than DARTS or ENAS.
4.3.5 Time Savings
The search time during the experiments was measured with
a system setup of a GPU model RTX 2080 by NVIDIA
and a CPU model i9-9820X by Intel. Figure 2 shows the
time savings and Top-1 accuracy over r. We can observe a
linear dependency between search time and sampling size.
Interestingly, ENAS does not profit like other approaches
because the time savings are present for the controller train-
ing, not the child model’s training, see Pham et al. [17].
DARTS-V2 has the most significant gap between baseline
and the time needed for the most reduced proxy dataset
(r= 0.25) with around 41 hours. Thus, it is possible to de-
rive a superior performing cell design with Cell Search and
Cell Evaluation in one day with this setup. In addition, it
demonstrates that proxy datasets can improve the accuracy
of the resulting architectures. This is a very significant ob-
servation, especially for DARTS, where a long search time
is necessary otherwise.
Table 3. Cell Evaluation with ENAS. The local optimum cell de-
sign (discussed in Section 4.4.2) is marked with *, which results
in the best performing cell design. ENAS achieves only low ac-
curacy compared to the other NAS approaches. Thus, there is no
significant performance improvement or drop. All designs found
on proxy datasets reach similar results compared to the baseline.
Method rA, C=16 A, C=32 A, C=64
Full DS 1.0 11.0 ±0.5 12.5 ±0.0 13.0 ±0.0
RS .75 11.7 ±0.53 13.2 ±0.1 13.4 ±0.0
RS .50 11.0 ±0.50 12.5 ±0.0 13.0 ±0.0
RS .25 11.9 ±0.35 13.8 ±0.1 13.4 ±0.0
KM-OR .75 11.0 ±0.5 12.5 ±0.0 13.0 ±0.0
KM-OR .50 11.5 ±0.4 12.8 ±0.1 13.1 ±0.2
KM-OR .25 11.4 ±0.4 12.8 ±0.1 13.1 ±0.2
AE .75 11.4 ±0.4 12.8 ±0.1 13.1 ±0.2
AE .50 11.0 ±0.5 12.5 ±0.0 13.0 ±0.1
AE .25 11.4 ±0.4 12.8 ±0.1 13.1 ±0.2
CC-RS .75 10.6 ±0.3 12.0 ±0.1 12.6 ±0.1
CC-RS .50 11.9 ±0.4 13.2 ±0.1 13.4 ±0.0
CC-RS .25 11.4 ±0.4 12.8 ±0.1 13.1 ±0.2
CC-OR .75 11.9 ±0.4 13.2 ±0.1 13.4 ±0.0
CC-OR* .50 15.9 ±0.7 17.7 ±0.2 18.3 ±0.1
CC-OR .25 12.2 ±0.4 13.3 ±0.1 13.5 ±0.1
TL .75 12.3 ±0.4 13.4 ±0.1 13.6 ±0.1
TL .50 11.7 ±0.3 12.9 ±0.1 13.2 ±0.0
TL .25 11.9 ±0.4 13.2 ±0.1 13.4 ±0.0
1.00 0.75 0.50 0.25
Time [h]
1.00 0.75 0.50 0.25
1.00 0.75 0.50 0.25
Time [h]
1.00 0.75 0.50 0.25
Top-1 [%]
Top-1 [%]
Figure 2. Time savings and Accuracy results. The search time dur-
ing Cell Search (all experiments) decreases with the dataset size.
The standard deviation in time saving is not displayed because it
is close to zero. In addition, top-1 accuracy mean and standard
deviation for loss-value-based (AE & TL) sampling is plotted. In-
terestingly, the accuracy improves for DARTS with decreasing r.
Table 4. Cell Evaluation with GDAS. There is no significant differ-
ence when comparing most experimental results with the baseline.
Nevertheless, one can observe a performance drop for r= 0.25.
Moreover, the local optimum cell (discussed in Section 4.4.2) is
not occuring for GDAS, which indicates a more stable NAS algo-
rithm compared to DARTS and ENAS.
Method rA, C=16 A, C=32 A, C=64
Full DS 1.0 65.8 ±0.3 71.6 ±0.3 74.3 ±0.5
RS .75 66.2 ±0.0 71.3 ±0.3 73.8 ±0.2
RS .50 66.3 ±0.3 71.1 ±0.1 74.4 ±0.3
RS .25 47.7 ±0.1 57.0 ±0.4 60.4 ±0.5
KM-OR .75 65.9 ±0.8 70.9 ±0.2 74.2 ±0.4
KM-OR .50 60.2 ±0.3 65.6 ±0.5 70.4 ±0.2
KM-OR .25 60.4 ±0.6 66.0 ±0.8 70.5 ±0.4
AE .75 65.1 ±0.6 69.8 ±0.9 73.8 ±0.4
AE .50 64.2 ±0.4 69.4 ±0.2 73.0 ±0.6
AE .25 57.4 ±0.8 63.9 ±0.4 68.7 ±0.5
CC-RS .75 65.8 ±0.3 71.6 ±0.3 74.3 ±0.5
CC-RS .50 65.7 ±0.1 71.0 ±0.3 74.1 ±0.8
CC-RS .25 30.6 ±1.0 39.0 ±0.8 45.9 ±0.3
CC-OR .75 66.7 ±0.5 71.5 ±0.2 75.4 ±0.2
CC-OR .50 65.9 ±0.2 71.4 ±0.3 75.1 ±0.5
CC-OR .25 64.8 ±0.2 70.7 ±0.4 74.6 ±0.2
TL .75 66.0 ±0.5 71.5 ±0.4 74.2 ±0.1
TL .50 66.4 ±0.3 71.7 ±0.4 75.0 ±0.4
TL .25 54.2 ±0.0 62.4 ±0.6 65.6 ±0.1
TL 50%
DARTS-V2 (75.70%)
AE 50%
DARTS-V1 (75.72%)
Supervised Unsupervised
Figure 3. Best performing cell designs for the unsupervised and
supervised case.
4.4. Qualitative Analysis
This Section discusses the best performing cells, a local
optimum cell design, and the operational decisions taken by
NAS algorithms on the proxy datasets.
4.4.1 Best Performing Cells
Loss-value-based sampling methods (AE & TL) found the
best performing cell designs within proxy datasets, shown
in Figure 3. DARTS-V2 found it via the supervised sam-
pling method TL (r= 0.50) with 75.70% accuracy and
outperforms the baseline accuracy 53.74% by +21.96 p.p.
Figure 4. Local optimum cell. One downside of proxy datasets is
that many cell designs converge to Skip-Connections between the
vertices for instable NAS algorithms like DARTS. Hence, the cell
design does not contain any learnable parameter.
and GDAS (baseline) with 74.31% accuracy, by +1.39 p.p.,
achieving a time saving of roughly 27.5 hours. Better by
a small margin of +0.02 p.p. is the best cell design found
via DARTS-V1 and the unsupervised sampling method AE
(r= 0.50), reaching 75.72% top-1 accuracy. It comes with
a time saving of ca. nine hours and performance of +28,42
p.p. better than its baseline.
4.4.2 Local Optimum Cell
The experiments show that some Cell Searches obtained the
same cell design, which yields abysmal accuracy. The most
remarkable detail of this design is that the cell only uses
Skip-Connections. Consequently, it does not contain any
learnable parameter in the cell, explaining the lousy per-
formance without further analysis. Figure 4 illustrates the
cell. For DARTS, this is called the aggregation of Skip-
Connections. The local optimum cell problem is known
[3, 13, 2, 26]. Unfortunately, there is no perfect solu-
tion to this problem up to this point. Nonetheless, proxy
datasets seem to encourage this problem, which can be a
benchmark for possible solutions. This work’s experiments
show that proxy datasets can increase the search process’s
instability, especially for DARTS. On the other hand, the
wide-reaching number of experiments show that this is not
true for sample-based methods like GDAS. Thus, adding
stochastic uncertainty seems to increase the stability.
4.4.3 Generalizability
In order to test the generalizability, we applied the best
three cell designs (supervised and unsupervised, see Sec-
tion 4.4.1) derived from CIFAR-100 to ImageNet-16-120,
which is a down-sampled variant (16 ×16) of 120 classes
from ImageNet [4]. Table 5 lists the results. Interestingly, a
performance drop is observable by applying designs derived
from CIFAR-100 to another dataset like ImageNet-16-120.
Nonetheless, the top-3 performing cell designs (supervised
and unsupervised) still outperform the baseline except for
one case (TL, r= 0.50, GDAS). However, we can con-
clude that searching on proxy datasets does not hurt the
generalizability of found cell designs since the performance
drop does not differ significantly from that of non-proxy
datasets. Moreover, like observed in Section 4.4.2, we can
Table 5. Top-1 accuracy of the top-3 best performing cells de-
rived by loss-value-based sampling and applied on CIFAR-100
and ImageNet-16-120. Note that the NAS approaches are trained
on CIFAR-100 and are evaluated on ImageNet-16-120 similar to
Dong et al. [7]. Also, our baseline uses a different macro skele-
ton than NAS-Bench-201 and over-performance their results. The
presented sampling methods (AE & TL) reaches better results in
both datasets than all baselines and ResNet using proxy datasets.
r Model CIFAR-100 ImageNet
Acc [%] Acc [%]
NAS- 1.0 ResNet 70.9 43.6
Bench- 1.0 DARTS-V1 15.6 ±0.0 16.3 ±0.0
201 [7] 1.0 DARTS-V2 15.6 ±0.0 16.3 ±0.0
1.0 GDAS 70.6 ±0.3 41.8 ±0.9
Baseline 1.0 Darts-V1 47.3 ±0.3 28.7±0.7
(ours) 1.0 Darts-V2 53.7 ±0.4 31.3±0.5
1.0 GDAS 74.3 ±0.5 46.5 ±0.4
AE .50 Darts-V1 75.7 ±0.1 54.0 ±0.5
.75 GDAS 73.8 ±0.4 47.9±1.0
.50 GDAS 73.0 ±0.6 48.1±0.2
TL .50 Darts-V2 75.7 ±0.2 54.2 ±0.8
.50 GDAS 75.0 ±0.4 36.4±0.7
.75 GDAS 74.2 ±0.1 48.0±0.5
conclude that the bad experimental results of NAS-Bench-
201 for DARTS are a product of the local optimum cell,
which also explains the zero variance. This is because a cell
design consisting only of Skip-Connections does not work
better by applying different weight initializations.
4.4.4 Operation Decisions
This work was also interested in seeing how each NAS al-
gorithm’s operation choices change when applied to proxy
datasets. Thus, we derived an empirical probability distri-
bution from all experiments and show the operations taken
for each edge. This is done for all four algorithms to make
a comparison feasible. Consequently, it enables a distribu-
tion comparison with decreasing sampling size concerning
operation choice. It will be referred to as Cell Edge Distri-
bution in the following and is shown in Figure 5. To the best
of our knowledge, this is the first work that uses this kind of
The distribution’s most conspicuous findings are the col-
orings of ENAS. It does not pick other operations than
Averaging- or Skip-Connections (red and blue color). Thus,
similar to the local optimum cell of Section 4.4.2, it does
not contain any learnable parameter. This argument is con-
sistent r {0.75,0.50,0.25}. Consequently, it explains
the low performance of ENAS throughout the experiments
and its high robustness concerning the sampling methods
tested. Unfortunately, this raises the question of whether
r=0.75 r=0.50 r=0.25
Figure 5. Cell Edge Distribution. It contains the probability of an operation for a specific edge, following the notation on the bottom. The
edges are numbered, and the operations are colored according to the legend on the right. Each cell design has ten edges, and therefore, there
are ten pie charts for each entry. A strong dominance of Skip-Connections can be observed for DARTS and ENAS. In addition, ENAS only
chooses between Skip- and Average-Connections, which explains the bad performance in the experiments. In contrast, GDAS uses more
convolution operations. An exception is given for r= 0.25, where Average-Operations become dominant. It aligns with the performance
drop observed in the experiments.
the NAS algorithm itself is bad on NAS-Bench-201 or if
the implementation of the code framework of the authors of
NAS-Bench-201 is not correct. For DARTS-V1 and V2,
one can observe that both have similar decisions. Also,
a strong dominance of Skip-Connections (blue color) is
present. Concerning Section 4.4.2, the phenomenon of a
cell design with only Skip-Connections extends to a gen-
eral affinity to Skip-Connections. GDAS has a significant
difference from the other NAS approaches. It relies mainly
on Convolution operations. Zeroize-Connections of edge
3, 5, and 6, which is very present for r= 0.75 and par-
tially for r= 0.50 indicates a similarity for Inception-
Modules (wide cell design) from GoogLeNet[22]. As for
DARTS, Average-Connections become more present for de-
creasing r, especially for the edges to the last vertex (7-10).
It is a possible explanation why GDAS underperforms for
r= 0.25 for almost all experiments.
5. Conclusion & Future Work
In this paper, we explored several sampling methods
(supervised and unsupervised) for creating smaller proxy
datasets, consequently reducing NAS approaches’ search
time. Our evaluation is based on the prominent NAS-
Bench-201 framework, adding the dataset ratio and differ-
ent reduction techniques, resulting in roughly 1,400 experi-
ments on CIFAR-100. We further show the generalizability
of the discovered architectures to ImageNet-16-120. Within
the evaluation, we find that many NAS approaches benefit
from reduced dataset sizes (in contrast to current trends in
research): We find that not only the training time decreases
linearly with the dataset reduction, but also that the accu-
racy of the resulting cells is oftentimes higher than when
training on the full dataset. Along those lines, DARTS-V2
found a cell design that achieves 75.7% accuracy with only
25% of the dataset, whereas the NAS baseline achieves only
53.7% with all samples. Hence, overall reducing the size of
the dataset is not only helpful to reduce the NAS search time
but often also improves resulting accuracies (less is more).
For future work, we observed that DARTS is more prone
to instability in randomly sampled proxy datasets, which
could also be presented for other NAS approaches and used
as a benchmark to improve the stability. Another direc-
tion for future work is to exploit synthetic datasets as proxy
datasets, e.g., dataset distillation [24].
[1] Anonymous Authors. Additional experimental re-
sults, 2022. Supplied as supplemental material
[2] Kaifeng Bi, Changping Hu, Lingxi Xie, Xin Chen, Longhui
Wei, and Qi Tian. Stabilizing darts with amended gradi-
ent estimation on architectural parameters. arXiv preprint
arXiv:1910.11831, 2019.
[3] Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progres-
sive differentiable architecture search: Bridging the depth
gap between search and evaluation. In Proceedings of the
IEEE/CVF International Conference on Computer Vision,
pages 1294–1303, 2019.
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vision and
pattern recognition, pages 248–255. Ieee, 2009.
[5] Xuanyi Dong and Yi Yang. One-shot neural architecture
search via self-evaluated template network. In Proceedings
of the IEEE International Conference on Computer Vision,
pages 3681–3690, 2019.
[6] Xuanyi Dong and Yi Yang. Searching for a robust neu-
ral architecture in four gpu hours. In Proceedings of the
IEEE Conference on computer vision and pattern recogni-
tion, pages 1761–1770, 2019.
[7] Xuanyi Dong and Yi Yang. Nas-bench-201: Extending
the scope of reproducible neural architecture search. arXiv
preprint arXiv:2001.00326, 2020.
[8] Erich Gamma, Richard Helm, Ralph Johnson, John Vlis-
sides, and Design Patterns. Elements of reusable object-
oriented software, volume 99. Addison-Wesley Reading,
Massachusetts, 1995.
[9] Angelos Katharopoulos and Franc¸ois Fleuret. Not all sam-
ples are created equal: Deep learning with importance sam-
pling. In International conference on machine learning,
pages 2525–2534. PMLR, 2018.
[10] Mark A Kramer. Nonlinear principal component analy-
sis using autoassociative neural networks. AIChE journal,
37(2):233–243, 1991.
[11] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10
(canadian institute for advanced research).
[12] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-
100 (canadian institute for advanced research).
[13] Hanwen Liang, Shifeng Zhang, Jiacheng Sun, Xingqiu He,
Weiran Huang, Kechen Zhuang, and Zhenguo Li. Darts+:
Improved differentiable architecture search with early stop-
ping. arXiv preprint arXiv:1909.06035, 2019.
[14] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon
Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan
Huang, and Kevin Murphy. Progressive neural architecture
search. In Proceedings of the European Conference on Com-
puter Vision (ECCV), pages 19–34, 2018.
[15] Hanxiao Liu, Karen Simonyan, and Yiming Yang.
Darts: Differentiable architecture search. arXiv preprint
arXiv:1806.09055, 2018.
[16] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis-
sacco, Bo Wu, and Andrew Y Ng. Reading digits in natural
images with unsupervised feature learning. 2011.
[17] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and
Jeff Dean. Efficient neural architecture search via parameter
sharing. arXiv preprint arXiv:1802.03268, 2018.
[18] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V
Le. Regularized evolution for image classifier architecture
search. In Proceedings of the aaai conference on artificial
intelligence, volume 33, pages 4780–4789, 2019.
[19] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang,
Zhihui Li, Xiaojiang Chen, and Xin Wang. A comprehen-
sive survey of neural architecture search: Challenges and so-
lutions. arXiv preprint arXiv:2006.02903, 2020.
[20] Sam Shleifer and Eric Prokop. Using small proxy datasets to
accelerate hyperparameter search, 2019.
[21] Masanori Suganuma, Shinichi Shirakawa, and Tomoharu
Nagao. A genetic programming approach to designing con-
volutional neural network architectures. In Proceedings of
the genetic and evolutionary computation conference, pages
497–504, 2017.
[22] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. Going deeper with
convolutions. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1–9, 2015.
[23] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan,
Mark Sandler, Andrew Howard, and Quoc V Le. Mnas-
net: Platform-aware neural architecture search for mobile.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 2820–2828, 2019.
[24] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and
Alexei A Efros. Dataset distillation. arXiv preprint
arXiv:1811.10959, 2018.
[25] Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real,
Kevin Murphy, and Frank Hutter. Nas-bench-101: Towards
reproducible neural architecture search. In International
Conference on Machine Learning, pages 7105–7114. PMLR,
[26] Arber Zela, Thomas Elsken, Tonmoy Saikia, Yassine Mar-
rakchi, Thomas Brox, and Frank Hutter. Understanding
and robustifying differentiable architecture search. arXiv
preprint arXiv:1909.09656, 2019.
[27] Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset
condensation with gradient matching. In International Con-
ference on Learning Representations, 2021.
[28] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V
Le. Learning transferable architectures for scalable image
recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 8697–8710,
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Deep learning has made substantial breakthroughs in many fields due to its powerful automatic representation capabilities. It has been proven that neural architecture design is crucial to the feature representation of data and the final performance. However, the design of the neural architecture heavily relies on the researchers’ prior knowledge and experience. And due to the limitations of humans’ inherent knowledge, it is difficult for people to jump out of their original thinking paradigm and design an optimal model. Therefore, an intuitive idea would be to reduce human intervention as much as possible and let the algorithm automatically design the neural architecture. Neural Architecture Search ( NAS ) is just such a revolutionary algorithm, and the related research work is complicated and rich. Therefore, a comprehensive and systematic survey on the NAS is essential. Previously related surveys have begun to classify existing work mainly based on the key components of NAS: search space, search strategy, and evaluation strategy. While this classification method is more intuitive, it is difficult for readers to grasp the challenges and the landmark work involved. Therefore, in this survey, we provide a new perspective: beginning with an overview of the characteristics of the earliest NAS algorithms, summarizing the problems in these early NAS algorithms, and then providing solutions for subsequent related research work. In addition, we conduct a detailed and comprehensive analysis, comparison, and summary of these works. Finally, we provide some possible future research directions.
Conference Paper
Full-text available
Neural architecture search (NAS) aims to automate the search procedure of architecture instead of manual design. Even if recent NAS approaches finish the search within days, lengthy training is still required for a specific architecture candidate to get the parameters for its accurate evaluation. Recently one-shot NAS methods are proposed to largely squeeze the tedious training process by sharing parameters across candidates. In this way, the parameters for each candidate can be directly extracted from the shared parameters instead of training them from scratch. However , they have no sense of which candidate will perform better until evaluation so that the candidates to evaluate are randomly sampled and the top-1 candidate is considered the best. In this paper, we propose a Self-Evaluated Template Network (SETN) to improve the quality of the architecture candidates for evaluation so that it is more likely to cover competitive candidates. SETN consists of two components: (1) an evaluator, which learns to indicate the probability of each individual architecture being likely to have a lower validation loss. The candidates for evaluation can thus be selectively sampled according to this evaluator. (2) a template network, which shares parameters among all candidates to amortize the training cost of generated candidates. In experiments, the architecture found by SETN achieves the state-of-the-art performance on CIFAR and ImageNet benchmarks within comparable computation costs. Code is publicly available on GitHub:
Conference Paper
Full-text available
Conventional neural architecture search (NAS) approaches are based on reinforcement learning or evolutionary strategy, which take more than 3000 GPU hours to find a good model on CIFAR-10. We propose an efficient NAS approach learning to search by gradient descent. Our approach represents the search space as a directed acyclic graph (DAG). This DAG contains billions of sub-graphs, each of which indicates a kind of neural architecture. To avoid traversing all the possibilities of the sub-graphs, we develop a differentiable sampler over the DAG. This sam-pler is learnable and optimized by the validation loss after training the sampled architecture. In this way, our approach can be trained in an end-to-end fashion by gradient descent, named Gradient-based search using Differ-entiable Architecture Sampler (GDAS). In experiments, we can finish one searching procedure in four GPU hours on CIFAR-10, and the discovered model obtains a test error of 2.82% with only 2.5M parameters, which is on par with the state-of-the-art. Code is publicly available on GitHub:
Conference Paper
We propose a method for designing convolutional neural network (CNN) architectures based on Cartesian genetic programming (CGP). In the proposed method, the architectures of CNNs are represented by directed acyclic graphs, in which each node represents highly-functional modules such as convolutional blocks and tensor operations, and each edge represents the connectivity of layers. The architecture is optimized to maximize the classification accuracy for a validation dataset by an evolutionary algorithm. We show that the proposed method can find competitive CNN architectures compared with state-of-the-art methods on the image classification task using CIFAR-10 and CIFAR-100 datasets.
Detecting and reading text from natural images is a hard computer vision task that is central to a variety of emerging applications. Related problems like document character recognition have been widely studied by computer vision and machine learning researchers and are virtually solved for practical applications like reading handwritten digits. Reliably recognizing characters in more complex scenes like photographs, however, is far more difficult: the best existing methods lag well behind human performance on the same tasks. In this paper we attack the prob-lem of recognizing digits in a real application using unsupervised feature learning methods: reading house numbers from street level photos. To this end, we intro-duce a new benchmark dataset for research use containing over 600,000 labeled digits cropped from Street View images. We then demonstrate the difficulty of recognizing these digits when the problem is approached with hand-designed fea-tures. Finally, we employ variants of two recently proposed unsupervised feature learning methods and find that they are convincingly superior on our benchmarks.