Available via license: CC BY 4.0
Content may be subject to copyright.
Eicient Deep Learning Infrastructures for Embedded
Computing Systems: A Comprehensive Survey and Future
Envision
XIANGZHONG LUO, Nanyang Technological University, Singapore
DI LIU, Norwegian University of Science and Technology, Norway
HAO KONG, Nanyang Technological University, Singapore
SHUO HUAI, Nanyang Technological University, Singapore
HUI CHEN, Nanyang Technological University, Singapore
GUOCHU XIONG, Nanyang Technological University, Singapore
WEICHEN LIU∗,Nanyang Technological University, Singapore
Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world
vision and language processing tasks, spanning from image classication to many other downstream vision
tasks, such as object detection, tracking, and segmentation. However, previous well-established DNNs, despite
being able to maintain superior accuracy, have also been evolving to be deeper and wider and thus inevitably
necessitate prohibitive computational resources for both training and inference. This trend further enlarges
the computational gap between computation-intensive DNNs and resource-constrained embedded computing
systems, making it challenging to deploy powerful DNNs upon real-world embedded computing systems
towards ubiquitous embedded intelligence. To alleviate the above computational gap and enable ubiquitous
embedded intelligence, we, in this survey, focus on discussing recent ecient deep learning infrastructures
for embedded computing systems, spanning from training to inference,from manual to automated,
from convolutional neural networks to transformers,from transformers to vision transformers,
from vision models to large language models,from software to hardware, and from algorithms to
applications. Specically, we discuss recent ecient deep learning infrastructures for embedded computing
systems from the lens of (1) ecient manual network design for embedded computing systems, (2) ecient
automated network design for embedded computing systems, (3) ecient network compression for embedded
computing systems, (4) ecient on-device learning for embedded computing systems, (5) ecient large
language models for embedded computing systems, (6) ecient deep learning software and hardware for
embedded computing systems, and (7) ecient intelligent applications for embedded computing systems.
Furthermore, we also envision promising future directions and trends, which have the potential to deliver
more ubiquitous embedded intelligence. We believe this survey has its merits and can shed light on future
research, which can largely benet researchers to quickly and smoothly get started in this emerging eld.
CCS Concepts: •Embedded and cyber-physical systems
→
Embedded software; Embedded hardware; •
Computing methodologies →Articial intelligence; Machine learning; Modeling and simulation.
Additional Key Words and Phrases: Embedded Computing Systems, Embedded Intelligence, Articial Intelli-
gence, Ecient Deep Learning Algorithms, Ecient Network Design, Ecient Neural Architecture Search,
Ecient Model Compression, Ecient On-Device Learning, Ecient Large Language Models, Ecient Deep
Learning Software and Hardware, and Intelligent Embedded Applications.
∗The corresponding author is Weichen Liu (Email: liu@ntu.edu.sg).
This research is partially supported by the Ministry of Education, Singapore, under its Academic Research
Fund Tier 1 (RG94/23), and partially supported by Nanyang Technological University, Singapore, under its NAP
(M4082282/04INS000515C130).
Authors’ addresses: Xiangzhong Luo, xiangzho001@e.ntu.edu.sg, Nanyang Technological University, Singapore; Di Liu,
Norwegian University of Science and Technology, Norway, di.liu@ntnu.no; Hao Kong, Nanyang Technological University,
Singapore, kong.hao@ntu.edu.sg; Shuo Huai, Nanyang Technological University, Singapore, huai.shuo@ntu.edu.sg; Hui
Chen, Nanyang Technological University, Singapore, chen.hui@ntu.edu.sg; Guochu Xiong, Nanyang Technological Univer-
sity, Singapore, guochu.xiong@ntu.edu.sg; Weichen Liu, Nanyang Technological University, Singapore, liu@ntu.edu.sg.
arXiv:2411.01431v1 [cs.LG] 3 Nov 2024
Section 3 | Automated Network Design
3.1 | Modular Search Spaces
3.2 | Efficient Search Strategies
3.3 | Speedup Techniques and Extensions
3.4 | Future Envision
Section 2 | Manual Network Design
2.1 | Manual Convolutional Network Design
2.2 | Manual Transformer Design
2.3 | Future Envision
Section 4 | Network Compression
4.1 | Network Pruning
4.2 | Network Quantization
4.3 | Network Distillation
4.4 | Future Envision
Section 5 | On-Device Learning
5.1 | General On-Device Learning
5.2 | On-Device Continual Learning
5.3 | On-Device Transfer Learning
5.4 | On-Device Federated Learning
5.5 | Future Envision
Section 6 | Large Language Models
6.1 | Preliminaries on LLMs
6.2 | Efficient LLM Architectures
6.3 | Efficient LLM Compression
6.4 | Efficient LLM Systems
6.5 | Future Envision
Section 7 | ML Software and Hardware
7.1 | ML Software Frameworks
7.2 | ML Hardware Frameworks
7.3 | Future Envision
Section 8 | Intelligent Applications
8.1 Computer Vision
8.2 Natural Language Processing
8.3 Future Envision
Fig. 1. The organization of this paper, in which we ignore Section 1 and Section 9 for the sake of simplicity.
1 INTRODUCTION
With the increasing availability of large-scale datasets and advanced computing paradigms, deep
neural networks (DNNs)
1
have empowered a wide range of intelligent applications and have demon-
strated strong performance [
1
–
3
]. These intelligent applications may span from image classication
[
2
] to downstream vision tasks, such as object detection [
4
], tracking [
5
], and segmentation [
6
],
to natural language processing (NLP) tasks, such as automatic speech recognition [
7
], machine
translation [
8
], and question answering [
9
]. In the subsequent years, deep neural networks have
been evolving deeper and deeper with more and more layers in order to maintain state-of-the-art
accuracy on target task [
1
–
3
]. In the meantime, novel network structures and advanced training
techniques have also emerged, which further push forward the attainable accuracy [
10
–
12
]. These
powerful deep learning (DL) networks and advanced training techniques, starting from VGGNet
[1] and ResNet [2], mark the emergence of the deep learning era.
The tremendous breakthroughs of DNNs have subsequently attracted a huge amount of attention
from both academia and industry to deploy powerful DNNs upon real-world embedded computing
systems, including mobile phones [
13
,
14
], autonomous vehicles [
15
,
16
], and healthcare [
17
,
18
], to
enable intelligent embedded applications towards embedded intelligence [
19
]. In practice, this may
bring signicant benets. For example, embedded computing systems explicitly allow real-time
on-device data processing, which signicantly improves the processing eciency and thus delivers
enhanced user experience. This also protects data security and privacy since everything can be
locally processed without being uploaded to the remote server [
19
]. Despite the above promising
benets, deploying powerful DNNs upon real-world embedded computing systems still suers
from several critical limitations. On the one hand, in order to maintain competitive accuracy, recent
representative networks have been evolving deeper and deeper with hundreds of layers [
2
,
3
], and
as a result, lead to prohibitive computational complexity [
19
,
20
]. For example, ResNet50 [
2
], as
one of the most representative deep networks, consists of over 4 billion oating-point operations
(FLOPs) and 25 million parameters, which requires over 87 MB on-device storage to deal with
one single input image. On the other hand, real-world embedded computing systems like mobile
phones and autonomous vehicles typically feature limited available computational resources in
order to optimize the on-device power and energy consumption. In sight of the above, the evolving
network complexity continues to enlarge the computational gap between computation-intensive
deep neural networks and resource-constrained embedded computing systems [
20
], inevitably
making it increasingly challenging to embrace ubiquitous embedded intelligence.
To bridge the aforementioned computational gap towards ubiquitous embedded intelligence, a
plethora of model compression techniques have been recently proposed, including network pruning
[
21
–
23
], network quantization [
24
–
26
], and network distillation [
11
,
27
,
28
], which strive for better
1
In this work, we may interchangeably use some technical terms, such as deep learning models, machine learning models,
DL models, ML models, deep neural networks (DNNs), and convolutional neural networks (CNNs).
2
accuracy-eciency trade-os to accommodate the limited available computational resources in
real-world embedded scenarios. For example, network pruning focuses on removing the redundancy
network units, such as weights [
29
], channels [
21
], and layers [
30
], to trim down the network
redundancy, which can boost the eciency on target hardware with minimal accuracy loss on target
task. In addition to network compression, another parallel alternative is to manually design resource-
ecient networks instead, such as SqueezeNet [
31
], MobileNets [
32
,
33
], ShueNets [
34
,
35
], and
GhostNets [
36
,
37
], which have dominated the early progress from the lens of ecient network
design. These ecient networks, despite being able to exhibit superior eciency, highly rely on
human expertise to explore novel network structures through trial and error, which also involve
non-trivial engineering eorts and prohibitive computational resources [
38
–
40
]. To overcome
such limitations, recent network design practices have shifted from manual to automated, also
referred to as neural architecture search (NAS) or automated machine learning (AutoML), which
focuses on automatically exploring novel network structures [
41
]. The tremendous success of NAS
has subsequently sparked rich hardware-aware NAS works, such as MnasNet [
38
], ProxylessNAS
[
40
], FBNet [
39
], and Once-for-All [
42
], to automate the design of accurate yet hardware-ecient
network solutions, which have shown strong accuracy-eciency trade-os and have been widely
upon real-world embedded computing systems to deliver intelligent services [43].
Apart from the above ecient networks and techniques that typically focus on improving the
on-device inference eciency, recent research also turns back to the on-device training eciency
[
44
,
45
]. The rationale here is that previous representative networks, despite being able to exhibit
superior accuracy, have to be trained for hundreds of epochs, which may require multiple days on
powerful GPUs [
44
]. And even worse, the expensive training process on remote GPUs does not
allow on-device customization on local hardware, especially in resource-constrained embedded
scenarios [
45
]. Note that local on-device customization has the potential to further improve the
attainable accuracy using newly collected data since local sensors continue to collect new data from
users over time. To overcome such limitations, several ecient on-device learning techniques have
been recently established, such as on-device continual learning [
46
], on-device transfer learning
[
44
], and on-device federated learning [
47
], making it possible to train and ne-tune powerful deep
networks on local hardware for further performance improvement.
More recently, large language models (LLMs), such as GPT-3 [
48
] and GPT-4 [
49
], have demon-
strated impressive success across various real-world language processing tasks [
50
]. However, the
strong learning capability of these powerful LLMs also comes at the cost of excessive computational
complexity. For example, OpenAI’s GPT-3 [
48
], as one of the most representative LLMs, consists of
175 billion parameters. Furthermore, in order to achieve state-of-the-art performance, recent LLMs
continue to evolve to be larger and larger with ever-increasing model sizes [
51
,
52
]. These make it
increasingly challenging to deploy recent powerful LLMs on modern embedded computing systems
towards intelligent language processing services. To overcome such limitations, a series of eective
techniques have been recently proposed, which focus on alleviating the prohibitive computational
complexity of LLMs to explore computation-ecient LLMs, including ecient LLM architecture
design [
53
–
56
], ecient LLM compression techniques (i.e., pruning [
57
,
58
], quantization [
59
,
60
],
and knowledge distillation [61, 62]), and ecient LLM system design [63–65].
In parallel to the booming emergence of powerful deep networks and advanced training tech-
niques, a plethora of representative deep learning software frameworks and hardware accelerators
have been tailored to facilitate the development of ecient deep learning solutions for embedded
computing systems, such as TensorFlow [
66
], PyTorch [
67
], Google edge TPUs [
68
], Nvidia edge
GPUs [
69
], and Intel Neural Compute Stick [
70
]. These deep learning software and hardware have
been extensively adopted in the deep learning era and bring two main benets. On the one hand,
these deep learning software and hardware lift the roadblock for both software and hardware
3
engineers and thus allow them to quickly develop intelligent embedded applications, such as on-
device object detection [
4
], tracking [
5
], and segmentation [
6
], with less domain-specic expertise.
On the other hand, these deep learning software and hardware typically feature domain-specic
optimization and thus can achieve superior accuracy-eciency trade-os with minimal engineering
eorts. For example, Nvidia Jetson AGX Xavier, as one representative Nvidia Jetson edge GPU,
supports the development of intelligent embedded applications with the precision of INT8 (i.e., 8-bit
weights), which can deliver signicant eciency improvement over its full-precision counterpart
(32-bit weights) without degrading the accuracy on target task [69].
1.1 Organization of This Paper
In this survey, we focus on summarizing recent ecient deep learning infrastructures that may
benet current and future embedded computing systems towards ubiquitous embedded intelligence.
In practice, some existing surveys [
71
–
74
] typically focus on ecient deep learning algorithms,
which, however, may be out-of-date since recent deep learning infrastructures have been rapidly
evolving, especially from the perspective of large language models. In contrast to [
71
–
74
], we focus
on providing a more comprehensive and holistic view of recent ecient deep learning infrastruc-
tures for embedded computing systems, spanning from training to inference,from manual
to automated,from convolutional neural networks to transformers,from transformers
to vision transformers,from vision models to large language models,from software to
hardware, and from algorithms to applications. Specically, we discuss recent ecient deep
learning infrastructures for embedded computing systems from the lens of (1) ecient manual
network design for embedded computing systems, (2) ecient automated network design for
embedded computing systems, (3) ecient network compression for embedded computing systems,
(4) ecient on-device learning for embedded computing systems, (5) ecient large language models
for embedded computing systems, (6) ecient deep learning software and hardware for embedded
computing systems, and (7) ecient intelligent applications for embedded computing systems. We
believe this survey has its merits and can shed light on future research, which can largely benet
researchers to quickly and smoothly get started in this emerging eld. Finally, we demonstrate the
organization of this survey in Fig. 1, which is also summarized as follows:
•Section 2 extensively discusses recent representative ecient manual networks.
•Section 3 extensively discusses recent representative ecient automated networks.
•Section 4 extensively discusses recent representative network compression techniques.
•Section 5 extensively discusses recent representative on-device learning techniques.
•Section 6 extensively discusses recent representative large language models.
•
Section 7 extensively discusses recent representative deep learning software and hardware.
•Section 8 extensively discusses recent representative intelligent embedded applications.
Furthermore, at the end of each section, we also envision possible future directions in the respective
eld, which have the potential to pave the way for future ubiquitous embedded intelligence.
2 MANUAL NETWORK DESIGN FOR EMBEDDED COMPUTING SYSTEMS
The tremendous success of DNNs highly relies on the prohibitive network complexity, leading to
the computational gap between computation-intensive DNNs and resource-constrained embedded
computing systems [
20
]. To bridge the above computational gap, one of the most representative
solutions is to design computation-ecient DNNs to accommodate the limited computational
resources on embedded computing systems. To this end, we, in this section, systematically discuss
recent state-of-the-art ecient manual networks. For better understanding, we divide these ecient
networks into two main categories and sub-sections, including ecient convolutional networks
4
(b) the Ghost convolution(a) the standard convolution
Fig. 2. Comparisons between the standard convolution (le) and the Ghost convolution (right ) of GhostNets
[
36
,
37
,
75
]. In particular, compared with the standard convolutional layer, the Ghost convolutional layer can
generate rich features using simple and cheaper linear operations. (figure from [36])
in Section 2.1 and ecient transformers in Section 2.2, since these ecient networks may feature
dierent network structures and also target dierent intelligent embedded applications.
2.1 Manual Convolutional Neural Network Design
As shown in previous state-of-the-art deep convolutional networks, such as AlexNet [
76
], VGGNet
[
1
], GoogleNet [
77
], ResNet [
2
], DenseNet [
3
], and EcientNets [
78
,
79
], despite being able to
push forward the attainable accuracy on ImageNet [
80
] from 57.2% [
81
] to 87.3% [
79
], the network
complexity has increased over time. We note that the convolutional network consists of convolu-
tional layers, pooling layers, and fully-connected layers, where most of the network complexity
comes from convolutional layers [
82
]. For example, in ResNet50 [
2
], more than 99% oating-point
operations (FLOPs) are from convolutional layers. In sight of this, designing ecient convolutional
layers is critical to innovating computation-ecient convolutional networks. In practice, there are
ve typical ecient convolutional layers, including pointwise convolution, groupwise convolution,
depthwise convolution, dilated convolution, and Ghost convolution:
•
Pointwise Convolution. Pointwise convolution is a type of convolutional layer with the
xed kernel size of 1
×
1, which performs an element-wise multiplication and addition along
the depth dimension. On the one hand, compared with the standard
𝐾×𝐾
convolutional layer,
the pointwise convolutional layer is able to reduce the number of FLOPs and parameters
by
𝐾2
times, which therefore signicantly improves the eciency. On the other hand, we
note that the output from the pointwise convolutional layer typically has the same spatial
dimensions as the input but may have a dierent number of channels. As such, the pointwise
convolutional layer can be used to adjust the intermediate feature maps in terms of the
number of channels. Specically, it can reduce or increase the number of channels, making
it a practical technique for compressing or expanding convolutional networks.
•
Groupwise Convolution. Groupwise convolution is a type of convolutional layer that
(1) divides the input feature map into
𝐺
groups along the depth dimension, (2) performs
convolution in terms of each group, respectively, and (3) concatenates the outputs along the
depth dimension to derive the nal output. For example, given an input feature map with the
size of
𝐵×𝐶×𝐻×𝑊
, each kernel in the
𝐾×𝐾
groupwise convolutional layer is with the size of
(𝐶/𝐺)×𝐾×𝐾
, which convolves the above
𝐺
groups of feature maps, respectively. Therefore,
compared with the standard
𝐾×𝐾
convolutional layer, the groupwise convolutional layer
is able to reduce the number of FLOPs and parameters by 𝐺times.
•
Depthwise Convolution. Depthwise convolution is a type of convolutional layer that
has gained popularity due to its ability to signicantly reduce the number of FLOPs and
5
Fig. 3. Comparisons of eicient convolutional networks that have been discussed in Section 2.1, including
SqueezeNet [
81
], MobileNets [
32
,
85
,
86
], ShuleNets [
34
,
35
], CondenseNets [
87
,
88
], and GhostNets
[
36
,
37
,
75
], in which the accuracy is evaluated on ImageNet [
80
] and is taken from the respective paper. Note
that the convolutional networks in this figure may be trained under dierent training recipes.
parameters in convolutional networks. It is a special case of groupwise convolutional layer,
in which the number of groups
𝐺
is equal to the number of input channels. Specically,
each input channel is convolved with a unique kernel of 1
×𝐾×𝐾
, after which the outputs
from all input channels are concatenated along the depth dimension to derive the nal
output. In practice, this has the potential to achieve signicant reduction in the number of
FLOPs and parameters because the intermediate feature maps may consist of thousands of
channels as shown in previous state-of-the-art convolutional networks [2, 3, 78, 79].
•
Dilated Convolution. Dilated convolution [
83
], also referred to as atrous convolution, is a
type of convolutional layer that is designed to increase the receptive eld size. Specically,
in the dilated convolutional layer, there is an adjustable parameter called dilation rate, which
determines the spacing between dierent elements and can be varied to adjust the size of
the receptive eld. For example, the 3
×
3dilated convolutional layer with the dilation rate
of 1 maintains the same receptive eld as the standard 5
×
5convolutional layer. This further
allows us to increase the receptive eld size to unlock better accuracy without introducing
additional computational overheads, such as FLOPs and parameters.
•
Ghost Convolution. The Ghost convolution [
36
,
37
,
75
] is a type of convolutional layer
that is designed to generate rich feature maps using cheaper computational resources as
illustrated in Fig. 2. Specically, the Ghost convolutional layer consists of two sequential
parts. The rst part corresponds to the standard convolutional layer, in which the number
of output channels is rigorously controlled. Subsequently, in the second part, to generate
rich feature maps, a series of simple linear operations are applied to the output feature maps
from the rst part. As a result, the size of the output feature maps still remains the same as
the standard convolutional layer, but the total required computational resources, such as
the number of FLOPs and parameters, are signicantly reduced as shown in Fig. 2.
•
Partial Convolution. The partial convolution [
84
] is to reduce the computational redun-
dancy and memory access simultaneously. Specically, the partial convolution is built upon
the regular convolution, in which only a small number of input channels are convolved with
the regular convolution to extract representative spatial features and the remaining input
channels are unchanged. Similar to the Ghost convolution, the resulting output channels
are further concatenated along the depth dimension to produce the nal output channels.
In practice, the partial convolution brings signicant computational eciency and memory
eciency since only a small number of input channels are convolved, which also maintains
better on-device resource utilization than the Ghost convolution.
6
Built on top of the aforementioned ecient convolutional layers and structures, there are several
representative families of manually designed ecient convolutional networks, including SqueezeNet
[
81
], MobileNets [
32
,
85
,
86
], ShueNets [
34
,
35
], CondenseNets [
87
,
88
], GhostNets [
36
,
37
,
75
],
and FasterNet [
84
]. We compare the above representative ecient convolutional networks in Fig. 3
2
,
which are also discussed in the remainder of this section.
SqueezeNet [
81
] is stacked using a series of Fire modules, which aims to achieve AlexNet-level
accuracy with fewer parameters. Specically, each Fire module consists of two convolutional layers,
including one squeeze layer and one expand layer. In the squeeze layer, only pointwise convolutional
layers are used to reduce the number of input channels for the subsequent expand layer. Next, the
expand layer performs feature expansion using a pair of 1
×
1and 3
×
3convolutional layers. In
particular, SqueezeNet is able to achieve slightly better accuracy on ImageNet than AlexNet (i.e.,
57.5% in SqueezeNet vs. 57.2% in AlexNet) using
×
50 smaller model size. Meanwhile, SqueezeNet
is more compression-friendly than AlexNet. For example, we are allowed to further compress
SqueezeNet using [
89
], which delivers more compact network variants with
×
363
∼×
510 smaller
model size, and more importantly, without degrading the accuracy on ImageNet.
MobileNets [
32
,
85
,
86
] are a family of lightweight convolutional networks, including Mo-
bileNetV1 [
32
], MobileNetV2 [
85
], and MobileNeXt [
86
], which are tailored for mobile devices with
limited computational resources. Specically, MobileNetV1 is built upon a series of building blocks,
where each building block consists of two convolutional layers, including one 3
×
3depthwise
convolutional layer and one 1
×
1pointwise convolutional layer. With 569M FLOPs and 4.2 M
parameters, MobileNetV1 achieves 70.6% top-1 accuracy on ImageNet. In addition, MobileNetV2 is
an improved version of MobileNetV1, which aims to unlock higher accuracy with fewer FLOPs and
parameters. Specically, MobileNetV2 introduces the inverted residual building block that consists
of three convolutional layers, including one 1
×
1pointwise convolutional layer, one 3
×
3depthwise
convolutional layer, and one 1
×
1pointwise convolutional layer. Here, the inverted residual building
block also borrows the residual connection from ResNet [
2
] to stabilize the training process and
improve the accuracy. With 300 M FLOPs and 3.4M parameters, MobileNetV2 achieves 72.0% top-1
accuracy on ImageNet. Furthermore, MobileNeXt investigates the inverted residual building block
in MobileNetV2 and introduces the sandglass block to enhance the accuracy without increasing
the network complexity. Specically, the sandglass block consists of four convolutional layers,
including one 3
×
3depthwise convolutional layer, one 1
×
1pointwise convolutional layer, one
1
×
1pointwise convolutional layer, and one 3
×
3depthwise convolutional layer. With 300 M FLOPs
and 3.4 M parameters, MobileNeXt achieves 74.0% top-1 accuracy on ImageNet.
ShuleNets [
34
,
35
] are a family of ecient convolutional networks, including ShueNetV1
[
35
] and ShueNetV2 [
34
], which exploit channel shuing to reduce the network complexity
while maintaining competitive accuracy. Specically, ShueNetV1, for the rst time, introduces
channel shuing to enhance the information ow across dierent channels. In practice, the channel
shuing operation is inserted after the 3
×
3depthwise convolutional layer to shue the feature
maps from dierent groups, which is capable of generating richer and more diverse feature maps
while not increasing the number of FLOPs and parameters. With 292 M FLOPs and 3.4 M parameters,
ShueNetV1 achieves 71.5% top-1 accuracy on ImageNet, which is
+
0.9% higher than MobileNetV1
under comparable settings of FLOPs. Furthermore, ShueNetV2 improves the accuracy and e-
ciency of ShueNetV1 with several architectural modications. Specically, ShueNetV2 rst
leverages channel splitting to divide the input feature maps into two parallel branches, one of
which is fed into three convolutional layers, including one 1
×
1pointwise convolutional layer,
one 3
×
3depthwise convolutional layer, and one 1
×
1pointwise convolutional layer. After that,
2We do not include FasterNet [84] in Fig. 3 for comparisons since FasterNet does not optimizes the number of FLOPs.
7
2017.6 | Transformer [90]
The first network based on
attention mechanisms, and
achieves strong performance
in NLP tasks.
2018.10 | BERT [91]
Pre-trained transformers
show promising performance
in NLP tasks and begin to
dominate the field of NLP.
2020.5 | GPT-3 [48]
An autoregressive
transformer with 175 billion
parameters, and takes a big
step towards general NLP
solutions.
2020.5 | DETR [106]
A simple yet effective
transformer for end-to-end
object detection, and
achieves promising detection
performance.
2020.10 | ViT [107]
A pure transformer, and
demonstrates surprisingly
strong performance in
various vision tasks.
Early 2021 | YOLOS/HRViT
Applications of transformer in
low-level vision tasks like object
detection and segmentation
begin to flourish, such as YOLOS
[111] and HRViT [115].
2021 | ViT Variants
A myriad of ViT variants
emerge, such as DeiT [133]
and Swin [108], and push
forward the performance in a
wide range of vision tasks.
2022 to Now | Efficient ViTs
The community begins to
pay attention to the
prohibitive complexity of
ViTs, and design efficient
ViTs, such as EfficientViT
[125] and FastViT [129].
Fig. 4. Illustration of the key milestones of transformer, which is originally applied to NLP tasks and has
recently gained increasing popularity in the vision community. Here, we mark the vision transformers in red.
the above two branches of feature maps are concatenated along the depth dimension, which are
then shued using the channel shuing operation. In particular, with 299M FLOPs and 3.5 M
parameters, ShueNetV2 is able to achieve 72.6% top-1 accuracy on ImageNet, which is
+
1.1%
higher than ShueNetV1 under comparable settings of FLOPs.
CondenseNets [
87
,
88
] are a family of ecient convolutional networks, including CondenseNetV1
[
87
] and CondenseNetV2 [
88
], which are built upon another representative convolutional network
named DenseNet [
3
]. Specically, CondenseNetV1 enhances the dense connection with a novel
module called learned group convolution. Note that the dense connection re-uses the features
from preceding convolutional layers to enhance the information ow as seen in DenseNet. In
contrast, the learned group convolution removes the redundant dense connection between dierent
convolutional layers to reduce network redundancy. With 274 M FLOPs and 2.9 M parameters,
CondenseNetV1 achieves 71.0% top-1 accuracy on ImageNet. Furthermore, CondenseNetV2 intro-
duces an alternative named sparse feature re-activation (SFR) to increase the feature re-using. In
particular, integrated with SFR, each convolutional layer can learn to (1) selectively re-use a set
of most important features from preceding convolutional layers and (2) actively update a set of
preceding features to increase their re-using in subsequent convolutional layers. With 146 M FLOPs
and 3.6 M parameters, CondenseNetV2 achieves 71.9% top-1 accuracy on ImageNet.
GhostNets [
36
,
37
,
75
] are a family of ecient deep convolutional networks, including Ghost-
NetV1 [
36
,
75
] and GhostNetV2 [
37
], which focus on generating rich feature maps using computa-
tionally cheap and simple yet powerful operations. To this end, GhostNetV1 introduces a powerful
yet computation-ecient convolution dubbed Ghost convolution as shown in Fig. 2, which consists
of two sequential parts. The rst part corresponds to the standard convolutional layer, where
the number of output channels is rigorously controlled. And next, in the second part, a series of
computationally cheap and simple linear operations are applied to the output feature maps from the
rst part to generate rich feature maps. In particular, with only 141M FLOPs and 5.2 M parameters,
GhostNetV1 achieves 73.9% top-1 accuracy on ImageNet. Furthermore, GhostNetV2 introduces
a novel hardware-friendly attention mechanism, namely DFC attention, to enhance the learned
feature maps to boost the expressiveness ability, which is seamlessly integrated into GhostNetV1
to push forward the accuracy and eciency. For example, with 167M FLOPs and 6.1 M parameters,
GhostNetV2 is able to achieve 75.3% top-1 accuracy on ImageNet.
FasterNets [
84
] are built upon the partial convolution. In contrast to the above ecient networks
that typically optimize the number of FLOPs, FasterNet pioneers to design ecient networks with
8
optimized FLOPS (i.e., FLOPs per second). The motivation behind FasterNet is that the on-device
latency is determined by both FLOPs and FLOPS (i.e., Latency=FLOPs/FLOPS). To this end, FasterNet
focuses on increasing the number of FLOPs to maintain competitive accuracy on target task, while
at the same time optimizing FLOPS to maintain competitive eciency on target hardware. For
example, compared with GhostNetV1x1.3 that involves 0.24G FLOPs and exhibits 75.7% top-1
accuracy on ImageNet, FasterNet-T1 achieves
+
0.5% higher top-1 accuracy with much more FLOPs
(i.e., 0.85 G), and more importantly, achieves ×1.7 speedup on ARM processors.
2.2 Manual Transformer Design
2.2.1 Transformer for NLP. In parallel to convolutional networks, transformer [
90
] is another
well-established branch of DNNs, which exploits multi-head self-attention mechanisms. In practice,
transformer is rst designed and applied to natural language processing (NLP) tasks, where it has
achieved tremendous success. For example, BERT [
91
], as one of the most representative trans-
formers in the eld of NLP, is able to achieve state-of-the-art performance across 11 downstream
NLP tasks, such as language translation, question answering, language generation, etc., at the
moment of BERT being proposed. Furthermore, GPT-3 [
48
], also known as Generative Pre-trained
Transformer 3, pioneers to scale up and pre-train a massive transformer that consists of 175 billion
parameters on 45 TB compressed plaintext data, which unlocks even stronger performance across
almost all downstream NLP tasks, and more importantly, without requiring ne-tuning on spe-
cic NLP tasks. More recently, GPT-4 [
49
] has been proposed by OpenAI, which can signicantly
outperform GPT-3 across a wide range of language processing tasks and has also been widely
integrated into various real-world language processing tasks, such as ChatGPT [
92
], to provide
intelligent language processing services. These early transformer-based deep networks, thanks to
their prohibitive computational complexity, have been pushing forward the boundaries of various
language processing tasks and dominating recent advances in the eld of NLP (see Fig. 4).
Nonetheless, it is quite challenging to deploy powerful transformers on embedded computing sys-
tems due to the computational gap between computation-intensive transformers and computation-
limited embedded computing systems. For example, as pointed out in [
93
], to translate a short
sentence with only 30 words, a typical transformer model needs to execute 13 G FLOPs, which
takes 20 seconds on a Raspberry Pi device. This signicantly hinders the user experience in real-
world embedded scenarios. To tackle this issue, a series of computation-ecient transformers have
emerged, among which TinyBERT [
94
], MobileBERT [
95
], DistilBERT [
96
], Linformer [
97
], and
Reformer [
98
] are some of the most representative ones. The main intuition behind these ecient
transformers is to resolve the memory bottleneck and increase the parallelism, making it possible to
deploy NLP workloads on resource-constrained embedded computing systems. Note that, compared
with computer vision tasks like image classication and object detection, running NLP workloads
on embedded computing systems is less common due to the high inference latency. For example, as
demonstrated in [
93
], running language translation workloads with hardware-tailored transformers
on a Raspberry Pi device takes even seconds, whereas running image classication workloads
typically takes milliseconds per image. More recently, inspired by the remarkable success of GPTs
[
48
,
49
], transformer-based large language models have become increasingly popular in the NLP
community. To optimize the eciency of transformer-based large language models, a plethora of
ecient transformer-based large language models have been proposed, which typically focus on
improving the training eciency [
99
–
101
], the inference eciency [
102
,
103
], and the ne-tuning
eciency [
104
,
105
] of transformers in the context of large language models. For example, to
optimize the inference eciency of transformer-based large language models, [
102
] partitions
large language models over dierent hardware chips in order to t weights and activation tensors
into memory and run computation and memory workloads within the given latency constraint,
9
Transformer Encoder
MLP
Head
Vision Transformer (ViT)
*
Linear Projection of Flattened Patches
* Extra learnable
[c l ass ] embedding
12 3 4 5 6 7 8 9
0
Patch + Position
Embedding
Class
Bird
Ball
Car
...
Embedded
Patches
Multi-Head
Attention
Norm
MLP
Norm
+
L x
+
Transformer Encoder
Fig. 5. Overview of Vision Transformer (ViT) [
107
], which (1) splits the image into fixed-size patches, (2)
linearly embeds each of them, and (3) feeds the sequence of vectors into the encoder. (figure from [107])
which also features a simple yet eective strategy to alleviate the communication overheads among
dierent hardware chips for cost-eective and latency-ecient inference.
2.2.2 Transformer for Vision. Inspired by the tremendous success of transformer in the eld of NLP,
researchers have recently applied transformer to vision tasks, which achieves surprisingly strong
performance (see Fig. 4). This opens up a new direction and further challenges the dominant role
of convolutional networks in vision tasks. Specically, DETR [
106
] and Vision Transformer (ViT)
[
107
] are the very early transformers in vision tasks, among which ViT is the most representative
one. These early pioneers have motivated a myriad of subsequent transformers in various vision
tasks, such as image classication [
107
–
109
], object detection [
110
–
112
], semantic segmentation
[
113
–
116
], and video analysis [
117
–
119
]. For example, ViT is rst proposed in June 2020, which
has since gained over 20,000 citations as shown in Google Scholar. In particular, the main intuition
behind ViT is surprisingly simple and straightforward, which (1) splits the input image into a series
of xed-size patches, (2) linearly embeds each of them, and (3) feeds the resulting sequence of
vectors into the standard transformer encoder as illustrated in Fig. 5. However, there is no free
lunch. The surprisingly strong performance of ViT and its variants comes at the cost of prohibitive
computational complexity, which signicantly hinders the practical deployments of ViT and its
variants on embedded computing systems with limited computational resources.
To resolve the complexity bottleneck, some recent works have pioneered to design computation-
ecient transformers for vision tasks, with the aim of reducing the computational complexity while
maintaining competitive accuracy. The representative computation-ecient transformers in vision
tasks include LeViT [
120
], MobileFormer [
121
], MobileViTs [
122
–
124
], EcientViT [
125
], EdgeViT
[
126
], EdgeNeXt [
127
], CastlingViT [
128
], and FastViT [
129
]. The above computation-ecient
vision transformers are summarized and compared in Fig. 6.
LeViT [
120
] is a hybrid vision transformer built on top of convolutional networks, which aims
to improve the trade-o between accuracy and eciency. To this end, LeViT introduces several
enhancements to shrink down the network size, including (1) a multi-stage transformer architecture
that uses attention mechanisms as down-sampling, (2) a computation-ecient patch descriptor
that shrinks down the number of features in the early layers, (3) a per-head translation-invariant
attention bias that replaces ViT’s positional embeddings, and (4) an ecient MLP-based attention
10
block that improves the network capacity under given computational budgets. With 406 M FLOPs
and 9.2 M parameters, LeViT achieves 78.6% top-1 accuracy on ImageNet.
MobileFormer [
121
] parallelizes MobileNetV2 [
85
] and transformer [
107
] with a two-way
bridge, which shifts the network design paradigm from series to parallel. The network here is
named MobileFormer, where Mobile refers to MobileNetV2 and Former stands for transformer.
Specically, Mobile takes the image as input and stacks inverted residual blocks that consist of
ecient pointwise and depthwise convolutional layers to extract local features. Former takes
learnable tokens as input and stacks multi-head attention and feed-forward networks, in which the
learnable tokens encode global features of the image. As such, Mobile and Former can communicate
through a two-way bridge to fuse local and global features for better expressiveness ability. With
294 M FLOPs and 11.4 M parameters, MobileFormer achieves 77.9% top-1 accuracy on ImageNet.
MobileViTs, including MobileViTv1 [
122
], MobileViTv2 [
123
], and MobileViTv3 [
124
], are a
family of ecient hybrid networks that combine the benets of CNNs (e.g., spatial inductive bias
and less sensitivity to data augmentations) and vision transformers (e.g., input-adaptive weighting
and global processing). Dierent from mainstream vision transformers, both MobileViTv1 and
MobileViTv2 are designed with the aim of low inference latency rather than low FLOPs since the
number of FLOPs cannot accurately reect the inference eciency on target hardware. To this
end, MobileViTv1 introduces a novel block that is able to eciently and eectively encode both
local and global features. In addition, MobileViTv1 also replaces local processing in convolutional
layers with global processing using transformers, which can lead to better representation capability
with fewer parameters and simpler training recipes. Finally, with 5.6M parameters, MobileViTv1
achieves 78.4% top-1 accuracy on ImageNet. Furthermore, MobileViTv2 introduces a separable
self-attention mechanism with linear complexity, which is integrated into MobileViTv1 to boost
the accuracy and hardware eciency. For example, MobileViTv2 achieves 75.6% top-1 accuracy on
ImageNet, which is
+
0.8% higher than MobileViTv1 while maintaining
×
3.2 speedup on iPhone 12.
In addition, MobileViTv3 introduces two simple yet eective enhancements, including (1) replacing
3
×
3convolutional layers with 1
𝑡×
1convolutional layers and (2) scaling up building blocks in
terms of the network width. With 927 M FLOPs, MobileViTv3 achieves 76.7% top-1 accuracy on
ImageNet, which is +1.9% higher than MobileViTv1 under similar FLOPs.
EcientViT [
125
] investigates high-resolution low-computation visual recognition tasks using
ViT and its variants, and identies that the complexity bottleneck of ViT and its variants comes
from the excessively used softmax attention mechanism. To resolve the complexity bottleneck,
EcientViT challenges the dominant role of softmax attention in vision transformers and further
introduces a strong alternative, namely enhanced linear attention, to replace softmax attention,
which demonstrates strong representation capability in local feature extraction while being able
to maintain low computational complexity and high hardware eciency. With 406 M FLOPs and
7.9 M parameters, EcientViT achieves 78.6% top-1 accuracy on ImageNet.
EdgeViT [
126
] investigates the design of ecient vision transformers from the perspective of
on-device deployment, enabling vision transformers to compete with state-of-the-art CNNs in
terms of the accuracy-eciency trade-o. Specically, EdgeViT is designed based on an optimal
decomposition of self-attention using standard primitive operations, optimizing EdgeViT towards
target hardware to achieve superior accuracy-eciency trade-os. With 600 M FLOPs and 4.1 M
parameters, EdgeViT achieves 74.4% top-1 accuracy on ImageNet, which is
+
2.4% higher than
MobileNetV2 under comparable latency constraints on Samsung Galaxy S21.
EdgeNeXt [
127
] is an ecient hybrid network that marries both worlds of convolutional
networks and vision transformers. To better encode the global information, EdgeNeXt introduces
an ecient split depthwise transpose attention (SDTA) encoder to address the issue of limited
receptive elds in CNNs without increasing the number of FLOPs and parameters. In addition,
11
Fig. 6. Comparisons of eicient vision transformers that have been discussed in Section 2.2, including
LeViT [
120
], MobileFormer [
121
], MobileViTs [
122
–
124
], EicientViT [
125
], EdgeViT [
126
], EdgeNeXt [
127
],
CastlingViT [
128
], and FastViT [
129
], in which the accuracy is evaluated on ImageNet [
80
] and is taken from
the respective paper. Note that the vision transformers here may be trained under dierent training recipes.
EdgeNeXt also leverages adaptive kernel sizes to shrink down the network complexity. With 538 M
FLOPs and 2.3 M parameters, EdgeNeXt achieves 75.0% top-1 accuracy on ImageNet, which is
comparable to MobileViTv1 [122] in terms of both accuracy and on-device latency.
CastlingViT [
128
] proposes to (1) train ViT and its variants using both linear-angular attention
and masked softmax-based quadratic attention and (2) switch to having only linear-angular attention
during the inference in order to save computational resources. Specically, the linear-angular
attention leverages angular kernels to bridge the accuracy gap between linear attention and
softmax-based attention. It expands angular kernels where linear terms are kept while complex
high-order residuals are approximated. This aligns with the observation in EcientViT [
125
] that
the complexity bottleneck of ViT and its variants comes from the excessively involved softmax
attention mechanism. To address the complexity bottleneck, CastlingViT replaces softmax attention
with linear-angular attention to further improve the eciency of ViT and its variants. With 490 M
FLOPs and 10.5 M parameters, CastlingViT achieves 79.6% top-1 accuracy on ImageNet.
FastViT [
129
] is an ecient hybrid network that combines both CNNs and vision transformer,
which aims to marry the best of both and enable state-of-the-art accuracy-eciency trade-os.
To this end, FastViT introduces a novel token mixing operator named RepMixer, which is the
basic building block of FastViT that leverages structural reparameterization to reduce the memory
access cost by removing the less important skip connections. In addition, FastViT also applies
training-time over-parameterization and large kernel convolutions to further boost the accuracy
with minimal eect on the inference latency. In practice, structural reparameterization enables
FastViT to achieve strong accuracy on target task during the training process and maintain superior
eciency on target hardware during the on-device inference process. With 700 M FLOPs and 3.6 M
parameters, FastViT achieves 75.6% top-1 accuracy on ImageNet.
2.3 Future Envision
In this section, we envision the future trends and possible directions of manual network design,
including convolutional networks and transformers, which are summarized as follows:
(1)
Hardware-Aware Optimization. The trend in the eld of network design is to reduce the
number of FLOPs. However, the number of FLOPs only represents the theoretical complexity
and the reduction in the number of FLOPs does not necessarily lead to the inference speedup
on target hardware [
122
,
123
,
126
,
130
,
131
]. For example, PiT [
132
] has
×
3 fewer FLOPs
than DeiT [
133
], but both have similar inference latency on iPhone 12 (i.e., DeiT vs. PiT on
iPhone 12: 10.99 ms vs. 10.56ms) [
122
]. In parallel, the attention mechanisms are powerful
12
plug-in enhancements in various real-world scenarios [
134
,
135
], such as Squeeze-and-
Excitation (SE) [
31
] in vision tasks and self-attention [
90
] in NLP tasks, which can further
boost the attainable accuracy on target task while slightly increasing the number of FLOPs.
However, DNNs with attention mechanisms, despite being able to push forward the accuracy
on target task, introduce considerable extra parameters and are dicult to parallelize on
target hardware, especially for transformers that are full of self-attention mechanisms. For
example, EcientViT [
125
] demonstrates that the prohibitive computational complexity
of ViT and its variants comes from the excessively used softmax attention. In light of the
above, we should focus on optimizing more direct eciency metrics, such as latency and
energy, which may directly benet real-world embedded computing systems.
(2)
Interpretability and Explainability. Recent manually designed DNNs, including ecient
convolutional networks and transformers, have been empirically developed through trial
and error. The intuition behind this is that DNNs suer from limited interpretability and
explainability [
136
]. Therefore, to nd one decent network solution with competitive ac-
curacy, we have to repeat a plethora of training experiments to evaluate the accuracy of
possible network congurations [
137
,
138
], thereby necessitating non-trivial computational
resources for repeated training workloads [
130
,
131
]. To avoid this, we, in the future, should
focus on addressing the interpretability and explainability of DNNs so as to facilitate the
network design process and minimize the required engineering eorts.
(3)
Hybrid Multi-Modal Networks. Compared with vision transformers, convolutional net-
works are able to maintain superior eciency on target hardware, but may suer from
inferior accuracy on target task. However, self-attention mechanisms are excessively in-
volved in vision transformers, which are dicult to parallelize on mainstream embedded
computing systems [
125
,
128
]. For example, as demonstrated in EdgeViT [
126
], under sim-
ilar FLOPs settings, MobileNetV2 [
85
] is about
×
2 faster on Samsung Galaxy S21 than
MobileViTv1 [
122
]. This further hinders the practical deployment of vision transformers
in real-world embedded scenarios. In parallel, [
139
] demonstrates that, similar to trans-
formers, graph neural networks, when properly engineered, can also achieve competitive
performance in vision tasks. More importantly, these hybrid networks have the potential
to handle various modalities (i.e., dierent types of input), such as text, image, and audio
[
140
]. For example, convolutional networks are particularly eective at handling spatial
data, such as image. In contrast, transformers are better suited for sequential data, such as
text. Therefore, in order to achieve better accuracy-eciency trade-os and allow diverse
input modalities, one natural and promising future direction is to continue exploring hybrid
multi-modal networks that combine the strengths of existing representative networks, such
as convolutional networks, vision transformers, and graph networks.
(4)
Simpler Training Recipes. As demonstrated in [
141
], the competitive performance of ViT
and its variants highly relies on more advanced training recipes, such as pre-trained on larger
datasets, more training epochs, stronger data augmentations, and stronger regularization
strategies. For example, ViT [
107
] is rst pre-trained on ImageNet-21k and JFT and then ne-
tuned on ImageNet. Note that ImageNet consists of 1,000 categories, whereas ImageNet-21k
has 21,000 categories. This further makes it more dicult and challenging to train vision
transformers under regular training settings and signicantly increases the total training
cost. Therefore, training vision transformers in a more computation-ecient manner and
under simpler training recipes is a promising future direction.
(5)
Adversarial Robustness. In addition to the eciency, the adversarial robustness is another
desirable network property since ecient networks, especially vision transformers [
142
],
are more sensitive to input perturbations, and as a result, are more vulnerable to adversarial
13
Supernet NASNet Cell DARTS Cell
Operator
candidates
Fig. 7. Illustration of the cell-based search space
A
in NASNet [
144
] and DARTS [
138
], in which NASNet
assigns operator candidates to nodes and DARTS assigns operator candidates to edges. (figure from [
41
])
attacks than non-ecient ones [
143
]. Specically, the adversarial robustness refers to
the ability of the network to maintain its accuracy, even when encountering adversarial
attacks that are intentionally designed to mislead the network. The adversarial robustness
is critical in real-world scenarios, especially in those where the environments are complex
and unpredictable, such as autonomous vehicles. Therefore, innovating ecient yet robust
DNNs is a promising future direction in the eld of network design.
3 AUTOMATED NETWORK DESIGN FOR EMBEDDED COMPUTING SYSTEMS
In contrast to manual network design, automated network design, also known as neural architecture
search (NAS) [
137
], has recently ourished, which strives to automate the design of ecient neural
networks. In the past decade, NAS has achieved impressive performance in the eld of network
design, which delivers more advanced networks with both higher accuracy and eciency than
the conventional manual network design (see Section 2). To this end, we, in this section, further
discuss recent advances in the eld of NAS, especially from the perspective of hardware-aware
NAS that searches for hardware-ecient network solutions, including modular search space in
Section 3.1, search strategy in Section 3.2, and speedup techniques and extensions in Section 3.3.
3.1 Modular Search Space
The search space
A
plays a prominent role in the success of NAS since the search engine of
NAS strives to search for top-performing architecture candidates within the pre-dened search
space. This also indicates that the search space determines the upper-performance limit of modern
NAS algorithms. However, designing ecient and eective search spaces is quite dicult and
challenging since there are a myriad of possible operator candidates (e.g., 1
×
1,3
×
3,5
×
5, and
7
×
7convolutional layers) and dierent network congurations (e.g., the combination strategies of
dierent operator candidates and the network channel layouts) [
137
,
138
,
144
]. Therefore, to reduce
the search space size and trim down the search complexity, previous state-of-the-art NAS methods
[
137
,
138
,
144
] often restrict the search space to allow ecient search and leverage modular search
spaces, which are coarse-grained in contrast to layer-wise ne-grained search spaces. In practice,
previous state-of-the-art NAS methods are based on the following two representative types of
modular search spaces, including cell-based search space and block-based search space.
14
Block
1
Block
2
Block
3
Block
4
Block
5
Block
6
Block
7
Layer
2-1
Input
image output
Layer
2-N2
Layer
4-1
Layer
4-N4
conv
1x1
dconv
5x5
conv
1x1 +
... ...
conv
3x3
Blocks are predefined Skeletons.
Search Space Per Block i:
● ConvOp: dconv, conv, ...
● KernelSize: 3x3, 5x5
● SERatio: 0, 0.25, ...
● SkipOp: identity, pool, ...
● FilterSize: Fi
● #Layers: Ni
Contents in blue are searched
F2 F4
SE
0.25
Fig. 8. Illustration of the block-based search space A, which is based on MobileNetV2. (figure from [38])
Cell-Based Search Space. The cell-based search space
A
has been dominating the early success
in the eld of NAS. Specically, the cell-based search space is rst introduced by NASNet [
144
]
and DARTS [
138
]. As dened in NASNet, the cell-based search space consists of two types of cell
structures, which are denoted as the normal cell and the reduction cell. In practice, both types
of cells are encoded into directed acyclic graphs (DAGs) and maintain the same cell structure as
illustrated in Fig. 7, except that the reduction cell starts with one convolutional layer with the
stride of 2 to reduce the input spatial dimension. Once the cell structure is determined at the end of
search, it is then repeatedly stacked to derive the nal architecture candidate. In addition, DARTS
introduces another type of cell-based search space, which has motivated a plethora of subsequent
NAS methods that are also built on top of the same cell-based search space, such as RobustDARTS
[
145
], EdgeNAS [
130
], PC-DARTS [
146
], P-DARTS [
147
], DARTS+ [
148
], DARTS- [
149
], FairDARTS
[
150
], and
𝛽
-DARTS [
151
]. Similar to NASNet, the cell-based search space in DARTS consists of
two types of cells, including the normal cell and the reduction cell. As shown in Fig. 7, each cell
has an ordered sequence of nodes, where each node is a latent representation (e.g., a feature map
in convolutional networks) and each directed edge has a set of possible operators
{𝑜(𝑖, 𝑗 )}
that
transform the input
𝑥(𝑖)
. Dierent from NASNet, the cell in DARTS is assumed to have two dierent
input nodes and one single output node. With the above in mind, we are able to mathematically
calculate the intermediate node as follows, which is based on all of its predecessors:
𝑥(𝑗)=
𝑖<𝑗
𝑜(𝑖, 𝑗 )(𝑥(𝑖))(1)
Finally, the search space of DARTS contains 6.3×1029 possible architecture candidates [130].
Block-Based Search Space. The block-based search space
A
advocates for simple and diverse
network topologies as illustrated in Fig. 8, in which each architecture candidate consists of multiple
sequential operator candidates. As shown in previous NAS practices, the operator candidates in the
block-based search space are usually taken from state-of-the-art manual DNNs, such as MobileNets
[
32
,
85
] and ShueNets [
34
,
35
]. For example, the block-based search space in ProxylessNAS [
40
]
is built on top of MobileNetV2, whereas the block-based search space in HSCoNAS [
152
] is built on
top of ShueNetV2. In parallel, HURRICANE [
153
] demonstrates that dierent hardware platforms
favor dierent search spaces, based on which HURRICANE introduces a hybrid block-based search
space that combines both MobileNetV2 and ShueNetV2 to deliver superior architecture solutions.
Dierent from the cell-based search space, the block-based search space is hardware-friendly, due
to which the block-based search space has been widely adopted in previous hardware-aware NAS
methods, such as MnasNet [
38
], ProxylessNAS [
40
], OFA [
42
], HSCoNAS [
152
], SurgeNAS [
154
],
and LightNAS [
131
]. The intuition behind this is that the architecture candidate in the cell-based
15
Fig. 9. Illustration of how the recurrent neural network (RNN) controller samples possible convolutional
architecture candidates from the search space in reinforcement learning-based NAS. (figure from [137])
search space consists of multiple parallel branches as shown in Fig. 7, which introduce additional
overheads in terms of the memory access, and as a result, deteriorate the inference eciency
on target hardware according to the rooine analysis [
155
]. In addition, dierent from the cell-
based search space that repeatedly stacks the same cell structure across the entire network, the
block-based search space allows operator diversity within dierent blocks, encouraging to nd
architecture candidates with better accuracy-eciency trade-os [156].
3.2 Search Strategy
In this section, we discuss recent state-of-the-art NAS algorithms and divide them into three main
categories, including reinforcement learning-based search [
137
], evolutionary algorithm-based
search [157], and gradient-based search (also known as dierentiable search) [138].
Reinforcement Learning-Based Search. In the eld of NAS, [
137
] is the rst NAS work
3
that opens up the possibility to automate the design of top-performing DNNs, which features
reinforcement learning (RL) [
159
] as the search engine. Specically, [
137
] leverages a simple yet
eective recurrent neural network (RNN) as the RL controller to generate possible architecture
candidates from the search space as shown in Fig. 9. The generated architecture candidate is then
trained from scratch on target task to evaluate the accuracy. And next, the accuracy of the generated
architecture candidate is fed back into the aforementioned RNN controller, which optimizes the RNN
controller to generate better architecture candidates in the next iteration. Once the search process
terminates, the well-optimized RNN controller is able to provide DNNs with superior accuracy
on target task. For example, the network generated by the RNN controller achieves 96.35% top-1
accuracy on CIFAR-10, which is comparable to or even better than the family of manually designed
DNNs, such as ResNet [
2
]. The promising performance of [
137
] marks an important milestone in
the eld of NAS, pioneering an eective alternative to automate the design of competitive DNNs.
Subsequently, based on [
137
], NASNet [
144
] introduces the exible cell-based search space as
shown in Fig. 7, which further boosts the attainable accuracy on target task. For example, NASNet
achieves 97.6% top-1 accuracy on CIFAR-10, which is
+
1.25% higher than [
137
] while involving
fewer parameters (i.e., 37.4 M in [137] vs. 27.6 M in NASNet). Despite the promising performance,
[
137
] and NASNet have to train a large number of possible architecture candidates from scratch,
thus inevitably necessitating prohibitive computational resources. For example, to optimize the
RNN controller, [
137
] needs to train 12,800 stand-alone architecture candidates. To overcome such
3
MetaQNN [
158
] is another seminal NAS work in parallel to [
137
], both of which feature reinforcement learning as the
search engine to automate the design of top-performing DNNs with competitive accuracy on target task.
16
limitations, ENAS [
160
] proposes an ecient NAS paradigm dubbed parameter sharing, which
forces all the architecture candidates to share network weights to eschew training each architecture
candidate from scratch. In practice, this leads to signicant reduction in terms of search cost, while
at the same time still maintaining strong accuracy on target task. For example, in [
137
], one single
search experiment takes 3
∼
4 days on 450 Nvidia GTX 1080 Ti GPUs [
144
]. In contrast, beneting
from the paradigm of parameter sharing, ENAS is able to nd one decent network solution with
97.11% top-1 accuracy on CIFAR-10, and more importantly, in less than 16 hours on one single
Nvidia GTX 1080 Ti GPU. Thanks to the signicant search eciency, the paradigm of parameter
sharing has been dominating subsequent breakthroughs in the NAS community [42, 138, 161].
Sample models
from search space Trainer Mobile
phones
Multi-objective
reward
latency
reward
Controller
accuracy
Fig. 10. Overview of MnasNet [38]. (figure from [38])
Although early RL-based NAS methods
[
137
,
144
,
160
] have made tremendous suc-
cess in automatic network design, they fo-
cus on accuracy-only optimization, which ig-
nore other important performance metrics,
such as latency and energy. To search for
hardware-ecient network solutions, Mnas-
Net [
38
] formulates the search process as a
multi-objective optimization problem that op-
timizes both accuracy and latency as shown in
Fig. 10. To achieve this, MnasNet introduces
a exible block-based search space (see Fig. 8)
and designs an eective multi-objective RL
reward function to optimize the RNN controller. Specically, the goal of MnasNet is to nd Pareto-
optimal architecture candidates
𝑎𝑟𝑐ℎ
in the search space
A
that maximize the pre-dened multi-
objective RL reward, which can be formulated as follows:
maximize
𝑎𝑟𝑐ℎ ∈A 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝑎𝑟𝑐ℎ) × 𝐿𝑎𝑡 𝑒𝑛𝑐𝑦(𝑎𝑟𝑐ℎ)
𝑇𝑤
(2)
where
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(·)
and
𝐿𝑎𝑡𝑒𝑛𝑐𝑦(·)
denote the accuracy on target task and the latency on target
hardware, respectively. Besides,
𝑇
is the specied latency constraint. It is worth noting that the
latency
𝐿𝑎𝑡𝑒𝑛𝑐𝑦(·)
in MnasNet is directly measured on target hardware, which suers from non-
trivial engineering eorts due to the prohibitive search space (e.g.,
|A| =∼
10
39
in MnasNet) [
40
,
42
].
To avoid the tedious on-device latency measurements, we later discuss several ecient latency
predictors in this section. Apart from these,
𝑤
is the trade-o coecient to control the trade-o
magnitude between accuracy and latency, which is dened as follows:
𝑤=(𝛼, if 𝐿𝑎𝑡 𝑒𝑛𝑐𝑦 (𝑎𝑟 𝑐ℎ) ≤ 𝑇
𝛽, otherwise (3)
where
𝛼
and
𝛽
are application-specic hyper-parameters to control the trade-o magnitude between
accuracy and eciency. According to the empirical observation that doubling the latency usually
brings
∼
5% relative accuracy improvement, MnasNet assigns
𝛼=𝛽=−
0
.
07. In practice,
𝛼
and
𝛽
are both sensitive and dicult to tune. And even worse, given new hardware devices or new search
spaces,
𝛼
and
𝛽
involve additional engineering eorts for hyper-parameter tuning. For example, as
observed in MobileNetV3 [
162
], the accuracy changes much more dramatically with latency for
small networks. Therefore, to obtain the required architecture candidate that satises the specied
latency constraint
𝑇
, we typically need to repeat 7 search experiments to tune
𝛼
and
𝛽
through
trial and error [
163
]. This signicantly increases the total search cost by
×
7 times. To eliminate
such additional hyper-parameter tuning, TuNAS [
163
] investigates the multi-objective RL reward
17
Fig. 11. Overview of the one-shot supernet, where solid lines mean that the operator candidates are enabled
while dashed lines mean that the operator candidates are part of the search space but disabled. Here, the
one-shot supernet contains all the possible architecture candidates in the search space. (figure from [
169
])
in Eq (2) and further introduces a similar RL reward function, which can be formulated as follows:
maximize
𝑎𝑟𝑐ℎ ∈A 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝑎𝑟𝑐ℎ) +𝛾× |𝐿𝑎𝑡𝑒𝑛𝑐𝑦(𝑎𝑟𝑐ℎ)
𝑇−1|(4)
where
| · |
is the absolute function. Besides,
𝛾<
0is a nite negative value, which controls how
strongly we enforce the architecture candidate to maintain the latency close to 𝑇.
In addition, MONAS [
164
] also introduces a simple yet eective RL reward function that considers
to optimize both accuracy and energy, which can be formulated as follows:
𝑅𝑒𝑤 𝑎𝑟𝑑 (𝑎𝑟𝑐ℎ)=𝜂×𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (𝑎𝑟𝑐ℎ)−(1−𝜂) × 𝐸𝑛𝑒𝑟𝑔𝑦(𝑎𝑟𝑐ℎ)(5)
where
𝜂∈ [
0
,
1
]
is the coecient to control the trade-o between accuracy and energy. We note
that the RL reward function in Eq (5) aims to nd the architecture candidate with high accuracy
and low energy, which can be generalized to other performance constraints like latency.
Evolutionary Algorithm-Based Search. In addition to reinforcement learning-based search,
evolutionary algorithm-based search is another popular branch in the NAS literature, thanks to
its exibility, conceptual simplicity, and competitive performance [
157
]. As seen in the very early
evolutionary practices [
165
–
168
], the evolutionary algorithm-based search typically consists of four
key steps, including (1) sampling a set of possible architecture candidates from the search space as
the child population, (2) evaluating the architecture candidates in the child population to interpret
the performance, such as accuracy and eciency, (3) reserving the top-k architecture candidates in
the latest child population to form the parent population and discarding the architecture candidates
with poor performance, and (4) manipulating the architecture candidates in the latest parent
population to generate new architecture candidates to form the next-generation child population.
The above four steps are repeated until the evolutionary process converges.
There are many other aspects in which the evolutionary algorithm may dier, including (1) how
to sample the initial population, (2) how to select the parent population, and (3) how to generate the
child population from the parent population. Among them, generating the child population from
the parent population is of utmost importance in order to produce superior architecture candidates
[
157
]. In practice, to allow ecient exploration and exploitation [
170
], crossover and mutation
are two of the most popular strategies to generate the child population [
171
,
172
]. Specically, for
crossover, two random architecture candidates from the parent population are crossed to produce
one new child architecture candidate. For mutation, one randomly selected architecture candidate
mutates its operators with a xed probability. However, the early evolutionary NAS works have to
train a large number of stand-alone architecture candidates from scratch to evaluate the accuracy
[157], and as a result, suer from non-trivial computational resources [169].
18
Fig. 12. Illustration of the architecture candidate evalua-
tion in one-shot NAS [169]. (figure from [169])
To reduce the required computational re-
sources for neural architecture search, [
169
]
introduces the paradigm of one-shot NAS,
which has been widely applied in subsequent
NAS methods [
40
,
42
,
138
] thanks to its sig-
nicant search eciency. In parallel to [
169
],
SMASH [
173
] also proposes a similar one-shot
NAS paradigm, but [
169
] is much more popu-
lar in the NAS community. Specically, [
169
]
designs an eective one-shot supernet as vi-
sualized in Fig. 11, which consists of all the
possible architecture candidates in the search
space. Therefore, we only need to train the
one-shot supernet, after which we can evalu-
ate dierent architecture candidates in the search space with inherited network weights from the
pre-trained one-shot supernet as shown in Fig. 12. This eectively avoids to train a large number
of stand-alone architecture candidates from scratch. In practice, the one-shot supernet is simply
trained using the standard SGD optimizer with momentum. Once the one-shot supernet is well
trained, it is able to quickly and reliably approximate the performance of dierent architecture
candidates using the paradigm of weight sharing [
160
]. With the well-trained one-shot supernet, it
is straightforward and technically easy to leverage the standard evolutionary algorithm to search
for top-performing architecture candidates with superior accuracy on target task [
169
]. We note
that the searched architecture candidates still need to be re-trained or ne-tuned on target task in
order to recover the accuracy for further deployment on target hardware.
Furthermore, SPOS [
161
] investigates the one-shot NAS [
169
] and identies two critical issues. On
the one hand, the network weights in the one-shot supernet are deeply coupled during the training
process. On the other hand, the joint optimization introduces further coupling between architecture
candidates and supernet weights. To address these, SPOS proposes the paradigm of single-path
one-shot NAS, which uniformly samples one single-path sub-network from the supernet and train
the sample single-path sub-network instead. This brings two main benets, including (1) reducing
the memory consumption to the single-path level and (2) improving the performance of the nal
searched architecture candidate. The success of SPOS has motivated a series of follow-up works
[
42
,
152
,
153
,
174
–
179
]. Note that all of the above follow-up works [
42
,
152
,
153
,
174
–
179
] focus on
training an eective and reliable supernet, which then serves as the evaluator to quickly query
the performance of dierent architecture candidates. For example, FairNAS [
174
] demonstrates
that the uniform sampling strategy only implies the soft fairness, and to imply the strict fairness,
FairNAS samples multiple single-path sub-networks to enforce that all the operator candidates in
the supernet are equally optimized during each training iteration. In parallel, OFA [
42
] is another
representative evolutionary NAS method that aims to train the supernet, after which we are allowed
to detach single-path sub-networks from the supernet with inherited network weights for further
deployment on target hardware. Note that the detached sub-network in OFA still requires to be
ne-tuned on target task for several epochs (e.g., 25 epochs) in order to obtain competitive accuracy.
To eliminate the ne-tuning process, BigNAS [
175
] proposes several enhancements to train one
single-stage supernet, where the single-path sub-network detached from the supernet with inherited
network weights can achieve superior accuracy without being re-trained or ne-tuned on target
task and can be directly deployed on target hardware. This signicantly saves the computational
resources required for training stand-alone architecture candidates, especially when targeting
multiple dierent deployment scenarios like multiple dierent hardware platforms.
19
Thanks to its search exibility, evolutionary algorithm-based NAS can be easily extended to
search for hardware-ecient architecture candidates, which maximize the accuracy on target task
while satisfying various real-world performance constraints [
153
], such as latency, energy, memory,
etc. Without loss of generality, we consider the following multi-objective optimization:
maximize
𝑎𝑟𝑐ℎ ∈A 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝑎𝑟𝑐ℎ)𝑠 .𝑡 ., 𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡1(𝑎𝑟𝑐ℎ) ≤ 𝐶1, ..., 𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡𝑛(𝑎𝑟𝑐ℎ) ≤ 𝐶𝑛(6)
where {𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡𝑖(·)}𝑛
𝑖=1and {𝐶𝑖}𝑛
𝑖=1are a set of real-world performance constraints.
Gradient-Based Search. In addition to reinforcement learning-based search and evolutionary
algorithm-based search, gradient-based search [
138
], also known as dierentiable search, is another
representative branch of NAS, which has since gained increasing popularity in the NAS community
and motivated a plethora of subsequent dierentiable NAS works [
145
–
151
,
180
–
188
], thanks to
its signicant search eciency [
189
]. For example, DARTS [
138
], as the seminal dierentiable
NAS work, is able to deliver one superior architecture candidate in
∼
1 day on one single Nvidia
GTX 1080 Ti GPU. In contrast to previous non-dierentiable NAS practices [
137
,
144
,
160
,
169
]
that highly rely on discrete search spaces, DARTS leverages a list of architecture parameters
𝛼
to
relax the discrete search space to become continuous. Beneting from the continuous search space,
both the network weights
𝑤
and the architecture parameters
𝛼
can be optimized via alternating
gradient descent. Once the dierentiable search process terminates, we can interpret the optimal
architecture candidate from the architecture parameters
𝛼
. Specically, the supernet in DARTS
is initialized by stacking multiple over-parameterized cells (see Fig. 13 (1)), in which each cell
consists of all the possible cell structures in the cell-based search space
A
. As shown in Fig. 13,
each cell is represented using the directed acyclic graph (DAG) that consists of
𝑁
nodes
{𝑥𝑖}𝑁
𝑖=1
.
Note that the nodes here correspond to the intermediate feature maps. In addition, the directed
edges between
𝑥𝑖
and
𝑥𝑗
correspond to a list of operator candidates
{𝑜|𝑜∈ O}
in the operator space
O
. Meanwhile, the directed edges between
𝑥𝑖
and
𝑥𝑗
are also assigned with a list of architecture
parameters {𝛼(𝑖, 𝑗 )
𝑜|𝑜∈ O}. Finally, following DARTS, we formulate 𝑥𝑗as follows:
𝑥𝑗=
𝑜∈O
exp 𝛼(𝑖, 𝑗 )
𝑜
Í𝑜′∈O exp 𝛼(𝑖 ,𝑗 )
𝑜′
𝑜(𝑥𝑖)(7)
Note that the output
𝑥𝑗
is continuous with respect to
𝑥𝑖
,
𝛼
, and
𝑤
. In sight of this, DARTS proposes
to optimize 𝛼and 𝑤using the following bi-level optimization scheme:
minimize
𝛼L𝑣𝑎𝑙 (𝑤∗(𝛼), 𝛼 )𝑠.𝑡., 𝑤∗(𝛼)=arg min
𝑤L𝑡𝑟 𝑎𝑖𝑛 (𝑤, 𝛼 )(8)
where
L𝑡𝑟 𝑎𝑖𝑛 (·)
and
L𝑣𝑎𝑙 (·)
are the loss functions on the training and validation datasets, respec-
tively. Once the dierentiable search process terminates, DARTS determines the optimal architecture
candidate by reserving the strongest operator
𝛼(𝑖, 𝑗 )
𝑜
and removing other operators between
𝑥𝑖
and
𝑥𝑗
, in which the operator strength is dened as
exp 𝛼(𝑖, 𝑗 )
𝑜/Í𝑜′∈O exp 𝛼(𝑖 ,𝑗 )
𝑜′
. It is worth noting that
the searched optimal architecture candidate still needs to be re-trained on target task in order to
recover its accuracy for further deployment on target hardware.
Inspired by the promising performance of DARTS, a plethora of follow-up works [
145
–
151
,
180
–
188
] have recently emerged, which strive to unleash the power of dierentiable NAS so as to
deliver superior architecture candidates. For example, in contrast to DARTS that simultaneously
optimizes all the operator candidates in the supernet, PC-DARTS [
146
] introduces partial channel
connections to alleviate the excessive memory consumption of DARTS. In addition, DARTS+ [
148
]
investigates the performance collapse issue of DARTS and nds that the performance collapse
issue is caused by the over-selection of skip-connect. To tackle this, DARTS+ proposes a simple yet
eective early-stopping strategy to terminate the search process upon fullling a set of pre-dened
20
node0
node1
node2
node3
(1) initialize
node0
node1
node2
node3
(2) optimize
node0
node1
node2
node3
(3) discretize
node0
node1
node2
node3
(4) re-train
Fig. 13. Overview of DARTS [
138
], which consists of four stages, including (1) initializing
𝑤
and
𝛼
in the
supernet, (2) optimizing
𝑤
and
𝛼
via alternating gradient descent, (3) discretizing the optimal architecture
candidate from the supernet, and (4) re-training the optimal architecture candidate to recover the accuracy.
criteria. In parallel, DARTS- [
149
] also observes that the performance collapse issue of DARTS
comes from the over-selection of skip-connect and further leverages an auxiliary skip connection
to mitigate the performance collapse issue and stabilize the search process. Apart from these,
Single-DARTS [
185
] and Gold-NAS [
184
] investigate the bi-level optimization in Eq (8) and point
out that the bi-level optimization may end up with sub-optimal architecture candidates, based on
which Single-DARTS and Gold-NAS turn back to the one-level optimization. Besides, to accelerate
the search process, GDAS [
186
] introduces an ecient Gumbel-Softmax [
190
] based dierentiable
sampling approach to reduce the optimization complexity to the single-path level. Similar to GDAS,
SNAS [
187
] also leverages Gumbel-Softmax reparameterization to improve the search process,
which can make use of gradient information from generic dierentiable loss without sacricing the
completeness of NAS pipelines. Furthermore, PT-DARTS [
182
] revisits the architecture selection
in dierentiable NAS and demonstrates that the architecture parameters
𝛼
cannot always imply
the optimal architecture candidate, based on which PT-DARTS introduces the perturbation-based
architecture selection to determine the optimal architecture candidate at the end of search.
The aforementioned dierentiable NAS works [
145
–
151
,
180
–
188
], however, focus on accuracy-
only neural architecture search, which indeed demonstrates promising performance in terms of
nding the architecture candidate with competitive accuracy but fails to accommodate the limited
available computational resources in real-world embedded scenarios. To overcome such limitations,
the paradigm of hardware-aware dierentiable NAS [
130
,
191
–
194
] has recently emerged, which is
based on DARTS and focuses on nding top-performing architecture candidates within the cell-
based search space that can achieve both high accuracy on target task and high inference eciency
on target hardware. To achieve the above goal, one widely adopted approach is to integrate the
latency-constrained loss term into the overall loss function to penalize the architecture candidate
with high latency, which can be mathematically formulated as follows:
minimize
𝛼L𝑣𝑎𝑙 (𝑤∗(𝛼), 𝛼 ) +𝜆·𝐿𝑎𝑡𝑒𝑛𝑐𝑦 (𝛼)𝑠 .𝑡 ., 𝑤 ∗(𝛼)=arg min
𝑤L𝑡𝑟 𝑎𝑖𝑛 (𝑤, 𝛼 )(9)
where
𝜆
is the trade-o coecient to control the trade-o magnitude between accuracy and latency.
As demonstrated in [
131
,
156
], a larger
𝜆
ends up with the architecture candidate that maintains
low accuracy and low latency, whereas a smaller
𝜆
leads to the architecture candidate with high
accuracy and high latency. Besides,
𝐿𝑎𝑡𝑒𝑛𝑐𝑦(𝛼)
corresponds to the latency of the architecture
candidate encoded by
𝛼
. We note that the optimization objective in Eq (9) can be easily generalized
21
AvgPool+Fc
1x1 Conv
3x3 Conv
MBInvRes
Stage8
Stage3
Stage4
Stage5
Stage6
Stage7
op1op2op8
Layer 8
Layer 9
Layer 10
Layer 11
...
(a) Overall Framework of Supernet
(b) Bi-sampling Algorithm (c) Sink-connecting Space (d) Elasticity-scaling Strategy
Shrink Expand Expand
Latency
17.68ms
Fixed
Fixed
Latency
14.26ms Latency
14.78ms Latency
14.99ms
Target: 15.00ms Supernet Selected Width
8
l
2
l
1
l
1
s
2
s
3
s
4
s
Sink
Point
Gumbel
Softmax
Random
Fig. 14. Overview of TF-NAS [
195
], which investigates the three search freedoms in conventional hardware-
aware dierentiable NAS, including (1) operator-level, (2) depth-level, and (3) width-level.(figure from [
195
])
to jointly optimize other types of hardware performance constraints, such as energy and memory
consumption, in which we only need to incorporate
𝐸𝑛𝑒𝑟𝑔𝑦(𝛼)
and
𝑀𝑒𝑚𝑜𝑟𝑦 (𝛼)
into the optimiza-
tion objective in Eq (9). For example, we can re-formulate the optimization objective in Eq (9) as
follows to jointly optimize the on-device latency, energy, and memory consumption:
minimize
𝛼L𝑣𝑎𝑙 (𝑤∗(𝛼), 𝛼 ) +𝜆1·𝐿𝑎𝑡𝑒𝑛𝑐𝑦 (𝛼) + 𝜆2·𝐸𝑛𝑒𝑟𝑔𝑦(𝛼) + 𝜆3·𝑀𝑒𝑚𝑜𝑟𝑦 (𝛼)(10)
where
𝜆1
,
𝜆2
, and
𝜆3
are trade-o coecients to determine the trade-o magnitudes between
accuracy and latency, energy, and memory, respectively.
Despite the signicant progress to date, the aforementioned hardware-aware dierentiable NAS
works [
130
,
191
–
194
] highly rely on the cell-based search space, which rst determine the optimal
cell structure and then repeatedly stack the same cell structure across the entire network [
138
].
However, as demonstrated in MnasNet [
38
], such NAS practices suer from inferior accuracy and
eciency due to the lack of operator diversity. And even worse, the architecture candidates in
the cell-based search space are of multiple parallel branches as shown in Fig. 7, which introduce
considerable memory access overheads, and as a result, are dicult to benet from the high
computational parallelism on mainstream hardware platforms [
34
,
35
]. To overcome such limitations,
recent hardware-aware dierentiable NAS works [
39
,
40
,
154
,
188
,
195
–
199
] have shifted their
attention from the cell-based search space (see Fig. 7) to the block-based search space (see Fig. 8).
Among them, the most representative ones include FBNet [
39
], ProxylessNAS [
40
], SP-NAS [
197
],
and TF-NAS [
195
]. Specically, similar to GDAS [
186
] and SNAS [
187
], FBNet leverages Gumbel-
Softmax reparameterization [
190
] to relax the discrete search space to be continuous. Besides, FBNet
collects a simple yet eective latency lookup table to quickly approximate the latency of dierent
architecture candidates. The pre-collected latency lookup table is then integrated into the search
process to derive hardware-ecient architecture candidates. However, similar to DARTS [
138
],
FBNet requires to simultaneously optimize all the operator candidates in the supernet during the
search process, which is not scalable to large search spaces and suers from the memory bottleneck
[
40
,
131
]. In sight of this, ProxylessNAS introduces an eective path-level binarization approach
to reduce the memory consumption to the single-path level, which signicantly improves the
search eciency without compromising the search accuracy. In parallel, SP-NAS demonstrates that
dierent operator candidates in the supernet can be viewed as subsets of an over-parameterized
22
superkernel, based on which SP-NAS proposes to encode all the operator candidates into the
superkernel. In practice, this explicitly reduces the memory consumption to the single-path level,
which therefore alleviates the memory bottleneck during the search process. Furthermore, TF-NAS
thoroughly investigates the three search freedoms in hardware-aware dierentiable NAS, including
(1) operator-level search, (2) depth-level search, and (3) width-level search as shown in Fig. 14,
which is able to perform ne-grained architecture search. Besides, to obtain hardware-ecient
architecture candidates, TF-NAS integrates the pre-collected latency lookup table into the search
process. In the meantime, TF-NAS introduces a simple yet eective bi-sampling search algorithm
to accelerate the search process towards enhanced search eciency.
But even so, we should consider not only the explicit search cost - the time required for one single
search experiment, but also the implicit search cost - the time required for manual hyper-parameter
tuning in order to nd the desired architecture candidate. This is because, in real-world embedded
scenarios like autonomous vehicles, DNNs must be executed under strict latency constraints (e.g.,
24 ms), in which any violation may lead to catastrophic consequences [
20
,
163
]. However, to nd the
architecture candidate with the latency of 24 ms, the aforementioned hardware-aware dierentiable
NAS works [
39
,
40
,
130
,
154
,
188
,
191
–
199
] have to repeat a plethora of search experiments to
tune the trade-o coecient
𝜆
(see Eq (9)) through trial and error [
131
,
156
], which signicantly
increases the total search cost. The intuition behind this is that
𝜆
, despite being able to trade o
between accuracy and latency, is quite sensitive and dicult to control [
131
,
156
]. To overcome such
limitations, HardCoRe-NAS [
200
] leverages an elegant Block Coordinate Stochastic Frank-Wolfe
(BCSFW) algorithm [
201
] to restrict the search direction around the specied latency requirement.
In addition, LightNAS [
131
,
156
] introduces a simple yet eective hardware-aware dierentiable
NAS approach, which investigates the optimization objective in Eq (9) and proposes to optimize the
trade-o coecient
𝜆
during the search process in order to satisfy the specied latency requirement.
In other words, LightNAS focuses on automatically learning
𝜆
that strictly complies with the
specied latency requirement, which is able to nd the required architecture candidate in one
single search (i.e., you only search once) and avoids performing manual hyper-parameter tuning
over 𝜆. Specically, the optimization objective of LightNAS is formulated as follows:
minimize
𝛼L𝑣𝑎𝑙 (𝑤∗(𝛼), 𝛼 ) +𝜆·𝐿𝑎𝑡𝑒𝑛𝑐𝑦 (𝛼)
𝑇−1𝑠.𝑡., 𝑤∗(𝛼)=arg min
𝑤L𝑡𝑟 𝑎𝑖𝑛 (𝑤, 𝛼 )(11)
where
𝑇
is the specied latency requirement. Dierent from previous hardware-aware dierentiable
NAS works [
39
,
40
,
130
,
154
,
188
,
191
–
199
],
𝜆
in Eq (11) is not a constant but a learnable hyper-
parameter that can be automatically optimized during the search process. For the sake of simplicity,
below we use
L(𝑤, 𝛼 , 𝜆)
to denote the optimization objective in Eq (11). Finally, to satisfy the
specied latency requirement (i.e.,
𝐿𝑎𝑡𝑒𝑛𝑐𝑦(𝛼)=𝑇
),
𝑤
and
𝛼
are updated using gradient descent
[138], whereas 𝜆is updated using gradient ascent as follows:
(𝑤∗=𝑤−𝑙𝑟𝑤·𝜕L(𝑤 ,𝛼,𝜆 )
𝜕𝑤 , 𝛼 ∗=𝛼−𝑙𝑟𝛼·𝜕L(𝑤,𝛼,𝜆 )
𝜕𝛼
𝜆∗=𝜆+𝑙𝑟𝜆·𝜕L(𝑤,𝛼 ,𝜆)
𝜕𝜆 =𝜆+𝑙𝑟𝜆·𝐿𝐴𝑇 (𝛼)
𝑇−1(12)
where
𝑙𝑟𝑤
,
𝑙𝑟𝛼
, and
𝑙𝑟𝜆
are the learning rates of
𝑤
,
𝛼
, and
𝜆
, respectively. Below we further demon-
strate why LightNAS guarantees
𝐿𝑎𝑡𝑒𝑛𝑐𝑦(𝛼)=𝑇
. As shown in LightNAS, a larger
𝜆
leads to the
architecture candidate with low latency, whereas a smaller
𝜆
results in the architecture candidate
with high latency. Therefore, if
𝐿𝑎𝑡𝑒𝑛𝑐𝑦(𝛼)>𝑇
, the gradient ascent scheme increases
𝜆
to reinforce
the latency regularization magnitude. As a result,
𝐿𝑎𝑡𝑒𝑛𝑐𝑦(𝛼)
decreases towards
𝑇
in the next search
iteration. Likewise, if
𝐿𝑎𝑡𝑒𝑛𝑐𝑦(𝛼)<𝑇
, the gradient ascent scheme decreases
𝜆
to diminish the
latency regularization magnitude, after which
𝐿𝑎𝑡𝑒𝑛𝑐𝑦(𝛼)
increases towards
𝑇
in the next search
23
Method Search
Space
Search
Strategy
Search Cost Target
Hardware
Hardware
Modeling
ImageNet
Dataset GPU-Hours GPU FLOPs (M) Top-1 Acc (%)
MnasNet [38] Block Reinforce ImageNet 40,000 V100 Mobile Phones N/A 312 75.2
ProxylessNAS [40] Block Gradient ImageNet 200 V100 GPUs, CP Us, and
Mobile Phones LUT N/A 75.1
MobileNetV3 [162] Block Evolution ImageNet N/A N/A Mobile Phones N/A 219 75.2
FBNet [39] Block Gradient ImageNet 216 N/A Mobile Phones LUT 375 74.9
TuNAS [163] Block Reinforce ImageNet N/A N/A Mobile Phones LUT - 75.4
OFA [42] Block Evolution ImageNet 1,200 V100
GPUs, CPUs,
Edge GPUs, and
Mobile Phones
LUT 230 76.0
SP-NAS [197] Block Gradient ImageNet 30 TP U Mobile Phones LUT N/A 75.0
LA-DARTS [191] Cell Gradient CIFAR-10 17 P100 GPUs and CPUs Predictor 575 74.8
MDARTS [194] Cell Gradient CIFAR-10 ∼6.5 Titan XP Eyeriss Predictor N/A N/A
EH-DNAS [203] Cell Gradient CIFAR-10 24 1080 Ti Customized
Accelerators Predictor 840 69.6
E-DNAS [204] Block Gradient ImageNet N/A V100 CPUs and DSPs Predictor 365 76.9
SNAS [198] Block Gradient ImageNet 30 N/A TPUs Predictor 1290 79.4
HSCoNAS [152] Block Evolution ImageNet N/A N/A GPU, CPUs, and
Edge GPUs LUT N/A 74.9
DenseNAS [199] Block Gradient ImageNet 64 Titan XP GPUs LUT 361 75.3
TF-NAS [195] Block Gradient ImageNet 43 Titan RTX GPUs LUT 284 75.2
HardCoRe-NAS [200] Block Gradient ImageNet 400 P100 GPUs and CPUs LUT N/A 75.7
LightNAS [131] Block Gradient ImageNet 10 RTX 3090 Edge GPUs Predictor N/A 75.2
SurgeNAS [154] Block Gradient ImageNet 30 V100 GPUs, CPUs,
and Edge GPUs Predictor N/A 75.5
SPOS [161] Block Evolution ImageNet 288 1080 Ti GPUs LUT 328 74.7
HURRICANE [153] Block Evolution ImageNet N/A N/A CPUs, DSPs,
and VPUs LUT 409 75.1
ProxyNAS [176] Block Evolution ImageNet N/A N/A GPUs, CPUs,
TPUs, and FPGAs Predictor N/A N/A
Table 1. Comparisons of representative hardware-aware NAS works. This table is to roughly compare dierent
hardware-aware NAS works, in which N/A means that the related data is not reported in the respective paper.
Note that the accuracy in this table may be trained under dierent training recipes.
iteration. Finally, the search engine ends up with the architecture candidate that strictly satises
the specied latency requirement (i.e.,
𝐿𝑎𝑡𝑒𝑛𝑐𝑦(𝛼)=𝑇
). More recently, Double-Win NAS [
202
]
proposes deep-to-shallow transformable search to further marry the best of both deep and shallow
networks towards an aggressive accuracy-eciency win-win. Similar to LightNAS [
131
,
156
], the
resulting shallow network can also satisfy the specied latency constraint. Finally, we compare
previous representative hardware-aware NAS works, which are summarized in Table 1.
3.3 Speedup Techniques and Extensions
In this section, we further discuss recent state-of-the-art advances in general speedup techniques and
extensions for NAS algorithms, including one-shot NAS enhancements, ecient latency prediction,
ecient accuracy prediction, low-cost proxies, zero-cost proxies, ecient transformer search,
ecient domain-specic search, and mainstream NAS benchmarks, which have the potential to
signicantly benet NAS algorithms and largely facilitate the search process.
Beyond One-Shot NAS. Despite the high search eciency, one-shot NAS often suers from
poor ranking correlation between one-shot search and stand-alone training. As pointed out in [
205
],
one-shot search results do not necessarily correlate with stand-alone training results across various
search experiments. To overcome such limitations, a plethora of one-shot NAS enhancements
have been recently proposed [
206
–
211
]. Specically, [
206
–
210
] turn back to few-shot NAS. In
contrast to one-shot NAS [
160
] that only features one supernet, few-shot NAS further introduces
multiple supernets to explore dierent regions of the pre-dened search space, which slightly
increases the search cost over one-shot NAS but can deliver much more reliable search results.
For example, as shown in [
206
], with only up to 7 supernets, few-shot NAS can establish new
24
state-of-the-art search results on ImageNet. Among them, [
209
] demonstrates that zero-cost proxies
can be integrated into few-shot NAS, which can further enhance the search process of one-shot
NAS and thus end up with better search results. More recently, [
208
] generalizes few-shot NAS
to distill large language models, which focuses on automatically distilling multiple compressed
student models under various computational budgets from a large teacher model. In contrast to
few-shot NAS that leverages multiple supernets to improve the ranking correlation performance
of one-shot NAS, CLOSE [
211
] instead features an eective curriculum learning-like schedule to
control the parameter sharing extent within the proposed supernet dubbed CLOSENet, in which
the parameter sharing extent can be exibly adjusted during the search process and the parameter
sharing scheme is built upon an ecient graph-based encoding scheme.
Ecient Latency Prediction
4
.As seen in MnasNet [
38
], the latency is directly measured on
target hardware, which is then integrated into the RL reward (see Eq (2)) to penalize the architecture
candidate with high latency. The direct on-device latency measurement is indeed accurate, which,
however, is time-consuming and unscalable to large search spaces [
40
]. To overcome such limitations,
several latency prediction strategies have been recently proposed. For example, ProxylessNAS
[
40
], FBNet [
39
], and OFA [
42
] leverage the latency lookup table to approximate the on-device
latency, which sums up the latency of all the operator candidates. In addition, HSCoNAS [
152
,
178
]
demonstrates that the data movements and communications among dierent operator candidates
introduce additional latency overheads, making the pre-collected latency lookup table inaccurate.
To mitigate this issue, HSCoNAS quanties the latency that corresponds to the intermediate data
movements and communications, which is then fed into the pre-collected latency lookup table
to achieve more accurate latency prediction performance. However, the latency lookup table is
only applicable to the block-based search space, which leads to unreliable latency prediction
performance in terms of the cell-based search space [
213
]. To this end, EdgeNAS [
130
], LA-DARTS
[
191
], and LC-NAS [
192
] propose to use learning-based approaches for the latency prediction
purpose. For example, EdgeNAS trains an ecient multi-layer perceptron (MLP) to predict the
latency of dierent architecture candidates in the cell-based search space, which can also be
generalized to predict the latency of dierent architecture candidates in the block-based search
space as shown in [
131
,
156
,
176
,
212
,
214
]. Furthermore, BRP-NAS [
213
] and SurgeNAS [
154
]
introduce graph neural networks (GNNs) based latency predictors to achieve more reliable latency
prediction performance. In practice, the above latency predictors (1) rely on a large number of
training samples to achieve decent latency prediction performance (e.g., 100,000 training samples in
EdgeNAS) and (2) need to be reconstructed for either new hardware or new search spaces. To avoid
these, HELP [
215
] and MAPLE-Edge [
216
] focus on building an ecient latency predictor using
only few training samples (e.g., as few as 10 training samples in HELP), which can be generalized
to new hardware or new search spaces with only minimal re-engineering eorts. More recently,
EvoLP [
217
] considers an eective self-evolving scheme to construct ecient yet accurate latency
predictors, which can adapt to unseen hardware with only minimal re-engineering eorts.
Ecient Accuracy Prediction. In parallel to latency prediction, accuracy prediction has
also received increasing attention from the NAS community [
213
,
218
–
222
], which strives to
directly predict the accuracy of dierent architecture candidates in the search space. Specically,
[
218
] introduces a simple yet eective graph convolutional networks (GCNs) based accuracy
predictor, which can achieve reliable accuracy prediction performance, thanks to GCNs’ strong
capability to learn graph-structured data. Similar to [
218
], BRP-NAS [
213
] also considers GCNs for
reliable accuracy prediction, which introduces transfer learning to further improve the accuracy
4
We mainly discuss latency prediction since latency is the most dominant performance constraint in hardware-aware NAS
[176, 212], which can be generalized to predict other performance constraints, such as energy and memory consumption.
25
(a) LRC (b) NTK (c) LRC + NTK
Fig. 15. Comparisons of dierent zero-cost proxies on CIFAR-100, including (a) LRC [
228
], (b) NTK [
229
], and
(c) LRC + NTK [230], in which the accuracy is queried from NAS-Bench-201 [224]. (figure from [230])
prediction performance from the pre-trained latency predictor. In parallel, [
219
] leverages the
non-neural network (i.e., GBDT) as the accuracy predictor, which is of stronger capability to
learn representations than neural networks based accuracy predictors. In addition, NASLib [
220
]
investigates a wide range of accuracy predictors from learning curve extrapolation, weight-sharing,
supervised learning, and zero-cost proxies on three popular NAS benchmarks (i.e., NAS-Bench-
101 [
223
], NAS-Bench-201 [
224
], and NAS-Bench-NLP [
225
]). In particular, NASLib reveals that
dierent accuracy predictors can be combined to achieve substantially better accuracy prediction
performance than any single accuracy predictor. Furthermore, DONNA [
221
] proposes to build
an ecient accuracy predictor, which only involves minimal computational resources, and more
importantly, can scale to diverse search spaces. To achieve this, DONNA uses blockwise knowledge
distillation to construct an architecture candidate pool, in which each architecture candidate only
needs to be ne-tuned for several epochs to derive the accuracy rather than being trained from
scratch. Dierent from the aforementioned accuracy predictors that feature graph-based encoding
schemes, GATES [
222
,
226
] instead models the operations as the transformation of the propagating
information, which can eectively mimic the actual data processing of dierent neural architecture
candidates. More importantly, the encoding scheme of GATES can be integrated into the above
accuracy predictors to further boost their accuracy prediction performance. Similar to GATES, TA-
GATES [
227
] also introduces an eective encoding scheme with analogous modeling of the training
process of dierent neural architecture candidates, which can further achieve better accuracy
prediction performance than GATES on various representative NAS benchmarks.
Low-Cost Proxies (Learning Curve Extrapolations). Low-cost proxies, also referred to
as Learning curve extrapolations [
231
], aim to interpret the accuracy of the given architecture
candidate only using its early training statistics, such as the training loss in the rst few training
epochs, which has motivated a plethora of subsequent works to continue exploring learning curve
extrapolation [
156
,
232
–
237
]. For example, dierent from the conventional accuracy predictor that
only uses the network conguration as input features, [
236
] proposes to combine the network
conguration and a series of validation accuracy in the rst few training epochs as input features
to train a simple regression model, which can be generalized to predict the accuracy of unseen
architecture candidates. In addition, [
232
] introduces Training Speed Estimation (TSE), which simply
accumulates the early training statistics to achieve reliable yet computationally cheap ranking
among dierent architecture candidates. Besides, [
156
,
237
] introduce Batchwise Training Estimation
(BTE) and Trained Batchwise Estimation (TBE), which both consider the ne-grained batchwise
training statistics to provide more reliable prediction performance using minimal computational
resources. In parallel, [
234
] introduces Loss Curve Gradient Approximation (LCGA) to rank the
accuracy of dierent architecture candidates with minimal training. Furthermore, [
233
] introduces
26
NAS-Bench-x11 to unleash the power of learning curve extrapolation by predicting the training
trajectories, which can be easily integrated into the aforementioned learning curve extrapolation
works to quickly estimate the performance of the given architecture candidate.
Zero-Cost Proxies
5
.In addition to the above low-cost proxies (i.e., learning curve extrapolation),
zero-cost proxies have recently ourished [
228
–
230
,
238
–
247
], which focus on interpreting the
performance of the given architecture candidate in training-free manners. Specically, zero-cost
proxies, such as EPE [
240
], Fisher [
241
], GradNorm [
238
], Grasp [
242
], Jacov [
243
], Snip [
244
], Syn-
ow [
245
], ZenScore [
246
], LRC [
228
], and NTK [
229
], can provide reliable performance estimation
using only one single mini-batch of data and one single forward/backward propagation pass, which
necessitate near-zero computational cost [
230
,
238
,
239
]. Thanks to their reliable performance
estimation and low cost, these zero-cost proxies have been widely adopted in recent NAS works to
accelerate the search process [
230
,
243
,
247
]. In particular, as demonstrated in [
230
,
238
], combining
dierent zero-cost proxies may lead to more reliable ranking performance estimation than any
single zero-cost proxy. For example, as shown in Fig. 15, combining LRC and NTK is able to provide
more reliable ranking performance estimation than LRC or NTK itself. In sight of this, TE-NAS
[
230
] further leverages LRC and NTK to jointly estimate the ranking performance among dierent
architecture candidates in the search space, which quickly ends up with the optimal architecture
candidate on ImageNet in less than 4 hours on one single Nvidia GTX 1080 Ti GPU.
Fig. 16. Illustration of weight sharing [
160
] and
weight entanglement [
248
]. (figure from [
248
])
Ecient Transformer Search. In addition to
CNNs, transformers are another important branch of
DNNs. Inspired by the tremendous success of NAS in
searching for superior CNNs, automated transformer
search has gained increasing popularity, which ap-
plies NAS techniques to automatically search for su-
perior transformers, including transformers for NLP
tasks [
93
,
249
–
254
] and vision transformers for vision
tasks [
248
,
255
–
260
]. In practice, automated trans-
former search is technically the same as automated
convolutional network search, in which both feature
the same search pipeline. For example, HAT [
93
], as
one of the state-of-the-art NAS works in the eld of
NLP, focuses on searching for hardware-ecient transformers for NLP tasks. To achieve this, HAT
rst initializes an over-parameterized superformer that consists of all the possible transformer
candidates in the search space, which is technically the same as the supernet in automated convo-
lutional network search. After that, HAT trains the superformer using the standard weight-sharing
technique [
160
], which then serves as the accuracy predictor to quickly interpret the accuracy
of dierent transformer candidates. In the meantime, HAT builds an ecient latency predictor
to avoid the tedious on-device latency measurement. Finally, HAT applies the standard evolu-
tionary algorithm to nd hardware-ecient transformer candidates with both high accuracy and
high eciency, which is technically the same as OFA [
42
] that searches for hardware-ecient
convolutional networks. Furthermore, due to the tremendous success of vision transformers in
vision tasks as discussed in Section 2.2, a plethora of NAS works [
248
,
255
–
260
] have been subse-
quently proposed to automate the design of superior vision transformers. Among them, [
248
], as
the rst one, introduces an evolutionary algorithm-based NAS framework dubbed AutoFormer.
Similar to HAT, AutoFormer rst constructs an over-parameterized superformer that consists of
all the possible vision transformer candidates in the search space, which is then trained using the
5Note that most of the covered zero-cost proxies are available at https://github.com/automl/naslib/tree/zerocost.
27
weight entanglement scheme. The dierence between weight sharing and weight entanglement
is visualized in Fig. 16, in which weight entanglement is technically similar to the superkernel
in SP-NAS [
197
]. Finally, AutoFormer applies the standard evolutionary algorithm to explore the
optimal vision transformer candidate. These clearly demonstrate that we can easily leverage recent
state-of-the-art NAS techniques that focus on searching for competitive CNNs to automate the
design of top-performing transformers for both NLP and vision tasks.
Ecient Domain-Specic Search. In addition to image classication, NAS can also be applied
to a wide range of real-world scenarios, such as object detection [
268
–
270
], semantic segmentation
[
271
–
273
], point cloud processing [
192
,
274
–
276
], image super-resolution [
277
,
278
], etc. For ex-
ample, MobileDets [
268
] are a family of hardware-ecient object detection networks, which can
deliver promising detection accuracy while maintaining superior detection eciency on multiple
embedded computing systems, including mobile CPUs, edge TPUs, and edge GPUs. Specically,
MobileDets rst construct an enlarged search space that contains a large number of possible object
detection networks and then leverage MnasNet-like reinforcement learning-based search algorithm
[
38
] to nd top-performing object detection networks, which also feature the same reward function
as TuNAS [
163
] to trade o between the detection accuracy and eciency. Besides, [
192
] introduces
an ecient hardware-aware dierentiable NAS framework dubbed LC-NAS, aiming to automate
the design of competitive network solutions for point cloud processing. Here, similar to EdgeNAS
[
130
] and LA-DARTS [
191
] that focus on nding top-performing architecture candidates for im-
age classication, LC-NAS exploits the same cell-based search space and integrates the latency
constraint into the optimization objective to penalize the architecture candidate with high latency.
These demonstrate that we can easily include domain-specic knowledge (e.g., domain-specic
search spaces) into mainstream NAS techniques (e.g., dierentiable, evolutionary algorithm-based,
and reinforcement learning-based NAS) to search for domain-specic network solutions.
Mainstream NAS Benchmarks. Although NAS has achieved substantial performance im-
provement across various NLP and vision tasks, fair comparisons between dierent NAS works
are frustratingly hard and still an open issue as demonstrated in [
279
]. This is because dierent
NAS works may feature quite dierent training recipes, such as dierent training epochs and
training enhancements. For example, DARTS+ [
148
] trains the searched architecture candidate on
CIFAR-10 for 2,000 epochs, whereas DARTS [
138
] only applies 600 training epochs. Meanwhile,
DARTS+ trains the searched architecture candidate on ImageNet for 800 epochs with the batch
size of 2,048, where AutoAugment [
280
] is also integrated in order to achieve stronger data aug-
mentations. In contrast, DARTS only applies 250 training epochs with the batch size of 128 by
default. We note that, for the same architecture candidate, longer training epochs and stronger
data augmentations typically achieve better training accuracy on target task as shown in [
279
].
Furthermore, RandomNAS [
281
] challenges the eectiveness of early state-of-the-art NAS works
and demonstrates that random search, as one strong search baseline to explore random networks,
can achieve even better performance on target task than early state-of-the-art NAS works. In
parallel, RandWire [
282
] shows that randomly wired networks can also exhibit strong accuracy on
ImageNet. Therefore, it remains unknown whether the performance improvement of NAS is due to
the more advanced training recipe or the search algorithm itself, making it dicult to evaluate and
compare the technical contributions of dierent NAS works [223, 224, 279].
To overcome such limitations, a plethora of tabular and surrogate NAS benchmarks have been
subsequently proposed, including NAS-Bench-101 [
223
], NAS-Bench-201 [
224
], NATS-Bench [
261
],
NAS-Bench-301 [
262
], NAS-Bench-360 [
263
], NAS-Bench-1Shot1 [
264
], NAS-Bench-ASR [
265
],
NAS-Bench-Graph [
266
], NAS-Bench-NLP [
225
], HW-NAS-Bench [
214
], NAS-Bench-x11 [
233
],
and NAS-Bench-Suite [
267
]. We note that NAS benchmarks typically have two important parts,
including the pre-dened search space and the related performance metrics for all the possible
28
Benchmark Search Space Queryable Tasks Datasets Metrics
Size Type Tabular Surrogate
NAS-Bench-101 [223] 423k Cell-Based ✓ ✗ Image
Classication CIFAR-10
Training Accuracy,
Validation Accuracy,
Testing Accuracy,
Training Time, and
Number of Parameters
NAS-Bench-201 [224] 15.6k Cell-Based ✓ ✗ Image
Classication
CIFAR-10,
CIFAR-100, and
ImageNet-16-120
Training Accuracy,
Validation Accuracy,
Testing Accuracy,
Training Loss,
Validation Loss,
Testing Loss,
Training Time,
Number of FLOPs, and
Number of Parameters
NATS-Bench [261] 39.3k Cell-Based ✓ ✗ Image
Classication
CIFAR-10,
CIFAR-100, and
ImageNet-16-120
Training Accuracy,
Validation Accuracy,
Testing Accuracy,
Training Loss,
Validation Loss,
Testing Loss,
Training Time,
Number of FLOPs, and
Number of Parameters
NAS-Bench-301 [262] 1018 Cell-Based ✗ ✓ Image
Classication CIFAR-10 Validation Accuracy
NAS-Bench-360 [263] N/A Cell- and
Block-Based ✓ ✗ 10 Diverse Tasks 10 Diverse Datasets N/A
NAS-Bench-1Shot1 [264] 399k Cell-Based ✓ ✗ Image
Classication CIFAR-10 Validation Accuracy
NAS-Bench-ASR [265] 8.2k Cell-Based ✓ ✗
Automatic
Speech
Recognition
TIMIT
CTC Loss,
Phoneme Error Rate (PER),
On-Device Latency,
Number of FLOPs, and
Number of Parameters
NAS-Bench-Graph [266] 26.2k Cell-Based ✓ ✗ 9 Graph Tasks 9 Graph Datasets
Training Loss,
Validation Loss,
Testing Loss,
Validation Accuracy,
On-Device Latency, and
Number of Parameters
NAS-Bench-NLP [225] 14k Cell-Based ✓ ✗ Language
Understanding
PTB and
WikiText-2
Testing Perplexity,
Training Time, and
Number of Parameters
NAS-Bench-111 [233] 423k Cell-Based ✗ ✓ Image
Classication CIFAR-10
Training Accuracy,
Validation Accuracy,
Testing Accuracy,
Training Loss,
Validation Loss,
and Testing Loss
NAS-Bench-311 [233] 1018 Cell-Based ✗ ✓ Image
Classication CIFAR-10 Same as NAS-Bench-111
NAS-Bench-NLP11 [233] 1053 Cell-Based ✗ ✓ Language
Understanding PTB Same as NAS-Bench-111
NAS-Bench-Suite [267] N/A Cell-Based ✓ ✓ A suite of 11 tabular and surrogate NAS benchmarks
HW-NAS-Bench [214] 15.6k Cell-Based ✓ ✗ Image
Classication
CIFAR-10,
CIFAR-100, and
ImageNet-16-120
On-Device Latency
HW-NAS-Bench [214] 1021 Block-Based ✗ ✓ Image
Classication
CIFAR-100 and
ImageNet On-Device Latency
Table 2. Comparisons of dierent NAS benchmarks. Note that ImageNet-16-120 is a subset of ImageNet that
consists of 120 object categories, in which the input image resolution is fixed to 16×16 [224].
architecture candidates that can be easily queried. Specically, in tabular NAS benchmarks [
214
,
223
–
225
,
261
,
263
–
266
], all the possible architecture candidates are enumerated and trained from scratch
on target task, respectively, to obtain the performance metrics, such as the training and validation
29
accuracy. In contrast, surrogate NAS benchmarks [
214
,
233
,
262
,
267
] leverage learning-based
methods to predict the performance metrics of dierent architecture candidates rather than directly
enumerating and training all the possible architecture candidate candidates on target task, thus
leading to signicantly reduced computational resources. In sight of this, surrogate NAS benchmarks
can be easily extended to deal with larger search spaces than tabular NAS benchmarks (10
18
in
NAS-Bench-301 [
262
] vs. 15,625 in NAS-Bench-201 [
224
]). Finally, we compare and summarize the
aforementioned state-of-the-art NAS benchmarks in Table 2.
3.4 Future Envision
In this section, we further envision several promising future trends and possible directions in the
eld of automated network design, which are summarized as follows:
(1)
General Search Spaces. The success of NAS highly relies on the well-engineered search
space, such as the cell-based search space [
137
,
144
,
160
] and the block-based search space
[
38
–
40
]. In the past, researchers manually design the search spaces using heuristic-based
strategies, which are typically based on existing state-of-the-art networks, such as Mo-
bileNets [
123
,
268
] and ShueNets [
34
,
35
]. This eectively restricts the search space to
improve the search eciency and delivers competitive architecture candidates with promis-
ing accuracy and eciency. In the meantime, this, however, may signicantly limit the
search performance, which may reject more competitive architecture candidates outside
the well-engineered search space. To overcome such limitations, [
283
,
284
] pioneer to de-
sign more general search spaces than the cell-based and block-based search spaces, which,
unfortunately, are under-explored since [
283
,
284
] still suer from human biases. Therefore,
one promising future direction in the eld of NAS is to innovate and explore more general
search spaces to unleash the power of automated network design.
(2)
Fully Automated Architecture Search. The early NAS practices either focus on search-
ing for the optimal architecture candidate [
137
,
144
,
160
], the optimal data augmentation
[
280
,
285
], the optimal activation function [
286
,
287
], or the optimal training recipe [
288
,
289
].
As demonstrated in FBNetV3 [
288
] and AutoHAS [
289
], dierent architecture candidates
may prefer dierent training recipes, in which jointly searching for the optimal architecture
candidate and its tailored training recipe has the potential to push forward the attainable
accuracy. This observation can be easily generalized. For example, dierent architecture
candidates may prefer dierent data augmentations. Therefore, one promising future direc-
tion in the eld of NAS is fully automated search, which jointly searches for the optimal
architecture candidate and its tailored data augmentation, activation function, and training
recipe in one single search experiment to maximize the attainable accuracy.
(3)
Multi-Task Architecture Search. Previous NAS works typically focus on searching for
task-specic architecture candidates that can achieve promising performance in the specied
task, such as image classication [
138
,
146
], object detection [
268
–
270
], and semantic
segmentation [
271
–
273
]. This search paradigm, however, signicantly increases the total
search cost when the number of tasks exponentially evolves since we have to conduct
search experiments for each task, respectively. To alleviate this issue, FBNetV5 [
290
] takes
the rst step to search for multi-task architecture candidates which can achieve competitive
performance across multiple tasks, including image classication on ImageNet [
80
], object
detection on COCO [
291
], and semantic segmentation on ADE20K [
292
]. Nonetheless, this
is far from enough since we have a large number of tasks in real-world scenarios. Therefore,
one promising future direction in the eld of NAS is multi-task search, which automates
30
the design of top-performing architecture candidates that can be generalized to multiple
dierent tasks without being re-engineered (i.e., once for all).
(4)
Dynamic Architecture Search. Previous NAS works [
137
,
138
,
144
,
160
] typically focus on
searching for static neural networks that can only run at xed computational budgets, which
cannot adapt to lower or higher computational complexity. In addition, dynamic neural
networks, such as slimmable neural networks [
293
–
295
], are another important branch
of DNNs, which can be executed to accommodate dierent computational resources in
real-world environments. This is because, even on the same hardware device, the available
computational resources may vary with respect to time. For example, mobile phones may be
in low-power or power-saving modes to reduce the power consumption. To overcome such
limitations, deploying multiple static neural networks on the same hardware device seems
to be the rst-in-mind solution, which, unfortunately, demands high on-device storage
requirements. Therefore, one promising future direction in the eld of NAS is to search for
top-performing dynamic neural networks, which can instantly, adaptively, and eciently
trade o between accuracy and inference eciency to accommodate the rapidly-changing
computational budgets in real-world embedded computing scenarios.
(5)
Hybrid Architecture Search. As discussed in Section 2, both convolutional networks
and vision transformers have their own technical merits when applied to vision tasks.
Specically, convolutional networks demonstrate superior eciency on target hardware,
whereas vision transformers achieve better accuracy on target task. In sight of this, designing
hybrid networks on top of both convolutional networks and vision transformers has the
potential to push forward accuracy-eciency trade-os. Nonetheless, previous NAS works
typically focus on searching for either convolutional networks [
137
,
138
,
144
,
160
] or vision
transformers [
248
,
256
,
257
,
260
]. To alleviate this, [
296
,
297
] have taken the very rst steps
to investigate hybrid architecture search, which, however, still remains under-explored.
Therefore, one promising future direction in the eld of NAS is to search for competitive
hybrid networks that combine the technical merits of both convolutional networks and
vision transformers to achieve better accuracy-eciency trade-os.
(6)
Explainable Architecture Search. Previous representative NAS works [
40
,
42
,
138
,
144
]
highly rely on the weight-sharing paradigm [
160
], also known as one-shot NAS, which
initializes an over-parameterized supernet that consists of all the possible architectures
in the search space and then searches for the optimal architecture candidate within the
supernet through weight-sharing. Despite the promising search eciency, one-shot NAS has
been widely criticized due to its limited explainability, which implies that weight-sharing
may lead to sub-optimal architectures due to weight interference. And even worse, the
intuition behind one-shot NAS still remains unknown in the NAS community. To alleviate
this, a plethora of zero-shot NAS works have been recently proposed [
228
–
230
,
238
–
247
],
which leverage zero-cost proxies to quickly interpret the accuracy of dierent architectures.
However, existing zero-cost proxies still cannot achieve reliable performance estimation as
shown in Fig. 15. Therefore, one promising future direction in the eld of NAS is to develop
more explainable NAS techniques and innovate more reliable zero-cost proxies.
(7)
Meta Architecture Search. Meta-learning [
298
], also referred to as learning-to-learn, aims
to facilitate and accelerate common learning-based practices such that the learned model
can quickly adapt to unseen tasks/environments using minimal engineering eorts. For
example, HELP [
215
] introduces an ecient meta-learning based latency predictor, which
can be generalized to new hardware platforms using as few as 10 latency measurements. In
particular, the widely used weight-sharing paradigm in the eld of NAS can be considered
to be a special case of meta-learning, which takes the over-parameterized supernet as the
31
(a) Network without Pruning (c) Channel Pruning (d) Layer Pruning(b) Weight Pruning
Fig. 17. Illustration of dierent structured and non-structured pruning strategies. Among them, weight
pruning is non-structured, whereas channel pruning and layer pruning are structured.
meta-model. As the weight-sharing paradigm has been dominating recent advances in the
eld of NAS, one promising future direction is to explore meta-learning to accelerate the
search process and enhance the few-shot learning capability [299–301].
4 NETWORK COMPRESSION FOR EMBEDDED COMPUTING SYSTEMS
In addition to designing novel networks, another alternative is to compress existing networks
at hand, either manually designed networks or automatically searched networks, to reduce the
network redundancy, which therefore leads to network variants with better accuracy-eciency
trade-os. As illustrated in previous relevant literature [
82
,
302
], there are three popular branches of
network compression techniques, including network pruning, network quantization, and network
distillation. Note that these three branches are parallel with each other as shown in [
303
], which
indicates that they can be combined to further enable better accuracy-eciency trade-os. Among
them, network pruning and network quantization focus on improving the accuracy-eciency trade-
o from the eciency perspective, whereas network distillation enhances the accuracy-eciency
trade-o from the accuracy perspective. To this end, we, in this section, systematically discuss
recent state-of-the-art network compression techniques. For better understanding, we divide these
network compression techniques into three main categories and sub-sections, including network
pruning in Section 4.1, network quantization in Section 4.2, and network distillation in Section 4.3,
since these network compression techniques feature dierent algorithms to improve the accuracy-
eciency trade-o from dierent perspectives. Note that these network compression techniques
can typically generalize across dierent networks (e.g., convolutional networks and transformers).
For example, we can leverage knowledge distillation to enhance the training process of both
convolutional networks and transformers towards better training accuracy.
4.1 Network Pruning
The rationale behind network pruning is that DNNs are usually over-parameterized and redundant
in terms of network weights and channels [
304
,
305
]. As such, eliminating redundant network
weights and channels can largely benet the network eciency at the cost of minimal accuracy loss,
thus being able to accommodate limited available computational resources and rigorous storage
requirements in real-world embedded scenarios. Following previous well-established pruning
conventions, we divide recent state-of-the-art pruning methods into two main categories according
to their pruning granularity, in which non-structured pruning (i.e., weight pruning) is ne-grained
whereas structured pruning (i.e., channel pruning and layer pruning) is coarse-grained. As illustrated
in Fig. 17, weight pruning focuses on removing the redundant weight connections, whereas channel
32
pruning and layer pruning focus on removing the redundant channels and layers. In practice, both
non-structured pruning and structured pruning can explore simplied network structures with
optimized computational eciency. Nonetheless, non-structured weight pruning highly relies on
specialized hardware accelerators [
29
] and cannot provide realistic runtime speedups on modern
embedded computing systems due to irregular network sparsity [
305
,
306
]. In contrast, structured
channel pruning and layer pruning are coarse-grained and do not introduce irregular network
sparsity, which can deliver realistic runtime speedups on modern embedded computing systems.
For better coverage, below we further discuss recent representative works in the eld of both
non-structured pruning and structured pruning, which are also summarized in Fig. 19.
4.1.1 Non-Structured Pruning. Non-structured pruning, also referred to as weight pruning, removes
the less important network weights, which is typically more ne-grained than structured pruning
as illustrated in Fig. 17. In particular, applying weight pruning for network compression can be
traced back to the early 1990s. For example, Optimal Brain Damage [
307
] and Optimal Brain
Surgeon [
308
], as the very early weight pruning approaches, pioneer to investigate the ecacy of
weight pruning on vanilla fully-connected networks, in which the less important network weights
are removed based on Hessian of the loss function. More recently, [
29
] proposes a simple yet
eective weight pruning technique to compress deep convolutional networks, such as AlexNet [
76
]
and VGGNet [
1
], instead of vanilla fully-connected networks. Specically, [
29
] observes that the
network weights with smaller magnitudes typically contribute less to the network accuracy, based
on which [
29
] removes the less important network weights with smaller magnitudes. Subsequently,
this weight pruning technique is further integrated into Deep Compression [
89
] to obtain highly
compressed networks, making it possible to aggressively reduce the network size without sacricing
the network accuracy. For example, Deep Compression is able to signicantly reduce the network
size of VGGNet by
×
49 times, from 552 MB to 11.3 MB, while maintaining comparable accuracy
on ImageNet. Nonetheless, the reduction in terms of the network size cannot directly translate
into the speedup on target hardware since the resulting compressed networks are of high irregular
network sparsity. To overcome such limitations, EIE [
306
] designs an ecient specialized inference
engine to maximize the inference eciency of compressed networks. In parallel, [
309
] proposes
an ecient data-free weight pruning approach to iteratively remove redundant network weights.
Besides, [
310
] and [
311
] leverage Variational Dropout and
𝐿0
-norm regularization-based stochastic
gates, respectively, to remove the less important network weights.
Fig. 18. Distribution of weight gates [312].
Weight Importance Criteria. The core of weight prun-
ing is to determine the importance of dierent network
weights, based on which we can easily rank dierent net-
work weights and remove the less important network
weights at the cost of minimal accuracy loss. There have
been several representative importance criteria to mea-
sure the importance of dierent network weights after
the network is trained. Among them, the most straight-
forward criterion is based on the weight magnitude thanks
to its conceptual simplicity and surprisingly strong perfor-
mance, which leverages the absolute weight
|𝑤|
to inter-
pret the importance of dierent network weights (i.e., the
larger, the more important) [
29
,
313
,
314
]. The rationale
behind magnitude-based weight pruning is that smaller
network weights typically contribute less to the output
of the network. Apart from this, other importance criteria include second-order derivative-based
33
[
307
,
315
], Taylor expansion-based [
316
], output sensitivity-based [
317
] strategies, etc. More re-
cently, GDP [
312
] proposes an eective strategy, namely gates with dierentiable polarization,
which introduces learnable gates to interpret the importance of dierent network weights. In
particular, GDP encourages a large margin between exact zero gates and non-zero gates as shown
in Fig. 18, while still allowing gradient optimization. Finally, GDP removes the network weights
with exact zero gates and further merges the remaining non-zero gates into the resulting pruned
network without hurting the network accuracy once the optimization process terminates. Despite
the impressive progress to date, the design of ecient and eective importance criterion is quite
under-explored and still remains an open challenge in the community.
Sparse Network Acceleration. Dierent from structured pruning that is hardware-friendly, non-
structured pruning, despite being able to maintain competitive accuracy under high compression
ratios, introduces considerable irregular network sparsity, making it dicult to parallelize the
resulting sparse networks on mainstream hardware systems like GPUs and CPUs [
306
]. This
indicates that non-structured pruning highly relies on specialized hardware to achieve superior
on-device speedups. To this end, a plethora of specialized hardware accelerators [
306
,
318
–
325
] and
compiler-based optimization techniques [
326
,
327
] have been recently developed to accelerate the
on-device inference of sparse networks, which typically focus on improving the irregular memory
access on target hardware. For example, Cambricon-X [
321
] features an access-ecient indexing
module to select and transfer irregular network weights to dierent processing elements (PEs) with
reduced bandwidth requirements. This indexing module allows each PE to store irregular network
weights for local computation in an asynchronous manner, and as a result, signicantly reduces
the irregular memory access overheads across dierent PEs.
Sparse Training Techniques. As shown in [
29
], weight pruning can eectively lead to ecient
sparse network variants that are as smaller as 90% than the unpruned network, signicantly allevi-
ating the storage requirements and reducing the computational complexity. Despite the promising
eciency improvement, training the resulting sparse networks is quite challenging, in which
using conventional training strategies may lead to non-negligible accuracy loss as demonstrated in
[
328
]. To recover the accuracy, a plethora of sparse training techniques have been developed to
train the resulting sparse networks [
29
,
89
,
329
–
332
]. For example, [
29
] proposes to ne-tune the
pruned sparse network with inherited network weights from the unpruned network. Furthermore,
[
89
] generalizes the above ne-tuning strategy to become iterative, where multiple iterations of
pruning and ne-tuning are iteratively repeated to recover the attainable accuracy. In addition,
[
330
] investigates the performance collapse issue of training sparse networks, which may simply
be trapped into sub-optimal local minima. To escape the sub-optimal local minima, [
330
] proposes
to traverse extra dimensions between dense and sparse sub-spaces during training sparse networks.
More recently, [
331
] introduces an alternative sparse training strategy, which customizes the sparse
training techniques to deviate from the default vanilla training protocols, consisting of introducing
ghost neurons and skip connections at the early training stage, and strategically modifying the
initialization as well as the labels. Below we further introduce another representative branch of
weight pruning and sparse training techniques, namely lottery ticket hypothesis [
328
], which
demonstrates that the pruned sparse networks, when properly initialized, can be trained from
scratch to recover the accuracy comparable to the unpruned network.
Lottery Ticket Hypothesis. The lottery ticket hypothesis [
328
] is a special case of non-
structured pruning and has since gained increasing popularity in the pruning community [
333
–
338
].
Specically, the lottery ticket hypothesis reveals that: A randomly-initialized unpruned network
contains sparse sub-networks (i.e., winning tickets) that are initialized such that, when trained in
isolation, they can match the accuracy of the unpruned network after training for at most the same
number of iterations. In particular, the winning tickets can be as smaller as 90% than the unpruned
34
network, while at the same time maintaining comparable accuracy. To identify the winning ticket,
the lottery ticket hypothesis [328] proposes to leverage the following steps:
(1) Randomly initialize the unpruned network 𝑓(𝑥;𝜃0), in which 𝜃0∼ D𝜃;
(2) Train the unpruned network for a number of 𝑗iterations, arriving at weights 𝜃𝑗;
(3) Prune 𝑝% of the weights in 𝜃𝑗, creating the sparse mask 𝑚∈ {0,1}|𝜃0|;
(4)
Reset the remaining weights to their values in
𝜃0
, creating the winning ticket
𝑓(𝑥
;
𝑚⊙𝜃0)
.
As demonstrated in [
328
], directly pruning
𝑝
% of the network weights may lead to signicant
accuracy loss, which also makes the training process unstable. To overcome such limitations, the
lottery ticket hypothesis proposes to iteratively repeat the above steps, also referred to as iterative
pruning, which repeatedly trains, prunes, and resets the network weights over
𝑛
rounds where
each round prunes
𝑝1
𝑛
% of the network weights, respectively. The lottery ticket hypothesis opens
up the possibility and provides empirical guidelines to train sparse networks from scratch to match
the accuracy of the unpruned network. Subsequently, [
333
,
334
] prove the lottery ticket hypothesis
from insightful theoretical perspectives. In parallel, [
335
] investigates the performance collapse
issue of the lottery ticket hypothesis, especially when dealing with deeper networks, such as
ResNets [
2
] and DenseNets [
3
], based on which [
335
] proposes an eective modied iterative
pruning scheme called rewinding iteration to stabilize the lottery ticket hypothesis. Furthermore,
[
336
–
338
] generalize the lottery ticket hypothesis to other types of networks beyond convolutional
networks, such as graph networks [336], spiking networks [110], and photonic networks [338].
Semi-Structured Pruning. In contrast to the above mainstream non-structured pruning meth-
ods that introduce considerable irregular network sparsity, semi-structured pruning focuses on
removing the less important consecutive weight connections [
339
]. The resulting semi-structured
sparse networks can exhibit less irregular network sparsity than non-structured pruning, which are
well supported by some existing deep learning libraries (e.g., cuSPARSElt [
340
] and TVM [
341
]) and
thus can maintain much higher parallelism and speedups on modern embedded computing systems
than non-structured pruning. For example, popular BERT models can achieve about 1.3x to 1.6 run-
time inference speedups on Nvidia A100 GPUs using the optimized sparse tensor cores [
340
,
342
].
Thanks to its superior accuracy-eciency trade-os over non-structured pruning, semi-structured
pruning has been widely employed to optimize the computational complexity of convolutional
networks [
343
,
344
], transformers [
342
,
345
], and large language models [
346
,
347
]. For example,
[
344
] introduces an eective channel permutation scheme to optimize the attainable accuracy
of the resulting semi-structured sparse convolutional network. [
343
] introduces sparse-rened
straight-through estimator (SR-STE) to explore the optimal semi-structured sparse convolutional
network. In addition to convolutional networks, [
342
] and [
345
] introduce alternating direction
method of multipliers (ADMM) and progressive gradient ow to explore the optimal semi-structured
sparse transformer for real-world language processing tasks. More recently, [
346
,
347
] investigate
semi-structured sparsity to enhance the inference eciency of large language models. In parallel
to the above semi-structured pruning methods that optimize the inference eciency, [
348
,
349
]
instead focus on optimizing the training eciency of semi-structured sparse networks. Furthermore,
[
350
–
352
] also focus on exploring dedicated hardware accelerators to further enhance the runtime
inference eciency of semi-structured sparse networks.
4.1.2 Structured Pruning. In parallel to non-structured pruning, structured pruning, including
channel pruning
6
and layer pruning, is another popular branch, which removes the less important
channels or layers to reduce the network complexity as shown in Fig. 17. In practice, layer pruning
6
Filter pruning is another name for channel pruning since removing lters is technically equivalent to removing channels
[383]. In this work, we use channel pruning by default and may interchangeably use lter pruning and channel pruning.
35
Non-Structured Pruning
Weight Importance [29, 307, 312–317]
Sparse Acceleration [306, 318–327]
Sparse Training [29, 89, 329–332]
Lottery Ticket Hypothesis [328, 333–336, 336, 337, 337, 338, 338]
Semi-Structured Pruning [339, 342–352]
Structured Pruning
Weight-Based [21–23, 353–356]
Activation-Based [244, 295, 357–363]
Statistics-Based [364–368]
Search-Based [369–377]
Layer-Based [30, 378–382]
Fig. 19. Illustration of non-structured and structured pruning works that have been discussed in Section 4.1.
is a special case of channel pruning and channel pruning is equivalent to layer pruning when all the
channels in the same layer are removed. We emphasize that non-structured pruning, despite being
able to achieve signicant compression ratios, introduces considerable irregular network sparsity
and the resulting pruned network also features irregular computational patterns, which highly
relies on specialized hardware accelerators to achieve realistic speedups as demonstrated in [
306
].
In contrast, structured pruning can easily achieve realistic speedups on mainstream hardware such
as GPUs and CPUs, thanks to its high on-device parallelisms [
305
]. This unique technical merit has
been making structured pruning more and more popular, especially in the context of designing
hardware-friendly network solutions [
383
]. With the above in mind, below we further elaborate on
recent representative structured pruning works, which can be roughly divided into the following
four categories, including weight-based pruning, activation-based pruning, batch normalization
statistics-based pruning, and search-based pruning.
Weight-Based Pruning. Weight-based pruning, also referred to as weight-dependent pruning,
determines the importance of dierent channels based on the corresponding weights, which is
technically similar to magnitude-based pruning as discussed in Section 4.1.1. There have been two
popular weight-based pruning criteria, including weight norm and weight correlation. Without
loss of generality, we can easily calculate the
𝐿𝑛
-norm as
||𝑤||𝑛
, where
𝑤
is the corresponding
network weight. For example, [
21
] proposes to remove the less important channels based on their
𝐿1
-norm values, which indicates that the channel with smaller
𝐿1
is considered less important and
contributes less to the network output. Besides, [
22
] observes that
𝐿2
-norm can achieve better
pruning performance than
𝐿1
-norm. Furthermore, [
23
] challenges the empirical assumption in
[
21
,
22
] and demonstrates that the channels with smaller
𝐿1
- and
𝐿2
-norm magnitudes are not
necessarily less important. To avoid this, [
23
] instead turns back to the channel correlation, which
reveals that the channels close to the geometric median are typically redundant since they represent
similar feature maps in the same layer. As a result, removing the channels close to the geometric
median only leads to minimal accuracy loss. Inspired by the promising performance of [
23
],
[
353
,
354
] propose to rst apply scalar hashing on the weights of each layer and then remove
redundant channels based on the corresponding weight similarity. This is because similar channels
36
are of high redundancy in terms of the contributions to the network representation capability. In
addition, unlike [
21
–
23
,
353
,
354
] that measure the channel redundancy in the same layer, [
355
]
investigates the channel redundancy across multiple dierent layers in order to minimize the
accuracy loss. Furthermore, [
356
] prioritizes to remove the channels in more redundant layers
rather than globally ranking dierent channels across all the network layers.
Activation-Based Pruning. Activation-based pruning typically leverages the intermediate
activation maps to interpret the importance of dierent channels, in which activation maps, also
known as feature maps, correspond to the output features from one specic network layer. As
demonstrated in [
383
], there have been three representative techniques to determine the importance
of dierent channels in the
𝐿
-th layer, including (1) using the activation maps of the
𝐿
-th layer, (2)
using the activation maps of adjacent layers (e.g.,
(𝐿+
1
)
-th and
(𝐿+
2
)
-th layers), and (3) using
the activation maps of the last layer (i.e., the network output):
(1)
Current Layer. To determine the importance of dierent channels in the
𝐿
-th layer, [
357
]
proposes a simple yet eective two-step scheme, which rst removes dierent channels
and then measures the reconstruction error based on the output activation maps of the
𝐿
-th
layer. And next, the channels that lead to smaller reconstruction error are removed to reduce
the network complexity. Similarly, [
358
] measures the channel importance according to
the decomposition error. Furthermore, subsequent works utilize the channel independence
[359] and post-activation maps [360] to measure the channel importance.
(2)
Adjacent Layers. Recent state-of-the-art DNNs are naturally coupled and of sequential layer
structures, which indicates that there has signicant layer dependency between dierent
adjacent layers. In sight of this convention, [
361
,
362
] investigate the dependency between
the current layer and the subsequent layer in order to measure the channel importance in
the current layer. In parallel, [
295
,
363
] demonstrate that the activation maps of previous
layers can also reect the channel importance in the subsequent layers.
(3)
Last Layer. The channel importance can also be evaluated using the activation maps of the
last layer, which correspond to the network output. The rationale behind this is that we are
allowed to use the network output to interpret the accuracy of the pruned network. For
example, we can simply determine the channel importance based on the reconstruction
error [244] and the discrimination of the entire network [384].
Statistics-Based Pruning. Statistics-based pruning refers to those that exploit batch normal-
ization statistics [
385
] to interpret the channel importance, which has since gained increasing
popularity thanks to its conceptual simplicity and surprisingly strong pruning performance. As
shown in previous representative networks [
2
,
3
,
32
,
34
,
35
,
123
], batch normalization is a widely
used plug-and-play technique to accelerate and stabilize the network training process towards
better training convergence, while also reducing internal covariate shift to benet the network
accuracy. Specically, batch normalization
𝐵𝑁 (·)
transforms the input
𝑥∈R𝐵×𝐶×𝑊×𝐻
as follows:
𝐵𝑁 (𝑥)=𝛾·𝑥−𝜇
√𝛿2+𝜖+𝛽(13)
where
𝜇
and
𝛿
are the mean and standard deviation of the input
𝑥
, respectively. And
𝜖
is a small
constant (e.g., 1
×
10
−9
) to avoid zero-division. Besides,
𝛾∈R𝐵
and
𝛽∈R𝐵
are learnable parameters
to scale and shift
𝑥−𝜇
√𝛿2+𝜖
, which are optimized during the training process to recover the input
𝑥
. Note that
𝛾
and
𝛽
have the same dimension as the number of input channels. As seen in
[
2
,
3
,
32
,
34
,
35
,
123
], it is common practice to insert one batch normalization layer after one
convolutional layer. In sight of these, [
364
] pioneers to leverage batch normalization statistics
𝛾
to enable and disable dierent input channels, among which those disabled channels are pruned
at the end of the training process. To this end, [
364
] applies
𝐿1
-norm regularization on
𝛾
to
37
Reward= -Error*log(FLOP)
Agent: DDPG
Action: Compress with
Sparsity ratio at (e.g. 50%)
Embedding st=[N,C,H,W,i…]
Environment: Channel Pruning
Layer t-1
Layer t
Layer t+1
Critic
Actor
Embedding
Original NN
Model Compression by Human:
Labor Consuming, Sub-optimal
Model Compression by AI:!
Automated, Higher Compression Rate, Faster
Compressed NN
AMC Engine
Original NN Compressed NN
30%
50%
? %
Fig. 20. Overview of AMC [
369
], which formulates channel pruning as reinforcement learning-based search
and instead automatically searches for the less important channels to be pruned. (figure from [369])
achieve sparsity. Similar to [
364
], Gate Decorator [
365
] introduces gated batch normalization that
leverages batch normalization statistics
𝛾
as channel gates to enable and disable dierent input
channels from the previous convolutional layer. To achieve sparsity, Gate Decorator also exploits
𝐿1
-norm regularization to penalize
𝛾
during the training process. In addition, Gate Decorator
introduces an iterative pruning scheme, which progressively prunes redundant channels during
the training process and ne-tunes the resulting pruned network to recover the accuracy. However,
as demonstrated in [
366
],
𝐿1
-norm regularization suers from inferior discrimination between
dierent channels since
𝐿1
-norm regularization pushes all the scaling factors
𝛾
towards zero. To
tackle this, dierent from [
364
,
365
] that regularize
𝛾
with
𝐿1
-norm penalty, [
366
] instead polarizes
𝛾
to enforce a large margin between zero and non-zero
𝛾
. Furthermore, [
367
] challenges [
364
–
366
],
which observes that smaller non-zero
𝛾
does not imply that the corresponding channel is less
important. Based on this observation, [
367
] introduces a simple yet eective iterative pruning
approach, which (1) prunes the less important channels with exact zero
𝛾
, (2) rescales the magnitude
of
𝛾
, and (3) ne-tunes the resulting pruned network to recover the accuracy. In addition to the
scaling factors
𝛾
, [
368
] demonstrates that the shifting factors
𝛽
can also be leveraged to interpret
the channel importance and jointly considering
𝛾
and
𝛽
has the potential to achieve more reliable
channel pruning. The aforementioned statistics-based channel pruning works [
364
–
368
] explicitly
demonstrate that batch normalization statistics (i.e.,
𝛾
and
𝛽
), when properly engineered, can reect
the importance of input channels from the previous convolutional layer.
Search-Based Pruning. Inspired by the tremendous success of neural architecture search (NAS)
as discussed in Section 3, a plethora of search-based pruning works have recently emerged, which
typically leverage search-based techniques, including reinforcement learning-based [
369
–
371
],
evolutionary algorithm-based [
372
–
374
], and gradient-based search [
375
–
377
], to automatically
search for the optimal pruning policy, instead of using manually designed pruning heuristics. The
rationale behind this is that dierent channels can be alternatively viewed as a list of possible
operator candidates, making it possible to generalize the novel ndings and advances in the eld
of NAS to address research challenges in the eld of pruning. Specically, previous search-based
channel pruning works can be divided into the following three categories:
(1)
Reinforcement Learning-Based Search. Reinforcement learning is a well-established technique
for solving search problems as discussed in Section 3.2. Several pruning works [
369
–
371
]
have pioneered to exploit reinforcement learning to search for the optimal channel pruning
policy. For example, AMC [
369
] proposes to train an ecient deep deterministic policy
gradient (DDPG) [
386
] agent such that the well-trained DDPG agent can output the optimal
38
layer-wise channel pruning policy to maximize the pre-dened reward function as shown
in Fig. 20. In addition, AGMC [
370
] demonstrates that AMC may yield sub-optimal pruning
policies due to the xed number of environment states. To tackle this, AGMC instead
leverages graph convolutional networks (GCNs) to encode the pruned network and exploits
the graph-based encoder-decoder to automatically learn the optimal environment state.
Furthermore, DECORE [
371
] turns back to multi-agent search, which assigns each agent to
one specic network layer to learn better pruning policies.
(2)
Evolutionary Algorithm-Based Search. Evolutionary algorithm is another well-established
search technique, thanks to its conceptual simplicity, exibility, and surprisingly strong
performance. There have been several pruning works [
372
–
374
] that employ evolutionary
algorithm to automatically search for the optimal channel pruning policy. For example,
MetaPruning [
372
] introduces an ecient two-stage pruning pipeline. In the rst stage,
MetaPruning trains an over-parameterized PruningNet that consists of all the possible
pruned network congurations. Note that PruningNet here is technically the same as the
supernet in the eld of NAS as discussed in Section 3. Next, in the second stage, MetaPruning
leverages the well-trained PruningNet to quickly evaluate the accuracy of dierent pruned
networks with inherited weights from the well-trained PruningNet [
160
]. In the meantime,
an evolutionary engine is integrated to explore the optimal pruned network.
(3)
Gradient-Based Search. Unlike the aforementioned reinforcement learning-based and evolu-
tionary algorithm-based pruning works that explore the optimal pruning policy within the
discrete space, gradient-based search instead allows to learn the optimal pruning policy
within the continuous space [
375
–
377
]. As a result, gradient-based search is able to maintain
much better computational eciency than reinforcement-learning-based and evolutionary
algorithm-based counterparts. For example, DSA [
377
] proposes an ecient dierentiable
sparsity allocation approach to learn optimal layer-wise pruning ratios with gradient-based
optimization, in which each pruning experiment only requires about 5 GPU-hours. To relax
the discrete search space to become continuous, DSA introduces learnable pruning ratios,
which are conceptually the same as the architecture parameters in dierentiable NAS [
138
].
During the training process, the above learnable pruning ratios can be jointly optimized
together with the network weights using standard gradient descent.
In general, search-based pruning is similar to NAS, in which search-based pruning searches for
pruned network structures and NAS searches for stand-alone network structures. This indicates
that we can generalize more advanced NAS algorithms to search for better pruned networks.
Layer-Based Pruning. Layer pruning is a special case of channel pruning, which aggressively
removes all the channels in the same layer as shown in Fig. 17. In practice, under similar compression
ratios, layer pruning can achieve better performance in terms of latency reduction than channel
pruning as demonstrated in [
379
]. However, there is no free lunch, which indicates that layer
pruning may suer from greater accuracy loss than channel pruning. Note that layer pruning is
conceptually and technically similar to channel pruning. Specically, channel pruning aims to
remove the less important channels, whereas layer pruning focuses on removing the less important
layers as seen in previous representative layer pruning works [
30
,
378
–
382
]. In sight of this, the
aforementioned channel pruning techniques can be easily generalized to prune redundant layers.
For example, [
379
] introduces several importance criteria from the lens of channel pruning, such
as weight magnitudes, activation maps, and batch normalization statistics, which are further
combined to reliably determine the less important layers. Besides, [
382
] investigates the lottery
ticket hypothesis [
328
] from the perspective of layer pruning, which conrms that there also
exist winning tickets at initialization in terms of layer pruning. More importantly, the winning
39
Quantized Networks
Binarized Networks [24–26, 394–400]
Ternarized Networks [401–408]
INT8 Quantized Networks [409–414]
Mixed-Precision Networks [415–421]
Quantization Extensions
Quantization-Aware Training [414, 422–427]
Automated Mixed-Precision [428–436]
Quantization Accelerators [437–446]
Fig. 21. Illustration of dierent network quantization techniques that have been discussed in Section 4.2.
tickets here are more environment-friendly with less carbon emission, while at the same time
achieving better training eciency and adversarial robustness [
382
]. In addition, several recent
methods [
387
,
388
] observe that the intermediate non-linear activation layers can also be grafted
with negligible accuracy loss. Based on this observation, [
387
,
388
] propose to rst graft the
less important intermediate non-linear activation layers with their linear counterparts and then
reparameterize multiple consecutive linear layers into one single linear layer to explore shallow
network solutions with fewer layers. Furthermore, several recent pruning methods [
389
–
392
] focus
on multi-dimensional pruning, which strive to actively prune less important channels, layers, and
input resolutions to aggressively trim down the model’s complexity towards enhanced inference
eciency on target hardware. These multi-dimensional pruning methods can achieve much better
accuracy-eciency trade-os than traditional channel-based and layer-based pruning methods.
Similarly, HACScale [
393
] proposes an eective scaling paradigm to re-scale dierent channels and
layers towards more ecient inference on target hardware.
4.2 Network antization
Dierent from network pruning that aims to reduce the network complexity at the structure
level, network quantization instead focuses on representing the network weights and activations
with lower bits, which is able to signicantly reduce the network complexity at the precision
level. Therefore, the resulting quantized network maintains the same network structure (i.e., the
same number of layers and channels), but with lower-bit network weights and activations. In
practice, network quantization can be traced back to the 1990s, when early quantization works
pioneer to quantize the network weights for Boltzmann machines [
447
], optical networks [
448
],
and multi-layer perceptrons (MLPs) [
449
]. Note that quantization has the potential to signicantly
trim down the network size to accommodate the limited storage in real-world embedded scenarios.
For example, Deep Compression [
89
] is able to reduce the network size of VGGNet by
×
49 times,
from 552 MB to 11.3 MB, while delivering comparable accuracy on ImageNet [
80
]. Thanks to its
surprisingly strong performance in reducing the computational complexity and alleviating the
storage requirements, renewed research interest in network quantization has emerged since the
2010s [
409
], which demonstrates that, compared with full-precision weights (i.e., 32 bits), 8-bit
quantized weights can eectively accelerate the network inference on mainstream CPUs without
signicant accuracy degradation. To this end, we, in this section, discuss recent advances in the eld
of network quantization, including representative quantized networks and popular quantization-
related extensions/implementations, which are also summarized in Fig. 21.
40
4.2.1 antized Networks. Below we introduce several representative quantized networks, includ-
ing binarized networks, ternarized networks, INT8 networks, and mixed-precision networks.
Binarized Networks. Binarized networks are built upon only 1-bit weights, which are con-
strained to be either
+
1 or
−
1 during forward and backward propagations [
24
,
25
]. This can
eectively eliminate computation-intensive multiply-accumulate operations and allows to replace
multiply-accumulate operations with cheap additions and subtractions, and as a result, can lead to
signicant performance improvement in terms of latency and energy consumption as demonstrated
in [
24
]. In the relevant literature, BinaryConnect [
24
] and BinaryNet [
25
] are the very early seminal
binarized networks, which pioneer to investigate the ecacy of 1-bit weights in order to reduce
the computational complexity and alleviate the storage bottleneck. Specically, BinaryConnect
[24] introduces the rst binarized network, which explores both deterministic binarization:
𝑤𝑏=sign(𝑤)=(+1,if 𝑤≥𝑇
−1,otherwise (14)
and stochastic binarization to stochastically binarize the network weights:
𝑤𝑏=(+1,with probability 𝑝=𝜎(𝑤)
−1,with probability 𝑝=1−𝜎(𝑤)(15)
where 𝜎(·) is the hard sigmoid function and can be mathematically formulated as follows:
𝜎(𝑥)=clip(𝑥+1
2,0,1)=max(𝑥, min(1,𝑥+1
2)) (16)
Note that BinaryConnect exploits the above hard sigmoid function rather than the soft version
because it is far less computationally expensive and can still yield competitive results. As shown
in BinaryConnect, the stochastic binarization is more advanced and can achieve much better
quantization accuracy than the deterministic counterpart. So far, BinaryConnect only enables
weight-level binarization, whereas the network inputs are still required to be full-precision. In sight
of this, BinaryNet [
25
] extends BinaryConnect to support both binarized weights and binarized
inputs in order to maximize the inference eciency of binarized networks. Furthermore, XNOR-Net
[
26
] demonstrates that [
24
,
25
] cannot be generalized to large-scale datasets like ImageNet. To
address this, XNOR-Net introduces an eective approach to estimate binarized weights to maintain
𝑤≈𝛼·𝑤𝑏
, after which the estimated
𝛼
can be detached from the binarized weight to rescale the
input. To further enhance the binarization accuracy, a plethora of binarized networks [
394
–
400
]
have been subsequently proposed. For example, [
399
,
400
] propose learnable activation binarizer
[399] and adaptive binary sets [400] to explore more accurate binarized networks.
Ternarized Networks. In addition to binarized networks, ternarized networks [
401
,
402
] are
another representative branch of quantized networks and have gained increasing popularity, thanks
to their superior accuracy. Specically, ternarized networks quantize the network weights from
32 bits to 2 bits, in which the 2-bit weights are constrained to
−
1, 0, and
+
1 in contrast to
±
1 in
binarized networks. As such, ternarized networks can achieve much better accuracy than binarized
networks at the cost of slightly increased computational complexity. To achieve this, [
401
], as the
rst ternarized network, proposes to quantize the full-precision weights as follows:
𝑤𝑡=
+1,if 𝑤>Δ
0,if |𝑤| ≤ Δ
−1,if 𝑤<−Δ
(17)
where
Δ
is a positive constant to control the ternarization threshold. To derive the optimal ternar-
ization threshold
Δ∗
, [
401
] turns back to XNOR-Net [
26
] and borrows the binarization estimating
41
Pretrained Model
Quantization
Training Data
Retraining or Finetuning
Quantized Model
Pretrained Model
Calibration
Quantization
Quantized Model
Calibration Data
Fig. 22. Comparisons between post-training quantization and quantization-aware training. Dierent from
post-training quantization, quantization-aware training integrates the quantization loss into the training loss,
which allows the optimizer to minimize the quantization loss so as to improve the quantization accuracy.
scheme from XNOR-Net, which introduces an adjustable scaling factor to minimize
||𝑤−𝛼·𝑤𝑡||2
2
.
Finally, [401] demonstrates an empirical rule of thumb to derive Δ∗as follows:
Δ∗=0.7·𝐸(|𝑤|) ≈ 0.7
𝑛
𝑛
𝑖=1|𝑤𝑖|(18)
where
𝑛
is the number of elements within
𝑤
. To boost the ternarization accuracy, [
402
] further
introduces two trained scaling coecients
𝑤𝑝
𝑙
and
𝑤𝑛
𝑙
for the
𝑙
-th layer, which are then trained using
gradient descent during backward propagation. Once the training process terminates, [
402
] deploys
the ternarized networks on target hardware, including the trained ternarized weights and the
corresponding scaling coecients, reducing the network size by at least
×
16 times. Subsequently,
several ternarized networks [
403
–
408
] have been proposed to further improve the ternarization
accuracy. Among them, [
408
] demonstrates that the hard ternarization threshold
Δ
, despite being
simple and eective, often leads to sub-optimal results. To avoid this, [
408
] introduces the paradigm
of soft ternarization threshold, which instead enables the network to automatically determine the
optimal ternarization intervals to maximize its accuracy.
Fig. 23. TensorRT’s INT8 calibration [410].
INT8 Quantization. Binarized and ternarized
networks have the potential to achieve
×
16
∼×
32
speedups, which, however, suer from non-negligible
accuracy loss, and even worse, require considerable
engineering eorts to design specialized hardware
for further deployment. The rationale behind this is
that mainstream hardware does not support low-bit
quantized networks. To overcome such limitations,
an eective alternative is INT8 quantization, which
trims down the network weights from 32 bits to 8 bits
within the range of
[−
128
,
127
]
. As such, INT8 quan-
tization can lead to about
×
4 compression in terms
of the network size, and more importantly, at the cost of negligible accuracy loss [
409
]. Besides,
thanks to its well-suited software (e.g., Google’s TensorFlow Lite and Nvidia’s TensorRT), we can
easily deploy INT8 quantized networks on mainstream hardware, such as mobiles, CPUs, and edge
GPUs, with minimal engineering eorts [
68
,
69
]. For example, as shown in [
410
], TensorRT allows
post-training INT8 quantization, which leverages simple weight calibration to convert pre-trained
full-precision weights into 8-bit weights (see Fig. 23) with only trivial accuracy loss. Furthermore,
several follow-up works [
411
–
414
] have been recently proposed to investigate INT8 quantization
42
and improve INT8 quantization accuracy. Among them, [
414
], as the very rst INT8 quantization
work, pioneers to quantize both weights and activations with 8-bit integers to boost the inference
eciency. Besides, [
411
] evaluates the performance of various INT8 quantized networks on mobile
GPUs, based on which [
411
] introduces a unied INT8 quantization framework that integrates
various o-the-shelf INT8 quantization techniques, such as symmetric, asymmetric, per-layer,
and per-channel INT8 quantization. Furthermore, [
412
] investigates the eciency bottleneck of
INT8 quantization and introduces hardware-friendly search space design to enable ecient INT8
quantization. More recently, [
450
,
451
] explore INT8 quantization to compress redundant CNNs for
ecient in-memory computing infrastructures. In addition to quantizing CNNs, [
413
] turns back
to transformers and leverages INT8 quantization to quantize computation-intensive transformers
in order to boost the inference eciency for general NLP tasks.
Mixed-Precision Networks. Mixed-precision quantization is another well-established branch
of network quantization. As shown in [
415
], mixed-precision quantization allows more ne-grained
quantization schemes across dierent weights and activations, and as a result, can usually achieve
better accuracy-eciency trade-os than conventional xed-precision quantization, such as bina-
rized (1-bit), ternarized (2-bit), and INT8 (8-bit) quantization. For example, TBN [
415
], as the very
rst mixed-precision network, proposes to combine layer-wise ternarized inputs and binarized
weights, which delivers surprisingly better accuracy-eciency trade-os than stand-alone bina-
rized networks and ternarized networks. The success of TBN has motivated several subsequent
mixed-precision quantization works [
416
,
417
] to continue improving the quantization accuracy.
For example, SYQ [
416
] proposes to quantize the network weights with 1/2 bits and the intermediate
activation with 8 bits, whereas PACT [
417
] allows 2-bit activations and 2/3/4/5-bit weights. These
early mixed-precision quantization works have demonstrated promising performance. Later, we
will introduce automated mixed-precision quantization, which exploits automated techniques to
search for the optimal bit allocation and can achieve more ne-grained quantization. Further-
more, [
418
–
421
] also consider to leverage mixed-precision quantization to improve the training
eciency of full-precision networks, which can signicantly accelerate the training process, and
more importantly, achieve comparable accuracy to the full-precision training.
4.2.2 antization Extensions and Implementations. Below we introduce several quantization
extensions and implementations, including quantization-aware training, automated mixed-precision
quantization, and quantization-aware hardware accelerators.
Quantization-Aware Training. Quantization-aware training refers to the technique that trains
quantized networks, which is fundamentally dierent from post-training quantization as shown
in Fig. 22. Note that post-training quantization can achieve satisfactory performance on early
networks like AlexNet [
76
] and VGGNet [
1
], which, however, suers from signicant accuracy loss
when applied to more advanced lightweight networks like MobileNets [
32
,
85
] and ShueNets
[
34
,
35
]. In general, quantization-aware training incorporates the quantization loss into the training
loss, which then allows the optimizer to minimize the quantization loss during the training process
in order to unlock better quantization accuracy than post-training quantization. In practice, [
414
],
as the seminal quantization-aware training work, proposes to quantize both weights and activations
with 8-bit integers. To maximize the accuracy of INT8 quantized networks, [
414
] also introduces
an eective tailored quantization-aware training approach to train the resulting INT8 quantized
networks. Similar to [
414
], [
422
,
423
] unify and improve quantization-aware training of INT8
quantized networks to minimize the accuracy degradation. To generalize quantization-aware
training to train other types of quantized networks (e.g., 1-bit and 2-bit networks), a plethora of
quantization-aware training works [
424
–
427
] have been subsequently proposed, which further
push forward the attainable quantization accuracy.
43
Data-Dependent KD
Logits-Based KD [11, 27, 452–460]
Intermediate Layers KD [28, 461–466]
Multi-Teacher KD [467–474]
Teacher-Free KD [475–481]
Privileged KD [482–487]
Data-Ecient KD
GANs-Based KD [488–494]
Few-Sample KD [33, 495–500]
Fig. 24. Illustration of data-dependent and data-eicient knowledge distillation (KD) works in Section 4.3.
Automated Mixed-Precision Quantization. Early quantization works typically quantize all
the weights and activations with the same level of precision, such as 1 bit for binarized networks
and 2 bits for ternarized networks. Despite the promising performance, early uniform quantization
practices suer from sub-optimal accuracy-eciency trade-os. For example, as shown in TBN [
415
],
mixed-precision quantization that combines layer-wise ternarized inputs and binarized weights
can achieve much better accuracy-eciency trade-os than stand-alone binarized networks and
ternarized networks. However, determining the optimal mixed-precision quantization strategy
is dicult due to the large number of possible quantization combinations across dierent layers.
To overcome such limitations, recent research has been shifted to automated mixed-precision
quantization [
428
–
436
], thanks to the tremendous success of neural architecture search (NAS) as
discussed in Section 3. Among them, [
428
], as the rst automated mixed-precision quantization
work, follows early dierentiable NAS practices [
39
,
138
] to search for the optimal layer-wise
precision assignment. Furthermore, HAQ [
429
] leverages reinforcement learning-based search to
explore the huge quantization design space with hardware feedback in the loop, which focuses
on nding the optimal layer-wise precision assignment to maximize both quantization accuracy
and hardware eciency. Note that we can easily generalize recent advances in the eld of NAS to
further improve automated mixed-precision quantization.
Quantization-Aware Accelerators. Dierent from INT8 quantized neural networks, binarized,
ternarized, and mixed-precision quantized neural networks are not supported by mainstream
hardware, such as mobiles, CPUs, and edge GPUs. This further demands the design of quantization-
aware accelerators to eciently execute low-bit quantized networks at run time. To this end, a
plethora of representative quantization-aware accelerators have been recently proposed [
437
–
446
],
including binarized networks-based accelerators [
437
–
439
], ternarized networks-based accelerators
[
440
–
442
], and mixed-precision networks-based accelerators [
443
–
446
]. These quantization-aware
accelerators have demonstrated signicant eciency improvements in terms of latency, memory,
area, and energy consumption in various real-world embedded scenarios.
4.3 Network Distillation
Network distillation, also referred to as knowledge distillation
7
, is another well-established paradigm
to further push forward the accuracy-eciency trade-o, which is initially proposed by [
501
] and
subsequently generalized by [
11
,
28
]. Note that knowledge distillation is a plug-and-play training
technique, which has been applied to various tasks to achieve better training performance, such
7
We interchangeably use network distillation and knowledge distillation to refer to the distillation-based training process.
44
as object detection [
502
] and language understanding [
96
]. Dierent from network pruning and
network quantization, which focus on improving the network eciency without sacricing the
network accuracy as discussed in Section 4.1 and Section 4.2, network distillation instead boosts the
accuracy-eciency trade-o from the accuracy perspective, which aims to improve the network
accuracy without changing the network structure. In other words, unlike network pruning and
network quantization that lead to simplied network structures, network distillation results in
the same network but the resulting network can typically achieve higher accuracy. Specically,
as shown in [
11
,
28
,
501
], knowledge distillation refers to the training process that leverages a
larger pre-trained teacher network to benet the training process of a smaller student network (see
Fig. 25), which transfers the rich and discriminative knowledge from the larger pre-trained teacher
network to the smaller student network to further achieve better accuracy on target task than
simply training the student network alone. Below we rst present the preliminaries of knowledge
distillation and then introduce recent representative data-dependent and data-ecient knowledge
distillation works. These knowledge distillation works can also be found in Fig. 24.
Fig. 25. Illustration of the teacher-student knowledge transfer
process in seminal knowledge distillation techniques [11, 28].
Knowledge Distillation Basics.
In order to better understand knowl-
edge distillation, we rst elaborate on
the preliminaries of knowledge distil-
lation, which are mainly based on the
most representative knowledge distil-
lation work [
11
]. As shown in previous
state-of-the-art networks [
32
,
34
,
35
,
85
], the network outputs, also referred
to as the network logits, are typically
fed into the softmax function to calcu-
late the probability distribution over
dierent categories for further predic-
tion purposes. However, given a pre-
trained teacher network, the output
logits after the softmax function are
discriminative but less informative, which are close to either 1 or 0 (e.g, [0.02, 0.95, 0.01, 0.01,
0.01]). This makes it dicult to directly transfer the discriminative knowledge from the pre-trained
teacher network to the student network. The rationale behind this is that the student network is of
smaller network size than the teacher network, thus being less capable of learning discriminative
knowledge [
11
]. To mitigate this issue, [
11
] leverages the distillation temperature
𝑇
to soften
the knowledge from the pre-trained teacher network to facilitate the teacher-student knowledge
transfer process, which can be mathematically formulated as follows:
𝑧𝑖=
exp(𝑦𝑖/𝑇)
Í𝑛
𝑗=1exp(𝑦𝑗/𝑇)𝑠.𝑡 ., 𝑖 =1, ..., 𝑛 (19)
where
{𝑦𝑖}𝑛
𝑖=1
denotes the output logits without softmax and
𝑇
is the temperature to soften the
output logits
{𝑦𝑖}𝑛
𝑖=1
as
{𝑧𝑖}𝑛
𝑖=1
. Note that Eq (19) is equivalent to the standard softmax function
when
𝑇=
1. As shown in [
11
], larger
𝑇
can produce softer probability distribution over dierent
categories (e.g.,
[
0
.
1
,
0
.
6
,
0
.
1
,
0
.
1
,
0
.
1
]
when
𝑇=
5and
[
0
.
2
,
0
.
2
,
0
.
2
,
0
.
2
,
0
.
2
]
when
𝑇=+∞
). Besides,
xing the temperature
𝑇
to 2 can empirically yield the best performance. Furthermore, [
11
] exploits
the softened knowledge from the pre-trained teacher network to guide the training process of the
45
student network, which can be mathematically formulated as follows:
minimize
𝑤L𝑡𝑟 𝑎𝑖𝑛 (𝑥, 𝑤 )=L(𝑦, 𝑦 ∗) +𝛼·𝑇2· L(𝑦, 𝑧)𝑠 .𝑡 ., 𝑦 =𝑓𝑤(𝑥)(20)
where
𝑥
is the input data,
𝑦∗
is the ground-truth label,
𝑧
is the softened knowledge from the pre-
trained teacher network,
𝛼
is the constant to control the teacher-student distillation magnitude,
and
L(·)
is the standard cross entropy loss function. Apart from these,
𝑓𝑤(·)
parameterizes the
student network with the weight of
𝑤
. As demonstrated in [
11
], it is important to multiply the
teacher-student distillation loss term (i.e.,
L(𝑦, 𝑧)
) with
𝑇2
because the teacher-student distillation
loss term scales the gradient of 𝑤to 1/𝑇2during the training process.
With the above in mind, below we further introduce recent representative knowledge distillation
practices, which are built upon [
11
] and can be roughly divided into the following two categories,
including data-dependent and data-ecient knowledge distillation.
4.3.1 Data-Dependent Knowledge Distillation. In this section, we introduce several representative
data-dependent knowledge distillation techniques, including logits-based knowledge distillation,
intermediate layers-based knowledge distillation, multi-teacher knowledge distillation, teacher-free
knowledge distillation, and privileged knowledge distillation.
Knowledge from Logits. Knowledge distillation from logits is one representative branch of
knowledge distillation and has been widely applied to improve the network accuracy, thanks to its
conceptual simplicity and surprisingly strong performance. Specically, [
11
,
27
] pioneer to leverage
the logits-based knowledge from the pre-trained teacher network to facilitate the training process of
the less-capable student network. As seen in [
11
], the logits-based knowledge from the pre-trained
teacher network, also referred to as soft labels, corresponds to the output of the teacher network
after being fed into the softmax function to calculate the probability distribution over dierent
categories as shown in Eq (19). Subsequently, [
452
–
454
] investigate the ecacy of knowledge
distillation and demonstrate that early knowledge distillation practices [
11
,
27
] only yield sub-
optimal results since
𝛼
and
𝑇
are xed for dierent teacher-student networks as shown in Eq (20).
Besides, [
455
–
457
] demonstrate the accuracy of the student network may signicantly degrade
when there are large accuracy gaps between teacher and student. To overcome such limitations,
[
456
] instead introduces an intermediate-sized network (i.e., teacher assistant) to facilitate the
knowledge transferred from the pre-trained teacher network to the student network, thus eectively
bridging the gap between powerful teacher and less-capable student. In addition to soft labels,
[
458
–
460
] demonstrate that noisy labels are also helpful to knowledge distillation, which can be
leveraged to further improve the accuracy of the student network.
Knowledge from Intermediate Layers. Apart from knowledge distillation from logits, knowl-
edge distillation from intermediate layers is another representative branch of knowledge distillation,
which provides more ne-grained knowledge to better guide the training process of the smaller
student network. The rationale behind this is that intermediate features are also discriminative and
can be combined with the nal network output to further enhance the feature expressiveness as
seen in [
503
]. Specically, [
28
] pioneers to investigate knowledge distillation from intermediate
layers and introduces hint learning to improve the training process of the student network, in which
hints correspond to the intermediate features of the teacher network. Compared with logits-based
knowledge, knowledge from intermediate layers are often richer and more ne-grained as shown
in [
28
]. Furthermore, a plethora of subsequent knowledge distillation works [
461
–
466
] have been
proposed to enhance the knowledge transferred from the pre-trained teacher network to the student
network, which continue to explore the rich intermediate features to facilitate the training process
of the student network. For example, [
466
] delves into more ne-grained channel-level knowledge
distillation, leading to more ne-grained and discriminative knowledge.
46
Multi-Teacher Knowledge Distillation. The standard knowledge distillation paradigm exploits
the pre-trained knowledge from one single teacher network to guide the training process of the
less-capable student network [
11
,
28
]. Furthermore, [
467
] demonstrates that the student network
may learn richer and more discriminative knowledge from multiple teacher networks, which pushes
the student network to achieve better accuracy since multiple teacher networks can provide more
informative and instructive knowledge than one single teacher network. To this end, [
467
] proposes
to average the network weights of multiple teacher networks (i.e., mean teachers) to better guide the
training process of the student network. Similar to [
467
], several follow-up knowledge distillation
works [
468
–
471
] propose to average the output logits of multiple pre-trained teacher networks
and then exploit the averaged knowledge to enhance the training process of the student network.
In addition, [
472
–
474
] demonstrate that directly averaging the output logits of multiple teacher
networks ignores the teacher diversity since dierent teacher networks may maintain dierent
network capabilities. To avoid this, [
472
–
474
] propose to actively enable and disable dierent
teacher networks through gates during the training process to better guide the student network to
learn more discriminative knowledge from dierent teacher networks.
Teacher-Free Knowledge Distillation. Despite the promising accuracy improvement, previous
knowledge distillation works [
11
,
28
] highly rely on o-the-shelf pre-trained teacher networks,
which necessitate considerable computational resources to train teacher networks. In addition,
to maximize the accuracy improvement, it is also of utmost importance to design proper teacher
networks, leading to additional engineering eorts. To overcome such limitations, several knowl-
edge distillation works [
475
–
481
] have been recently proposed to exclude teacher networks, which
instead exploit the knowledge from the student network itself to guide the training process of
the student network in teacher-free manners. For example, [
475
] introduces deep mutual learn-
ing, which demonstrates that the pre-trained teacher network is not necessary in the context of
knowledge distillation. Instead, [
475
] demonstrates that an ensemble of student networks can
collaboratively learn from each other throughout the training process, and more importantly, can
achieve surprisingly better training accuracy than standard knowledge distillation practices [
11
,
28
].
This explicitly indicates that the knowledge from the student network itself can also be leveraged
to improve the training process of the student network towards better training accuracy.
Privileged Knowledge Distillation. Privileged knowledge distillation is a special case of
knowledge distillation, where the student network has access to additional information or features
that are not available to the teacher network during the training process [
482
]. In contrast to
the standard knowledge distillation paradigm [
11
], privileged knowledge distillation allows the
student network to learn from both the teacher network and the additional information that is only
available to the student network, which has the potential to further improve the attainable accuracy
as demonstrated in [
482
]. The rationale behind privileged knowledge distillation is that the student
network can leverage the additional information to improve its ability to mimic the behavior of
the pre-trained teacher network. Furthermore, inspired by [
482
], several privileged knowledge
distillation works [
483
–
487
] have been recently proposed to improve the performance of the student
network in various tasks. For example, [
484
] explores progressive privileged knowledge distillation
to embrace better online action detection. In addition, [
487
] introduces privileged feature distillation
to improve the product recommendations of Taobao. These privileged knowledge distillation works
clearly demonstrate that the student network may benet from the additional information and
knowledge to achieve better training accuracy on target task.
4.3.2 Data-Eicient Knowledge Distillation. In this section, we introduce GANs-based knowledge
distillation and few-sample knowledge distillation, which are of high data eciency and can
perform teacher-student distillation using a small amount of training data.
47
GANs-based Knowledge Distillation. Despite the promising accuracy improvement, knowl-
edge distillation is often data-driven and highly relies on sucient training data to transfer the rich
pre-trained knowledge from the teacher network to the student network, thus inevitably leading to
signicant engineering eorts for data preparation, such as data collection, cleaning, and labeling.
As seen in the relevant literature, generative adversarial networks (GANs) have been applied to
a wide range of tasks and are considered one of the most eective approaches for generating
high-quality synthetic data [
504
]. In sight of this, a plethora of GANs-based knowledge distillation
works [
488
–
494
] have been recently proposed to leverage GANs to generate sucient training
data and then use the generated data to train the student network. These GANs-based knowledge
distillation works have demonstrated signicant data eciency since the well-optimized generator
can be used to produce a large amount of high-quality synthetic data, while at the same time
achieving promising accuracy improvement on target task.
Few-Sample Knowledge Distillation. In addition to GANs-based knowledge distillation,
another promising direction is to perform ecient knowledge distillation to transfer the rich
knowledge from the pre-trained teacher network to the student network with only a small amount
of training data or only few data samples, which can also bring signicant data eciency. To
achieve this, several few-sample knowledge distillation works [
33
,
495
–
500
] have been recently
proposed. Among them, [
497
] proposes a simple yet eective solution for knowledge distillation
using label-free few samples to realize both data eciency and training/processing eciency.
Specically, [
497
] rst inserts one 1
×
1convolutional layer at the end of each building block of
the student network and then optimizes the inserted 1
×
1convolutional layer to minimize the
knowledge distillation loss, which can quickly converge using only few data samples. More recently,
[
500
] introduces an eective mimicking-then-replacing knowledge distillation technique to quickly
train the student network with only few data samples, which maintains signicant data eciency
while still achieving superior training accuracy.
4.4 Future Envision
In this section, we further envision several promising future trends and possible directions in the
eld of network compression, which are summarized as follows:
(1)
Automated Teacher-Student Search. Knowledge distillation transfers the rich knowledge
from the pre-trained teacher network to the student network to facilitate the training process
of the student network, which has achieved promising accuracy improvement [
11
,
28
]. In
the past, researchers empirically exploit larger networks as teacher networks and smaller
networks as student networks. However, such empirical practices may lead to sub-optimal
accuracy and cannot always achieve accuracy improvement. The rationale behind this is
that dierent student networks may prefer quite dierent teacher networks as shown in
[
505
,
506
]. This further motivates us to design the optimal teacher-student network pair to
maximize the attainable accuracy of the student network. To achieve this, one promising
alternative is to leverage recent advances in the eld of neural architecture search (NAS) to
automatically search for the optimal teacher-student network pair.
(2)
Joint Network Compression. To embrace better accuracy-eciency trade-os, an in-
tuitive and straightforward approach is sequential network compression, which exploits
multiple network compression techniques to progressively reduce the network complexity.
For example, [
507
] introduces a simple yet eective sequential compression pipeline, which
starts with searching for an ecient network with ProxylessNAS [
40
] and then applies
automated channel pruning [
369
] and mixed-precision quantization [
429
] to further trim
48
down the network size. However, such sequential compression pipeline has critical draw-
backs, and as a result, leads to sub-optimal results. This is because the searched optimal
network is not necessarily optimal for subsequent pruning and quantization. To address this,
one promising future direction is joint network compression, which jointly optimizes the
network structure, pruning, and quantization to yield the best accuracy-eciency trade-o.
(3)
Federated Network Compression. Federated learning is an emerging decentralized
learning approach that allows multiple hardware devices to collaboratively learn the same
network without sharing their raw data [
508
]. Specically, federated learning allows the
network to be trained locally on each hardware device using its own data, in which only
the network updates, rather than the raw data, are sent back to the central server for
further aggregation. In sight of this, one promising future direction is federated network
compression, including federated pruning, federated quantization, and federated distillation,
which can signicantly enhance the data privacy and protect the data security while still
achieving competitive performance in terms of accuracy and training eciency.
(4)
Domain-Specic Network Compression. In addition to image classication, there are
also a wide range of popular downstream applications, such as object detection, tracking, and
semantic segmentation, where the involved networks are still quite computation-intensive.
This makes it dicult to accommodate the limited available computational resources in
real-world embedded scenarios. To tackle this issue, some early practices have attempted to
leverage general network compression techniques to compress domain-specic networks.
For example, [
502
,
509
,
510
] propose to leverage pruning [
510
], quantization [
509
], and
knowledge distillation [
502
] to improve the accuracy-eciency trade-o in real-world
object detection scenarios, which, however, are still under-explored and cannot simply
generalize to other scenarios. Therefore, one promising future direction is domain-specic
network compression, which exploits domain-specic knowledge to largely boost the
network compression performance towards better accuracy-eciency trade-os.
(5)
Mixed-Precision Training. Mixed-precision training refers to the technique that trains
the network with both full-precision weights and low-bit weights, which has the potential to
signicantly improve the training eciency without sacricing the accuracy. For example,
PyTorch [
67
] introduces Automatic Mixed Precision (AMP), which combines 32-bit full-
precision weights with 16-bit half-precision weights during the training process. As a result,
AMP achieves the same level of accuracy as the stand-alone 32-bit full-precision training,
while at the same time being able to deliver about
×
2 training speedups for convolutional
networks [
511
]. In sight of this, one promising future direction is to leverage low-bit mixed-
precision training techniques to train full-precision networks, which may aggressively push
forward the training eciency without degrading the accuracy.
5 EFFICIENT ON-DEVICE LEARNING FOR EMBEDDED COMPUTING SYSTEMS
On-device learning consists of two branches, including on-device inference and training. Specically,
on-device inference refers to the process of deploying ecient pre-trained networks on local
hardware devices, which allows local hardware devices to run various intelligent inference tasks,
such as image classication and object detection. There have been several representative techniques
[
513
,
514
] to enable ecient on-device inference, which focus on either designing computation-
ecient networks with less redundancy or compressing computation-intensive networks to reduce
the computational complexity in order to accommodate the limited on-device computational
resources. Note that this paper has discussed popular techniques for ecient on-device inference,
such as ecient manual/automated network design and ecient network compression, and the
readers may refer to Section 2, Section 3, and Section 4 for more details.
49
General On-Device Learning
On-Device Inference [32, 85, 162, 512–515]
On-Device Training [45, 516–523]
Advanced On-Device Learning
On-Device Continual Learning [46, 524–534]
On-Device Transfer Learning [44, 535–540]
On-Device Federated Learning [47, 414, 541–545]
Fig. 26. Comparisons of eicient on-device learning techniques that have been discussed in Section 5.
On the other hand, on-device training refers to the capability of local hardware to perform
training tasks directly on local hardware itself without the need for remote servers [
44
]. Unlike
on-device inference where the deployed network always remains static, on-device training may
further enhance the deployed network over time, which allows to adapt the deployed network
to new data collected from local sensors so as to achieve better accuracy. Thanks to its strong
capability of protecting data privacy and ensuring data security, on-device training has become
increasingly popular over the past few years in order to achieve secured embedded intelligence
[
546
]. To this end, we, in this section, systematically discuss recent state-of-the-art on-device
learning techniques (especially on-device training), including general on-device learning in Sec-
tion 5.1, on-device continual learning in Section 5.2, on-device transfer learning in Section 5.3,
and on-device federated learning in Section 5.4, since these on-device learning techniques feature
dierent learning algorithms to enhance the on-device learning performance. For better under-
standing, we also summarize these on-device learning methods in Fig. 26. Note that these on-device
learning techniques can typically generalize across dierent networks (e.g., convolutional networks
and transformers). For example, we can leverage on-device federated learning to optimize both
convolutional networks and transformers on multiple local hardware devices.
5.1 General On-Device Learning
In this section, we introduce recent state-of-the-art works about general on-device learning tech-
niques, including ecient on-device inference and ecient on-device training.
Ecient On-Device Inference. To enable ecient on-device inference, one straightforward
approach is to design tiny networks with less redundancy in order to accommodate the limited on-
device computational resources. To this end, a plethora of representative tiny networks [
512
–
515
]
have been recently proposed, including MicroNets [
512
], MCUNets [
513
,
514
], and EtinyNet [
515
].
Among them, MCUNetV1 [
513
], as one of the early tiny networks, proposes to jointly design the
lightweight tiny network using TinyNAS and the lightweight inference engine using TinyEngine,
enabling ImageNet-scale inference on microcontrollers. Furthermore, MCUNetV2 [
514
] introduces
an ecient patch-based inference pipeline to trim down the on-device memory consumption
for better on-device inference since the memory consumption is the key bottleneck of on-device
inference. However, dierent from training large networks, training tiny networks poses signicant
challenges as demonstrated in [
547
]. The rationale here is that existing regularization techniques
(e.g., data augmentation and dropout), despite being able to benet the training process of large
networks, may degrade the training performance of tiny networks [
547
]. To tackle this issue, [
547
]
proposes to augment the tiny network itself rather than augmenting the input data, which shows
promising accuracy improvement over the standard training scheme.
Ecient On-Device Training. The key dierence between on-device training and inference is
that on-device training requires to save all the intermediate activations, which are used to optimize
50
parameters using gradient descent during backward propagation. In contrast, on-device inference
that only performs forward propagation does not need to save intermediate activations, which can
be progressively released to reduce the memory consumption. In sight of this, on-device training
suers from non-negligible memory consumption since the activation size grows with respect to
the training batch size and training typically involves a large batch size to accelerate the training
process. As a result, intermediate activations arise to be the major bottleneck of on-device training
as demonstrated in [
44
]. For example, under the batch size of 16, the activation size of ResNet50
[
2
] is
×
13.9 larger than its parameter size as shown in Fig. 27. To alleviate the excessive memory
consumption caused by intermediate activations, there have been several representative strategies,
including gradient checkpointing, activation gradient pruning, and low-bit training:
(1)
Gradient Checkpointing. Gradient checkpointing is a simple yet eective memory opti-
mization technique, which seeks to reduce the training memory consumption at the cost
of increased training time [
516
]. To this end, gradient checkpointing reserves a minimal
set of intermediate activations during forward propagation, which are then utilized to re-
compute the remaining intermediate activations during backward propagation. As shown in
[
516
], gradient checkpointing has the potential to signicantly reduce the training memory
consumption from
𝑂(𝑛)
to
𝑂(√𝑛)
, where
𝑛
is the number of network layers. More impor-
tantly, gradient checkpointing does not degrade the training accuracy since the training
behaviors remain to be the same as the standard training scheme. Furthermore, several sub-
sequent gradient checkpointing works generalize gradient checkpointing to allow arbitrary
computation graphs [517] and train graph neural networks [518].
(2)
Activation Gradient Pruning. Activation gradient pruning removes less important in-
termediate activation gradients to optimize the training memory consumption [519]. This
relies on an empirical observation that most of the intermediate activation gradients during
backward propagation are very close to zero and thus have minimal impact on gradient
descent [
519
]. Therefore, pruning these very small activation gradients can eectively
reduce the training memory consumption at the cost of minimal accuracy loss, which
also accelerates the training process. Similar to [
519
], [
520
] proposes an ecient gradient
ltering scheme, which lters similar activation gradients during backward propagation
and only reserves those with unique elements to reduce the number of elements in the
activation gradient maps. Apart from these, another popular approach is to build dynamic
sparse computation graphs to eliminate intermediate activations in an input-dependent
manner, which can also reduce the training memory consumption [521].
(3)
Low-Bit Training. Low-bit training refers to training the given network with low-bit
weights (e.g., 8 and 16 bits) rather than full-precision 32-bit weights, which has the potential
to signicantly reduce the training memory consumption by
×
32 times [
82
]. The rationale
here is that low-bit training can reduce the memory consumption for both network weights
and intermediate activations. Specically, [
522
], as an early exploration, proposes to train the
given network with 16-bit weights under stochastic rounding, which leads to
×
2 less training
memory consumption than the standard full-precision 32-bit training counterpart, and more
importantly, maintains comparable training accuracy. Besides, [
523
] introduces an ecient
INT8 training pipeline, consisting of loss-aware compensation and backward quantization,
to enable tiny on-device training, thanks to the well-optimized INT8 quantization on
mainstream hardware. Furthermore, [
45
] proposes to optimize real-quantized graphs, in
which an eective memory-ecient sparse update scheme and tiny training engine are
integrated to achieve on-device training under 256 KB memory. It is worth noting that low-
bit training is similar to network quantization as discussed in Section 4.2 since both leverage
51
Training
128x expensive!
Inference Memory Footprint, Batch Size = 1 (20MB)
Memory Cost
#batch size
ResNet50 Act
ResNet50 Params
ResNet50 Running
Act
ResNet50 Training
Memory Cost
#batch size
ResNet50 Inference
Memory Cost
#batch size
MobileNetV2 Act
MobileNetV2
Params
MobileNetV2
Running Act
MobileNetV2
Training Memory
Cost
#batch size
MobileNetV2
Inference Memory
Cost
Untitled 1
0
88.4
102.23
6.42
190.63
0
108.65
0
54.80
14.02
5.60
68.82
0
19.62
Untitled 2
1
176.8
102.23
279.03
1
108.65
1
109.60
14.02
123.62
1
19.62
Untitled 3
2
353.6
102.23
456.83
2
108.65
2
219.20
14.02
233.22
2
19.62
Untitled 4
3
707.2
102.23
809.43
3
108.65
3
438.40
14.02
452.42
3
19.62
4
1414.4
102.23
1516.63
4
108.65
4
876.80
14.02
890.82
4
19.62
101
102
103
TPU SRAM (28MB)
2
1
4
8
Raspberry Pi 1 DRAM
(256MB)
float mult
SRAM access
DRAM access
Energy
3.7
5.0
640.0
Table 1
ResNet
MBV2-1.4
Params (M)
102
24
Activations (M)
707.2
626.4
0
200
400
600
800
Param (MB)
Activation (MB)
ResNet-50 MbV2-1.4
4.3x
1.1x
The main bottleneck
does not improve much.
DRAM: 640 pJ/byte
SRAM: 5 pJ/byte
6.9x larger
Table 1-1
MobileNetV3-1.4
4
40
59
16
Batch Size
MbV2
Memory Footprint (MB)
Activation is the
main bottleneck,
not parameters.
float mult
SRAM access
DRAM access
Energy
3.7
5.0
640.0
Training Inference
Batch Size
101
102
103
MobileNetV2
Memory Footprint (MB)
TPU SRAM (28MB)
2
1
4
8
16
Raspberry Pi 1
Model A DRAM (256MB)
32 bit
Float Mult
32 bit
SRAM Access
32 bit
DRAM Access
102
103
101
100
Energy (pJ)
3.7 pJ
5 pJ
640 pJ
128x
Expensive
float mult
SRAM access
DRAM access
Energy
3.7
5.0
640.0
Inference, bs=1
Energy
20.0
0
125
250
375
500
Inference
Batch Size = 1
MobileNetV2
Memory Footprint (MB)
SRAM: 5 pJ/byte
DRAM: 640 pJ/byte
128x expensive!
Table 2
SRAM Access
Training, bs=8
Energy
20
890.82
0
250
500
750
1000
AMD EPYC CPU SRAM (L3 Cache)
Raspberry Pi 1 DRAM
MbV2
Memory Footprint (MB)
Inference
Batch Size = 1
Training
Batch Size = 16
Table 3
ResNet-50
MbV2-1.4
Param (MB)
102
24
Activation (MB)
1414.4
1252.8
0
400
800
1200
1600
Param (MB)
Activation (MB)
ResNet-50 MbV2-1.4
The main bottleneck does not improve much.
13.9x larger
Activation is the
main bottleneck,
not parameters.
4.3x
1.1x
1
Fig. 27. Comparisons between on-device training and inference in terms of the memory consumption. This
reveals that the activation size, instead of the parameter size, is the major boleneck of on-device training,
motivating future research to reduce the activation size for eicient on-device training. (figure from [44])
quantized weights to trim down the network complexity, which allows to generalize recent
advanced quantization techniques to benet low-bit on-device training.
We note that the aforementioned strategies can be also combined to further reduce the overall
memory consumption during training and accelerate the training process. For example, we can
easily combine gradient checkpointing and low-bit training to further reduce the training memory
consumption from 𝑂(√𝑛)to 𝑂(√𝑛/32), where 𝑛is the number of network layers.
5.2 On-Device Continual Learning
On-device continual learning, also known as on-device lifelong learning or incremental learning,
is an advanced learning paradigm, which allows the deployed network to continuously learn
from the newly collected data to further push forward the attainable accuracy [
46
,
533
]. This is
particularly favored in real-world embedded scenarios, especially those with rich local sensors,
where embedded devices can continue to collect new data through local sensors over time [
44
,
45
].
The newly collected data is then used to train the deployed network to unlock better performance
over time. In other words, we are allowed to utilize the newly collected data to train or ne-tune the
deployed network on target hardware itself, which typically leads to better accuracy on target task.
In the meantime, on-device continual learning performs local training and does not need to send
back the newly collected data to remote servers, which also protects data privacy and ensures data
security. However, on-device continual learning, despite being able to deliver signicant benets,
suers from the catastrophic forgetting issue, which is the tendency to forget the previously learned
knowledge when adapting to newly collected data [
46
]. The rationale is that on-device continual
learning must adjust the pre-trained network weights in order to adapt to the newly collected data,
which deteriorates the previously learned knowledge accordingly.
To alleviate the catastrophic forgetting issue, a plethora of state-of-the-art on-device continual
learning works have been recently established [
46
,
524
–
534
], which seeks to stabilize the on-device
continual learning process and further push forward the attainable accuracy on target task. Among
them, [
46
], as an early exploration, investigates three common continual learning scenarios and
demonstrates that it is frustratingly hard to evaluate dierent continual learning approaches, based
on which [
46
] establishes several evaluation protocols to compare dierent continual learning
approaches. It is worth noting that [
46
] itself does not support on-device continual learning but we
can easily integrate recent advances from the lens of on-device training (see Section 5.1) into [
46
]
to further enable ecient on-device continual learning. Several subsequent on-device continual
learning works [
524
–
527
] explore on-device continual learning on resource-constrained embedded
52
computing systems, which have since demonstrated promising accuracy improvement. In parallel,
[
528
–
530
] attempts to generalize on-device continual learning to benet languages tasks in real-
world embedded scenarios, such as environmental sound classication [
528
] and automatic speech
recognition [
530
]. Inspired by the tremendous success of vision transformers [
107
], [
531
] also
investigates the ecacy of on-device embedded continual learning to continuously improve the
accuracy of mainstream vision transformers. Furthermore, [
532
–
534
] delve deeper into on-device
continual learning, which focus more on the training pipeline and introduce several on-device
training enhancements to maximize the accuracy improvement, such as selective weight updates
[532], weight freezing [533], and deep network ensembles [534].
5.3 On-Device Transfer Learning
As demonstrated in [
44
], it is often dierent to directly train DNNs from scratch in real-world
embedded scenarios, where the collected data samples are far limited. To tackle this issue, an
eective alternative is on-device transfer learning, which instead ne-tunes pre-trained networks
on large-scale datasets. The rationale here is that DNNs pre-trained on large-scale datasets (e.g.,
ImageNet [
80
]) can serve as powerful feature extractors for further transfer learning, which only
ne-tunes several layers (e.g., batch normalization layers and the last layer) whereas other layers
are typically frozen. Dierent from previous on-device learning practices as discussed in Section 5.1
and Section 5.2, on-device transfer learning does not require to store memory-intensive interme-
diate activations, and as a result, maintains signicant eciency in terms of training memory
consumption as illustrated in Fig. 28 [
44
]. Despite the promising memory eciency, on-device
transfer learning is quite challenging, which may result in poor accuracy, especially on those
datasets whose data distribution is far from ImageNet [82].
To overcome such limitations, early transfer learning practices [
535
,
536
] propose to ne-tune
all the network layers, which indeed achieve better accuracy but lead to considerable memory
consumption due to memory-intensive intermediate activations. To avoid this, several subsequent
transfer learning works [
537
–
540
] demonstrate that it is often not necessary to ne-tune all the
network layers, which indicate that ne-tuning batch normalization layers can also achieve strong
accuracy on target task. This ne-tuning paradigm has the potential to signicantly reduce the
number of trainable parameters during the transfer learning process. In sight of this, [
537
–
540
]
propose to only optimize learnable parameters in batch normalization layers (see
𝛾
and
𝛽
in Eq (13)),
whereas other learnable parameters are frozen during the transfer learning process. For example,
[
537
] leverages batch normalization layers as scale-and-bias patches and then trains the patched
parameters, optionally also the last layer, whereas the remaining parameters are left unchanged.
Furthermore, [
538
] reveals that, for those networks with sucient depth, training only
𝛾
and
𝛽
can reach surprisingly strong accuracy, which demonstrates the expressive power of the learnable
parameters in batch normalization layers. However, fewer trainable parameters cannot directly
translate to superior training memory eciency as shown in Fig. 27, which may still involve a
large amount of memory consumption (e.g., 326 MB under the training batch size of 8) to store
memory-intensive intermediate activations of batch normalization layers [44].
To further alleviate the prohibitive training memory consumption, [
44
] introduces a simple
yet eective transfer learning solution, which exhibits signicant training memory eciency.
Specically, [
44
] relies on an empirical observation that intermediate activations are only required to
update network weights, whereas updating network biases does not involve intermediate activations.
This observation also reveals that the training memory bottleneck comes from updating network
weights rather than biases. In sight of this, [
44
] proposes to freeze network weights and only update
network biases. However, freezing network weights and only updating network biases may lead to
signicant accuracy loss. To compensate for such accuracy loss due to freezing network weights,
53
Stanford-Cars
Full
Last
BN
Bias
LiteResidual
LiteResidual+Bias
256, 448
224, 416
192, 384
89.1
292.4
160, 352
87.3
208.7
128, 320
84.2
140.5
60.0
57.6
80.1
59.3
88.3
64.7
88.8
64.7
96, 288
76.1
87.2
58.4
47.6
78.1
49.0
87.7
54.4
88.0
54.4
, 256
54.7
38.7
80.2
249.9
75.9
39.8
86.3
45.2
87.4
45.2
, 224
50.9
30.8
77.9
192.4
73.4
31.7
84.2
37.1
85.0
37.1
, 192
73.7
142.9
68.6
24.7
82.1
30.1
83.6
30.1
, 160
67.9
100.7
61.2
18.7
77.3
24.1
78.2
24.2
45
55
65
75
85
95
0
75
150
225
300
TinyTL (LiteResidual+Bias) TinyTL (Bias) FT-Norm+Last FT-Last FT-Full
Training Memory (MB)
Cars
Flowers102-1
Full
Last
BN
Bias
LiteResidual
LiteResidual+bias
Batch Size
Model Size
18.98636
5.138576
5.264432
5.201504
10.587824
10.63352
8
Act@256, Act@448
60.758528
12.845056
93.6488
13.246464
13.246464
13.246464
Act@224, Act@416
46.482132
11.075584
80.713856
11.421696
11.421696
11.421696
Act@192, Act@384
34.176672
9.437184
68.8032
9.732096
9.732096
9.732096
Act@160, Act@352
23.70904
7.929856
57.785036
8.177664
8.177664
8.177664
Act@128, Act@320
15.189632
6.5536
47.78
6.7584
6.7584
6.7584
Act@96, Act@288
8.530757
5.308416
38.678632
5.474304
5.474304
5.474304
, Act@256
4.194304
30.5792
4.325376
4.325376
4.325376
, Act@224
3.211264
23.39462
3.311616
3.311616
3.311616
, Act@192
2.359296
17.2008
2.433024
2.433024
2.433024
, Act@160
1.6384
11.933009
1.6896
1.6896
1.6896
Aircraft
Full
Last
BN
Bias
LiteResidual
LiteResidual+Bias
256, 448
224, 416
192, 384
83.5
292.4
160, 352
81.0
208.7
128, 320
77.7
140.5
51.9
57.6
68.6
59.3
81.5
64.7
82.3
64.7
96, 288
70.5
87.2
50.6
47.6
67.3
49.0
80.0
54.4
80.8
54.4
, 256
48.6
38.7
70.7
249.9
65.6
39.8
79.0
45.2
78.9
45.2
, 224
44.9
30.8
68.1
192.4
63.2
31.7
76.4
37.1
75.4
37.1
, 192
64.7
142.9
59.4
24.7
73.3
30.1
74.9
30.1
, 160
60.5
100.7
55.2
18.7
69.5
24.1
70.4
24.2
40
50
60
70
80
90
0
75
150
225
300
Training Memory (MB)
Aircraft
Flowers
Full
Last
BN
Bias
LiteResidual
LiteResidual+Bias
256, 448
224, 416
96.8
390.8
192, 384
96.1
292.4
160, 352
95.4
208.7
128, 320
93.6
140.5
93.3
57.6
96.0
387.5
95.6
59.3
96.7
64.7
96.8
64.7
96, 288
89.6
87.2
92.6
47.6
95.6
314.7
95.1
49.0
96.4
54.4
96.4
54.4
, 256
91.6
38.7
95.0
249.9
94.5
39.8
95.9
45.2
96.0
45.2
, 224
90.1
30.8
94.3
192.4
93.5
31.7
95.3
37.1
95.5
37.1
, 192
92.8
142.9
91.5
24.7
94.6
30.1
94.6
30.1
, 160
90.5
100.7
89.5
18.7
92.8
24.1
93.1
24.2
88
90
92
94
96
98
0
100
200
300
400
Training Memory (MB)
Flowers
Cub200
Full
Last
BN
Bias
LiteResidual
LiteResidual+Bias
256, 448
224, 416
81.0
390.8
192, 384
79.0
292.4
160, 352
76.7
208.7
128, 320
71.8
140.5
77.9
57.6
80.6
387.5
79.8
59.3
80.5
64.7
81.0
64.7
96, 288
77.0
47.6
79.6
314.7
78.6
49.0
79.6
54.4
80.0
54.4
, 256
75.4
38.7
79.1
249.9
77.5
39.8
78.5
45.2
78.8
45.2
, 224
73.3
30.8
76.3
192.4
75.3
31.7
76.8
37.1
77.1
37.1
, 192
73.7
142.9
72.7
24.7
74.7
30.1
74.7
30.1
, 160
70
72
74
76
78
80
82
0
100
200
300
400
Training Memory (MB)
CUB
Food101
Full
Last
BN
Bias
LiteResidual
LiteResidual+Bias
256, 448
224, 416
84.6
390.8
192, 384
83.2
292.4
160, 352
81.2
208.7
128, 320
78.1
140.5
73.0
57.6
80.2
387.5
78.7
59.3
82.8
64.7
82.9
64.7
96, 288
73.5
87.2
72.0
47.6
79.5
314.7
77.9
49.0
82.0
54.4
82.1
54.4
, 256
70.7
38.7
78.4
249.9
76.8
39.8
81.1
45.2
81.5
45.2
, 224
68.7
30.8
77.0
192.4
75.5
31.7
79.2
37.1
79.7
37.1
, 192
74.9
142.9
73.0
24.7
78.2
30.1
78.4
30.1
, 160
72.4
100.7
70.1
18.7
74.6
24.1
75.1
24.2
65
69
73
77
81
85
0
100
200
300
400
Training Memory (MB)
Food
Pets
Full
Last
BN
Bias
LiteResidual
LiteResidual+Bias
256, 448
88
89
90
91
92
93
94
0
100
200
300
400
Training Memory (MB)
Pets
75
80
85
90
95
100
0
40
80
120
160
Training Memory (MB)
CIFAR10
55
61
67
73
79
85
0
40
80
120
160
Training Memory (MB)
CIFAR100
6.5x memory
saving
292MB
45MB
209MB
45MB
4.6x memory
saving
292MB
45MB
6.5x memory
saving
292MB
65MB
4.5x memory
saving
391MB
65MB
6.0x memory
saving
209MB
54MB
3.9x memory
saving
87MB
19MB
4.6x memory
saving
87MB
19MB
4.6x memory
saving
1
Fig. 28. Comparisons of various on-device transfer learning methods including TinyTL [
44
], FT-Norm+Last
[
537
], FT-Last [
536
], and FT-Full [
548
]. Among them, TinyTL freezes the weights and only optimizes the
bias modules. FT-Norm+Last fine-tunes the normalization layers and the last linear layer, whereas FT-Last
fine-tunes the last linear layer and FT-Full fine-tunes the full network. (figure from [44])
[
44
] introduces an eective lite residual learning scheme, which leverages generalized memory-
ecient bias modules to rene memory-intensive intermediate activations. In particular, the lite
residual learning scheme can improve the attainable accuracy on target task, and more importantly,
at the cost of negligible memory overheads. Finally, [
44
] reduces the training memory consumption
from more than 250 MB to only 16 MB, making it possible to explore in-memory computing
infrastructures to perform memory-ecient transfer learning. Note that we can easily integrate
recent advances from the lens of on-device training (see Section 5.1) into the aforementioned
transfer learning works towards boosted on-device transfer learning performance.
5.4 On-Device Federated Learning
On-device federated learning is an advanced decentralized learning paradigm, which enables
ecient training on a large corpus of decentralized data residing on local client devices like mobile
phones and allows multiple local client devices to jointly train the given network without explicitly
sharing their raw data [
541
,
549
]. In practice, on-device federated learning has the potential to
signicantly accelerate the training process when the number of client devices evolves. Besides, on-
device federated learning is also one instance of the more general approach of "bringing the neural
network to the data" rather than "bringing the data to the neural network", and as a result, addresses
the fundamental problems of data privacy, security, and ownership [
508
]. This is particularly
favored in real-world embedded scenarios, where embedded devices themselves can continue to
collect new data through local sensors. Thanks to the above practical benets, on-device federated
learning has garnered increasing attention from both academia and industry [
47
]. In the past decade,
on-device federated learning has been utilized to empower a plethora of real-world intelligent
applications, such as mobile keyboard content suggestions [
550
], medical image analysis [
551
], and
smart healthcare infrastructures [
552
]. As demonstrated in [
541
], standard on-device federated
learning practices typically consist of the following ve iterative steps:
(1)
Initialization. On-device federated learning begins with the randomly initialized network,
namely the global model, which is shared among local client devices. At the early learning
54
stage, the global model is sent to all the local client devices from the centralized server, in
which each local client device receives the same copy of the global model.
(2)
Local Training. Once local client devices receive the global model, they start to perform
local on-device training, which treats the global model as the local model and then trains
the local model using the locally collected data. Note that the locally collected data only
resides on the local client device itself and is not shared among other client devices.
(3)
Model Update. After local on-device training terminates, each local client device requires
to generate the respective model update scheme, which should essentially reect what the
local client device has learned from the locally collected data. These model update schemes,
instead of the locally collected data, are then sent back to the centralized server for further
aggregation, which eectively eliminates data leakage and protects data privacy.
(4)
Aggregation. The centralized server receives model update schemes from all the local
client devices, after which the centralized server aggregates the received model update
schemes to further produce an improved global model.
(5)
Distribution. The centralized server then distributes the improved global model to all the
local client devices, and the above steps repeat until convergence.
Despite being able to deliver superior learning performance across various real-world embedded
tasks, on-device federated learning also suers from critical limitations, especially from the per-
spective of data transmission [
508
], posing signicant challenges to generalize on-device federated
learning to benet real-world intelligent embedded applications. Dierent from the centralized
server that is equipped with high-end network infrastructures, local client devices in real-world
embedded scenarios are often low-end with less-capable network infrastructures. In such a case,
it may be time-consuming to (1) distribute the global model from the centralized server to local
client devices and (2) send back model update schemes from local client devices to the centralized
server for further aggregation. To overcome such limitations, a plethora of advanced federated
learning techniques have been recently established to accommodate the limited data bandwidth
of local client devices [
47
,
414
,
541
–
545
], which primarily focus on reducing the total data bits
transferred between the remote centralized server and local client devices, such as federated aver-
aging [
541
], gradient compression [
542
,
543
], quantization [
414
], delayed gradient averaging [
544
],
partial variable training [
545
], and local training sparsity [
47
]. Note that we can easily integrate
recent advances from the lens of general on-device training techniques (see Section 5.1) into the
aforementioned federated learning works to further enhance on-device federated learning.
5.5 Future Envision
In this section, we further envision several promising future trends and possible directions in the
eld of on-device learning, which are summarized as follows:
(1)
Oline On-Device Federated Learning. As discussed in Section 5.3, on-device federated
learning highly relies on the centralized server for updating local models, which requires
stable internet requirements for data movements between local devices and remote server,
and as a result, may suer from inferior on-device learning eciency due to the communi-
cation overheads between local devices and remote server, especially when the internet
connectivity is limited or unavailable. Therefore, one promising future trend is oine
on-device federated learning, which excludes the remote centralized server and exploits
local devices themselves to perform learning tasks. In particular, oine on-device federated
learning has the potential to signicantly boost the on-device learning eciency.
(2)
Personalized On-Device Learning. As discussed in Section 5.1, on-device learning exhibits
strong local personalization, which distinguishes itself from the global training counterpart.
55
In practice, personalized on-device learning can bring two-fold benets. On the one hand,
personalized on-device learning allows local devices to directly learn from local users to
provide user-tailored AI solutions, which can thus protect the data privacy since the collected
data do not need to be transferred to the remote cloud. On the other hand, personalized on-
device learning can achieve better learning accuracy since it can continue to collect rich new
personalized training data from local users. Therefore, future research should leverage this
unique capability to further provide more highly personalized on-device learning solutions,
where local devices can actively and quickly adapt themselves to users’ diverse needs so as
to deliver user-tailored services, such as personalized voice assistants.
(3)
Robust On-Device Learning. On-device learning, despite being able to achieve promising
success, still suers from critical limitations, such as poor adversarial robustness [
553
]. This
is particularly important in real-world embedded computing systems, such as embedded
visual sensing [
554
,
555
], where the environments may dynamically change over time. This
further makes local on-device learning more vulnerable to adversarial attacks, especially
those unseen adversarial attacks, which may signicantly degrade the on-device learning
performance even when encountering simple adversarial attacks [
553
]. To overcome such
limitations, future research should focus on developing robust on-device learning techniques
featuring novel adversarial training algorithms, which can achieve competitive on-device
learning performance while also maintaining superior adversarial robustness against well-
engineered adversarial attacks or even unseen adversarial attacks.
(4)
Ecient On-Device Learning Ecosystems. As discussed in Section 5.1, on-device learning
has gained increasing popularity from both academia and industry, thanks to its strong
capability to ensure data privacy and security. In sight of this, future research should
also develop ecient on-device learning ecosystems, including software and hardware
frameworks, to further support the development, deployment, and management of on-
device learning applications, making it easier for developers to create and optimize models
for various on-device learning purposes. For example, [
45
], as one of the most representative
on-device learning methods, leverages quantization to trim down the training memory
consumption. However, mainstream embedded computing systems do not support low-bit
training, making it dicult to benet mainstream embedded computing systems.
6
EFFICIENT LARGE LANGUAGE MODELS FOR EMBEDDED COMPUTING SYSTEMS
In the past few years, large language models (LLMs), such as GPT-3 [
48
] and GPT-4 [
49
], have
achieved impressive success across various real-world language processing tasks [
50
]. However, the
strong learning capability of LLMs also comes at the cost of excessive computational complexity.
For example, OpenAI’s GPT-3 [
48
], as one of the most representative LLMs, consists of 175 billion
parameters. More recently, LLMs continue to evolve with ever-increasing model sizes in order to
achieve state-of-the-art performance [
51
,
52
]. These further make it challenging to deploy LLMs
on modern embedded computing systems. To this end, we, in this section, rst introduce the
preliminaries on LLMs and then discuss recent state-of-the-art advances from the perspective of
ecient LLMs, including ecient LLM architecture design in Section 6.2, ecient LLM compression
techniques in Section 6.3, and ecient LLM system design in Section 6.4. We also summarize the
above state-of-the-art advances about ecient LLMs in Fig. 29. Finally, in Section 6.5, we envision
several promising future directions in the eld of ecient LLMs.
6.1 Preliminaries on LLMs
Large language models (LLMs) are emerging machine learning models, which are dedicated to
understanding, generating, and interacting with human language through leveraging extensive
56
textual data. In practice, LLMs are typically built upon transformer with both encoder and decoder
[
90
] and heavily rely on self-attention mechanisms to measure the signicance of dierent words
in the given sentence, regardless of their positional relationships. Thanks to their strong capability
to interpret rich information, LLMs can exhibit remarkable performance across a wide range
of language processing tasks, such as text summarization, translation, question answering, and
conversational response generation. As discussed in [
50
], recent state-of-the-art LLMs can be
divided into three main categories according to their inherent architectures, including encoder-only
architecture, decoder-only architecture, and encoder-decoder architecture as follows:
(1)
Encoder-Only Language Models. Encoder-only language models typically focus on
transforming the given input text into continuous representations, which can capture and
reect the context of the given input text. These encoder-only language models are usually
used for real-world language processing tasks that require understanding or embedding of
the given input text, such as sentence classication, named entity recognition, and extractive
question answering, where the output does not need to be sequential or generated texts.
For example, BERT [
91
] is one of the most representative encoder-only language models
that features masked language modeling during training, which enables the model itself to
understand the context from both directions (i.e., left and right context).
(2)
Decoder-Only Language Models. Decoder-only language models typically focus on
generating texts based on the given input text, which can interpret the context of the given
input text. These decoder-only language models are usually used for real-world language
processing tasks where generating texts is required, such as text generation and language
modeling. For example, GPT-3 [
48
] is one of the most representative decoder-only language
models that features advanced auto-regressive training, which can learn to accurately
predict the next word in sequence from all the previous words.
(3)
Encoder-Decoder Language Models. Encoder-decoder language models, also known as
sequence-to-sequence (seq2seq) models, typically consist of two parts, including (1) the
encoder to process the given input text and encode it into feature representations and (2) the
decoder to explore the above feature representations to generate an output sequence. The
encoder-decoder architecture is versatile and suitable for real-world language processing
tasks that require to transform the given input text into dierent formats, such as language
translation, summarization, and dialogue systems. For example, T5 [
556
] is one of the most
representative encoder-decoder language models, which formulates the given task as a
text-to-text transformation problem and converts the given input text to the target output
text. In parallel, BART [
557
] is particularly eective in generative and comprehensive tasks,
thanks to its bidirectional encoder and auto-regressive decoder.
6.2 Eicient LLM Architectures
As discussed in Section 6.1, recent LLMs are typically built upon transformer [
90
] and heavily rely
on self-attention mechanisms to interpret the signicance of dierent words in the given sentence,
regardless of their positional relationships. However, self-attention mechanisms, despite their
strong capability for language processing, also introduce considerable computational complexity.
As pointed out in [
599
], the quadratic time and memory complexity of self-attention mechanisms
may signicantly slow down the pre-training, inference, and ne-tuning stages of LLMs. To optimize
the prohibitive computational complexity of LLMs, recent state-of-the-art ecient LLMs often
focus on exploring computation-ecient self-attention mechanisms.
General Ecient Attention. Some recent works [
558
–
563
] focus on exploring computation-
ecient attention mechanisms to optimize the quadratic computational complexity of vanilla
57
Ecient LLM Architectures
General Attention [558–563]
Hardware-Aware Attention [53–55, 564–566]
Ecient LLM Compression
LLM Pruning [57, 58, 346, 567–579]
LLM Quantization [59, 60, 580–586]
LLM Distillation [61, 62, 587–592]
Ecient LLM Systems LLM Inference [63–65, 593–598]
Fig. 29. Overview of eicient LLM architectures, LLM compression techniques, and LLM systems in Section 6.
self-attention [
90
]. Among them, [
558
] introduces clustered attention, which groups dierent
queries into clusters and computes the attention just for the centroids rather than computing the
attention for every query. To further improve the above approximation, [
558
] also employs the
computed clusters to identify the keys with the highest attention per query and computes the
exact key-query dot products. [
559
] draws insights from the Nystrom method and proposes to
approximate the standard self-attention mechanism in linear complexity, which thus can enable
applications to longer sequences with even thousands of tokens. [
560
] demonstrates that the
complexity bottleneck of self-attention mainly comes from the computation of partition functions
in the denominator of the softmax function and the multiplication of the softmax matrix with the
matrix of values. To this end, [
560
] features an ecient kernel density estimation (KDE) solver
to resolve the above complexity bottleneck via sub-sampling based fast matrix products, which
can approximate the attention in sub-quadratic time with provable spectral norm bounds. [
561
]
introduces an ecient single-head gated attention mechanism with exponential moving average
to incorporate inductive bias of position-aware local dependencies into the position-agnostic
attention mechanism, which can exhibit linear time and space complexity while causing minimal
performance loss. [
562
] introduces an ecient universal approximation for self-attention, which
can exhibit linear time and space complexity and also reveal the theoretical insights behind existing
ecient transformers (e.g., Linformer [
97
]) with linear time and space complexity. [
563
] introduces
an ecient attention approximation mechanism featuring fused low-rank kernel approximation,
which can provide sizable runtime performance gains and is also of high quality.
Hardware-Aware Ecient Attention. In parallel to the above ecient attention approximation
works, some recent works [
53
–
55
,
564
–
566
] focus on exploring ecient hardware-aware attention
mechanisms, which can exhibit considerable eciency improvement on modern hardware systems.
Among them, FlashAttention [
54
] features an ecient IO-aware exact attention algorithm, which
explores tiling to reduce the total number of memory reads and writes between GPU high bandwidth
memory and GPU on-chip SRAM. However, FlashAttention is still not nearly as fast as optimized
matrix-multiply (GEMM) operations, which is mainly due to the sub-optimal work partitioning
between dierent thread blocks and warps on GPUs. To tackle this issue, FlashAttention-2 [
55
]
further introduces an improved work partitioning scheme, which (1) tweaks the algorithm to reduce
the number of non-matmul FLOPs, (2) parallelizes the attention computation workloads across
dierent thread blocks, and (3) distributes the work between warps to reduce communications
through shared memory. FLASHLINEARATTENTION [
566
] dives deeper into I/O-awareness and
introduces an eective hardware-ecient algorithm for linear attention, which trades o memory
movement against parallelizability and can even be faster than FlashAttention-2. PagedAttention
[
53
] draws insights from the operating system’s solution to memory fragmentation and sharing
58
through virtual memory with paging and further divides the request’s key-value (KV) cache
into dierent blocks, each of which contains the attention keys and values of a xed number
of tokens. A3 [
564
] demonstrates that implementing attention mechanisms using matrix-vector
multiplication is often sub-optimal and further proposes to accelerate attention mechanisms with
joint algorithmic approximation and hardware specialization. Similar to A3, ELSA [
565
] features
an eective approximation scheme to signicantly reduce the amount of computation workloads
by eciently ltering out relationships that are unlikely to aect the nal output.
6.3 Eicient LLM Compression
In addition to designing LLMs with ecient architectures, another promising direction is to
explore ecient LLM compression techniques to optimize the computational complexity of existing
computation-intensive LLMs. With this in mind, we, in this section, further discuss recent state-of-
the-art compression techniques for LLMs, including ecient LLM pruning in Section 6.3.1, ecient
LLM quantization in Section 6.3.2, and ecient LLM distillation in Section 6.3.3.
6.3.1 Eicient LLM Pruning. Pruning is one of the most eective strategies to optimize the compu-
tational eciency of LLMs, which removes the less important parameters of LLMs while incurring
minimal accuracy loss. Recent state-of-the-art LLM pruning methods can be divided into two main
categories, including non-structured LLM pruning and structured LLM pruning as follows:
(1)
Non-Structured LLM Pruning. Non-structured LLM pruning removes the less important
LLM weights/connects, which can yield more aggressive compression ratios than structured
pruning while also exhibiting strong accuracy [
58
,
346
,
567
–
572
]. For example, SparseGPT
[
567
] shows that LLMs can be pruned to at least 50% sparsity in one shot without any
retraining, and more importantly, at minimal accuracy loss. In parallel, Wanda [
58
] proposes
to prune the less important weights with the smallest magnitudes multiplied with the
corresponding output activations on a per-output basis. More importantly, both SparseGPT
and Wanda can generalize to semi-structured pruning [
346
,
568
] towards better hardware
parallelism, which can deliver realistic on-device inference speedups with the support of
some existing deep learning libraries (e.g., cuSPARSElt [
340
] and TVM [
341
]). Furthermore,
[
569
] advocates for reinstating ReLU activation in LLMs and explores sparse patterns in
ReLU-based LLMs, which shows that ReLU activation can eectively reduce LLM inference
computation overheads up to three times. In practice, non-structured pruning has been
widely employed to enhance the pre-training and ne-tuning process of LLMs towards
better pre-training and ne-tuning process [570–572].
(2)
Structured LLM Pruning. In contrast to non-structured LLM pruning, structured LLM
pruning can achieve realistic inference speedups on target hardware, which, however,
also suers from more aggressive accuracy loss than non-structured pruning. To tackle
this dilemma, recent state-of-the-art non-structured LLM pruning methods [
57
,
573
–
579
]
typically feature an additional ne-tuning stage to further recover the attainable accuracy
of the pruned LLM. For example, LLM-Pruner [
57
] employs structural pruning to selectively
remove non-critical coupled structures according to their gradient information, which
can preserve the majority of the LLM’s functionality while optimizing its computational
eciency. Furthermore, LLM-Pruner recovers the performance of the pruned LLM using
another state-of-the-art tuning technique (i.e., LoRA [
600
]), which merely takes 3 hours
with 50K data. Similar to LLM-Pruner, ZipLM [
574
] iteratively identies and removes LLM
components with the worst loss-runtime trade-o, which can end up with ecient LLMs and
also generalize across various runtime constraints. In addition, LoRAShear [
575
] rst creates
the dependency graphs over LoRA modules, which then proceeds progressive structured
59
pruning on LoRA adaptors and enables inherent knowledge transfer. To further recover the
lost information during pruning, LoRAShear also introduces an eective ne-tuning scheme
with dynamic data adaptors to narrow down the performance gap between the pruned LLM
and the non-pruned LLM. More recently, several LLM layer pruning methods [
577
–
579
]
demonstrate that LLM’s layers are also redundant and thus can be removed to largely
enhance the inference eciency of LLMs at minimal accuracy loss. For example, ShortGPT
[
578
] and Shorted-LLaMA [
579
] propose to remove the less important LLM layers according
to their layer importance score, whereas LLM-Streamline [
577
] proposes to replace the less
important LLM layers with more lightweight ones.
6.3.2 Eicient LLM antization. Recent state-of-the-art LLM quantization techniques [
59
,
60
,
580
–
586
] focus on reducing the weights of LLMs from higher to lower bits (e.g., from 32 bits to 8 bits
or even 1 bit), which can substantially enhance the inference eciency of LLMs at the cost of
slight accuracy loss. Among them, SmoothQuant [
59
] introduces an ecient training-free post-
training quantization solution to enable 8-bit weight and 8-bit activation quantization for LLMs.
Given that weights are easy to quantize while activations are not, SmoothQuant also smooths
the activation outliers by oine mitigating the quantization diculty from activations to weights
with an equivalent mathematical transformation. Similar to SmoothQuant, AWQ [
60
] introduces
an ecient hardware-friendly quantization approach for low-bit LLM weight-only quantization,
which is built upon an interesting observation that weights are not equally important and reserving
only 1% of salient weights can greatly reduce the quantization error. In light of this, AWQ proposes
to search for the optimal per-channel scaling scheme that protects the salient weights by observing
the activations rather than weights. SpQR [
580
] introduces a new compressed format for ecient
LLM quantization, which can enable near-lossless compression of LLMs across various model scales
while maintaining comparable compression levels to previous quantization methods. Specically,
SpQR rst identies and isolates the outlier weights that may cause particular-large quantization
errors, after which SpQR stores them in high precision while compression all other weights in 3-4
bits. OS+ [
581
] features the channel-wise shifting for asymmetry and the channel-wise scaling for
concentration since these operations can be seamlessly migrated into the subsequent quantization
modules while maintaining strict equivalence. OS+ also introduces a fast and stable scheme to
calculate eective shifting and scaling values, which can further achieve better quantization burden
balance towards better quantization performance.
Furthermore, OWQ [
582
] introduces an ecient outlier-aware weight quantization strategy,
which aims to minimize LLM’s memory footprint through low-bit quantization. OWQ prioritizes a
small subset of structured weights that are sensitive to quantization and stores them in higher bits,
while applying highly tuned quantization to the remaining dense weights. QuIP [
583
] introduces
quantization with incoherence process, which consists of two independent stages, including (1)
an adaptive rounding stage to minimize the pre-dened quadratic proxy objective and (2) an
ecient pre- and post-processing stage to ensure weight and Hessian incoherence via multiplication
by random orthogonal matrices. OmniQuant [
584
] introduces an omnidirectionally calibrated
quantization technique for LLMs, which consists of two novel components including learnable
weight clipping (LWC) and learnable equivalent transformation (LET). LWC modulates the extreme
weight values by optimizing the clipping threshold and LET eliminates the activation outliers by
shifting the challenge of quantization from activations to weights. Both LWC and LET can be
seamlessly integrated into an eective dierentiable optimization framework featuring block-wise
error minimization for both weight-only and weight-activation quantization. [
585
] further dives
deeper into LLM quantization and analyzes the eect of LLM quantization with comprehensive
experiments to evaluate current state-of-the-art LLM quantization techniques, which systematically
60
summarizes the eect of LLM quantization, provides recommendations to apply LLM quantization
techniques, and points out future directions of LLM quantization. In contrast to the above LLM
quantization works that focus on ecient LLM quantization algorithms, OliVe [
586
] presents an
algorithm/architecture co-designed solution to explore ecient quantized LLMs, which features an
outlier-victim pair (OVP) quantization scheme and handles outlier values locally with low hardware
overheads and high performance gains. This enables an ecient hardware-aligned OVP encoding
scheme, which can be integrated into existing hardware accelerators (e.g., systolic arrays and tensor
cores) towards more ecient quantized LLMs for generative inference.
6.3.3 Eicient LLM Distillation. Another promising direction is to leverage the pre-trained knowl-
edge from large LLMs to enhance the training or ne-tuning process of small LLMs, which can
allow small LLMs to maintain as strong performance as large LLMs while exhibiting superior
eciency. As discussed in [
50
], recent LLM distillation methods can be divided into two main
categories, including black-box LLM distillation and white-box LLM distillation as follows:
(1)
Black-Box LLM Distillation. In the context of black-box distillation, the teacher LLM’s
parameters are not available for the student LLM and the student LLM can only see the
nal output from the teacher LLM. In practice, black-box distillation typically features
those commercial LLMs (e.g., GPT-3 [
48
] and GPT-4 [
49
]) as the teacher and leverages
the predictions from the teacher to further enhance the training or ne-tuning process of
small student LLMs [
61
,
587
–
589
]. For example, Self-Instruct [
61
] rst generates a large
number of instructions, input, and output sequences from GPT-3 using its APIs, after which
Self-Instruct lters the invalid or similar ones before using them to ne-tune the original
GPT-3 model. Finally, Self-Instruct can achieve an absolute improvement of 33% over the
original GPT-3 model on Super-NaturalInstructions. Similar to Self-Instruct, [
587
] uses
GPT-4 to rst generate rich instruction-following data pairs and then uses the generated
data pairs to ne-tune small LLaMA models to improve their performance.
(2)
White-Box LLM Distillation. In the context of white-box distillation, the teacher LLM’s
parameters are available for the student LLM and the student LLM can also see the hidden
intermediate output from the teacher LLM. More recently, with the emergence of open-
source LLMs, white-box distillation has become more popular and more valuable for the
LLM community since the student LLM can potentially benet from the hidden states of the
teacher LLM towards better distillation performance [
62
,
590
–
592
]. Among them, MiniLLM
[
62
] rst replaces the forward Kullback-Leibler divergence (KLD) objective with reverse
KLD, which can prevent the student LLM from overestimating the low-probability regions
of the teacher distribution. Next, MiniLLM introduces an eective optimization approach to
learn the above reverse KLD objective, which can enhance the student LLM to generate
high-quality responses. TED [
590
] presents an eective task-aware layer-wise distillation
strategy, which features task-aware lters to align the hidden states of teacher and student
at each layer. The above lters can select the knowledge from the hidden states that are
useful for target tasks, which can further reduce the knowledge gap between teacher and
student LLMs. GKD [
591
] proposes to train the student LLM on its self-generated output
sequences along with the feedback from the teacher LLM on such self-generated sequences.
In addition, GKD also oers the exibility to employ alternative loss functions between
teacher and student, which can enhance the distillation performance of the student even
when the student lacks the expressivity to mimic the teacher’s distribution. More recently,
[
592
] introduces token-scaled logit distillation for quantization-aware training of LLMs,
which can eectively mitigate the overtting issue and also largely enhance the distillation
process from the teacher predictions and the ground truths.
61
6.4 Eicient LLM Systems
In parallel to the rapid development of ecient LLM algorithms, a plethora of ecient LLM systems
and infrastructures have also recently emerged [
63
–
65
,
593
–
598
], which further optimize the gen-
erative inference eciency of LLMs from the perspective of ecient system-level implementations.
Among them, FlexGen [
63
] features an ecient high-throughput generation engine for running
LLMs on one single GPU with limited memory, which can be exibly congured under various
hardware resource constraints by aggregating memory and computation from the GPU, CPU, and
disk. After solving a linear programming problem, FlexGen can also search for ecient patterns
to store and access tensors. Tabi [
65
] features an inference system with an ecient multi-level
inference engine, which can serve queries using small models and optional LLMs for demanding
applications. Tabi is particularly optimized for discriminative models (not generative LLMs) in a
serving framework, which uses the calibrated condence score to determine whether to directly
return the accurate results of small models or further re-route them to LLMs. DeepSpeed [
593
]
presents a comprehensive system solution for ecient transformer inference, which consists of
(1) a multi-GPU inference engine to minimize the runtime latency while maximizing the runtime
throughput of both dense and sparse transformers when they t into aggregate GPU memory and
(2) a heterogeneous inference engine that leverages CPU and NVMe memory in addition to GPU
memory and computation to enable high inference throughput with large models which do not t
into aggregate GPU memory. FastServe [
594
] presents an ecient distributed inference serving
system for ecient LLM inference, which (1) exploits the auto-regressive pattern of LLM inference
to enable preemption at the granularity of each output token and (2) explores preemptive schedul-
ing to minimize job completion time with a novel skip-join multi-level feedback queue scheduler.
Petals [
64
] features an ecient collaborative system for runtime inference and ne-tuning of LLMs
through joining the resources of multiple parties. In contrast to concurrent LLM inference engines,
Petals also natively exposes hidden states of the served model, allowing to train and share custom
model extensions based on ecient ne-tuning schemes.
Furthermore, S
3
[
595
] demonstrates that designing an inference system with prior knowledge of
the output sequence can largely increase the runtime inference throughput of LLMs. Therefore, in
order to increase the runtime inference throughput of LLMs, S
3
proposes to (1) rst predict the
output sequence length, (2) then schedule generation queries based on the prediction to increase
runtime resource utilization and throughput, and (3) nally handle mispredictions. Thanks to the
prior knowledge of the output sequence, S
3
can achieve much better inference throughput than early
LLM systems. Splitwise [
596
] proposes to split the two phases of typical LLM inference workloads on
to dierent hardware, which thus allows to use well-suited hardware for each inference phase and
provision independent computational resources for each inference phase. This strategy can improve
the runtime resource utilization across dierent hardware. Splitwise also optimizes the state transfer
across dierent hardware using the fast back-plane interconnects in today’s GPU clusters to further
increase the runtime LLM inference throughput. DistServe [
597
] proposes to disaggregate the prell
and decoding computation to enhance the runtime serving performance of LLMs, which assigns
the prell and decoding computation workloads to dierent GPUs and thus largely eliminates the
prell-decoding interference towards better runtime inference throughput. DistServe also optimizes
the above two phases according to the serving cluster’s bandwidth to minimize the communication
overheads caused by the disaggregation. Liger [
598
] features an ecient distributed collaborative
inference system for LLMs, which can achieve low inference latency at high throughput on multiple
GPUs. In addition, to achieve high parallelism and throughput, Liger also introduces an ecient
scheduling strategy to eectively schedule the computation and communication kernels across
dierent input requests onto multiple streams of multiple GPUs.
62
6.5 Future Envision
In this section, we further envision several promising future trends and possible directions in the
eld of ecient LLMs, which are summarized as follows:
(1)
AutoML for Ecient LLMs. Recent state-of-the-art ecient LLMs are typically built upon
manual heuristics, which, despite their ecacy, often require considerable human expertise
and engineering eorts. In light of this, one promising future direction is to automatically
explore ecient LLMs using automated machine learning (AutoML) techniques [137]. For
example, given an ecient LLM, we can leverage AutoML techniques to automatically search
for its tailored ecient system implementation towards the optimal on-device inference
speedup. Similarly, we can also leverage AutoML techniques to automatically search for its
tailored pruning or quantization strategy towards the optimal accuracy-eciency trade-o.
This has the potential to largely push forward the frontier of ecient LLM designs.
(2)
Alternative Structures for Ecient LLMs. Recent state-of-the-art LLMs heavily rely on
the self-attention mechanism in transformer [
90
], which, however, suers from quadratic
time and memory complexity and greatly slows down the pre-training, inference, and
ne-tuning stages of LLMs [
599
]. To tackle this dilemma, several alternative structures
have recently emerged (e.g., RWKV [
601
], Mamba [
602
], and RetNet [
603
]), which exhibit
optimized computational eciency and also allow researchers to perform ecient language
modeling tasks without transformers. For example, RetNet [
603
] introduces the recurrent
representation to enable low-cost inference, which improves the decoding throughput, run-
time latency, and GPU memory without sacricing the language modeling performance. In
light of this, one promising future direction is to explore more ecient alternative structures
for LLMs, which may deliver considerable eciency gains over existing transformer-based
LLMs without sacricing the language modeling performance.
(3)
Hardware-Aware Benchmarks for Ecient LLMs. Recent state-of-the-art ecient
LLMs are typically optimized in terms of the number of parameters or FLOPs. However,
these theoretical complexity metrics cannot accurately reect the runtime performance
on target hardware (e.g., latency and energy). This further makes it challenging to fairly
compare dierent ecient LLMs in terms of their runtime inference eciency on target
hardware. In light of this, one promising future direction is to design hardware-aware
benchmarks for ecient LLMs, which may include dierent hardware performance metrics
(e.g., latency and energy) across dierent hardware systems.
(4)
Infrastructures for Ecient LLMs. Recently, there have been a large number of works on
ecient LLM compression including LLM pruning and LLM quantization, which, however,
often require specialized hardware accelerators and thus cannot achieve realistic on-device
inference speedups on modern embedded computing systems. For example, non-structured
LLM pruning can remove the less important weights to explore highly sparse LLMs with
aggressive compression ratios. However, the resulting sparse LLMs cannot achieve realistic
on-device inference speedups due to the irregular network sparsity [
57
]. Another recent
work [
586
] has also explored to accelerate quantized LLMs and achieved promising per-
formance. However, these are far from enough for real-world large-scale deployments. In
light of this, one promising future direction is to design specialized software and hardware
infrastructures to further optimize LLMs for ecient on-device inference.
7 DEEP LEARNING FRAMEWORKS FOR EMBEDDED COMPUTING SYSTEMS
In the past few years, DNNs have been achieving tremendous success in a myriad of real-world
intelligent embedded computing scenarios, such as on-device speech recognition [
604
,
605
], object
63
Software Created by Year Programming Languages Computation Graph Training Maintenance
TensorFlow [66] Google 2015 Python, C++,
Java, and JavaScript Static and Dynamic ✓ ✓
PyTorch [67] Facebool
(now Meta) 2016 Python and C++ Dynamic ✓ ✓
Cae [610] Berkeley 2014 C++ Static ✓ ✗
MXNet [611] Amazon 2015
Python, C++, R, Java,
Julia, JavaScript,
Scala, Go, and Perl
Static and Dynamic ✓ ✗
Keras [612] Personal 2015 Python Static and Dynamic ✓ ✓
CoreML [613] Apple 2017 Python, Swift,
and Objective-C Static ✗ ✓
PaddlePaddle [614] Baidu 2016 Python and C++ Static and Dynamic ✓ ✓
BigDL [615] Intel 2017 Python and Scala Dynamic ✓ ✓
Table 3. Illustration of deep learning soware frameworks that have been discussed in Section 7.1. Note that
we refer to the framework under active maintenance if there are new releases with the previous six months.
detection and tracking [
606
,
607
], autonomous vehicles [
608
,
609
], etc. In the meantime, a series of
customized software [
66
,
67
,
610
–
615
] and hardware frameworks [
68
–
70
,
616
–
618
] have also been
developed to facilitate the deployment of DNNs on embedded computing systems. Therefore, we, in
this section, further discuss recent popular deep learning software and hardware frameworks that
bring deep learning to embedded computing systems to embrace ubiquitous embedded intelligence.
7.1 Deep Learning Soware Frameworks
In this section, we introduce popular deep learning software frameworks that have been widely
used to develop deep learning solutions for embedded computing systems, including TensorFlow
[
66
], PyTorch [
67
], Cae [
610
], MXNet [
611
], Keras [
612
], CoreML [
613
], PaddlePaddle [
614
], and
BigDL [615]. The covered deep learning software frameworks are summarized in Table 3.
TensorFlow [
66
] is an open-source deep learning software framework developed by Google,
which is released in 2015 and has since become one of the most popular deep learning software
frameworks for training and deploying DNNs. In practice, TensorFlow, especially TensorFlow
Lite, allows developers to easily build and deploy DNNs in a wide range of embedded computing
systems, including mobile phones, microcontrollers (MCUs), Raspberry Pi, TPUs, and edge GPUs.
In the meantime, TensorFlow also supports various real-world applications, ranging from image
and speech recognition to natural language processing and predictive analytics. With its exible
architecture and vast pre-trained models, TensorFlow has been deemed as one of the most important
tools for researchers and developers in the eld of deep learning.
PyTorch [
67
] is an open-source deep learning software framework that is widely used for
training and deploying deep neural networks. It is developed by Facebook (now known as Meta)
and is released in 2016. One of the key features is the dynamic computation graph, which allows
developers to change the computation behavior of DNNs on the y. This feature distinguishes
PyTorch from other deep learning software frameworks (e.g., TensorFlow) that only support static
computation graph. In addition, PyTorch has a number of high-level features that make it easier to
build more complex DNNs. For example, the TorchVision package provides various useful tools
and pre-trained models for image and video processing. With its dynamic computation graph, ease
of use, and range of high-level features, PyTorch has become an essential software framework for
training and deploying DNNs in the deep learning community.
Cae [
610
] is a popular deep learning software framework developed by Berkeley and released
in 2014, which has gained increasing popularity due to its speed, modularity, and ease of use. One of
64
the key features is its ability to deal with large datasets with millions of images. Another important
feature is its modularity, which allows developers to add or remove components with ease. Besides,
Cae includes a large library that contains hundreds of pre-trained models, which can be used
to quickly build deep learning applications, such as image classication, object detection, and
segmentation. In addition to its powerful features, Cae has a user-friendly interface, allowing
developers to train and deploy DNNs without extensive knowledge of deep learning. With its
powerful features and user-friendly interfaces, Cae has become an invaluable tool for researchers
and developers and inspired subsequent deep learning software frameworks.
MXNet [
611
] is an open-source deep learning software framework to train and deploy DNNs,
which is developed by Amazon and released in 2015. One of the key technical merits is its dis-
tributed training capability, which allows to train DNNs across multiple computation nodes, and
more importantly, in a computationally ecient manner. Besides, MXNet also supports multiple
programming languages, such as Python, C++, R, and Julia, which further increases the accessibility
to researchers and developers with diverse skill levels. Apart from these, MXNet’s integration
with other deep learning software frameworks and tools, such as Apache Spark and Apache Flink,
is another important feature that facilitates the integration of deep learning into existing data
processing pipelines. Thanks to its scalability, exibility, and eciency, MXNet has become a
popular option for developers and researchers in the deep learning community.
Keras [
612
] is an open-source deep learning software framework written in Python, which
provides high-level APIs for building and training ecient DNN solutions. It is developed by
François Chollet in 2015 and is now maintained by a community of developers. In particular, Keras
has been integrated into TensorFlow, and starting from TensorFlow 2.0, Keras has become the
default API to build DNN solutions in TensorFlow. Specically, one of the key features of Keras is
its modularity, which allows developers and researchers to easily construct and customize DNNs.
Furthermore, Keras also allows users to productize DNN solutions on modern mobile phones like
iOS and Android, on the web, or on the Java virtual machine. Last but not least, Keras allows to
train DNN solutions in an ecient distributed manner on clusters of multiple GPUs and TPUs. The
aforementioned strengths make Keras increasingly popular in both industry and academia.
CoreML [
613
] is an open-source deep learning software framework developed by Apple in
2017, which aims to integrate DNNs into Apple commercial products, such as iPhone, iPad, and
Apple Watch. In addition to supporting extensive DNNs with over 30 layer types, CoreML also
covers standard machine learning models, such as tree ensembles, support vector machine, and
generalized linear models. Furthermore, another key feature of CoreML is its ability to directly
run DNNs on the device, without the need for cloud-based inference. Besides, CoreML provides
a range of optimization techniques, such as quantization and pruning, to reduce the complexity
of DNNs. It is worth noting that this is particularly important for modern mobile devices, which
typically have limited storage and computational resources. Also, CoreML, built on top of advanced
technologies like Metal and Accelerate, seamlessly takes advantage of CPUs and GPUs to provide
the maximum inference performance at run time. These technical features make CoreML the rst
choice for developing ecient DNN solutions on Apple commercial products.
PaddlePaddle [
614
], also known as Paddle, is an open-source deep learning software framework
developed by Baidu, which has been released to benet the deep learning community since 2016.
Specically, PaddlePaddle is designed to be an industrial platform with advanced technologies and
rich features that cover core deep learning frameworks, basic model libraries, end-to-end tools, and
service platforms. In particular, PaddlePaddle is originated from industrial practices with dedication
and commitment to industrialization. It has been adopted by a wide range of sectors, including
manufacturing, agriculture, and enterprise service. With the industrial benets, PaddlePaddle has
motivated an increasing number of developers and researchers to commercialize AI.
65
Hardware RAM Storage Power Performance Price Supported Deep
Learning Software
Nvidia Jetson TX2 [69] 8 GB LPDDR4 32 GB eMMC 5.1 7.5 W ∼15 W 1.33 TFLOPS $399 TensorFlow, PyTorch,
Cae, Keras, and MXNet
Nvidia Jetson Nano [69] 4 GB LPDDR4 16 GB eMMC 5.1 5 W ∼10W 0.472TFLOPS $99 TensorFlow, PyTorch,
Cae, Keras, and MXNet
Nvidia Jetson AGX Xavier [69] 32 GB LPDDR4x 32 GB eMMC 5.1 10 W ∼30 W 32 TOPS $1,099 TensorFlow, PyTorch,
Cae, Keras, and MXNet
Nvidia Jetson Xavier NX [69] 8 GB LPDDR4x 16 GB eMMC 5.1 10 W ∼20 W 21 TOPS $399 TensorFlow, PyTorch,
Cae, Keras, and MXNet
Nvidia Jetson AGX Orin [69] 32 GB LPDDR5 64 GB eMMC 5.1 15 W ∼40 W 275 TOPS $1,999 TensorFlow, PyTorch,
Cae, Keras, and MXNet
Nvidia Jetson Orin NX [69] 16 GB LPDDR5 32 GB eMMC 5.1 10 W ∼25W 100 TOPS $599 TensorFlow, PyTorch,
Cae, Keras, and MXNet
Intel Neural Compute Stick [70] N/A N/A 0.5 W ∼1.5W 0.1TFLOPS $79 TensorFlow, Cae,
and MXNet
Google Edge TPU [68] N/A N/A 2 W 4 TOPS $75 TensorFlow and
TensorFlow Lite
Google Coral Dev Board [616] 1GB LPDDR4 8GB eMMC 5.1 1W ∼6 W 4 TOPS $149 TensorFlow and
TensorFlow Lite
Huawei HiKey 970 [617] 6 GB LPDDR4 64 GB UFS 2.1 6 W ∼12 W 1.88TOPS $299 TensorFlow, PyTorch,
and Cae
Orange Pi AI Stick Lite [618] N/A N/A 1 W ∼2W 4 TOPS $69 TensorFlow, PyTorch,
and Cae
Table 4. Illustration of deep learning soware frameworks that have been discussed in Section 7.2. Note that
the price here refers to the initial price during the product launch, which is subject to fluctuations over time.
BigDL [
615
] is an open-source deep learning software framework that runs on top of Apache
Spark. It is developed by Intel and released in 2017. The goal of BigDL is to provide a high-
performance, scalable, and easy-to-use platform, especially distributed deep learning. To this end,
BigDL includes a comprehensive set of features that cover various deep learning applications,
including image classication, object detection, and natural language processing. One of the key
features of BigDL is its ability to take full advantage of distributed computing resources, such as
CPU, GPU, and FPGA clusters, to accelerate the training of DNNs. Besides, BigDL is seamlessly
integrated with Apache Spark, which enables users to leverage the distributed computing capability
of Spark for data preprocessing and postprocessing. This integration also makes it possible to build
end-to-end deep learning pipelines that span from data ingestion to model deployment.
7.2 Deep Learning Hardware Frameworks
In this section, we introduce popular embedded hardware platforms that are designed to run
powerful DNNs in embedded scenarios without cloud-based assistance, including Nvidia Jetson
[
69
], Intel Neural Compute Stick [
70
], Google Edge TPU [
68
], Google Coral Dev Board [
616
],
Huawei HiKey 970 [
617
], and Orange Pi AI Stick Lite [
618
]. The covered deep learning hardware
frameworks are summarized in Table 4.
Nvidia Jetson [
69
] is a series of embedded system-on-modules (SoMs) designed by Nvidia
for running advanced deep learning workloads, especially the inference of DNNs. Specically,
Nvidia Jetson consists of Nvidia Jetson TX2, Nvidia Jetson Nano, Nvidia Jetson AGX Xavier, Nvidia
Jetson Xavier NX, Nvidia Jetson AGX Orin, and Nvidia Jetson Orin NX. To accelerate deep learning
workloads, Nvidia Jetson runs on top of Nvidia’s CUDA parallel computing architectures and
features an integrated system-on-chip (SoC) with a powerful Nvidia GPU, a multi-core CPU, and
various high-speed interfaces, including Ethernet, USB, HDMI, and CSI/DSI. More importantly,
Nvidia Jetson is compatible with various deep learning software frameworks, including TensorFlow,
PyTorch, Cae, Keras, and MXNet. Thanks to its advanced architecture design and powerful
66
interfaces, Nvidia Jetson is able to support a wide range of embedded deep learning applications to
accommodate dierent resource and performance requirements.
Intel Neural Compute Stick [
70
] is a small, low-power, and cost-eective embedded hardware
designed to run deep learning workloads without cloud-based assistance, which is developed by
Movidius (now acquired by Intel). Neural Compute Stick (NCS) is a small USB device that can
be connected to a host computer or embedded computing system. Specically, NCS features the
Myriad 2 Vision Processing Unit (VPU), which is optimized for the inference of DNNs. Besides,
NCS is integrated with various high-speed interfaces, including USB 3.0 and Wi-Fi. Meanwhile,
developers can use the Intel Movidius SDK, which provides a set of tools for developing, testing, and
deploying DNNs on NCS. Furthermore, NCS supports various deep learning software frameworks,
including TensorFlow, Cae, and MXNet. Thanks to its signicant exibility and cost eciency,
NCS makes it possible to deploy advanced DNNs in a wide range of embedded scenarios.
Google Edge TPU [
68
] is a custom-built ASIC chip to accelerate deep learning workloads
on resource-constrained edge computing systems. Specically, Google Edge TPU is designed
to seamlessly work together with TensorFlow Lite, a lightweight version of TensorFlow, and is
optimized for the inference of DNNs towards enhanced inference eciency. It is worth noting that
Google Edge TPU itself cannot work alone, and similar to Intel Neural Compute Stick, it must be
connected to other embedded computing systems, such as Raspberry Pi 4 and Google Coral Dev
Board, to deliver deep learning solutions. In particular, Google Edge TPU is capable of performing
up to two trillion oating-point operations per second (TFLOPS) and four trillion operations per
second (TOPS) using only two watts of power. This further allows us to build and deploy powerful
deep learning solutions on embedded computing systems with limited computational resources.
Thanks to its easy integration with other embedded computing systems, powerful performance,
and signicant eciency, Google Edge TPU has gained increasing popularity in the deep learning
community for deploying deep learning solutions on embedded computing systems.
Google Coral Dev Board [
616
] is a single-board computer designed for building embedded deep
learning applications. Specically, it features an on-board Google Edge TPU, which is a custom-built
chip to run TensorFlow Lite models with high performance and low power consumption. The Coral
Dev Board has various built-in interfaces, including Audio, Wi-Fi, Bluetooth, Ethernet, and USB 3.0,
which enable it to be connected to other embedded computing systems. Besides, the Coral Dev
Board is integrated with 1 GB LPDDR4 RAM, 8GB eMMC 5.1 ash memory, and a MicroSD slot for
additional storage. The Coral Dev Board also comes with pre-installed software tools, including
TensorFlow Lite, Google Edge TPU API, and various sample applications. These allow users to
easily and quickly start building their deep learning solutions. Thanks to its powerful Google Edge
TPU, various useful interfaces, and software tools, the Coral Dev Board has become increasingly
popular for developing and deploying embedded deep learning solutions.
Huawei HiKey 970 [
617
] is a high-performance single-board embedded computer designed by
Huawei. Specically, the HiKey 970 features a powerful neural processing unit (NPU) to accelerate
various deep learning workloads. The HiKey 970 is also integrated with 6 GB LPDDR4 LPDDR4 RAM
and 64 GB UFS 2.1 ash memory, while at the same time allowing MicroSD extension for additional
storage. Besides, the HiKey 970 supports various high-speed interfaces, including Ethernet, USB
3.0, and PCIe 3.0. Furthermore, the HiKey 970 is compatible with popular deep learning software
frameworks, such as TensorFlow, PyTorch, and Cae, allowing users to easily and quickly build
on-board deep learning solutions. Thanks to its powerful NPU, rich memory and storage, and
extensive connectivity options, the HiKey 970 is suitable for a wide range of intelligent embedded
applications, such as robotics, autonomous vehicles, and smart home devices.
Orange Pi AI Stick Lite [
618
] is a tiny and cost-eective USB stick designed for small to
medium-sized deep learning workloads. It is equipped with a single-core Cortex-A7 processor
67
and a neural processing unit (NPU) that provides hardware acceleration. Note that, similar to the
Intel Neural Compute Stick and Google Edge TPU, the Orange Pi AI Stick Lite cannot work alone,
which must be connected to a host device using the on-device USB 3.0 interface. Furthermore, the
Orange Pi AI Stick Lite supports various deep learning software frameworks, including TensorFlow,
PyTorch, and Cae. Thanks to its cost eciency, the Orange Pi AI Stick Lite is suitable for embedded
computing systems to deal with small to medium-sized deep learning workloads.
7.3 Future Envision
In this section, we envision the future trends and possible directions of deep learning software and
hardware infrastructures, which are summarized as follows:
(1)
Integration with Emerging Technologies. In the future, we should consider to develop
deep learning software and hardware, which can be seamlessly integrated with emerging
technologies. For example, quantum computing [
619
–
621
] has the potential to deliver
signicant speedup and computational capability, which can accelerate the training and
inference of DNNs. Therefore, it is of paramount importance to explore the potential of
integrating deep learning software and hardware with emerging technologies to unlock
new possibilities and new advances in various real-world scenarios.
(2)
Democratization of Deep Learning. The democratization of deep learning [
622
] has
emerged as a prominent trend in the deep learning era, with the explicit goal of making
deep learning software and hardware more accessible to a wider range of developers and
researchers. Therefore, in order to alleviate the technical barrier to entry for building
ecient embedded deep learning solutions, we, in the future, should continue to develop
more user-friendly deep learning software and hardware frameworks, democratizing the
benets and advances of deep learning in real-world embedded scenarios.
(3)
Development of Specialized Hardware. Conventional embedded computing systems
typically focus on optimizing and accelerating the training and inference of traditional
convolutional networks, neglecting recent advances in the deep learning era. Among them,
Vision Transformer (ViT) [
107
] is the most representative one, which has opened up a
new direction and has been challenging the dominant role of traditional convolutional
networks in various real-world vision applications, such as image classication [
107
–
109
],
object detection [
110
–
112
], semantic segmentation [
113
–
116
], and video analysis [
117
–
119
].
Therefore, in order to unleash the promise of ViT and its variants, we should also develop
specialized embedded computing systems to accelerate the family of ViT rather than only
focusing on accelerating complicated convolutional networks.
(4)
Development of More Powerful Hardware. As seen in recent advanced DNNs [
623
–
625
],
the network complexity has continued to explode, and as a result, continues to enlarge the
computational gap between computation-intensive DNNs and resource-constrained embed-
ded computing systems. In parallel, the large language models (LLMs), such as ChatGPT [
49
],
have been achieving impressive success in various neural language processing (NLP) tasks,
such as language generation, language translation, question answering, etc., at the cost of
pushing forward the network complexity to another unseen level, signicantly enlarging
the computational gap. These further demonstrate the necessity of innovating more power-
ful yet cost-eective embedded computing systems to further bridge the aforementioned
computational gap, especially from the hardware perspective.
(5)
Development of Infrastructures for On-Device Training. In the past, the convention
in the deep learning community is to (1) rst train DNNs on powerful GPUs or remote cloud
and (2) then deploy the pre-trained DNNs on local embedded computing systems for further
68
inference at run time. Compared with this convention, the emerging paradigm of on-device
training enables the pre-trained DNNs to adapt to the new data collected from the local
sensors or by the users [
45
]. As such, the users can benet from customized DNNs without
having to transfer the collected data to the remote cloud, thereby signicantly protecting
the data privacy and security [
45
]. Nonetheless, conventional embedded computing systems
are typically optimized for inference and do not support ecient on-device training due to
the training memory bottleneck during the training process [
44
,
45
]. This motivates us to
further develop ecient infrastructures, including specialized deep learning software and
hardware, to eectively accommodate future on-device training demands.
8 DEEP LEARNING APPLICATIONS FOR EMBEDDED COMPUTING SYSTEMS
In the previous sections, we have extensively discussed recent advances towards ubiquitous embed-
ded intelligence from various perspectives of ecient deep learning networks, algorithms, software,
and hardware. In this section, we further elaborate on recent popular intelligent deep learning
applications in real-world embedded scenarios, spanning from vision to NLP tasks. Note that these
intelligent embedded applications highly rely on ecient deep networks and ecient deep learning
algorithms that have been extensively discussed in the previous sections.
8.1 Computer Vision Applications
Computer vision is an emerging eld that focuses on interpreting and understanding visual infor-
mation from real-world environments, such as image and video, spanning from image classication
[
1
] to downstream vision tasks, such as object detection [
4
], tracking [
5
], and segmentation [
6
].
Below we discuss recent popular intelligent embedded vision applications.
Fig. 30. Milestones of early convolutional networks [
626
].
Image Classication. Image classica-
tion, also referred to as image recognition,
is the most fundamental vision task, which
focuses on recognizing the input image based
on its visual information [
1
]. This enables var-
ious intelligent applications in real-world em-
bedded computing systems, such as mobile
phones and IoT sensors, allowing these em-
bedded computing systems to automatically
recognize objects, scenes, or patterns within
the given image [
627
]. For example, face
recognition [
628
], person re-identication
[
503
], and hand gesture recognition [
629
] have been widely integrated into mainstream embed-
ded computing systems, such as mobile phones, ATMs, and intelligent cameras, for the purpose
of identity authentication. In practice, image classication typically features deep convolutional
networks, such as VGGNet [
1
], ResNet [
2
], and DenseNet [
3
], thanks to their strong capabilities to
capture rich visual information, especially for large-scale datasets like ImageNet [
80
]. For example,
as shown in Fig. 30, AlexNet [
76
], as the rst of its kind, demonstrates the possibility of leveraging
convolutional layers to learn discriminative features from vision inputs, which exhibits signicantly
better recognition performance on ImageNet than previous well-established non-convolutional
networks, such as multi-layer perceptrons (MLPs) and other learning-based techniques. Further-
more, ResNet [
2
] investigates the training collapse of deep convolutional networks and introduces
a simple yet eective deep residual learning paradigm, which allows us to signicantly increase the
network depth for stronger learning capabilities and also marks the booming development of the
69
deep learning era. As a result, ResNet, for the rst time, achieves better recognition performance
on ImageNet than humans thanks to its signicant network depth as shown in Fig. 30.
Downstream Vision Applications. Downstream vision applications typically refer to the
practical and specic usage, where the results or outputs from other fundamental vision tasks, such
as image classication, are applied to deal with real-world challenges. Popular downstream vision
applications in practice include but not limited to object detection [
4
], object tracking [
5
], object
segmentation [
6
], image super-resolution [
277
,
278
], image restoration [
630
], pose estimation
[
631
], image captioning [
632
,
633
], augmented reality (AR) and virtual reality (VR) [
634
], and
video-related analysis [
635
]. Among them, [
630
] features memory-oriented structured pruning
to optimize the on-device memory consumption during runtime image restoration, which can
accommodate the limited memory and storage requirements in real-world embedded scenarios.
These downstream vision applications have evolved to be ubiquitous in real-world embedded
scenarios, which serve as important components towards ubiquitous embedded intelligence. For
example, object detection and tracking have been widely used in recent autonomous vehicles [
636
]
to detect other vehicles and surveillance systems [
637
] to detect suspicious persons or activities.
These downstream vision applications have also been widely applied to other real-world embedded
scenarios, such as smart cities and intelligent healthcare [
638
]. To further facilitate the development
of intelligent applications, several powerful tools have been recently proposed. For example, Precog
[
639
] introduces an ecient object detection infrastructure to enable real-time object detection
on resource-constrained embedded computing systems, such as Raspberry Pi, which also features
YOLOv3 [640] to achieve superior on-device object detection accuracy.
From CNNs to Vision Transformers. More recently, vision transformers (ViTs) [
107
] and its
variants have demonstrated surprisingly strong performance in various vision tasks, including but
not limited to image classication [
107
–
109
], object detection [
110
–
112
], semantic segmentation
[
113
–
116
], and video analysis [
117
–
119
], which continue to push forward the state-of-the-art
performance over their convolutional counterparts across various vision tasks. Specically, [107],
as the very rst vision transformer, proposes to divide the input image into a series of smaller
image patches (e.g., 8, 16, and 32 patches), each of which is then fed into the transformer-based
encoder to learn discriminative features. The learned discriminative features are further aggregated
and fed into the classication layer to make predictions as shown in Fig. 5. However, despite their
strong performance across various vision tasks, ViTs and its variants often exhibit inferior on-
device eciency [
139
] since they are typically more dicult to parallelize on resource-constrained
embedded computing systems than their convolutional counterparts and thus inevitably suer
from considerable resource underutilization as pointed out in [
127
]. To overcome such limitations,
a plethora of resource-ecient vision transformers have recently ourished and we refer the
interested readers to Section 2.2 for more details about recent representative resource-ecient vision
transformers. We emphasize that signicant eorts are still required in order to further alleviate the
on-device eciency bottleneck and also unleash the promise of modern vision transformers, which
are of paramount importance to bring powerful vision transformers to the less capable embedded
computing systems towards ubiquitous embedded intelligence.
8.2 Natural Language Processing Applications
In parallel to vision tasks, natural language processing (NLP) is another representative application
that has been widely deployed in real-world embedded scenarios to explore auditory and textual
inputs, which has largely revolutionized how embedded computing systems interact with users and
their surroundings [
641
]. In practice, embedded computing systems ranging from traditional IoT
systems to wearable systems, and autonomous systems are transitioning from simple responsive
systems to more proactive and interactive systems, which can comprehend context and also
70
anticipate users’ needs based on their linguistic inputs. To this end, below we introduce several
representative NLP applications in real-world embedded scenarios.
(1)
Sentiment Analysis. Recent intelligent embedded computing systems, such as wearable
devices and intelligent healthcare infrastructures, have largely featured sentiment analysis,
which can eectively capture users’ physiological status through language interactions
[
642
]. This also allows more comprehensive understanding of users’ emotional well-being,
which paves the way for future holistic health ecosystem solutions [643].
(2)
Automatic Speech Recognition. Automatic speech recognition has gained increasing
interest in real-world embedded scenarios, such as autonomous vehicles and smart homes,
which can largely facilitate complicated functions using vocal commands. This can mitigate
manual interactions and also enhance safety and user convenience [644, 645].
(3)
Conversational Agents. The conversational agents have been playing an important role
in recent intelligent embedded computing systems, such as home automation systems [
646
]
and interactive assistant systems [
647
]. These intelligent conversational agents maintain
strong abilities to comprehend and interpret users’ commands, preferences, and behavioral
patterns towards better intelligent services in subsequent interactions.
(4)
Speech-to-Text/Text-to-Speech Synthesis. The integration of text-to-speech (TTS) [
253
]
and speech-to-text (STT) [
648
] marks an important milestone in enriching human-computer
interactions, especially for wearable systems, such as mobile phones and intelligent transla-
tion devices, which can largely facilitate human-computer interactions. Specically, TTS
can synthesize digital text into speech to provide auditory feedback to users and vice versa
for STT, both of which are particularly important in hands-free environments.
(5)
Real-Time Translation. The emergence of real-time translation has served as an eective
technique to eliminate cross-language barriers. More recently, real-time translation has been
widely integrated into real-world embedded scenarios, especially wearable communication
devices, which can largely facilitate cross-language interactions [649, 650].
To summarize, the integration of NLP and embedded computing systems is more than simple
technical enhancements. Instead, it is an important paradigm shift towards ubiquitous embedded
intelligence. It can enable real-world embedded computing systems to understand and interpret
not only short commands but also longer contexts and conversations, which can further ensure
seamless and enriched interfaces between humans and embedded computing systems.
8.3 Future Envision
In this section, we envision some future trends and possible directions of intelligent embedded
applications, which are summarized as follows:
(1)
LLMs-Enabled Embedded Applications. Large language models (LLMs) starting from
GPT-3 [
48
] have attracted considerable interest from both academia and industry, thanks
to their surprisingly strong performance across various language tasks. Among them,
ChatGPT [
92
], as one of the most representative LLMs-enabled applications, has achieved
promising performance improvement over humans across diverse domains of knowledge.
Nonetheless, modern LLMs, despite their promise, require a huge amount of computational
resources for both training and inference, making it challenging to deploy powerful LLMs on
resource-constrained embedded computing systems. Therefore, modern LLMs can only be
deployed on remote GPU servers and provide remote services to local users through network
connectivity. This, however, is often less convenient and also involves data security/privacy
concerns. To overcome such limitations, a plethora of works have been recently proposed
to compress computation-intensive LLMs towards better on-device inference eciency. For
71
example, SmoothQuant [
59
] and AWQ [
60
] pioneer to quantize the weights of powerful
LLMs from higher bits to lower bits in order to reduce their prohibitive computational
complexity, making it possible to run powerful LLMs on resource-constrained embedded
computing systems. These are also important milestones to bring LLMs to real-world
embedded computing systems towards ubiquitous embedded intelligence.
(2)
Multi-Modal Embedded Applications. Modern embedded applications largely focus on
one single modality, either from the perspective of vision or language processing. Nonethe-
less, recent embedded computing systems typically feature various advanced sensors, which
can simultaneously collect rich data from multiple modalities, including but not limited to
visual, auditory, and tactile information. In practice, the most important benet of these
multi-modal embedded applications is their strong abilities to provide comprehensive under-
standing of real-world dynamic environments using comprehensive information collected
from dierent modalities. This also has the potential to signicantly boost the attainable
accuracy on target task and also greatly improve the reliability in real-world dynamic envi-
ronments. For example, visual information can be easily augmented with other modalities,
such as radar and lidar, which can be jointly leveraged to deliver better and safer driving
experiences in autonomous vehicles [
651
]. However, despite the promising benets, the
development of multi-modal embedded applications is also challenging. On the one hand,
the real-time synchronization of diverse data modalities may require signicant computa-
tional resources. On the other hand, the development of multi-modal embedded applications
introduces additional complexity for data alignment, calibration, and fusion, which may
also require more advanced software algorithms to ensure real-time processing.
9 CONCLUSION
In this survey, we focus on summarizing recent ecient deep learning infrastructures for embedded
computing systems towards ubiquitous embedded intelligence, spanning from training to infer-
ence,from manual to automated,from convolutional neural networks to transformers,
from transformers to vision transformers,from vision models to large language models,
from software to hardware, and from algorithms to applications. To this end, we discuss
recent ecient deep learning infrastructures for embedded computing systems from the lens of
(1) ecient manual network design for embedded computing systems, (2) ecient automated
network design for embedded computing systems, (3) ecient network compression for embedded
computing systems, (4) ecient on-device learning for embedded computing systems, (5) ecient
large language models for embedded computing systems, (6) ecient deep learning software and
hardware for embedded computing systems, and (7) ecient intelligent applications for embedded
computing systems. Furthermore, we also envision promising future directions and trends to enable
more ecient and ubiquitous embedded intelligence. We believe this survey can shed light on
future research and allow researchers to quickly and smoothly get started in this emerging eld.
REFERENCES
[1]
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.
arXiv
preprint arXiv:1409.1556, 2014.
[2]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[3]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional
networks. In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 4700–4708,
2017.
[4]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg.
Ssd: Single shot multibox detector. In
Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
72
Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 21–37. Springer, 2016.
[5]
Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Luka Cehovin, Gustavo Fernandez, Tomas Vojir, Gustav
Hager, Georg Nebehay, and Roman Pugfelder. The visual object tracking vot2015 challenge results. In
Proceedings
of the IEEE international conference on computer vision workshops, pages 1–23, 2015.
[6]
Yin Li, Xiaodi Hou, Christof Koch, James M Rehg, and Alan L Yuille. The secrets of salient object segmentation. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 280–287, 2014.
[7] Dong Yu and Lin Deng. Automatic speech recognition, volume 1. Springer, 2016.
[8]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun,
Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between
human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[9]
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac:
Question answering in context. arXiv preprint arXiv:1808.07036, 2018.
[10]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for
deep neural networks. In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages
1492–1500, 2017.
[11]
Georey Hinton, Oriol Vinyals, and Je Dean. Distilling the knowledge in a neural network.
arXiv preprint
arXiv:1503.02531, 2015.
[12]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization.
arXiv preprint arXiv:1710.09412, 2017.
[13]
Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. Ai benchmark:
Running deep neural networks on android smartphones. In
Proceedings of the European Conference on Computer
Vision (ECCV) Workshops, pages 0–0, 2018.
[14]
Ke Tan, Xueliang Zhang, and DeLiang Wang. Deep learning based real-time speech enhancement for dual-microphone
mobile phones. IEEE/ACM transactions on audio, speech, and language processing, 29:1853–1863, 2021.
[15]
Branislav Kisačanin. Deep learning for autonomous vehicles. In
2017 IEEE 47th International Symposium on
Multiple-Valued Logic (ISMVL), pages 142–142. IEEE, 2017.
[16]
Jamil Fayyad, Mohammad A Jaradat, Dominique Gruyer, and Homayoun Najjaran. Deep learning sensor fusion for
autonomous vehicle perception and localization: A review. Sensors, 20(15):4220, 2020.
[17]
Beau Norgeot, Benjamin S Glicksberg, and Atul J Butte. A call for deep-learning healthcare.
Nature medicine
,
25(1):14–15, 2019.
[18]
Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou,
Claire Cui, Greg Corrado, Sebastian Thrun, and Je Dean. A guide to deep learning in healthcare.
Nature medicine
,
25(1):24–29, 2019.
[19]
Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad
Isaac, Yangqing Jia, Bill Jia, et al. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE
international symposium on high performance computer architecture (HPCA), pages 331–344. IEEE, 2019.
[20]
Di Liu, Hao Kong, Xiangzhong Luo, Weichen Liu, and Ravi Subramaniam. Bringing ai to edge: From deep learning’s
perspective. Neurocomputing, 485:297–320, 2022.
[21]
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning lters for ecient convnets. In
International Conference on Learning Representations, 2017.
[22]
Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft lter pruning for accelerating deep convolutional
neural networks. In International Joint Conference on Articial Intelligence, 2018.
[23]
Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median for deep convolu-
tional neural networks acceleration. In
Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 4340–4349, 2019.
[24]
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with
binary weights during propagations. Advances in neural information processing systems, 28, 2015.
[25]
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks.
Advances in neural information processing systems, 29, 2016.
[26]
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classication using
binary convolutional neural networks. In
Computer Vision–ECCV 2016: 14th European Conference, Amsterdam,
The Netherlands, October 11–14, 2016, Proceedings, Part IV, pages 525–542. Springer, 2016.
[27]
Jimmy Ba and Rich Caruana. Do deep nets really need to be deep?
Advances in neural information processing
systems, 27, 2014.
[28]
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets:
Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
73
[29]
Song Han, Je Pool, John Tran, and William Dally. Learning both weights and connections for ecient neural
network. Advances in neural information processing systems, 28, 2015.
[30]
Artur Jordao, Maiko Lie, and William Robson Schwartz. Discriminative layer pruning for convolutional neural
networks. IEEE Journal of Selected Topics in Signal Processing, 14(4):828–837, 2020.
[31]
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In
Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 7132–7141, 2018.
[32]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto,
and Hartwig Adam. Mobilenets: Ecient convolutional neural networks for mobile vision applications.
arXiv
preprint arXiv:1704.04861, 2017.
[33]
Pavlo Molchanov, Jimmy Hall, Hongxu Yin, Jan Kautz, Nicolo Fusi, and Arash Vahdat. Lana: latency aware network
acceleration. In
Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,
Proceedings, Part XII, pages 137–156. Springer, 2022.
[34]
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shuenet v2: Practical guidelines for ecient cnn
architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
[35]
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuenet: An extremely ecient convolutional neural
network for mobile devices. In
Proceedings of the IEEE conference on computer vision and pattern recognition
,
pages 6848–6856, 2018.
[36]
Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. Ghostnet: More features from cheap
operations. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
, pages 1580–1589,
2020.
[37]
Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Chao Xu, and Yunhe Wang. Ghostnetv2: Enhance cheap operation
with long-range attention. arXiv preprint arXiv:2211.12905, 2022.
[38]
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet:
Platform-aware neural architecture search for mobile. In
Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 2820–2828, 2019.
[39]
Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing
Jia, and Kurt Keutzer. Fbnet: Hardware-aware ecient convnet design via dierentiable neural architecture search. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10734–10742, 2019.
[40]
Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. In
International Conference on Learning Representations, 2019.
[41]
Colin White, Mahmoud Safari, Rhea Sukthanker, Binxin Ru, Thomas Elsken, Arber Zela, Debadeepta Dey, and Frank
Hutter. Neural architecture search: Insights from 1000 papers. arXiv preprint arXiv:2301.08727, 2023.
[42]
Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train one network and specialize
it for ecient deployment. In International Conference on Learning Representations, 2020.
[43]
Hadjer Benmeziane, Kaoutar El Maghraoui, Hamza Ouarnoughi, Smail Niar, Martin Wistuba, and Naigang Wang. A
comprehensive survey on hardware-aware neural architecture search. arXiv preprint arXiv:2101.09336, 2021.
[44]
Han Cai, Chuang Gan, Ligeng Zhu, and Song Han. Tinytl: Reduce memory, not parameters for ecient on-device
learning. Advances in Neural Information Processing Systems, 33:11285–11297, 2020.
[45]
Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. On-device training under 256kb
memory. Advances in Neural Information Processing Systems, 2022.
[46]
Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning.
arXiv preprint arXiv:1904.07734
,
2019.
[47]
Xinchi Qiu, Javier Fernandez-Marques, Pedro PB Gusmao, Yan Gao, Titouan Parcollet, and Nicholas Donald Lane.
Zero: Ecient on-device training for federated learning with local sparsity.
International Conference on Learning
Representations, 2022.
[48]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.
Advances in neural
information processing systems, 33:1877–1901, 2020.
[49] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[50]
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu,
Yifei Zhang, et al. Beyond eciency: A systematic survey of resource-ecient large language models.
arXiv preprint
arXiv:2401.00625, 2024.
[51]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,
Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.
Journal of Machine Learning Research, 24(240):1–113, 2023.
[52]
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexan-
dra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language
74
model. 2023.
[53]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang,
and Ion Stoica. Ecient memory management for large language model serving with pagedattention. In
Proceedings
of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
[54]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-ecient exact
attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
[55]
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.
arXiv preprint
arXiv:2307.08691, 2023.
[56]
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Ecient streaming language models with
attention sinks. arXiv preprint arXiv:2309.17453, 2023.
[57]
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.
Advances in neural information processing systems, 36:21702–21720, 2023.
[58]
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and eective pruning approach for large language
models. arXiv preprint arXiv:2306.11695, 2023.
[59]
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and
ecient post-training quantization for large language models. In
International Conference on Machine Learning
,
pages 38087–38099. PMLR, 2023.
[60]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight
quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
[61]
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi.
Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
[62]
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. In
The
Twelfth International Conference on Learning Representations, 2023.
[63]
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion
Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In
International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
[64]
Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko,
Pavel Samygin, and Colin Rael. Petals: Collaborative inference and ne-tuning of large models.
arXiv preprint
arXiv:2209.01188, 2022.
[65]
Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. Tabi: An ecient multi-level inference system for large
language models. In
Proceedings of the Eighteenth European Conference on Computer Systems
, pages 233–248,
2023.
[66]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, et al. Tensorow: Large-scale machine learning on
heterogeneous systems, 2015. Software available from tensorow.org.
[67]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.
Advances in neural information processing systems, 32, 2019.
[68] Google. Google edge tpu. https://cloud.google.com/edge-tpu/.
[69] NVIDIA. Nvidia jetson. https://www.nvidia.com/en-sg/autonomous-machines/embedded-systems/.
[70] Intel. Intel movidius neural compute stick. https://movidius.github.io/ncsdk/ncs.html.
[71]
Gaurav Menghani. Ecient deep learning: A survey on making deep learning models smaller, faster, and better.
ACM Computing Surveys, 55(12):1–37, 2023.
[72]
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural
networks. arXiv preprint arXiv:1710.09282, 2017.
[73]
Tejalal Choudhary, Vipul Mishra, Anurag Goswami, and Jagannathan Sarangapani. A comprehensive survey on
model compression and acceleration. Articial Intelligence Review, 53:5113–5155, 2020.
[74]
Zhuo Li, Hengyi Li, and Lin Meng. Model compression for deep neural networks: A survey.
Computers
, 12(3):60,
2023.
[75]
Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chunjing Xu, Enhua Wu, and Qi Tian. Ghostnets on heterogeneous
devices via cheap operations. International Journal of Computer Vision, 130(4):1050–1069, 2022.
[76]
Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. Imagenet classication with deep convolutional neural
networks. Advances in neural information processing systems, 25:1097–1105, 2012.
[77]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In
Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1–9, 2015.
[78]
Mingxing Tan and Quoc Le. Ecientnet: Rethinking model scaling for convolutional neural networks. In
International
conference on machine learning, pages 6105–6114. PMLR, 2019.
75
[79]
Mingxing Tan and Quoc Le. Ecientnetv2: Smaller models and faster training. In
International conference on
machine learning, pages 10096–10106. PMLR, 2021.
[80]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[81]
Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet:
Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
[82]
Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Haotian Tang, Hanrui Wang, Ligeng Zhu, and Song Han. Enable deep learning
on mobile devices: Methods, systems, and applications.
ACM Transactions on Design Automation of Electronic
Systems (TODAES), 27(3):1–50, 2022.
[83]
Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions.
arXiv preprint
arXiv:1511.07122, 2015.
[84]
Jierun Chen, Shiu-hong Kao, Hao He, Weipeng Zhuo, Song Wen, Chul-Ho Lee, and S-H Gary Chan. Run, don’t walk:
Chasing higher ops for faster neural networks. In
Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 12021–12031, 2023.
[85]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
residuals and linear bottlenecks. In
Proceedings of the IEEE conference on computer vision and pattern recognition
,
pages 4510–4520, 2018.
[86]
Daquan Zhou, Qibin Hou, Yunpeng Chen, Jiashi Feng, and Shuicheng Yan. Rethinking bottleneck structure for
ecient mobile network design. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
23–28, 2020, Proceedings, Part III 16, pages 680–697. Springer, 2020.
[87]
Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q Weinberger. Condensenet: An ecient densenet using
learned group convolutions. In
Proceedings of the IEEE conference on computer vision and pattern recognition
,
pages 2752–2761, 2018.
[88]
Le Yang, Haojun Jiang, Ruojin Cai, Yulin Wang, Shiji Song, Gao Huang, and Qi Tian. Condensenet v2: Sparse
feature reactivation for deep networks. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 3569–3578, 2021.
[89]
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning,
trained quantization and human coding. In International Conference on Learning Representations, 2016.
[90]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[91]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. NAACL-HLT, 2019.
[92] OpenAI. Chatgpt: A variant of gpt by openai. https://openai.com/, 2020.
[93]
Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. Hat: Hardware-aware
transformers for ecient natural language processing. In
Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 7675–7688, 2020.
[94]
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling
bert for natural language understanding. In
Findings of the Association for Computational Linguistics: EMNLP 2020
,
pages 4163–4174, 2020.
[95]
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobilebert: a compact
task-agnostic bert for resource-limited devices. In
Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 2158–2170, 2020.
[96]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller,
faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
[97]
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.
arXiv preprint arXiv:2006.04768, 2020.
[98]
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The ecient transformer.
arXiv preprint
arXiv:2001.04451, 2020.
[99]
Jean Kaddour, Oscar Key, Piotr Nawrot, Pasquale Minervini, and Matt J Kusner. No train no gain: Revisiting ecient
training algorithms for transformer-based language models.
Advances in Neural Information Processing Systems
,
36, 2024.
[100]
Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris,
David Daniel Cox, Zhangyang Wang, and Yoon Kim. Learning to grow pretrained models for ecient transformer
training. arXiv preprint arXiv:2303.00980, 2023.
[101]
Malte Ostendor and Georg Rehm. Ecient language model training through cross-lingual and progressive transfer
learning. arXiv preprint arXiv:2301.09626, 2023.
76
[102]
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao,
Shivani Agrawal, and Je Dean. Eciently scaling transformer inference.
Proceedings of Machine Learning and
Systems, 5, 2023.
[103]
Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew M Dai,
Yifeng Lu, et al. Brainformers: Trading simplicity for eciency. In
International Conference on Machine Learning
,
pages 42531–42542. PMLR, 2023.
[104]
Zhen-Ru Zhang, Chuanqi Tan, Haiyang Xu, Chengyu Wang, Jun Huang, and Songfang Huang. Towards adaptive
prex tuning for parameter-ecient language model ne-tuning. arXiv preprint arXiv:2305.15212, 2023.
[105]
Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li,
and Yu Qiao. Llama-adapter: Ecient ne-tuning of language models with zero-init attention.
arXiv preprint
arXiv:2303.16199, 2023.
[106]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-
to-end object detection with transformers. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow,
UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
[107]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Trans-
formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[108]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer:
Hierarchical vision transformer using shifted windows. In
Proceedings of the IEEE/CVF international conference
on computer vision, pages 10012–10022, 2021.
[109]
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al.
Swin transformer v2: Scaling up capacity and resolution. In
Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 12009–12019, 2022.
[110]
Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for ob-
ject detection. In
Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,
Proceedings, Part IX, pages 280–296. Springer, 2022.
[111]
Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, and Wenyu Liu. You only
look at one sequence: Rethinking transformer in vision through object detection.
Advances in Neural Information
Processing Systems, 34:26183–26197, 2021.
[112]
Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy,
Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object
detection with vision transformers. arXiv preprint arXiv:2205.06230, 2022.
[113]
Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation.
In Proceedings of the IEEE/CVF international conference on computer vision, pages 7262–7272, 2021.
[114]
Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou.
Transunet: Transformers make strong encoders for medical image segmentation.
arXiv preprint arXiv:2102.04306
,
2021.
[115]
Jiaqi Gu, Hyoukjun Kwon, Dilin Wang, Wei Ye, Meng Li, Yu-Hsin Chen, Liangzhen Lai, Vikas Chandra, and David Z
Pan. Multi-scale high-resolution vision transformer for semantic segmentation. In
Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 12094–12103, 2022.
[116]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer
Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
[117]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
[118]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision
transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
[119]
Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. Video transformer network. In
Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 3163–3172, 2021.
[120]
Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze.
Levit: a vision transformer in convnet’s clothing for faster inference. In
Proceedings of the IEEE/CVF international
conference on computer vision, pages 12259–12269, 2021.
[121]
Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. Mobile-former:
Bridging mobilenet and transformer. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 5270–5279, 2022.
[122]
Sachin Mehta and Mohammad Rastegari. Mobilevit: light-weight, general-purpose, and mobile-friendly vision
transformer. In International Conference on Learning Representations, 2022.
77
[123]
Sachin Mehta and Mohammad Rastegari. Separable self-attention for mobile vision transformers.
arXiv preprint
arXiv:2206.02680, 2022.
[124]
Shakti N Wadekar and Abhishek Chaurasia. Mobilevitv3: Mobile-friendly vision transformer with simple and eective
fusion of local, global and input features. arXiv preprint arXiv:2209.15159, 2022.
[125]
Han Cai, Chuang Gan, and Song Han. Ecientvit: Enhanced linear attention for high-resolution low-computation
visual recognition. arXiv preprint arXiv:2205.14756, 2022.
[126]
Junting Pan, Adrian Bulat, Fuwen Tan, Xiatian Zhu, Lukasz Dudziak, Hongsheng Li, Georgios Tzimiropoulos, and
Brais Martinez. Edgevits: Competing light-weight cnns on mobile devices with vision transformers. In
Computer
Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI
, pages
294–311. Springer, 2022.
[127]
Muhammad Maaz, Abdelrahman Shaker, Hisham Cholakkal, Salman Khan, Syed Waqas Zamir, Rao Muhammad
Anwer, and Fahad Shahbaz Khan. Edgenext: eciently amalgamated cnn-transformer architecture for mobile vision
applications. In
Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part
VII, pages 3–20. Springer, 2023.
[128]
Haoran You, Yunyang Xiong, Xiaoliang Dai, Bichen Wu, Peizhao Zhang, Haoqi Fan, Peter Vajda, and Yingyan Lin.
Castling-vit: Compressing self-attention via switching towards linear-angular attention during vision transformer
inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
[129]
Pavan Kumar Anasosalu Vasu, James Gabriel, Je Zhu, Oncel Tuzel, and Anurag Ranjan. Fastvit: A fast hybrid vision
transformer using structural reparameterization. arXiv preprint arXiv:2303.14189, 2023.
[130]
Xiangzhong Luo, Di Liu, Hao Kong, and Weichen Liu. Edgenas: Discovering ecient neural architectures for edge
systems. In 2020 IEEE 38th International Conference on Computer Design (ICCD), pages 288–295. IEEE, 2020.
[131]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Hui Chen, and Weichen Liu. You only search once: On lightweight
dierentiable architecture search for resource-constrained embedded platforms. In
Proceedings of the 59th ACM/IEEE
Design Automation Conference, pages 475–480, 2022.
[132]
Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. Rethinking spatial
dimensions of vision transformers. In
Proceedings of the IEEE/CVF International Conference on Computer Vision
,
pages 11936–11945, 2021.
[133]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
data-ecient image transformers & distillation through attention. In
International conference on machine learning
,
pages 10347–10357. PMLR, 2021.
[134]
Dichao Hu. An introductory survey on attention mechanisms in nlp problems. In
Intelligent Systems and Applications:
Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 2, pages 432–448. Springer, 2020.
[135]
Meng-Hao Guo, Tian-Xing Xu, Jiang-Jiang Liu, Zheng-Ning Liu, Peng-Tao Jiang, Tai-Jiang Mu, Song-Hai Zhang,
Ralph R Martin, Ming-Ming Cheng, and Shi-Min Hu. Attention mechanisms in computer vision: A survey.
Computational Visual Media, 8(3):331–368, 2022.
[136]
Plamen Angelov and Eduardo Soares. Towards explainable deep neural networks (xdnn).
Neural Networks
, 130:185–
194, 2020.
[137]
Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning.
arXiv preprint arXiv:1611.01578
,
2016.
[138]
Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Dierentiable architecture search. In
International Conference
on Learning Representations, 2019.
[139]
Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and Enhua Wu. Vision gnn: An image is worth graph of nodes.
arXiv preprint arXiv:2206.00272, 2022.
[140]
Anubhav Jangra, Sourajit Mukherjee, Adam Jatowt, Sriparna Saha, and Mohammad Hasanuzzaman. A survey on
multi-modal summarization. ACM Computing Surveys, 2021.
[141]
Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to
train your vit? data, augmentation, and regularization in vision transformers.
arXiv preprint arXiv:2106.10270
, 2021.
[142]
Yonggan Fu, Shunyao Zhang, Shang Wu, Cheng Wan, and Yingyan Lin. Patch-fool: Are vision transformers always
robust against adversarial perturbations? In International Conference on Learning Representations, 2022.
[143]
Shaokai Ye, Kaidi Xu, Sijia Liu, Hao Cheng, Jan-Henrik Lambrechts, Huan Zhang, Aojun Zhou, Kaisheng Ma, Yanzhi
Wang, and Xue Lin. Adversarial robustness vs. model compression, or both? In
Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 111–120, 2019.
[144]
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image
recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 8697–8710,
2018.
[145]
Arber Zela, Thomas Elsken, Tonmoy Saikia, Yassine Marrakchi, Thomas Brox, and Frank Hutter. Understanding and
robustifying dierentiable architecture search. In International Conference on Learning Representations, 2020.
78
[146]
Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi, Qi Tian, and Hongkai Xiong. Pc-darts: Partial channel
connections for memory-ecient architecture search. In
International Conference on Learning Representations
,
2020.
[147]
Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progressive dierentiable architecture search: Bridging the depth gap
between search and evaluation. In
Proceedings of the IEEE/CVF international conference on computer vision
, pages
1294–1303, 2019.
[148]
Hanwen Liang, Shifeng Zhang, Jiacheng Sun, Xingqiu He, Weiran Huang, Kechen Zhuang, and Zhenguo Li. Darts+:
Improved dierentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035, 2019.
[149]
Xiangxiang Chu, Xiaoxing Wang, Bo Zhang, Shun Lu, Xiaolin Wei, and Junchi Yan. Darts-: Robustly stepping out of
performance collapse without indicators. In International Conference on Learning Representations, 2021.
[150]
Xiangxiang Chu, Tianbao Zhou, Bo Zhang, and Jixiang Li. Fair darts: Eliminating unfair advantages in dierentiable
architecture search. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XV, pages 465–480. Springer, 2020.
[151]
Peng Ye, Baopu Li, Yikang Li, Tao Chen, Jiayuan Fan, and Wanli Ouyang.
𝛽
-darts: Beta-decay regularization for
dierentiable architecture search. In
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 10864–10873. IEEE, 2022.
[152]
Xiangzhong Luo, Di Liu, Shuo Huai, and Weichen Liu. Hsconas: Hardware-software co-design of ecient dnns via
neural architecture search. In
2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)
, pages
418–421. IEEE, 2021.
[153]
Li Lyna Zhang, Yuqing Yang, Yuhang Jiang, Wenwu Zhu, and Yunxin Liu. Fast hardware-aware neural architecture
search. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
, pages
692–693, 2020.
[154]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Hui Chen, and Weichen Liu. Surgenas: A comprehensive surgery on
hardware-aware dierentiable neural architecture search. IEEE Transactions on Computers, 2022.
[155]
Samuel Williams, Andrew Waterman, and David Patterson. Rooine: an insightful visual performance model for
multicore architectures. Communications of the ACM, 52(4):65–76, 2009.
[156]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Hui Chen, and Weichen Liu. Lightnas: On lightweight and scalable
neural architecture search for embedded platforms.
IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 2022.
[157]
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classier architecture
search. In Proceedings of the aaai conference on articial intelligence, pages 4780–4789, 2019.
[158]
Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using
reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
[159]
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.
Reinforcement learning, pages 5–32, 1992.
[160]
Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Je Dean. Ecient neural architecture search via parameters
sharing. In International conference on machine learning, pages 4095–4104. PMLR, 2018.
[161]
Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot
neural architecture search with uniform sampling. In
Computer Vision–ECCV 2020: 16th European Conference,
Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 544–560. Springer, 2020.
[162]
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu,
Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In
Proceedings of the IEEE/CVF international
conference on computer vision, pages 1314–1324, 2019.
[163]
Gabriel Bender, Hanxiao Liu, Bo Chen, Grace Chu, Shuyang Cheng, Pieter-Jan Kindermans, and Quoc V Le. Can
weight sharing outperform random architecture search? an investigation with tunas. In
Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 14323–14332, 2020.
[164]
Chi-Hung Hsu, Shu-Huan Chang, Jhao-Hong Liang, Hsin-Ping Chou, Chun-Hao Liu, Shih-Chieh Chang, Jia-Yu Pan,
Yu-Ting Chen, Wei Wei, and Da-Cheng Juan. Monas: Multi-objective neural architecture search using reinforcement
learning. arXiv preprint arXiv:1806.10332, 2018.
[165]
Georey F Miller, Peter M Todd, and Shailesh U Hegde. Designing neural networks using genetic algorithms. In
ICGA, volume 89, pages 379–384, 1989.
[166]
Peter J Angeline, Gregory M Saunders, and Jordan B Pollack. An evolutionary algorithm that constructs recurrent
neural networks. IEEE transactions on Neural Networks, 5(1):54–65, 1994.
[167]
Dario Floreano, Peter Dürr, and Claudio Mattiussi. Neuroevolution: from architectures to learning.
Evolutionary
intelligence, 1:47–62, 2008.
[168]
Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies.
Evolutionary
computation, 10(2):99–127, 2002.
79
[169]
Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying
one-shot architecture search. In International conference on machine learning, pages 550–559. PMLR, 2018.
[170]
Matej Črepinšek, Shih-Hsi Liu, and Marjan Mernik. Exploration and exploitation in evolutionary algorithms: A
survey. ACM computing surveys (CSUR), 45(3):1–33, 2013.
[171]
Juan José Domínguez-Jiménez, Antonia Estero-Botaro, Antonio García-Domínguez, and Inmaculada Medina-Bulo.
Evolutionary mutation testing. Information and Software Technology, 53(10):1108–1123, 2011.
[172]
William M Spears et al. Adapting crossover in evolutionary algorithms. In
Evolutionary programming
, pages
367–384, 1995.
[173]
Andrew Brock, Theodore Lim, James Millar Ritchie, and Nicholas J Weston. Smash: One-shot model architecture
search through hypernetworks. In 6th International Conference on Learning Representations 2018, 2018.
[174]
Xiangxiang Chu, Bo Zhang, and Ruijun Xu. Fairnas: Rethinking evaluation fairness of weight sharing neural
architecture search. In
Proceedings of the IEEE/CVF International Conference on computer vision
, pages 12239–
12248, 2021.
[175]
Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender, Pieter-Jan Kindermans, Mingxing Tan, Thomas Huang,
Xiaodan Song, Ruoming Pang, and Quoc Le. Bignas: Scaling up neural architecture search with big single-stage
models. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Part VII 16, pages 702–717. Springer, 2020.
[176]
Bingqian Lu, Jianyi Yang, Weiwen Jiang, Yiyu Shi, and Shaolei Ren. One proxy device is enough for hardware-aware
neural architecture search.
Proceedings of the ACM on Measurement and Analysis of Computing Systems
, 5(3):1–34,
2021.
[177]
Shan You, Tao Huang, Mingmin Yang, Fei Wang, Chen Qian, and Changshui Zhang. Greedynas: Towards fast
one-shot nas with greedy supernet. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 1999–2008, 2020.
[178]
Xiangzhong Luo, Di Liu, Shuo Huai, Hao Kong, Hui Chen, and Weichen Liu. Designing ecient dnns via hardware-
aware neural architecture search and beyond.
IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, 41(6):1799–1812, 2021.
[179]
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Ecient multi-objective neural architecture search via
lamarckian evolution. In International Conference on Learning Representations, 2019.
[180]
Guohao Li, Guocheng Qian, Itzel C Delgadillo, Matthias Muller, Ali Thabet, and Bernard Ghanem. Sgas: Sequential
greedy architecture search. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
,
pages 1620–1630, 2020.
[181]
Yibo Yang, Shan You, Hongyang Li, Fei Wang, Chen Qian, and Zhouchen Lin. Towards improving the consistency,
eciency, and exibility of dierentiable neural architecture search. In
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 6667–6676, 2021.
[182]
Ruochen Wang, Minhao Cheng, Xiangning Chen, Xiaocheng Tang, and Cho-Jui Hsieh. Rethinking architecture
selection in dierentiable nas. In International Conference on Learning Representations, 2021.
[183]
Xiangning Chen, Ruochen Wang, Minhao Cheng, Xiaocheng Tang, and Cho-Jui Hsieh. Drnas: Dirichlet neural
architecture search. In International Conference on Learning Representations, 2021.
[184]
Kaifeng Bi, Lingxi Xie, Xin Chen, Longhui Wei, and Qi Tian. Gold-nas: Gradual, one-level, dierentiable.
arXiv
preprint arXiv:2007.03331, 2020.
[185]
Pengfei Hou, Ying Jin, and Yukang Chen. Single-darts: Towards stable architecture search. In
Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 373–382, 2021.
[186]
Xuanyi Dong and Yi Yang. Searching for a robust neural architecture in four gpu hours. In
Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1761–1770, 2019.
[187]
Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. In
International
Conference on Learning Representations, 2019.
[188]
Jieru Mei, Yingwei Li, Xiaochen Lian, Xiaojie Jin, Linjie Yang, Alan Yuille, and Jianchao Yang. Atomnas: Fine-grained
end-to-end neural architecture search. In International Conference on Learning Representations, 2020.
[189]
Xuanyi Dong, David Jacob Kedziora, Katarzyna Musial, and Bogdan Gabrys. Automated deep learning: Neural
architecture search is not the end. arXiv preprint arXiv:2112.09245, 2021.
[190]
Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In
International
Conference on Learning Representations, 2017.
[191]
Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Bowen Shi, Qi Tian, and Hongkai Xiong. Latency-aware dieren-
tiable neural architecture search. arXiv preprint arXiv:2001.06392, 2020.
[192]
Guohao Li, Mengmeng Xu, Silvio Giancola, Ali Thabet, and Bernard Ghanem. Lc-nas: Latency constrained neural
architecture search for point cloud networks. arXiv preprint arXiv:2008.10309, 2020.
80
[193]
Mohammad Loni, Hamid Mousavi, Mohammad Riazati, Masoud Daneshtalab, and Mikael Sjödin. Tas: ternarized
neural architecture search for resource-constrained edge devices. In
2022 Design, Automation & Test in Europe
Conference & Exhibition (DATE), pages 1115–1118. IEEE, 2022.
[194]
Sunghoon Kim, Hyunjeong Kwon, Eunji Kwon, Youngchang Choi, Tae-Hyun Oh, and Seokhyeong Kang. Mdarts:
Multi-objective dierentiable neural architecture search. In
2021 Design, Automation & Test in Europe Conference
& Exhibition (DATE), pages 1344–1349. IEEE, 2021.
[195]
Yibo Hu, Xiang Wu, and Ran He. Tf-nas: Rethinking three search freedoms of latency-constrained dierentiable
neural architecture search. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part XV 16, pages 123–139. Springer, 2020.
[196]
Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuandong Tian, Saining Xie, Bichen Wu, Matthew Yu, Tao Xu,
Kan Chen, et al. Fbnetv2: Dierentiable neural architecture search for spatial and channel dimensions. In
Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12965–12974, 2020.
[197]
Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Mar-
culescu. Single-path nas: Designing hardware-ecient convnets in less than 4 hours. In
Machine Learning and
Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September
16–20, 2019, Proceedings, Part II, pages 481–497. Springer, 2020.
[198]
Jaeseong Lee, Jungsub Rhim, Duseok Kang, and Soonhoi Ha. Snas: Fast hardware-aware neural architecture search
methodology. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(11):4826–4836,
2021.
[199]
Jiemin Fang, Yuzhu Sun, Qian Zhang, Yuan Li, Wenyu Liu, and Xinggang Wang. Densely connected search space
for more exible neural architecture search. In
Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 10628–10637, 2020.
[200]
Niv Nayman, Yonathan Aalo, Asaf Noy, and Lihi Zelnik. Hardcore-nas: Hard constrained dierentiable neural
architecture search. In International Conference on Machine Learning, pages 7979–7990. PMLR, 2021.
[201]
Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate frank-wolfe optimization
for structural svms. In International Conference on Machine Learning, pages 53–61. PMLR, 2013.
[202]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, and Weichen Liu. Double-win nas: Towards deep-to-shallow
transformable neural architecture search for intelligent embedded systems. In
Proceedings of the 61th ACM/IEEE
Design Automation Conference, pages 1–6, 2024.
[203]
Qian Jiang, Xiaofan Zhang, Deming Chen, Minh N Do, and Raymond A Yeh. Eh-dnas: End-to-end hardware-aware
dierentiable neural architecture search. arXiv preprint arXiv:2111.12299, 2021.
[204]
Javier García López, Antonio Agudo, and Francesc Moreno-Noguer. E-dnas: Dierentiable neural architecture search
for embedded systems. In
2020 25th International Conference on Pattern Recognition (ICPR)
, pages 4704–4711. IEEE,
2021.
[205]
Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. Evaluating the search phase of
neural architecture search. arXiv preprint arXiv:1902.08142, 2019.
[206]
Yiyang Zhao, Linnan Wang, Yuandong Tian, Rodrigo Fonseca, and Tian Guo. Few-shot neural architecture search. In
International Conference on Machine Learning, pages 12707–12718. PMLR, 2021.
[207]
Shoukang Hu, Ruochen Wang, Lanqing Hong, Zhenguo Li, Cho-Jui Hsieh, and Jiashi Feng. Generalizing few-shot
nas with gradient matching. arXiv preprint arXiv:2203.15207, 2022.
[208]
Dongkuan DK Xu, Subhabrata Mukherjee, Xiaodong Liu, Debadeepta Dey, Wenhui Wang, Xiang Zhang, Ahmed
Awadallah, and Jianfeng Gao. Few-shot task-agnostic neural architecture search for distilling large language models.
Advances in Neural Information Processing Systems, 35:28644–28656, 2022.
[209]
Timotée Ly-Manson, Mathieu Léonardon, and Abdeldjalil Aissa El Bey. Understanding few-shot neural architecture
search with zero-cost proxies. https://gretsi.fr/data/colloque/pdf/2023_lymanson1237.pdf, 2023.
[210]
Xiu Su, Shan You, Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang, and Chang Xu. K-shot nas: Learnable
weight-sharing for nas with k-shot supernets. In
International Conference on Machine Learning
, pages 9880–9890.
PMLR, 2021.
[211]
Zixuan Zhou, Xuefei Ning, Yi Cai, Jiashu Han, Yiping Deng, Yuhan Dong, Huazhong Yang, and Yu Wang. Close:
Curriculum learning on the sharing extent towards better one-shot nas. In
European Conference on Computer
Vision, pages 578–594. Springer, 2022.
[212]
Kevin Alexander Laube, Maximus Mutschler, and Andreas Zell. What to expect of hardware metric predictors in nas.
In International Conference on Automated Machine Learning, pages 13–1. PMLR, 2022.
[213]
Lukasz Dudziak, Thomas Chau, Mohamed Abdelfattah, Royson Lee, Hyeji Kim, and Nicholas Lane. Brp-nas:
Prediction-based nas using gcn. Advances in Neural Information Processing Systems, 33:10480–10490, 2020.
[214]
Chaojian Li, Zhongzhi Yu, Yonggan Fu, Yongan Zhang, Yang Zhao, Haoran You, Qixuan Yu, Yue Wang, Cong Hao, and
Yingyan Lin. Hw-nas-bench: Hardware-aware neural architecture search benchmark. In
International Conference
81
on Learning Representations, 2021.
[215]
Hayeon Lee, Sewoong Lee, Song Chong, and Sung Ju Hwang. Hardware-adaptive ecient latency prediction for nas
via meta-learning. Advances in Neural Information Processing Systems, 34:27016–27028, 2021.
[216]
Saeejith Nair, Saad Abbasi, Alexander Wong, and Mohammad Javad Shaee. Maple-edge: A runtime latency predictor
for edge devices. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages
3660–3668, 2022.
[217]
Shuo Huai, Hao Kong, Shiqing Li, Xiangzhong Luo, Ravi Subramaniam, Christian Makaya, Qian Lin, and Weichen Liu.
Evolp: Self-evolving latency predictor for model compression in real-time edge systems.
IEEE Embedded Systems
Letters, 2023.
[218]
Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Bender, and Pieter-Jan Kindermans. Neural predictor for neural
architecture search. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XXIX, pages 660–676. Springer, 2020.
[219]
Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Enhong Chen, and Tie-Yan Liu. Accuracy prediction with non-neural
model for neural architecture search. arXiv preprint arXiv:2007.04785, 2020.
[220]
Colin White, Arber Zela, Robin Ru, Yang Liu, and Frank Hutter. How powerful are performance predictors in neural
architecture search? Advances in Neural Information Processing Systems, 34:28454–28469, 2021.
[221]
Bert Moons, Parham Noorzad, Andrii Skliar, Giovanni Mariani, Dushyant Mehta, Chris Lott, and Tijmen Blankevoort.
Distilling optimal neural networks: Rapid search in diverse spaces. In
Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 12229–12238, 2021.
[222]
Xuefei Ning, Yin Zheng, Tianchen Zhao, Yu Wang, and Huazhong Yang. A generic graph-based neural architecture
encoding scheme for predictor-based nas. In
European Conference on Computer Vision
, pages 189–204. Springer,
2020.
[223]
Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. Nas-bench-101: Towards
reproducible neural architecture search. In
International Conference on Machine Learning
, pages 7105–7114. PMLR,
2019.
[224]
Xuanyi Dong and Yi Yang. Nas-bench-201: Extending the scope of reproducible neural architecture search. In
International Conference on Learning Representations, 2020.
[225]
Nikita Klyuchnikov, Ilya Tromov, Ekaterina Artemova, Mikhail Salnikov, Maxim Fedorov, Alexander Filippov, and
Evgeny Burnaev. Nas-bench-nlp: neural architecture search benchmark for natural language processing.
IEEE Access
,
10:45736–45747, 2022.
[226]
Xuefei Ning, Yin Zheng, Zixuan Zhou, Tianchen Zhao, Huazhong Yang, and Yu Wang. A generic graph-based neural
architecture encoding scheme with multifaceted information.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2022.
[227]
Xuefei Ning, Zixuan Zhou, Junbo Zhao, Tianchen Zhao, Yiping Deng, Changcheng Tang, Shuang Liang, Huazhong
Yang, and Yu Wang. Ta-gates: An encoding scheme for neural network architectures.
Advances in Neural Information
Processing Systems, 35:32325–32339, 2022.
[228]
Huan Xiong, Lei Huang, Mengyang Yu, Li Liu, Fan Zhu, and Ling Shao. On the number of linear regions of
convolutional neural networks. In
International Conference on Machine Learning
, pages 10514–10523. PMLR, 2020.
[229]
Lechao Xiao, Jerey Pennington, and Samuel Schoenholz. Disentangling trainability and generalization in deep
neural networks. In International Conference on Machine Learning, pages 10462–10472. PMLR, 2020.
[230]
Wuyang Chen, Xinyu Gong, and Zhangyang Wang. Neural architecture search on imagenet in four gpu hours: A
theoretically inspired perspective. In International Conference on Learning Representations, 2021.
[231]
Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperparameter optimization
of deep neural networks by extrapolation of learning curves. In
Twenty-fourth international joint conference on
articial intelligence, 2015.
[232]
Robin Ru, Clare Lyle, Lisa Schut, Miroslav Fil, Mark van der Wilk, and Yarin Gal. Speedy performance estimation for
neural architecture search. Advances in Neural Information Processing Systems, 34:4079–4092, 2021.
[233]
Shen Yan, Colin White, Yash Savani, and Frank Hutter. Nas-bench-x11 and the power of learning curves.
Advances
in Neural Information Processing Systems, 34:22534–22549, 2021.
[234]
Dan Zhao, Nathan C Frey,Vijay Gadepally, and Siddharth Samsi. Loss curve approximations for fast neural architecture
ranking & training elasticity estimation. In
2022 IEEE International Parallel and Distributed Processing Symposium
Workshops (IPDPSW), pages 715–723. IEEE, 2022.
[235]
Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. Learning curve prediction with bayesian
neural networks. In International Conference on Learning Representations, 2017.
[236]
Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating neural architecture search using
performance prediction. arXiv preprint arXiv:1705.10823, 2017.
82
[237]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Hui Chen, and Weichen Liu. Work-in-progress: What to expect
of early training statistics? an investigation on hardware-aware neural architecture search. In
2022 International
Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS), pages 1–2. IEEE, 2022.
[238]
Mohamed S Abdelfattah, Abhinav Mehrotra, Łukasz Dudziak, and Nicholas D Lane. Zero-cost proxies for lightweight
nas. arXiv preprint arXiv:2101.08134, 2021.
[239]
Arjun Krishnakumar, Colin White, Arber Zela, Renbo Tu, Mahmoud Safari, and Frank Hutter. Nas-bench-suite-zero:
Accelerating research on zero cost proxies. arXiv preprint arXiv:2210.03230, 2022.
[240]
Vasco Lopes, Saeid Alirezazadeh, and Luís A Alexandre. Epe-nas: Ecient performance estimation without training
for neural architecture search. In
Articial Neural Networks and Machine Learning–ICANN 2021: 30th International
Conference on Articial Neural Networks, Bratislava, Slovakia, September 14–17, 2021, Proceedings, Part V
, pages
552–563. Springer, 2021.
[241]
Jack Turner, Elliot J Crowley, Michael O’Boyle, Amos Storkey, and Gavin Gray. Blockswap: Fisher-guided block
substitution for network compression on a budget. arXiv preprint arXiv:1906.04113, 2019.
[242]
Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient
ow. arXiv preprint arXiv:2002.07376, 2020.
[243]
Joe Mellor, Jack Turner, Amos Storkey, and Elliot J Crowley. Neural architecture search without training. In
International Conference on Machine Learning, pages 7588–7598. PMLR, 2021.
[244]
Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection
sensitivity. arXiv preprint arXiv:1810.02340, 2018.
[245]
Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by
iteratively conserving synaptic ow. Advances in neural information processing systems, 33:6377–6389, 2020.
[246]
Ming Lin, Pichao Wang, Zhenhong Sun, Hesen Chen, Xiuyu Sun, Qi Qian, Hao Li, and Rong Jin. Zen-nas: A zero-shot
nas for high-performance image recognition. In
Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 347–356, 2021.
[247]
Yash Akhauri, Juan Munoz, Nilesh Jain, and Ravishankar Iyer. Eznas: Evolving zero-cost proxies for neural architecture
scoring. Advances in Neural Information Processing Systems, 35:30459–30470, 2022.
[248]
Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. Autoformer: Searching transformers for visual
recognition. In
Proceedings of the IEEE/CVF international conference on computer vision
, pages 12270–12280, 2021.
[249]
Jiahui Gao, Hang Xu, Han Shi, Xiaozhe Ren, LH Philip, Xiaodan Liang, Xin Jiang, and Zhenguo Li. Autobert-zero:
Evolving bert backbone from scratch. In
Proceedings of the AAAI Conference on Articial Intelligence
, pages
10663–10671, 2022.
[250]
David R So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V Le. Primer: Searching for ecient
transformers for language modeling. arXiv preprint arXiv:2109.08668, 2021.
[251]
Yichun Yin, Cheng Chen, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Autotinybert: Automatic hyper-parameter
optimization for ecient pre-trained language models. In
Proceedings of the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume
1: Long Papers), pages 5146–5157, 2021.
[252]
Jin Xu, Xu Tan, Renqian Luo, Kaitao Song, Jian Li, Tao Qin, and Tie-Yan Liu. Nas-bert: task-agnostic and adaptive-
size bert compression with neural architecture search. In
Proceedings of the 27th ACM SIGKDD Conference on
Knowledge Discovery & Data Mining, pages 1933–1943, 2021.
[253]
Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Jinzhu Li, Sheng Zhao, Enhong Chen, and Tie-Yan Liu. Lightspeech:
Lightweight and fast text to speech with neural architecture search. In
ICASSP 2021-2021 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5699–5703. IEEE, 2021.
[254]
Jihwan Kim, Jisung Wang, Sangki Kim, and Yeha Lee. Evolved speech-transformer: Applying neural architecture
search to end-to-end automatic speech recognition. In INTERSPEECH, pages 1788–1792, 2020.
[255]
Charles Jin, Phitchaya Mangpo Phothilimthana, and Sudip Roy.
𝛼
nas: Neural architecture search using property
guided synthesis. arXiv preprint arXiv:2205.03960, 2022.
[256]
Boyu Chen, Peixia Li, Chuming Li, Baopu Li, Lei Bai, Chen Lin, Ming Sun, Junjie Yan, and Wanli Ouyang. Glit:
Neural architecture search for global and local image transformer. In
Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 12–21, 2021.
[257]
Chengyue Gong, Dilin Wang, Meng Li, Xinlei Chen, Zhicheng Yan, Yuandong Tian, Vikas Chandra, et al. Nasvit:
Neural architecture search for ecient vision transformers with gradient conict aware supernet training. In
International Conference on Learning Representations, 2021.
[258]
Yong Guo, Yin Zheng, Mingkui Tan, Qi Chen, Jian Chen, Peilin Zhao, and Junzhou Huang. Nat: Neural architecture
transformer for accurate and compact architectures.
Advances in Neural Information Processing Systems
, 32, 2019.
[259]
Mingyu Ding, Xiaochen Lian, Linjie Yang, Peng Wang, Xiaojie Jin, Zhiwu Lu, and Ping Luo. Hr-nas: Searching ecient
high-resolution neural architectures with lightweight transformers. In
Proceedings of the IEEE/CVF Conference on
83
Computer Vision and Pattern Recognition, pages 2982–2992, 2021.
[260]
Xiu Su, Shan You, Jiyang Xie, Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang, Xiaogang Wang, and Chang
Xu. Vitas: Vision transformer architecture search. In
Computer Vision–ECCV 2022: 17th European Conference, Tel
Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI, pages 139–157. Springer, 2022.
[261]
Xuanyi Dong, Lu Liu, Katarzyna Musial, and Bogdan Gabrys. Nats-bench: Benchmarking nas algorithms for
architecture topology and size.
IEEE transactions on pattern analysis and machine intelligence
, 44(7):3634–3646,
2021.
[262]
Julien Siems, Lucas Zimmer, Arber Zela, Jovita Lukasik, Margret Keuper, and Frank Hutter. Nas-bench-301 and the
case for surrogate benchmarks for neural architecture search. arXiv preprint arXiv:2008.09777, 2020.
[263]
Renbo Tu, Nicholas Roberts, Misha Khodak, Junhong Shen, Frederic Sala, and Ameet Talwalkar. Nas-bench-360:
Benchmarking neural architecture search on diverse tasks.
Advances in Neural Information Processing Systems
,
35:12380–12394, 2022.
[264]
Arber Zela, Julien Siems, and Frank Hutter. Nas-bench-1shot1: Benchmarking and dissecting one-shot neural
architecture search. In International Conference on Learning Representations, 2020.
[265]
Abhinav Mehrotra, Alberto Gil CP Ramos, Sourav Bhattacharya, Łukasz Dudziak, Ravichander Vipperla, Thomas
Chau, Mohamed S Abdelfattah, Samin Ishtiaq, and Nicholas Donald Lane. Nas-bench-asr: Reproducible neural
architecture search for speech recognition. In International Conference on Learning Representations, 2021.
[266]
Yijian Qin, Ziwei Zhang, Xin Wang, Zeyang Zhang, and Wenwu Zhu. Nas-bench-graph: Benchmarking graph
neural architecture search. In
Thirty-sixth Conference on Neural Information Processing Systems Datasets and
Benchmarks Track, 2022.
[267]
Yash Mehta, Colin White, Arber Zela, Arjun Krishnakumar, Guri Zabergja, Shakiba Moradian, Mahmoud Safari,
Kaicheng Yu, and Frank Hutter. Nas-bench-suite: Nas evaluation is (now) surprisingly easy. In
International
Conference on Learning Representations, 2022.
[268]
Yunyang Xiong, Hanxiao Liu, Suyog Gupta, Berkin Akin, Gabriel Bender, Yongzhe Wang, Pieter-Jan Kindermans,
Mingxing Tan, Vikas Singh, and Bo Chen. Mobiledets: Searching for object detection architectures for mobile
accelerators. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
, pages 3825–
3834, 2021.
[269]
Ning Wang, Yang Gao, Hao Chen, Peng Wang, Zhi Tian, Chunhua Shen, and Yanning Zhang. Nas-fcos: Fast neural
architecture search for object detection. In
proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 11943–11951, 2020.
[270]
Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object
detection. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
, pages 7036–7045,
2019.
[271]
Chenxi Liu, Liang-Chieh Chen, Florian Schro, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto-deeplab:
Hierarchical neural architecture search for semantic image segmentation. In
Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages 82–92, 2019.
[272]
Albert Shaw, Daniel Hunter, Forrest Landola, and Sammy Sidhu. Squeezenas: Fast neural architecture search for faster
semantic segmentation. In
Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops
,
pages 0–0, 2019.
[273]
Xiong Zhang, Hongmin Xu, Hong Mo, Jianchao Tan, Cheng Yang, Lei Wang, and Wenqi Ren. Dcnas: Densely
connected neural architecture search for semantic image segmentation. In
Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages 13956–13967, 2021.
[274]
Chenxi Liu, Zhaoqi Leng, Pei Sun, Shuyang Cheng, Charles R Qi, Yin Zhou, Mingxing Tan, and Dragomir Anguelov.
Lidarnas: Unifying and searching neural architectures for 3d point clouds. In
Computer Vision–ECCV 2022: 17th
European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI, pages 158–175. Springer, 2022.
[275]
Zhijian Liu, Haotian Tang, Shengyu Zhao, Kevin Shao, and Song Han. Pvnas: 3d neural architecture search with
point-voxel convolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8552–8568, 2021.
[276]
Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, and Song Han. Searching ecient 3d
architectures with sparse point-voxel convolution. In
Computer Vision–ECCV 2020: 16th European Conference,
Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII, pages 685–702. Springer, 2020.
[277]
Shaoli Liu, Chengjian Zheng, Kaidi Lu, Si Gao, Ning Wang, Bofei Wang, Diankai Zhang, Xiaofeng Zhang, and Tianyu
Xu. Evsrnet: Ecient video super-resolution with neural architecture search. In
Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 2480–2485, 2021.
[278]
Yushu Wu, Yifan Gong, Pu Zhao, Yanyu Li, Zheng Zhan, Wei Niu, Hao Tang, Minghai Qin, Bin Ren, and Yanzhi Wang.
Compiler-aware neural architecture search for on-mobile real-time super-resolution. In
Computer Vision–ECCV
2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIX
, pages 92–111. Springer,
2022.
84
[279]
Antoine Yang, Pedro M Esperança, and Fabio M Carlucci. Nas evaluation is frustratingly hard. In
International
Conference on Learning Representations, 2020.
[280]
Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation
strategies from data. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
, pages
113–123, 2019.
[281]
Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. In
Uncertainty in
articial intelligence, pages 367–377. PMLR, 2020.
[282]
Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neural networks for
image recognition. In
Proceedings of the IEEE/CVF International Conference on Computer Vision
, pages 1284–1293,
2019.
[283]
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design
spaces. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
, pages 10428–10436,
2020.
[284]
Jianyuan Guo, Kai Han, Yunhe Wang, Chao Zhang, Zhaohui Yang, Han Wu, Xinghao Chen, and Chang Xu. Hit-
detector: Hierarchical trinity architecture search for object detection. In
Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 11405–11414, 2020.
[285]
Ryuichiro Hataya, Jan Zdenek, Kazuki Yoshizoe, and Hideki Nakayama. Faster autoaugment: Learning augmentation
strategies using backpropagation. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
23–28, 2020, Proceedings, Part XXV 16, pages 1–16. Springer, 2020.
[286]
Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.
arXiv preprint arXiv:1710.05941
,
2017.
[287]
Yucong Zhou, Zezhou Zhu, and Zhao Zhong. Learning specialized activation functions with the piecewise linear
unit. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12095–12104, 2021.
[288]
Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Bichen Wu, Zijian He, Zhen Wei, Kan Chen, Yuandong Tian, Matthew
Yu, Peter Vajda, et al. Fbnetv3: Joint architecture-recipe search using predictor pretraining. In
Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16276–16285, 2021.
[289]
Xuanyi Dong, Mingxing Tan, Adams Wei Yu, Daiyi Peng, Bogdan Gabrys, and Quoc V Le. Autohas: Ecient
hyperparameter and architecture search. arXiv preprint arXiv:2006.03656, 2020.
[290]
Bichen Wu, Chaojian Li, Hang Zhang, Xiaoliang Dai, Peizhao Zhang, Matthew Yu, Jialiang Wang, Yingyan Lin, and
Peter Vajda. Fbnetv5: Neural architecture search for multiple tasks in one run.
arXiv preprint arXiv:2111.10007
, 2021.
[291]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In
Computer Vision–ECCV 2014: 13th European Conference,
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[292]
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through
ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,
2017.
[293]
Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks.
arXiv preprint
arXiv:1812.08928, 2018.
[294]
Jiahui Yu and Thomas S Huang. Universally slimmable networks and improved training techniques. In
Proceedings
of the IEEE/CVF international conference on computer vision, pages 1803–1811, 2019.
[295]
Changlin Li, Guangrun Wang, Bing Wang, Xiaodan Liang, Zhihui Li, and Xiaojun Chang. Dynamic slimmable
network. In
Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition
, pages 8607–8617,
2021.
[296]
Changlin Li, Tao Tang, Guangrun Wang, Jiefeng Peng, Bing Wang, Xiaodan Liang, and Xiaojun Chang. Bossnas:
Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. In
Proceedings of
the IEEE/CVF International Conference on Computer Vision, pages 12281–12291, 2021.
[297]
Lot Abdelkrim Mecharbat, Hadjer Benmeziane, Hamza Ouranoughi, and Smail Niar. Hyt-nas: Hybrid transformers
neural architecture search for edge devices. arXiv preprint arXiv:2303.04440, 2023.
[298]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks.
In International conference on machine learning, pages 1126–1135. PMLR, 2017.
[299]
Albert Shaw, Wei Wei, Weiyang Liu, Le Song, and Bo Dai. Meta architecture search.
Advances in Neural Information
Processing Systems, 32, 2019.
[300]
Jiaxing Wang, Jiaxiang Wu, Haoli Bai, and Jian Cheng. M-nas: Meta neural architecture search. In
Proceedings of
the AAAI Conference on Articial Intelligence, pages 6186–6193, 2020.
[301]
Hayeon Lee, Eunyoung Hyung, and Sung Ju Hwang. Rapid neural architecture search by learning to generate graphs
from datasets. arXiv preprint arXiv:2107.00860, 2021.
85
[302]
Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. Model compression and hardware acceleration for neural
networks: A comprehensive survey. Proceedings of the IEEE, 108(4):485–532, 2020.
[303]
Tianzhe Wang, Kuan Wang, Han Cai, Ji Lin, Zhijian Liu, Hanrui Wang, Yujun Lin, and Song Han. Apq: Joint search
for network architecture, pruning and quantization policy. In
Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 2078–2087, 2020.
[304]
Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized
neural networks. In International Conference on Learning Representations, 2019.
[305]
Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning.
In International Conference on Learning Representations, 2019.
[306]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: Ecient
inference engine on compressed deep neural network.
ACM SIGARCH Computer Architecture News
, 44(3):243–254,
2016.
[307]
Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.
Advances in neural information processing
systems, 2, 1989.
[308]
Babak Hassibi, David G Stork, and Gregory J Wol. Optimal brain surgeon and general network pruning. In
IEEE
international conference on neural networks, pages 293–299. IEEE, 1993.
[309]
Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks.
British Machine Vision
Conference, 2015.
[310]
Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsies deep neural networks. In
International Conference on Machine Learning, pages 2498–2507. PMLR, 2017.
[311]
Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through
𝑙_
0regularization.
In International Conference on Learning Representations, 2018.
[312]
Yi Guo, Huan Yuan, Jianchao Tan, Zhangyang Wang, Sen Yang, and Ji Liu. Gdp: Stabilized neural network pruning
via gates with dierentiable polarization. In
Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 5239–5250, 2021.
[313]
Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks.
arXiv preprint
arXiv:1902.09574, 2019.
[314]
Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and ne-tuning in neural network pruning.
In International Conference on Learning Representations, 2020.
[315]
Babak Hassibi and David Stork. Second order derivatives for network pruning: Optimal brain surgeon.
Advances in
neural information processing systems, 5, 1992.
[316]
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for
resource ecient inference. In International Conference on Learning Representations, 2017.
[317]
Andries P Engelbrecht. A new pruning heuristic based on variance analysis of sensitivity information.
IEEE
transactions on Neural Networks, 12(6):1386–1399, 2001.
[318]
Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energy-ecient recongurable accelerator
for deep convolutional neural networks. IEEE journal of solid-state circuits, 52(1):127–138, 2016.
[319]
Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. Eyeriss v2: A exible accelerator for emerging deep neural
networks on mobile devices.
IEEE Journal on Emerging and Selected Topics in Circuits and Systems
, 9(2):292–308,
2019.
[320]
Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang,
et al. Ese: Ecient speech recognition engine with sparse lstm on fpga. In
Proceedings of the 2017 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, pages 75–84, 2017.
[321]
Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen.
Cambricon-x: An accelerator for sparse neural networks. In
2016 49th Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO), pages 1–12. IEEE, 2016.
[322]
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany,
Joel Emer, Stephen W Keckler, and William J Dally. Scnn: An accelerator for compressed-sparse convolutional neural
networks. ACM SIGARCH computer architecture news, 45(2):27–40, 2017.
[323]
Chunhua Deng, Yang Sui, Siyu Liao, Xuehai Qian, and Bo Yuan. Gospa: An energy-ecient high-performance
globally optimized sparse convolutional neural network accelerator. In
2021 ACM/IEEE 48th Annual International
Symposium on Computer Architecture (ISCA), pages 1110–1123. IEEE, 2021.
[324]
Jie-Fang Zhang, Ching-En Lee, Chester Liu, Yakun Sophia Shao, Stephen W Keckler, and Zhengya Zhang. Snap: An
ecient sparse neural acceleration processor for unstructured sparse deep neural network inference.
IEEE Journal
of Solid-State Circuits, 56(2):636–647, 2020.
[325]
Sumanth Gudaparthi, Sarabjeet Singh, Surya Narayanan, Rajeev Balasubramonian, and Visvesh Sathe. Candles:
Channel-aware novel dataow-microarchitecture co-design for low energy sparse neural network acceleration. In
86
2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)
, pages 876–891. IEEE,
2022.
[326]
Xiaolong Ma, Fu-Ming Guo, Wei Niu, Xue Lin, Jian Tang, Kaisheng Ma, Bin Ren, and Yanzhi Wang. Pconv: The
missing but desirable sparsity in dnn weight pruning for real-time execution on mobile devices. In
Proceedings of
the AAAI Conference on Articial Intelligence, pages 5117–5124, 2020.
[327]
Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. Patdnn: Achieving
real-time dnn execution on mobile devices with pattern-based weight pruning. In Proceedings of the Twenty-Fifth
International Conference on Architectural Support for Programming Languages and Operating Systems
, pages 907–
922, 2020.
[328]
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In
International Conference on Learning Representations, 2019.
[329]
Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural networks. In
Proceedings
of the IEEE conference on computer vision and pattern recognition workshops, pages 138–145, 2017.
[330]
Utku Evci, Fabian Pedregosa, Aidan Gomez, and Erich Elsen. The diculty of training sparse neural networks.
arXiv
preprint arXiv:1906.10732, 2019.
[331]
Ajay Kumar Jaiswal, Haoyu Ma, Tianlong Chen, Ying Ding, and Zhangyang Wang. Training your sparse neural
network better with any mask. In International Conference on Machine Learning, pages 9833–9844. PMLR, 2022.
[332]
Yi-Lin Sung, Varun Nair, and Colin A Rael. Training neural networks with xed sparse masks.
Advances in Neural
Information Processing Systems, 34:24193–24205, 2021.
[333]
Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning
is all you need. In International Conference on Machine Learning, pages 6682–6691. PMLR, 2020.
[334]
Zeru Zhang, Jiayin Jin, Zijie Zhang, Yang Zhou, Xin Zhao, Jiaxiang Ren, Ji Liu, Lingfei Wu, Ruoming Jin, and
Dejing Dou. Validating the lottery ticket hypothesis with inertial manifold theory.
Advances in Neural Information
Processing Systems, 34:30196–30210, 2021.
[335]
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing the lottery ticket
hypothesis. arXiv preprint arXiv:1903.01611, 2019.
[336]
Tianlong Chen, Yongduo Sui, Xuxi Chen, Aston Zhang, and Zhangyang Wang. A unied lottery ticket hypothesis
for graph neural networks. In International Conference on Machine Learning, pages 1695–1706. PMLR, 2021.
[337]
Youngeun Kim, Yuhang Li, Hyoungseob Park, Yeshwanth Venkatesha, Ruokai Yin, and Priyadarshini Panda. Exploring
lottery ticket hypothesis in spiking neural networks. In
Computer Vision–ECCV 2022: 17th European Conference,
Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII, pages 102–120. Springer, 2022.
[338]
Sanmitra Banerjee, Mahdi Nikdast, Sudeep Pasricha, and Krishnendu Chakrabarty. Pruning coherent integrated
photonic neural networks using the lottery ticket hypothesis. In
2022 IEEE Computer Society Annual Symposium
on VLSI (ISVLSI), pages 128–133. IEEE, 2022.
[339]
Yuxin Zhang, Mingbao Lin, Zhihang Lin, Yiting Luo, Ke Li, Fei Chao, Yongjian Wu, and Rongrong Ji. Learning best
combination for ecient n: M sparsity. Advances in Neural Information Processing Systems, 35:941–953, 2022.
[340]
Asit Mishra, Jorge Albericio Latorre, Je Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius
Micikevicius. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378, 2021.
[341]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan
Wang, Yuwei Hu, Luis Ceze, et al.
{
TVM
}
: An automated
{
End-to-End
}
optimizing compiler for deep learning. In
13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.
[342]
Connor Holmes, Minjia Zhang, Yuxiong He, and Bo Wu. Nxmtransformer: semi-structured sparsication for natural
language understanding via admm. Advances in neural information processing systems, 34:1818–1830, 2021.
[343]
Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng Li. Learning
n: m ne-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010, 2021.
[344]
Je Pool and Chong Yu. Channel permutations for n: m sparsity.
Advances in neural information processing systems
,
34:13316–13327, 2021.
[345]
Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay Subramanian, Sheng-Chun Kao, Shivani
Agrawal, Utku Evci, and Tushar Krishna. Progressive gradient ow for robust n: M sparsity training in transformers.
arXiv preprint arXiv:2402.04744, 2024.
[346]
Yun Li, Lin Niu, Xipeng Zhang, Kai Liu, Jianchen Zhu, and Zhanhui Kang. E-sparse: Boosting the large language
model inference through entropy-based n: M sparsity. arXiv preprint arXiv:2310.15929, 2023.
[347]
Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Zhangyang Wang, et al. The emergence of essential sparsity in large
pre-trained models: The weights that matter. Advances in Neural Information Processing Systems, 36, 2024.
[348]
Yuxin Zhang, Yiting Luo, Mingbao Lin, Yunshan Zhong, Jingjing Xie, Fei Chao, and Rongrong Ji. Bi-directional masks
for ecient n: M sparse training. In
International Conference on Machine Learning
, pages 41488–41497. PMLR, 2023.
87
[349]
Mike Lasby, Anna Golubeva, Utku Evci, Mihai Nica, and Yani Ioannou. Dynamic sparse training with structured
sparsity. arXiv preprint arXiv:2305.02299, 2023.
[350]
Chao Fang, Aojun Zhou, and Zhongfeng Wang. An algorithm–hardware co-optimized framework for accelerating n:
M sparse transformers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 30(11):1573–1586, 2022.
[351]
Chao Fang, Shouliang Guo, Wei Wu, Jun Lin, Zhongfeng Wang, Ming Kai Hsu, and Lingzhi Liu. An ecient hardware
accelerator for sparse transformer neural networks. In
2022 IEEE International Symposium on Circuits and Systems
(ISCAS), pages 2670–2674. IEEE, 2022.
[352]
Yixuan Luo, Payman Behnam, Kiran Thorat, Zhuo Liu, Hongwu Peng, Shaoyi Huang, Shu Zhou, Omer Khan, Alexey
Tumanov, Caiwen Ding, et al. Codg-reram: An algorithm-hardware co-design to accelerate semi-structured gnns on
reram. In 2022 IEEE 40th International Conference on Computer Design (ICCD), pages 280–289. IEEE, 2022.
[353]
Edouard Yvinec, Arnaud Dapogny, Matthieu Cord, and Kevin Bailly. Red: Looking for redundancies for data-
freestructured compression of deep neural networks.
Advances in Neural Information Processing Systems
, 34:20863–
20873, 2021.
[354]
Edouard Yvinec, Arnaud Dapogny, Matthieu Cord, and Kevin Bailly. Red++: Data-free pruning of deep neural
networks via input splitting and output merging.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
45(3):3664–3676, 2022.
[355]
Wenxiao Wang, Cong Fu, Jishun Guo, Deng Cai, and Xiaofei He. Cop: customized deep model compression via
regularized correlation-based lter-level pruning. In International Joint Conference on Articial Intelligence, 2019.
[356]
Zi Wang, Chengcheng Li, and Xiangyang Wang. Convolutional neural network pruning with structural redundancy
reduction. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages 14913–
14922, 2021.
[357]
Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In
Proceedings
of the IEEE international conference on computer vision, pages 1389–1397, 2017.
[358]
Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. Hrank: Filter
pruning using high-rank feature map. In
Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 1529–1538, 2020.
[359]
Yang Sui, Miao Yin, Yi Xie, Huy Phan, Saman Aliari Zonouz, and Bo Yuan. Chip: Channel independence-based
pruning for compact neural networks. Advances in Neural Information Processing Systems, 34:24604–24616, 2021.
[360]
Chong Min John Tan and Mehul Motani. Dropnet: Reducing neural network complexity via iterative pruning. In
International Conference on Machine Learning, pages 9356–9366. PMLR, 2020.
[361]
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A lter level pruning method for deep neural network compression.
In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017.
[362]
Xiaohan Ding, Guiguang Ding, Yuchen Guo, Jungong Han, and Chenggang Yan. Approximated oracle lter pruning
for destructive cnn width optimization. In
International Conference on Machine Learning
, pages 1607–1616. PMLR,
2019.
[363]
Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning.
Advances in neural information processing
systems, 30, 2017.
[364]
Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning ecient
convolutional networks through network slimming. In
Proceedings of the IEEE international conference on computer
vision, pages 2736–2744, 2017.
[365]
Zhonghui You, Kun Yan, Jinmian Ye, Meng Ma, and Ping Wang. Gate decorator: Global lter pruning method for
accelerating deep convolutional neural networks. Advances in neural information processing systems, 32, 2019.
[366]
Tao Zhuang, Zhixuan Zhang, Yuheng Huang, Xiaoyi Zeng, Kai Shuang, and Xiang Li. Neuron-level structured
pruning using polarization regularizer. Advances in neural information processing systems, 33:9865–9877, 2020.
[367]
Jianbo Ye, Xin Lu, Zhe Lin, and James Z Wang. Rethinking the smaller-norm-less-informative assumption in channel
pruning of convolution layers. In International Conference on Learning Representations, 2018.
[368]
Minsoo Kang and Bohyung Han. Operation-aware soft channel pruning using dierentiable masks. In
International
Conference on Machine Learning, pages 5122–5131. PMLR, 2020.
[369]
Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and
acceleration on mobile devices. In
Proceedings of the European conference on computer vision (ECCV)
, pages
784–800, 2018.
[370]
Sixing Yu, Arya Mazaheri, and Ali Jannesari. Auto graph encoder-decoder for neural network pruning. In
Proceedings
of the IEEE/CVF International Conference on Computer Vision, pages 6362–6372, 2021.
[371]
Manoj Alwani, Yang Wang, and Vashisht Madhavan. Decore: Deep compression with reinforcement learning. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12349–12359, 2022.
[372]
Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Kwang-Ting Cheng, and Jian Sun. Metaprun-
ing: Meta learning for automatic neural network channel pruning. In
Proceedings of the IEEE/CVF international
88
conference on computer vision, pages 3296–3305, 2019.
[373]
Mingbao Lin, Rongrong Ji, Yuxin Zhang, Baochang Zhang, Yongjian Wu, and Yonghong Tian. Channel pruning
via automatic structure search. In
Proceedings of the Twenty-Ninth International Conference on International Joint
Conferences on Articial Intelligence, pages 673–679, 2021.
[374]
Xuhua Li, Weize Sun, Lei Huang, and Shaowu Chen. Sub-network multi-objective evolutionary algorithm for lter
pruning. arXiv preprint arXiv:2211.01957, 2022.
[375]
Yawei Li, Shuhang Gu, Kai Zhang, Luc Van Gool, and Radu Timofte. Dhp: Dierentiable meta pruning via hypernet-
works. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Part VIII 16, pages 608–624. Springer, 2020.
[376]
Shaopeng Guo, Yujie Wang, Quanquan Li, and Junjie Yan. Dmcp: Dierentiable markov channel pruning for neural
networks. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
, pages 1539–1547,
2020.
[377]
Xuefei Ning, Tianchen Zhao, Wenshuo Li, Peng Lei, Yu Wang, and Huazhong Yang. Dsa: More ecient budgeted
pruning via dierentiable sparsity allocation. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow,
UK, August 23–28, 2020, Proceedings, Part III, pages 592–607. Springer, 2020.
[378]
Shi Chen and Qi Zhao. Shallowing deep networks: Layer-wise pruning based on feature representations.
IEEE
transactions on pattern analysis and machine intelligence, 41(12):3048–3056, 2018.
[379]
Sara Elkerdawy, Mostafa Elhoushi, Abhineet Singh, Hong Zhang, and Nilanjan Ray. To lter prune, or to layer prune,
that is the question. In Proceedings of the Asian Conference on Computer Vision, 2020.
[380]
Hui Tang, Yao Lu, and Qi Xuan. Sr-init: An interpretable layer pruning method.
arXiv preprint arXiv:2303.07677
,
2023.
[381]
Ke Zhang and Guangzhe Liu. Layer pruning for obtaining shallower resnets.
IEEE Signal Processing Letters
,
29:1172–1176, 2022.
[382]
Artur Jordao, George Correa de Araujo, Helena de Almeida Maia, and Helio Pedrini. When layers play the lottery, all
tickets win at initialization. arXiv preprint arXiv:2301.10835, 2023.
[383]
Yang He and Lingao Xiao. Structured pruning for deep convolutional neural networks: A survey.
arXiv preprint
arXiv:2303.00566, 2023.
[384]
Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui
Zhu. Discrimination-aware channel pruning for deep neural networks.
Advances in neural information processing
systems, 31, 2018.
[385]
Sergey Ioe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal
covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
[386]
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and
Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
[387]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Hui Chen, Shiqing Li, Guochu Xiong, and Weichen Liu. Pearls hide
behind linearity: Simplifying deep convolutional networks for embedded hardware systems via linearity grafting. In
2024 29th Asia and South Pacic Design Automation Conference (ASP-DAC), pages 250–255. IEEE, 2024.
[388]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Guochu Xiong, and Weichen Liu. Domino-pro-max: Towards
ecient network simplication and reparameterization for embedded hardware systems.
IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 2024.
[389]
Hao Kong, Di Liu, Shuo Huai, Xiangzhong Luo, Weichen Liu, Ravi Subramaniam, Christian Makaya, and Qian Lin.
Smart scissor: Coupling spatial redundancy reduction and cnn compression for embedded hardware. In
Proceedings
of the 41st IEEE/ACM International Conference on Computer-Aided Design, pages 1–9, 2022.
[390]
Hao Kong, Di Liu, Xiangzhong Luo, Shuo Huai, Ravi Subramaniam, Christian Makaya, Qian Lin, and Weichen Liu.
Towards ecient convolutional neural network for embedded hardware via multi-dimensional pruning. In
2023 60th
ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2023.
[391]
Hao Kong, Xiangzhong Luo, Shuo Huai, Di Liu, Ravi Subramaniam, Christian Makaya, Qian Lin, and Weichen Liu.
Emnape: Ecient multi-dimensional neural architecture pruning for edgeai. In
2023 Design, Automation & Test in
Europe Conference & Exhibition (DATE), pages 1–2. IEEE, 2023.
[392]
Hao Kong, Di Liu, Shuo Huai, Xiangzhong Luo, Ravi Subramaniam, Christian Makaya, Qian Lin, and Weichen Liu.
Edgecompress: Coupling multidimensional model compression and dynamic inference for edgeai.
IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 42(12):4657–4670, 2023.
[393]
Hao Kong, Di Liu, Xiangzhong Luo, Weichen Liu, and Ravi Subramaniam. Hacscale: Hardware-aware compound
scaling for resource-ecient dnns. In
2022 27th Asia and South Pacic Design Automation Conference (ASP-DAC)
,
pages 708–713. IEEE, 2022.
[394]
Adrian Bulat and Georgios Tzimiropoulos. Xnor-net++: Improved binary neural networks.
arXiv preprint
arXiv:1909.13863, 2019.
89
[395]
Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real net: Enhancing the
performance of 1-bit cnns with improved representational capability and advanced training algorithm. In
Proceedings
of the European conference on computer vision (ECCV), pages 722–737, 2018.
[396]
Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song. Forward and
backward information retention for accurate binary neural networks. In
Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages 2250–2259, 2020.
[397]
Mingbao Lin, Rongrong Ji, Zihan Xu, Baochang Zhang, Yan Wang, Yongjian Wu, Feiyue Huang, and Chia-Wen Lin.
Rotated binary neural network. Advances in neural information processing systems, 33:7474–7485, 2020.
[398]
Mingbao Lin, Rongrong Ji, Zihan Xu, Baochang Zhang, Fei Chao, Chia-Wen Lin, and Ling Shao. Siman: Sign-to-
magnitude network binarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[399]
Sieger Falkena, Hadi Jamali-Rad, and Jan van Gemert. Lab: Learnable activation binarizer for binary neural networks.
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6425–6434, 2023.
[400]
Zhijun Tu, Xinghao Chen, Pengju Ren, and Yunhe Wang. Adabin: Improving binary neural networks with adaptive
binary sets. In
Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,
Proceedings, Part XI, pages 379–395. Springer, 2022.
[401]
Fengfu Li, Bin Liu, Xiaoxing Wang, Bo Zhang, and Junchi Yan. Ternary weight networks.
arXiv preprint
arXiv:1605.04711, 2016.
[402]
Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. In
International Conference
on Learning Representations, 2017.
[403]
Hande Alemdar, Vincent Leroy, Adrien Prost-Boucle, and Frédéric Pétrot. Ternary neural networks for resource-
ecient ai applications. In
2017 international joint conference on neural networks (IJCNN)
, pages 2547–2554. IEEE,
2017.
[404]
Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey. Ternary
neural networks with ne-grained quantization. arXiv preprint arXiv:1705.01462, 2017.
[405]
Yue Li, Wenrui Ding, Chunlei Liu, Baochang Zhang, and Guodong Guo. Trq: Ternary neural networks with residual
quantization. In Proceedings of the AAAI conference on articial intelligence, pages 8538–8546, 2021.
[406]
Yuhang Li, Xin Dong, Sai Qian Zhang, Haoli Bai, Yuanpeng Chen, and Wei Wang. Rtn: Reparameterized ternary
network. In Proceedings of the AAAI Conference on Articial Intelligence, pages 4780–4787, 2020.
[407]
Peng Chen, Bohan Zhuang, and Chunhua Shen. Fatnn: Fast and accurate ternary neural networks. In
Proceedings of
the IEEE/CVF International Conference on Computer Vision, pages 5219–5228, 2021.
[408]
Weixiang Xu, Xiangyu He, Tianli Zhao, Qinghao Hu, Peisong Wang, and Jian Cheng. Soft threshold ternary networks.
arXiv preprint arXiv:2204.01234, 2022.
[409]
Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neural networks on cpus. In
Deep
Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
[410] Han Vanholder. Ecient inference with tensorrt. In GPU Technology Conference, 2016.
[411]
Sumin Kim, Gunju Park, and Youngmin Yi. Performance evaluation of int8 quantized inference on mobile gpus.
IEEE
Access, 9:164245–164255, 2021.
[412]
Li Lyna Zhang, Xudong Wang, Jiahang Xu, Quanlu Zhang, Yujing Wang, Yuqing Yang, Ningxin Zheng, Ting
Cao, and Mao Yang. Spaceevo: Hardware-friendly search space design for ecient int8 inference.
arXiv preprint
arXiv:2303.08308, 2023.
[413]
Aishwarya Bhandare, Vamsi Sripathi, Deepthi Karkada, Vivek Menon, Sun Choi, Kushal Datta, and Vikram Saletore. Ef-
cient 8-bit quantization of transformer neural machine language translation model.
arXiv preprint arXiv:1906.00532
,
2019.
[414]
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and
Dmitry Kalenichenko. Quantization and training of neural networks for ecient integer-arithmetic-only inference.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
[415]
Diwen Wan, Fumin Shen, Li Liu, Fan Zhu, Jie Qin, Ling Shao, and Heng Tao Shen. Tbn: Convolutional neural network
with ternary inputs and binary weights. In
Proceedings of the European Conference on Computer Vision (ECCV)
,
pages 315–332, 2018.
[416]
Julian Faraone, Nicholas Fraser, Michaela Blott, and Philip HW Leong. Syq: Learning symmetric quantization for
ecient deep neural networks. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
,
pages 4300–4309, 2018.
[417]
Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and
Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks.
arXiv preprint
arXiv:1805.06085, 2018.
[418]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Gins-
burg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.
arXiv preprint
90
arXiv:1710.03740, 2017.
[419]
Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee,
Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, et al. Mixed precision training of
convolutional neural networks using integer operations. In
International Conference on Learning Representations
,
2018.
[420]
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou
Yang, Liwei Yu, et al. Highly scalable deep learning training system with mixed-precision: Training imagenet in four
minutes. arXiv preprint arXiv:1807.11205, 2018.
[421]
Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vitaly Lavrukhin, Carl Case, and Paulius Micikevicius. Openseq2seq:
extensible toolkit for distributed and mixed precision training of sequence-to-sequence models. In
Proceedings of
Workshop for NLP Open Source Software (NLP-OSS), pages 41–46, 2018.
[422]
Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, and Junjie Yan. Towards
unied int8 training for convolutional neural network. In
Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 1969–1979, 2020.
[423]
Kang Zhao, Sida Huang, Pan Pan, Yinghan Li, Yingya Zhang, Zhenyu Gu, and Yinghui Xu. Distribution adaptive int8
quantization for training cnns. In
Proceedings of the AAAI Conference on Articial Intelligence
, pages 3483–3491,
2021.
[424]
Shyam A Tailor, Javier Fernandez-Marques, and Nicholas D Lane. Degree-quant: Quantization-aware training for
graph neural networks. In International Conference on Learning Representations, 2021.
[425]
Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, and Tijmen Blankevoort. Overcoming oscillations in
quantization-aware training. In International Conference on Machine Learning, pages 16318–16330. PMLR, 2022.
[426]
Charbel Sakr, Steve Dai, Rangha Venkatesan, Brian Zimmer, William Dally, and Brucek Khailany. Optimal clipping
and magnitude-aware dierentiation for improved quantization-aware training. In
International Conference on
Machine Learning, pages 19123–19138. PMLR, 2022.
[427]
Jiseok Youn, Jaehun Song, Hyung-Sin Kim, and Saewoong Bahk. Bitwidth-adaptive quantization-aware neural
network training: A meta-learning approach. In
Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv,
Israel, October 23–27, 2022, Proceedings, Part XII, pages 208–224. Springer, 2022.
[428]
Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. Mixed precision
quantization of convnets via dierentiable neural architecture search. arXiv preprint arXiv:1812.00090, 2018.
[429]
Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with
mixed precision. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages
8612–8620, 2019.
[430]
Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization
of neural networks with mixed-precision. In
Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 293–302, 2019.
[431]
Haibao Yu, Qi Han, Jianbo Li, Jianping Shi, Guangliang Cheng, and Bin Fan. Search what you want: Barrier panelty
nas for mixed precision quantization. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK,
August 23–28, 2020, Proceedings, Part IX 16, pages 1–16. Springer, 2020.
[432]
Weihan Chen, Peisong Wang, and Jian Cheng. Towards mixed-precision quantization of neural networks via
constrained optimization. In
Proceedings of the IEEE/CVF International Conference on Computer Vision
, pages
5350–5359, 2021.
[433]
Zhaowei Cai and Nuno Vasconcelos. Rethinking dierentiable search for mixed-precision neural networks. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2349–2358, 2020.
[434]
Ziwei Wang, Han Xiao, Jiwen Lu, and Jie Zhou. Generalizable mixed-precision quantization via attribution rank
preservation. In
Proceedings of the IEEE/CVF International Conference on Computer Vision
, pages 5291–5300, 2021.
[435]
Hai Victor Habi, Roy H Jennings, and Arnon Netzer. Hmq: Hardware friendly mixed precision quantization block for
cnns. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Part XXVI 16, pages 448–463. Springer, 2020.
[436]
Zhaohui Yang, Yunhe Wang, Kai Han, Chunjing Xu, Chao Xu, Dacheng Tao, and Chang Xu. Searching for low-bit
weights in quantized neural networks. Advances in neural information processing systems, 33:4091–4102, 2020.
[437]
Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. Yodann: An ultra-low power convolutional neural
network accelerator based on binary weights. In
2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)
,
pages 236–241. IEEE, 2016.
[438]
Peng Guo, Hong Ma, Ruizhi Chen, Pin Li, Shaolin Xie, and Donglin Wang. Fbna: A fully binarized neural network
accelerator. In
2018 28th International Conference on Field Programmable Logic and Applications (FPL)
, pages
51–513. IEEE, 2018.
91
[439]
Francesco Conti, Pasquale Davide Schiavone, and Luca Benini. Xnor neural engine: A hardware accelerator ip for
21.6-fj/op binary neural network inference.
IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, 37(11):2940–2951, 2018.
[440] Shubham Jain, Sumeet Kumar Gupta, and Anand Raghunathan. Tim-dnn: Ternary in-memory accelerator for deep
neural networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 28(7):1567–1577, 2020.
[441]
Moritz Scherer, Georg Rutishauser, Lukas Cavigelli, and Luca Benini. Cutie: Beyond petaop/s/w ternary dnn inference
acceleration with better-than-binary energy eciency.
IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 41(4):1020–1033, 2021.
[442]
Shien Zhu, Luan HK Duong, Hui Chen, Di Liu, and Weichen Liu. Fat: An in-memory accelerator with fast addition for
ternary weight neural networks.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
,
2022.
[443]
Nahsung Kim, Dongyeob Shin, Wonseok Choi, Geonho Kim, and Jongsun Park. Exploiting retraining-based mixed-
precision quantization for low-cost dnn accelerator design.
IEEE Transactions on Neural Networks and Learning
Systems, 32(7):2925–2938, 2020.
[444]
Mengshu Sun, Zhengang Li, Alec Lu, Yanyu Li, Sung-En Chang, Xiaolong Ma, Xue Lin, and Zhenman Fang. Film-qnn:
Ecient fpga acceleration of deep neural networks with intra-layer, mixed-precision quantization. In
Proceedings of
the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 134–145, 2022.
[445]
Jinsu Lee, Juhyoung Lee, Donghyeon Han, Jinmook Lee, Gwangtae Park, and Hoi-Jun Yoo. An energy-ecient sparse
deep-neural-network learning accelerator with ne-grained mixed precision of fp8–fp16.
IEEE Solid-State Circuits
Letters, 2(11):232–235, 2019.
[446]
Sitao Huang, Aayush Ankit, Plinio Silveira, Rodrigo Antunes, Sai Rahul Chalamalasetti, Izzat El Hajj, Dong Eun
Kim, Glaucimar Aguiar, Pedro Bruel, Sergey Serebryakov, et al. Mixed precision quantization for reram-based dnn
inference accelerators. In
Proceedings of the 26th Asia and South Pacic Design Automation Conference
, pages
372–377, 2021.
[447]
Wolfgang Balzer, Masanobu Takahashi, Jun Ohta, and Kazuo Kyuma. Weight quantization in boltzmann machines.
Neural Networks, 4(3):405–409, 1991.
[448]
Emile Fiesler, Amar Choudry, and H John Cauleld. Weight discretization paradigm for optical neural networks. In
Optical interconnections and networks, volume 1281, pages 164–173. SPIE, 1990.
[449]
Gunhan Dundar and Kenneth Rose. The eects of quantization on multilayer neural networks.
IEEE Transactions
on Neural Networks, 6(6):1446–1451, 1995.
[450]
Shuo Huai, Di Liu, Xiangzhong Luo, Hui Chen, Weichen Liu, and Ravi Subramaniam. Crossbar-aligned & integer-only
neural network compression for ecient in-memory acceleration. In
Proceedings of the 28th Asia and South Pacic
Design Automation Conference, pages 234–239, 2023.
[451]
Shuo Huai, Hao Kong, Xiangzhong Luo, Shiqing Li, Ravi Subramaniam, Christian Makaya, Qian Lin, and Weichen
Liu. Crimp: C ompact & r eliable dnn inference on i n-m emory p rocessing via crossbar-aligned compression and
non-ideality adaptation. ACM Transactions on Embedded Computing Systems, 22(5s):1–25, 2023.
[452]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation.
arXiv preprint
arXiv:1910.10699, 2019.
[453]
Srinidhi Hegde, Ranjitha Prasad, Ramya Hebbalaguppe, and Vishwajeet Kumar. Variational student: Learning compact
and sparser networks in knowledge distillation framework. In
ICASSP 2020-2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 3247–3251. IEEE, 2020.
[454]
Tiancheng Wen, Shenqi Lai, and Xueming Qian. Preparing lessons: Improve knowledge distillation with better
supervision. Neurocomputing, 454:25–33, 2021.
[455]
Jang Hyun Cho and Bharath Hariharan. On the ecacy of knowledge distillation. In
Proceedings of the IEEE/CVF
international conference on computer vision, pages 4794–4802, 2019.
[456]
Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh.
Improved knowledge distillation via teacher assistant. In
Proceedings of the AAAI conference on articial intelligence
,
pages 5191–5198, 2020.
[457]
Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge
distillation: A good teacher is patient and consistent. In
Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 10925–10934, 2022.
[458]
Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning from noisy labels with
distillation. In Proceedings of the IEEE international conference on computer vision, pages 1910–1918, 2017.
[459]
Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet
classication. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
, pages 10687–
10698, 2020.
92
[460]
Guanzhe Hong, Zhiyuan Mao, Xiaojun Lin, and Stanley H Chan. Student-teacher learning from clean inputs to
noisy inputs. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages
12075–12084, 2021.
[461]
Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization,
network minimization and transfer learning. In
Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 4133–4141, 2017.
[462]
Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor
transfer. Advances in neural information processing systems, 31, 2018.
[463]
Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information
distillation for knowledge transfer. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 9163–9171, 2019.
[464]
Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. In
Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 1365–1374, 2019.
[465]
Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. Bert-of-theseus: Compressing bert by progressive
module replacing. arXiv preprint arXiv:2002.02925, 2020.
[466]
Zaida Zhou, Chaoran Zhuge, Xinwei Guan, and Wen Liu. Channel distillation: Channel-wise attention for knowledge
distillation. arXiv preprint arXiv:2006.01683, 2020.
[467]
Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets
improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
[468] Shan You, Chang Xu, Chao Xu, and Dacheng Tao. Learning from multiple teacher networks. In Proceedings of the
23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1285–1294, 2017.
[469]
Bharat Bhusan Sau and Vineeth N Balasubramanian. Deep model compression: Distilling knowledge from noisy
teachers. arXiv preprint arXiv:1610.09650, 2016.
[470]
Guocong Song and Wei Chai. Collaborative learning for deep neural networks.
Advances in neural information
processing systems, 31, 2018.
[471]
Ze Yang, Linjun Shou, Ming Gong, Wutao Lin, and Daxin Jiang. Model compression with two-stage multi-teacher
knowledge distillation for web question answering system. In
Proceedings of the 13th International Conference on
Web Search and Data Mining, pages 690–698, 2020.
[472]
Xiatian Zhu, Shaogang Gong, et al. Knowledge distillation by on-the-y native ensemble.
Advances in neural
information processing systems, 31, 2018.
[473]
Takashi Fukuda, Masayuki Suzuki, Gakuto Kurata, Samuel Thomas, Jia Cui, and Bhuvana Ramabhadran. Ecient
knowledge distillation from an ensemble of teachers. In Interspeech, pages 3697–3701, 2017.
[474]
Liuyu Xiang, Guiguang Ding, and Jungong Han. Learning from multiple experts: Self-paced knowledge distillation
for long-tailed classication. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
23–28, 2020, Proceedings, Part V 16, pages 247–263. Springer, 2020.
[475]
Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. In
Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 4320–4328, 2018.
[476]
Elliot J Crowley, Gavin Gray, and Amos J Storkey. Moonshine: Distilling with cheap convolutions.
Advances in
Neural Information Processing Systems, 31, 2018.
[477]
Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher:
Improve the performance of convolutional neural networks via self distillation. In
Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 3713–3722, 2019.
[478]
Hossein Mobahi, Mehrdad Farajtabar, and Peter Bartlett. Self-distillation amplies regularization in hilbert space.
Advances in Neural Information Processing Systems, 33:3351–3361, 2020.
[479]
Sukmin Yun, Jongjin Park, Kimin Lee, and Jinwoo Shin. Regularizing class-wise predictions via self-knowledge
distillation. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
, pages 13876–
13885, 2020.
[480]
Mingi Ji, Seungjae Shin, Seunghyun Hwang, Gibeom Park, and Il-Chul Moon. Rene myself by teaching myself:
Feature renement via self-knowledge distillation. In
Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pages 10664–10673, 2021.
[481] Yixiao Ge, Xiao Zhang, Ching Lam Choi, Ka Chun Cheung, Peipei Zhao, Feng Zhu, Xiaogang Wang, Rui Zhao, and
Hongsheng Li. Self-distillation with batch knowledge ensembling improves imagenet classication.
arXiv preprint
arXiv:2104.13298, 2021.
[482]
Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: Similarity control and knowledge transfer.
Journal of Machine Learning Research, 16(61):2023–2049, 2015.
[483]
David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillation and privileged
information. In International Conference on Learning Representations, 2016.
93
[484]
Peisen Zhao, Lingxi Xie, Jiajie Wang, Ya Zhang, and Qi Tian. Progressive privileged knowledge distillation for online
action detection. Pattern Recognition, 129:108741, 2022.
[485]
Fengyi Tang, Cao Xiao, Fei Wang, Jiayu Zhou, and Li-wei H Lehman. Retaining privileged information for multi-
task learning. In
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining, pages 1369–1377, 2019.
[486]
Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. Adversarial distillation for learning with privileged provisions.
IEEE transactions on pattern analysis and machine intelligence, 43(3):786–797, 2019.
[487]
Chen Xu, Quan Li, Junfeng Ge, Jinyang Gao, Xiaoyong Yang, Changhua Pei, Fei Sun, Jian Wu, Hanxiao Sun, and
Wenwu Ou. Privileged features distillation at taobao recommendations. In
Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, pages 2590–2598, 2020.
[488]
Hanting Chen, Yunhe Wang, Chang Xu, Zhaohui Yang, Chuanjian Liu, Boxin Shi, Chunjing Xu, Chao Xu, and Qi Tian.
Data-free learning of student networks. In
Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 3514–3522, 2019.
[489]
Gongfan Fang, Jie Song, Chengchao Shen, Xinchao Wang, Da Chen, and Mingli Song. Data-free adversarial distillation.
arXiv preprint arXiv:1912.11006, 2019.
[490]
Xiaoyang Qu, Jianzong Wang, and Jing Xiao. Enhancing data-free adversarial distillation with activation regularization
and virtual interpolation. In
ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 3340–3344. IEEE, 2021.
[491]
Haoran Zhao, Xin Sun, Junyu Dong, Milos Manic, Huiyu Zhou, and Hui Yu. Dual discriminator adversarial distillation
for data-free model compression. International Journal of Machine Learning and Cybernetics, pages 1–18, 2022.
[492]
Yuanxin Zhuang, Lingjuan Lyu, Chuan Shi, Carl Yang, and Lichao Sun. Data-free adversarial knowledge distillation
for graph neural networks. arXiv preprint arXiv:2205.03811, 2022.
[493]
Yiman Zhang, Hanting Chen, Xinghao Chen, Yiping Deng, Chunjing Xu, and Yunhe Wang. Data-free knowledge
distillation for image super-resolution. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 7852–7861, 2021.
[494]
Gongfan Fang, Jie Song, Xinchao Wang, Chengchao Shen, Xingen Wang, and Mingli Song. Contrastive model
inversion for data-free knowledge distillation. arXiv preprint arXiv:2105.08584, 2021.
[495]
Mandar Kulkarni, Kalpesh Patil, and Shirish Karande. Knowledge distillation using unlabeled mismatched images.
arXiv preprint arXiv:1703.07131, 2017.
[496]
Qing Liu, Lingxi Xie, Huiyu Wang, and Alan L Yuille. Semantic-aware knowledge preservation for zero-shot
sketch-based image retrieval. In
Proceedings of the IEEE/CVF International Conference on Computer Vision
, pages
3662–3671, 2019.
[497]
Tianhong Li, Jianguo Li, Zhuang Liu, and Changshui Zhang. Few sample knowledge distillation for ecient network
compression. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages
14639–14647, 2020.
[498]
Akisato Kimura, Zoubin Ghahramani, Koh Takeuchi, Tomoharu Iwata, and Naonori Ueda. Few-shot learning of
neural networks from scratch by pseudo example optimization. arXiv preprint arXiv:1802.03039, 2018.
[499]
Haoli Bai, Jiaxiang Wu, Irwin King, and Michael Lyu. Few shot network compression via cross distillation. In
Proceedings of the AAAI Conference on Articial Intelligence, pages 3203–3210, 2020.
[500]
Huanyu Wang, Junjie Liu, Xin Ma, Yang Yong, Zhenhua Chai, and Jianxin Wu. Compressing models with few
samples: Mimicking then replacing. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 701–710, 2022.
[501]
Cristian Bucilu
ˇ
a, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In
Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006.
[502]
Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. Learning ecient object detection
models with knowledge distillation. Advances in neural information processing systems, 30, 2017.
[503]
Xiangzhong Luo, HK Luan Duong, and Weichen Liu. Person re-identication via pose-aware multi-semantic learning.
In 2020 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2020.
[504]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
[505]
Yu Liu, Xuhui Jia, Mingxing Tan, Raviteja Vemulapalli, Yukun Zhu, Bradley Green, and Xiaogang Wang. Search to
distill: Pearls are everywhere but not the eyes. In
Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 7539–7548, 2020.
[506]
Peijie Dong, Lujun Li, and Zimian Wei. Diswot: Student architecture search for distillation without training.
arXiv
preprint arXiv:2303.15678, 2023.
[507]
Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Kuan Wang, Tianzhe Wang, Ligeng Zhu, and Song Han. Automl for architecting
ecient and specialized neural networks. IEEE Micro, 40(1):75–82, 2019.
94
[508]
Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon,
Jakub Konečn
`
y, Stefano Mazzocchi, Brendan McMahan, et al. Towards federated learning at scale: System design.
Proceedings of machine learning and systems, pages 374–388, 2019.
[509]
Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, Junjie Yan, and Rui Fan. Fully quantized network for object
detection. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages 2810–
2819, 2019.
[510]
Zihao Xie, Li Zhu, Lin Zhao, Bo Tao, Liman Liu, and Wenbing Tao. Localization-aware channel pruning for object
detection. Neurocomputing, 403:400–408, 2020.
[511]
PyTorch. Automatic mixed precision. https://pytorch.org/blog/accelerating-training-on-nvidia- gpus-with- pytorch-
automatic-mixed-precision/, 2021.
[512]
Colby Banbury, Chuteng Zhou, Igor Fedorov, Ramon Matas, Urmish Thakker, Dibakar Gope, Vijay Janapa Reddi,
Matthew Mattina, and Paul Whatmough. Micronets: Neural network architectures for deploying tinyml applications
on commodity microcontrollers. Proceedings of machine learning and systems, 3:517–532, 2021.
[513]
Ji Lin, Wei-Ming Chen, Yujun Lin, Chuang Gan, Song Han, et al. Mcunet: Tiny deep learning on iot devices.
Advances
in Neural Information Processing Systems, 33:11711–11722, 2020.
[514]
Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, and Song Han. Memory-ecient patch-based inference for tiny deep
learning. Advances in Neural Information Processing Systems, 34:2346–2358, 2021.
[515]
Kunran Xu, Yishi Li, Huawei Zhang, Rui Lai, and Lin Gu. Etinynet: Extremely tiny network for tinyml. In
Proceedings
of the AAAI conference on articial intelligence, 2022.
[516]
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost.
arXiv
preprint arXiv:1604.06174, 2016.
[517]
Jianwei Feng and Dong Huang. Optimal gradient checkpoint search for arbitrary computation graphs. In
Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11433–11442, 2021.
[518]
Mucong Ding, Tahseen Rabbani, Bang An, Evan Wang, and Furong Huang. Sketch-gnn: Scalable graph neural
networks with sublinear training complexity.
Advances in Neural Information Processing Systems
, 35:2930–2943,
2022.
[519]
Xucheng Ye, Pengcheng Dai, Junyu Luo, Xin Guo, Yingjie Qi, Jianlei Yang, and Yiran Chen. Accelerating cnn training
by pruning activation gradients. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
23–28, 2020, Proceedings, Part XXV 16, pages 322–338. Springer, 2020.
[520]
Yuedong Yang, Guihong Li, and Radu Marculescu. Ecient on-device training via gradient ltering.
arXiv preprint
arXiv:2301.00330, 2023.
[521]
Liu Liu, Lei Deng, Xing Hu, Maohua Zhu, Guoqi Li, Yufei Ding, and Yuan Xie. Dynamic sparse graph for ecient
deep learning. International Conference on Learning Representations, 2019.
[522]
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical
precision. In International conference on machine learning, pages 1737–1746. PMLR, 2015.
[523]
Qihua Zhou, Song Guo, Zhihao Qu, Jingcai Guo, Zhenda Xu, Jiewei Zhang, Tao Guo, Boyuan Luo, and Jingren Zhou.
Octo: Int8 training with loss-aware compensation and backward quantization for tiny on-device learning. In
USENIX
Annual Technical Conference, pages 177–191, 2021.
[524]
Leonardo Ravaglia, Manuele Rusci, Davide Nadalini, Alessandro Capotondi, Francesco Conti, and Luca Benini. A
tinyml platform for on-device continual learning with quantized latent replays.
IEEE Journal on Emerging and
Selected Topics in Circuits and Systems, pages 789–802, 2021.
[525]
Tyler L Hayes and Christopher Kanan. Online continual learning for embedded devices.
arXiv preprint
arXiv:2203.10681, 2022.
[526]
Lorenzo Pellegrini, Vincenzo Lomonaco, Gabriele Graeti, and Davide Maltoni. Continual learning at the edge:
Real-time training on smartphone devices. arXiv preprint arXiv:2105.13127, 2021.
[527]
Giorgos Demosthenous and Vassilis Vassiliades. Continual learning on the edge with tensorow lite.
arXiv preprint
arXiv:2105.01946, 2021.
[528]
Yang Xiao, Xubo Liu, James King, Arshdeep Singh, Eng Siong Chng, Mark D Plumbley, and Wenwu Wang. Continual
learning for on-device environmental sound classication. arXiv preprint arXiv:2207.07429, 2022.
[529]
Young D Kwon, Jagmohan Chauhan, Abhishek Kumar, Pan Hui HKUST, and Cecilia Mascolo. Exploring system
performance of continual learning for mobile and embedded sensing applications. In
2021 IEEE/ACM Symposium
on Edge Computing (SEC), pages 319–332. IEEE, 2021.
[530]
Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, and Abdelrahman
Mohamed. Continual learning for on-device speech recognition using disentangled conformers. In
ICASSP 2023-2023
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
[531]
Alberto Dequino, Francesco Conti, and Luca Benini. Vit-lr: Pushing the envelope for transformer-based on-device
embedded continual learning. In
2022 IEEE 13th International Green and Sustainable Computing Conference (IGSC)
,
95
pages 1–6. IEEE, 2022.
[532]
Jaekang Shin, Seungkyu Choi, Yeongjae Choi, and Lee-Sup Kim. A pragmatic approach to on-device incremental
learning system with selective weight updates. In
2020 57th ACM/IEEE Design Automation Conference (DAC)
, pages
1–6. IEEE, 2020.
[533]
Ze-Han Wang, Zhenli He, Hui Fang, Yi-Xiong Huang, Ying Sun, Yu Yang, Zhi-Yuan Zhang, and Di Liu. Ecient on-
device incremental learning by weight freezing. In
2022 27th Asia and South Pacic Design Automation Conference
(ASP-DAC), pages 538–543. IEEE, 2022.
[534]
Prahalathan Sundaramoorthy, Gautham Krishna Gudur, Manav Rajiv Moorthy, R Nidhi Bhandari, and Vineeth
Vijayaraghavan. Harnet: Towards on-device incremental learning using deep ensembles on constrained devices. In
Proceedings of the 2nd International Workshop on Embedded and Mobile Deep Learning, pages 31–36, 2018.
[535]
Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. Large scale ne-grained categorization and
domain-specic transfer learning. In
Proceedings of the IEEE conference on computer vision and pattern recognition
,
pages 4109–4118, 2018.
[536]
Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In
Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, pages 2661–2671, 2019.
[537]
Pramod Kaushik Mudrakarta, Mark Sandler, Andrey Zhmoginov, and Andrew Howard. K for the price of 1: Parameter-
ecient multi-task and transfer learning. International Conference on Learning Representations, 2019.
[538]
Jonathan Frankle, David J Schwab, and Ari S Morcos. Training batchnorm and only batchnorm: On the expressive
power of random features in cnns. International Conference on Learning Representations, 2021.
[539]
Fahdi Kanavati and Masayuki Tsuneki. Partial transfusion: on the expressive inuence of trainable batch norm
parameters for transfer learning. In Medical Imaging with Deep Learning, pages 338–353. PMLR, 2021.
[540]
Moslem Yazdanpanah, Aamer Abdul Rahman, Muawiz Chaudhary, Christian Desrosiers, Mohammad Havaei, Eugene
Belilovsky, and Samira Ebrahimi Kahou. Revisiting learnable anes for batch norm in few-shot transfer learning. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9109–9118, 2022.
[541]
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-ecient
learning of deep networks from decentralized data. In
Articial intelligence and statistics
, pages 1273–1282. PMLR,
2017.
[542]
Sebastian Caldas, Jakub Konečny, H Brendan McMahan, and Ameet Talwalkar. Expanding the reach of federated
learning by reducing client resource requirements. arXiv preprint arXiv:1812.07210, 2018.
[543]
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing the communi-
cation bandwidth for distributed training. International Conference on Learning Representations, 2018.
[544]
Ligeng Zhu, Hongzhou Lin, Yao Lu, Yujun Lin, and Song Han. Delayed gradient averaging: Tolerate the communication
latency for federated learning. Advances in Neural Information Processing Systems, 34:29995–30007, 2021.
[545]
Tien-Ju Yang, Dhruv Guliani, Françoise Beaufays, and Giovanni Motta. Partial variable training for ecient on-device
federated learning. In
ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 4348–4352. IEEE, 2022.
[546]
Shuai Zhu, Thiemo Voigt, JeongGil Ko, and Fatemeh Rahimian. On-device training: A rst overview on existing
systems. arXiv preprint arXiv:2212.00824, 2022.
[547]
Han Cai, Chuang Gan, Ji Lin, and Song Han. Network augmentation for tiny deep learning. In
International
Conference on Learning Representations, 2022.
[548]
Ken Chateld, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving
deep into convolutional nets. arXiv preprint arXiv:1405.3531, 2014.
[549]
Shuo Huai, Di Liu, Hao Kong, Xiangzhong Luo, Weichen Liu, Ravi Subramaniam, Christian Makaya, and Qian Lin.
Collate: Collaborative neural network learning for latency-critical edge systems. In
2022 IEEE 40th International
Conference on Computer Design (ICCD), pages 627–634. IEEE, 2022.
[550]
Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Françoise Beaufays, Sean Augenstein, Hubert
Eichner, Chloé Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction.
arXiv preprint
arXiv:1811.03604, 2018.
[551]
Mohammed Adnan, Shivam Kalra, Jesse C Cresswell, Graham W Taylor, and Hamid R Tizhoosh. Federated learning
and dierential privacy for medical image analysis. Scientic reports, 2022.
[552]
Rodolfo Stoel Antunes, Cristiano André da Costa, Arne Küderle, Imrana Abdullahi Yari, and Björn Eskoer. Federated
learning for healthcare: Systematic review and architecture proposal.
ACM Transactions on Intelligent Systems and
Technology (TIST), 2022.
[553]
Yujin Huang, Han Hu, and Chunyang Chen. Robustness of on-device models: Adversarial attack to deep learning
models on android apps. In
2021 IEEE/ACM 43rd International Conference on Software Engineering: Software
Engineering in Practice (ICSE-SEIP), pages 101–110. IEEE, 2021.
96
[554]
Qun Song, Zhenyu Yan, and Rui Tan. Deepmtd: Moving target defense for deep visual sensing against adversarial
examples. ACM Transactions on Sensor Networks (TOSN), 18(1):1–32, 2021.
[555]
Qun Song, Zhenyu Yan, and Rui Tan. Moving target defense for embedded deep visual sensing against adversarial
examples. In Proceedings of the 17th Conference on Embedded Networked Sensor Systems, pages 124–137, 2019.
[556]
Colin Rael, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li,
and Peter J Liu. Exploring the limits of transfer learning with a unied text-to-text transformer.
Journal of machine
learning research, 21(140):1–67, 2020.
[557]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and
Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation,
and comprehension. arXiv preprint arXiv:1910.13461, 2019.
[558]
Apoorv Vyas, Angelos Katharopoulos, and François Fleuret. Fast transformers with clustered attention.
Advances in
Neural Information Processing Systems, 33:21665–21674, 2020.
[559]
Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh.
Nyströmformer: A nyström-based algorithm for approximating self-attention. In
Proceedings of the AAAI Conference
on Articial Intelligence, volume 35, pages 14138–14148, 2021.
[560]
Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. Kdeformer: Accelerating transformers via kernel density
estimation. In International Conference on Machine Learning, pages 40605–40623. PMLR, 2023.
[561]
Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke
Zettlemoyer. Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022.
[562]
Silas Alberti, Niclas Dern, Laura Thesing, and Gitta Kutyniok. Sumformer: Universal approximation for ecient
transformers. In Topological, Algebraic and Geometric Learning Workshops 2023, pages 72–86. PMLR, 2023.
[563]
Ahan Gupta, Yueming Yuan, Yanqi Zhou, and Charith Mendis. Flurka: Fast fused low-rank & kernel attention.
arXiv
preprint arXiv:2306.15799, 2023.
[564]
Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee
Lee, Kyoung Park, Jae W Lee, et al. Aˆ 3: Accelerating attention mechanisms in neural networks with approximation.
In
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
, pages 328–341. IEEE,
2020.
[565]
Tae Jun Ham, Yejin Lee, Seong Hoon Seo, Soosung Kim, Hyunji Choi, Sung Jun Jung, and Jae W Lee. Elsa: Hardware-
software co-design for ecient, lightweight self-attention mechanism in neural networks. In
2021 ACM/IEEE 48th
Annual International Symposium on Computer Architecture (ISCA), pages 692–705. IEEE, 2021.
[566]
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with
hardware-ecient training. arXiv preprint arXiv:2312.06635, 2023.
[567]
Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In
International Conference on Machine Learning, pages 10323–10337. PMLR, 2023.
[568]
Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoeer, and James Hensman. Slicegpt:
Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024.
[569]
Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad
Rastegari, and Mehrdad Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models.
arXiv
preprint arXiv:2310.04564, 2023.
[570]
Vithursan Thangarasa, Abhay Gupta, William Marshall, Tianda Li, Kevin Leong, Dennis DeCoste, Sean Lie, and
Shreyas Saxena. Spdf: Sparse pre-training and dense ne-tuning for large language models. In
Uncertainty in
Articial Intelligence, pages 2134–2146. PMLR, 2023.
[571]
Alan Ansell, Ivan Vulić, Hannah Sterz, Anna Korhonen, and Edoardo M Ponti. Scaling sparse ne-tuning to large
language models. arXiv preprint arXiv:2401.16405, 2024.
[572]
Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goin, and Dan Alistarh. Sparse netuning for inference
acceleration of large language models. arXiv preprint arXiv:2310.06927, 2023.
[573]
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Fluctuation-based adaptive structured pruning for large
language models. In
Proceedings of the AAAI Conference on Articial Intelligence
, volume 38, pages 10865–10873,
2024.
[574]
Eldar Kurtić, Elias Frantar, and Dan Alistarh. Ziplm: Inference-aware structured pruning of language models.
Advances in Neural Information Processing Systems, 36, 2024.
[575]
Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, and Luming Liang. Lorashear: Ecient large language model
structured pruning and knowledge recovery. arXiv preprint arXiv:2310.18356, 2023.
[576]
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training
via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
[577]
Xiaodong Chen, Yuxuan Hu, and Jing Zhang. Compressing large language models by streamlining the unimportant
layer. arXiv preprint arXiv:2403.19135, 2024.
97
[578]
Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen.
Shortgpt: Layers in large language models are more redundant than you expect.
arXiv preprint arXiv:2403.03853
,
2024.
[579]
Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song.
Shortened llama: A simple depth pruning for large language models. arXiv preprint arXiv:2402.02834, 2024.
[580]
Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander
Borzunov, Torsten Hoeer, and Dan Alistarh. Spqr: A sparse-quantized representation for near-lossless llm weight
compression. arXiv preprint arXiv:2306.03078, 2023.
[581]
Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier
suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling.
arXiv
preprint arXiv:2304.09145, 2023.
[582]
Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park. Owq: Lessons learned from activation
outliers for weight quantization in large language models. arXiv preprint arXiv:2306.02272, 2023.
[583] Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. Quip: 2-bit quantization of large language
models with guarantees. Advances in Neural Information Processing Systems, 36, 2024.
[584]
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao,
and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models.
arXiv preprint
arXiv:2308.13137, 2023.
[585]
Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang,
and Yu Wang. Evaluating quantized large language models. arXiv preprint arXiv:2402.18158, 2024.
[586]
Cong Guo, Jiaming Tang, Weiming Hu, Jingwen Leng, Chen Zhang, Fan Yang, Yunxin Liu, Minyi Guo, and Yuhao Zhu.
Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization. In
Proceedings of
the 50th Annual International Symposium on Computer Architecture, pages 1–15, 2023.
[587]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4.
arXiv
preprint arXiv:2304.03277, 2023.
[588]
Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and Alham Fikri Aji. Lamini-lm: A diverse
herd of distilled models from large-scale instructions. arXiv preprint arXiv:2304.14402, 2023.
[589]
Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei Wang. Lion: Adversarial distillation of closed-source large
language model. arXiv preprint arXiv:2305.12870, 2023.
[590]
Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, and Tuo Zhao. Less is more: Task-aware
layer-wise distillation for language model compression. In
International Conference on Machine Learning
, pages
20852–20867. PMLR, 2023.
[591]
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier
Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In
The Twelfth
International Conference on Learning Representations, 2024.
[592]
Minsoo Kim, Sihwa Lee, Janghwan Lee, Sukjin Hong, Du-Seong Chang, Wonyong Sung, and Jungwook Choi. Token-
scaled logit distillation for ternary weight generative language models.
Advances in Neural Information Processing
Systems, 36, 2024.
[593]
Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase,
Shaden Smith, Minjia Zhang, Je Rasley, et al. Deepspeed-inference: enabling ecient inference of transformer
models at unprecedented scale. In
SC22: International Conference for High Performance Computing, Networking,
Storage and Analysis, pages 1–15. IEEE, 2022.
[594]
Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving
for large language models. arXiv preprint arXiv:2305.05920, 2023.
[595]
Yunho Jin, Chun-Feng Wu, David Brooks, and Gu-Yeon Wei.
𝑠3
: Increasing gpu utilization during generative inference
for higher throughput. Advances in Neural Information Processing Systems, 36, 2024.
[596]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Aashaka Shah, Saeed Maleki, and Ricardo Bianchini.
Splitwise: Ecient generative llm inference using phase splitting. arXiv preprint arXiv:2311.18677, 2023.
[597]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distser ve: Disag-
gregating prell and decoding for goodput-optimized large language model serving.
arXiv preprint arXiv:2401.09670
,
2024.
[598]
Jiangsu Du, Jinhui Wei, Jiazhi Jiang, Shenggan Cheng, Dan Huang, Zhiguang Chen, and Yutong Lu. Liger: Interleaving
intra-and inter-operator parallelism for distributed large model inference. In
Proceedings of the 29th ACM SIGPLAN
Annual Symposium on Principles and Practice of Parallel Programming, pages 42–54, 2024.
[599]
Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, and Chinmay Hegde. On the computational complexity of
self-attention. In International Conference on Algorithmic Learning Theory, pages 597–619. PMLR, 2023.
98
[600]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
[601]
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael
Chung, Matteo Grella, Kranthi Kiran GV, et al. Rwkv: Reinventing rnns for the transformer era.
arXiv preprint
arXiv:2305.13048, 2023.
[602]
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.
arXiv preprint
arXiv:2312.00752, 2023.
[603]
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive
network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
[604]
Jinhwan Park, Yoonho Boo, Iksoo Choi, Sungho Shin, and Wonyong Sung. Fully neural network based speech
recognition on mobile and embedded devices. Advances in neural information processing systems, 31, 2018.
[605]
Yongqiang He and Xiguang Dong. Real time speech recognition algorithm on embedded system based on continuous
markov model. Microprocessors and Microsystems, 75:103058, 2020.
[606]
Xiaowei Xu, Xinyi Zhang, Bei Yu, Xiaobo Sharon Hu, Christopher Rowen, Jingtong Hu, and Yiyu Shi. Dac-sdc
low power object detection challenge for uav applications.
IEEE transactions on pattern analysis and machine
intelligence, 43(2):392–403, 2019.
[607]
Xiaofan Zhang, Haoming Lu, Cong Hao, Jiachen Li, Bowen Cheng, Yuhong Li, Kyle Rupnow, Jinjun Xiong, Thomas
Huang, Honghui Shi, et al. Skynet: a hardware-ecient method for object detection and tracking on embedded
systems. Proceedings of Machine Learning and Systems, 2:216–229, 2020.
[608]
Sabur Baidya, Yu-Jen Ku, Hengyu Zhao, Jishen Zhao, and Sujit Dey. Vehicular and edge computing for emerging
connected and autonomous vehicle applications. In
2020 57th ACM/IEEE Design Automation Conference (DAC)
,
pages 1–6. IEEE, 2020.
[609]
Xiaoming Zeng, Zhendong Wang, and Yang Hu. Enabling ecient deep convolutional neural network-based sensor
fusion for autonomous driving. In
Proceedings of the 59th ACM/IEEE Design Automation Conference
, pages 283–288,
2022.
[610]
Yangqing Jia, Evan Shelhamer, Je Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama,
and Trevor Darrell. Cae: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM
international conference on Multimedia, pages 675–678, 2014.
[611]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and
Zheng Zhang. Mxnet: A exible and ecient machine learning library for heterogeneous distributed systems.
arXiv
preprint arXiv:1512.01274, 2015.
[612]
Nikhil Ketkar and Nikhil Ketkar. Introduction to keras.
Deep learning with python: a hands-on introduction
, pages
97–111, 2017.
[613] Mohit Thakkar. Introduction to core ml framework. In Beginning Machine Learning in iOS, 2019.
[614]
Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang. Paddlepaddle: An open-source deep learning platform from
industrial practice. Frontiers of Data and Domputing, 1(1):105–115, 2019.
[615]
Jason Jinquan Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang Wang, Xianyan Jia, Cherry Li Zhang, Yan
Wan, Zhichao Li, Jiao Wang, Shengsheng Huang, Zhongyuan Wu, Yang Wang, Yuhao Yang, Bowen She, Dongjie Shi,
Qi Lu, Kai Huang, and Guoqiong Song. Bigdl: A distributed deep learning framework for big data. In
Proceedings of
the ACM Symposium on Cloud Computing (SoCC), page 50–60, 2019.
[616] Google. Google coral dev board. https://coral.ai/products/dev- board/.
[617] Huawei. Huawei hikey 970. https://www.96boards.org/product/hikey970.
[618]
Limited Shenzhen Xunlong Software CO. Orange pi ai stick lite. http://www.orangepi.org/html/hardWare/
computerAndMicrocontrollers/details/Orange-Pi-AI-Stick-Lite.html.
[619]
Hanrui Wang, Jiaqi Gu, Yongshan Ding, Zirui Li, Frederic T Chong, David Z Pan, and Song Han. Quantumnat:
quantum noise-aware training with noise injection, quantization and normalization. In
Proceedings of the 59th
ACM/IEEE Design Automation Conference, pages 1–6, 2022.
[620]
Hanrui Wang, Pengyu Liu, Jinglei Cheng, Zhiding Liang, Jiaqi Gu, Zirui Li, Yongshan Ding, Weiwen Jiang, Yiyu
Shi, Xuehai Qian, et al. Quest: Graph transformer for quantum circuit reliability estimation.
arXiv preprint
arXiv:2210.16724, 2022.
[621]
Hanrui Wang, Zirui Li, Jiaqi Gu, Yongshan Ding, David Z Pan, and Song Han. Qoc: quantum on-chip training with
parameter shift and gradient pruning. In
Proceedings of the 59th ACM/IEEE Design Automation Conference
, pages
655–660, 2022.
[622]
Nur Ahmed and Muntasir Wahed. The de-democratization of ai: Deep learning and the compute divide in articial
intelligence research. arXiv preprint arXiv:2010.15581, 2020.
[623]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the
2020s. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages 11976–11986,
99
2022.
[624]
Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext
v2: Co-designing and scaling convnets with masked autoencoders. arXiv preprint arXiv:2301.00808, 2023.
[625]
Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao Xiao, Boqian Wu, Mykola Pechenizkiy, Decebal Mocanu,
and Zhangyang Wang. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity.
arXiv preprint
arXiv:2207.03620, 2022.
[626]
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Ecient processing of deep neural networks: A tutorial
and survey. Proceedings of the IEEE, 105(12):2295–2329, 2017.
[627]
Han Cai, Tianzhe Wang, Zhanghao Wu, Kuan Wang, Ji Lin, and Song Han. On-device image classication with prox-
yless neural architecture search and quantization-aware ne-tuning. In
Proceedings of the IEEE/CVF International
Conference on Computer Vision Workshops, pages 0–0, 2019.
[628]
Abhijeet Boragule, Kin Choong Yow, and Moongu Jeon. On-device face authentication system for atms and privacy
preservation. In 2023 IEEE International Conference on Consumer Electronics (ICCE), pages 1–4. IEEE, 2023.
[629]
George Sung, Kanstantsin Sokal, Esha Uboweja, Valentin Bazarevsky, Jonathan Baccash, Eduard Gabriel Baza-
van, Chuo-Ling Chang, and Matthias Grundmann. On-device real-time hand gesture recognition.
arXiv preprint
arXiv:2111.00038, 2021.
[630]
Xiangsheng Shi, Xuefei Ning, Lidong Guo, Tianchen Zhao, Enshu Liu, Yi Cai, Yuhan Dong, Huazhong Yang, and
Yu Wang. Memor y-oriented structural pruning for ecient image restoration. In
Proceedings of the AAAI Conference
on Articial Intelligence, volume 37, pages 2245–2253, 2023.
[631]
Ivan Grishchenko, Valentin Bazarevsky, Andrei Zanr, Eduard Gabriel Bazavan, Mihai Zanr, Richard Yee, Karthik
Raveendran, Matsvei Zhdanovich, Matthias Grundmann, and Cristian Sminchisescu. Blazepose ghum holistic:
Real-time 3d human landmarks and pose estimation. arXiv preprint arXiv:2206.11678, 2022.
[632]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. Boosting image captioning with attributes. In
Proceedings
of the IEEE international conference on computer vision, pages 4894–4902, 2017.
[633]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651–4659, 2016.
[634]
Yonggan Fu, Zhifan Ye, Jiayi Yuan, Shunyao Zhang, Sixu Li, Haoran You, and Yingyan Lin. Gen-nerf: Ecient
and generalizable neural radiance elds via algorithm-hardware co-design. In
Proceedings of the 50th Annual
International Symposium on Computer Architecture, pages 1–12, 2023.
[635]
Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A
large-scale dataset for comprehensive instructional video analysis. In
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 1207–1216, 2019.
[636]
Wilko Schwarting, Javier Alonso-Mora, and Daniela Rus. Planning and decision-making for autonomous vehicles.
Annual Review of Control, Robotics, and Autonomous Systems, 1:187–210, 2018.
[637]
Omar Elharrouss, Noor Almaadeed, and Somaya Al-Maadeed. A review of video surveillance systems.
Journal of
Visual Communication and Image Representation, 77:103116, 2021.
[638]
Ge Wang, Andreu Badal, Xun Jia, Jonathan S Maltz, Klaus Mueller, Kyle J Myers, Chuang Niu, Michael Vannier,
Pingkun Yan, Zhou Yu, et al. Development of metaverse for intelligent healthcare.
Nature Machine Intelligence
,
4(11):922–929, 2022.
[639]
Utsav Drolia, Katherine Guo, and Priya Narasimhan. Precog: Prefetching for image recognition applications at the
edge. In Proceedings of the Second ACM/IEEE Symposium on Edge Computing, pages 1–13, 2017.
[640] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
[641]
Muhammad Waseem Anwar, Imran Ahsan, Farooque Azam, Wasi Haider Butt, and Muhammad Rashid. A natural
language processing (nlp) framework for embedded systems to automatically extract verication aspects from textual
design requirements. In
Proceedings of the 2020 12th International Conference on Computer and Automation
Engineering, pages 7–12, 2020.
[642]
Jin Zhou and Meiyu Zhou. Sentiment analysis of elderly wearable device users based on text mining. In
Advances in Usability, User Experience, Wearable and Assistive Technology: Proceedings of the AHFE 2021 Virtual
Conferences on Usability and User Experience, Human Factors and Wearable Technologies, Human Factors in
Virtual Environments and Game Design, and Human Factors and Assistive Technology, July 25-29, 2021, USA
, pages
360–365. Springer, 2021.
[643]
Aagam Shah, Rohan Shah, Praneeta Desai, and Chirag Desai. Mental health monitoring using sentiment analysis.
International Research Journal of Engineering and Technology (IRJET), 7(07):2395–0056, 2020.
[644]
Peiyan Dong, Siyue Wang, Wei Niu, Chengming Zhang, Sheng Lin, Zhengang Li, Yifan Gong, Bin Ren, Xue Lin,
and Dingwen Tao. Rtmobile: Beyond real-time mobile acceleration of rnns for speech recognition. In
2020 57th
ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2020.
100
[645]
Pani Prithvi Raj, Pakala Akhil Reddy, and Nitin Chandrachoodan. Reduced memory viterbi decoding for hardware-
accelerated speech recognition. ACM Transactions on Embedded Computing Systems (TECS), 21(3):1–18, 2022.
[646]
Minji Cho, Sang-su Lee, and Kun-Pyo Lee. Once a kind friend is now a thing: Understanding how conversational
agents at home are forgotten. In
Proceedings of the 2019 on Designing Interactive Systems Conference
, pages
1557–1569, 2019.
[647] Apple Incorporation. Siri, 2010.
[648]
Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, and Juan Pino. Fairseq s2t: Fast
speech-to-text modeling with fairseq. arXiv preprint arXiv:2010.05171, 2020.
[649]
Jiahui Hou, Xiang-Yang Li, Peide Zhu, Zefan Wang, Yu Wang, Jianwei Qian, and Panlong Yang. Signspeaker: A
real-time, high-precision smartwatch-based sign language translator. In
The 25th Annual International Conference
on Mobile Computing and Networking, pages 1–15, 2019.
[650]
Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, and Tie-Yan Liu. Simulspeech: End-to-end simulta-
neous speech to text translation. In
Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pages 3787–3796, 2020.
[651] Sauhaarda Chowdhuri, Tushar Pankaj, and Karl Zipser. Multinet: Multi-modal multi-task learning for autonomous
driving. In
2019 IEEE Winter Conference on Applications of Computer Vision (WACV)
, pages 1496–1504. IEEE, 2019.
101