Conference PaperPDF Available

Latency optimized Deep Neural Networks (DNNs): An Artificial Intelligence approach at the Edge using Multiprocessor System on Chip (MPSoC)



Almost in every heavily computation-dependent application, from 6G communication systems to autonomous driving platforms, a large portion of computing should be near to the client-side. Edge computing (AI at Edge) in mobile devices is one of the optimized approaches for addressing this requirement. Therefore, in this work, the possibilities and challenges of implementing a low latency and power-optimized smart mobile system are examined. Utilizing Field Programmable Gate Array (FPGA) based solutions at the edge will lead to bandwidth-optimized designs and as a consequence can boost the computational effectiveness at a system-level deadline. Moreover, various performance aspects and implementation feasibilities of Neural Networks (NNs) on both embedded FPGA edge devices (using Xilinx Multiprocessor System on Chip (MPSoC)) and Cloud are discussed throughout this research. The main goal of this work is to demonstrate a hybrid system that uses the deep learning programmable engine developed by Xilinx Inc as to the main component of the hardware accelerator. Then based on this design, an efficient system for mobile edge computing is represented by utilizing an embedded solution.
Latency optimized Deep Neural Networks (DNNs):
An Artificial Intelligence approach at the Edge
using Multiprocessor System on Chip (MPSoC)
Seyed Nima Omidsajedi, Rekha Reddy, Jianming Yi, Jan Herbst, Christoph Lippsand Hans Dieter Schotten∗†
Intelligent Networks Research Group, German Research Center for Artificial Intelligence
D-67663 Kaiserslautern, Email: {firstname.lastname}
Institute for Wireless Communication and Navigation, University of Kaiserslautern
D-67663 Kaiserslautern, mail: {lastname}
Abstract—Almost in every heavily computation-dependent ap-
plication, from 6G communication systems to autonomous driving
platforms, a large portion of computing should be near to the
client-side. Edge computing (AI at Edge) in mobile devices is one
of the optimized approaches for addressing this requirement.
Therefore, in this work, the possibilities and challenges of
implementing a low latency and power-optimized smart mobile
system are examined. Utilizing Field Programmable Gate Array
(FPGA) based solutions at the edge will lead to bandwidth-
optimized designs and as a consequence can boost the computa-
tional effectiveness at a system-level deadline. Moreover, various
performance aspects and implementation feasibilities of Neural
Networks (NNs) on both embedded FPGA edge devices (using
Xilinx Multiprocessor System on Chip (MPSoC)) and Cloud
are discussed throughout this research. The main goal of this
work is to demonstrate a hybrid system that uses the deep
learning programmable engine developed by Xilinx Inc as to
the main component of the hardware accelerator. Then based
on this design, an efficient system for mobile edge computing is
represented by utilizing an embedded solution.
Index Terms—Edge computing, Multiprocessor System on
Chip (MPSoC), Edge Artificial Intelligence (Edge AI), Deep
Neural Networks (DNN), Deep learning Processing Unit (DPU).
With new emerging technologies in the field of communica-
tion and mobile systems, demands for smart and secure plat-
forms are arising in real-world situations. Moreover, increasing
the complexity of target applications with exclusive purposes
leads to a wide spectrum of research focusing on constructing
smart systems employing Artificial Intelligence (AI) solutions.
For large-scale application models, the need for parallel com-
putational capabilities with higher performance arises a lot of
design challenges. For instance, Raina, Madhavan, and Ng [1]
discuss the implementation of unsupervised learning methods
over the modern graphics processor which scales 70 times
faster than a dual-core Central Processing Unit (CPU). The
development in Cloud computing technologies has enabled
the provision of higher computational capabilities. Thus, the
custom methodology for creating powerful AI systems are
using Cloud services due to their extensive computation power.
However, the traditional Cloud solutions cannot provide the
desired efficiency for daily increasing mobile systems [2][3].
As a result, a new computing architecture based on end-
devices (also known as IoT devices), edge devices, and Cloud
has already been introduced and is used to improve the
performance of systems that are extremely computationally
dependent [4].
Cloud servers could be used as a very powerful computing
tool, particularly for running AI inference systems; neverthe-
less, due to the significant data transmission between the end
device and the Cloud, the issue of reliability arises. Moreover,
the available bandwidth in a network is restricted, therefore
expanding the size of the network may result in further
bandwidth reductions concerning the increased data path. In
addition, the data transfer within a network can compromise
the data security and as a result, endangers the system security
A commonly used choice as a tool for edge computing
is utilizing Graphics Processing Unit (GPU) systems; this is
due to the strong computation capabilities and also the high
memory bandwidth of GPU devices. However, these systems
have a lot of issues for battery-dependent systems, especially
for mobile applications. Two of the most important design
parameters for every edge device are i) the energy efficiency
and ii) the security of the design. GPU systems are designed
for graphical purposes and would not provide a fully reliable
smart system. One considerable solution for implementing
bandwidth-optimized, energy-efficient, and secure devices is
using ARM-FPGA hybrid systems. These systems consist of
ARM processors as the heart of software processing which
act as the Central Control Unit (CCU) alongside a dedicated
hardware accelerator on FPGA blocks for boosting the desired
Machine Learning (ML) algorithm using the concept of paral-
lelism. In 1992, [6] demonstrated that a large neural network
application with a single ANNA chip (Analog Neural Network
Arithmetic and logic unit (ANNA)) enables to operate with
the speed advantage of 50 to 500 over conventional hardware
evaluated with Floating-Point (FP) precision.
In the current paper, an FPGA based edge solution is
presented in comparison to the GPU methods over the edge.
The remainder of this work is organized as follows. Section II
provides the current state of the art for the embedded FPGA’s
using MPSoCs. The detailed structure of our proposed system
with a dedicated hardware accelerator is presented in Section
III. The implementation comparison of embedded FPGA and
GPU (Edge vs Cloud) are discussed in Section IV. An outline
of future work and this work is concluded in Section V.
First ideas of reducing network traffic by the use of
edge-computation are provided of [4]. Therefore they tackled
the problem of crowdsourced applications which are geo-
distributed globally with highly heterogeneous crowdsourced
data with the use of edge computing. Especially they want to
combat the challenges in crowdsourced deep learning appli-
cations. They claim to be able to reduce the network traffic
by 80 % and running time by 69 % in comparison to state-
of-the-art cloud solutions. One of the big advantages is the
improvement of network latency between end-user devices
and edge servers compared to cloud servers. [7] and [8]
try to solve the problem of high computation-intensive Deep
Neural Network (DNN) based tasks on mobile devices with
the development of a specific Framework, which claims to
use a device-edge-synergy. Therefore they use algorithms to
adaptively partition computations between device and edge.
Furthermore, they are reducing computing latency via early
exiting inference provided by the use of BranchyNet [9] at
the edge.
[10] examines latency and power optimizations of heavily
computation-dependent applications on the client-side with the
use of FPGAs. They investigate the efficient hardware im-
plementation of Cellular Neural Networks (CeNNs) in FPGA
with the help of different optimization algorithms for com-
pressing image processing techniques. By using a so-called
incremental quantization which includes parameter partition,
parameter quantization, and re-training, they can quantize the
numbers in CeNN templates to the power of two, so they are
able to use the full potential of embedded FPGAs. This solves
the problem of the very high need for multiplications needed
in CeNNs, which leads to a bottleneck in an FPGA because of
the limited quantity of embedded multipliers one contains. By
reducing the power of numbers to two the multiplications can
be solved with the use of logic shifts, which can be achieved
with logic elements and registers. In the represented work
CNNs are used instead of CeNNs, moreover, it will be the
Xilinx deep learning IP (Xilinx DPU) which delivers sufficient
speed up for inference tasks.
Also [11] worked on more efficient ways of implement-
ing efficient DNNs on FPGAs. Therefore they developed a
Framework named DNNWEAVER which can automatically
generate synthesizable accelerators for a given pair of DNN
and FPGA. With performance tests, they compared three
different FPGAs (Xilinx Zynq, Altera Stratix V, Altera Arria
10) against different many-core GPUs. They were able to
achieve higher Performance-per-Watt against all tested many-
core GPUs for the Xilinx Zynq and Altera Arria 10 FPGAs
by using the Framework synthesized DNNs.
In [12] the authors went one step further and claim to
create the first hybrid system of GPU-FPGAs for training
a network. Therefore they propose a new framework with
the focus of effective DNN training on GPU-FPGA-hybrids.
They used different energy-efficient schemes for DNN train-
ing procedures to execute individual training operations on
either high-performance (GPUs) or power-efficient hardware
(FPGAs). With that, they claim to be able to reduce the
average power consumption by 44.3%. All in all, it is to
say the higher adaptability of FPGAs and the higher more
advanced processing power of embedded FPGAs compared to
normal GPUs result in lower cost and relatively lower power
consumption which make them a serious threat to pure GPU
solutions still used in many classical cloud-based-solutions so
This Section describes the proposed design with a ded-
icated hardware accelerator on FPGA using Programmable
Logic (PL) system of Xilinx MPSoC evaluation board (Xilinx
ZCU102). The heart of this accelerator includes Xilinx Deep
Learning Processing Unit (DPU) [13] which is responsible
for computing given inference tasks for a DNN system. The
main functionality of this system is to calculate massive
series of Multiply–Accumulate operations (MAC) on network
parameters of DNNs. On the other hand, the initialization task,
as well as data coordination of a DNN, will be done using
ARM processors in the Processing System (PS) of the MPSoC.
The overall workflow for preparing and implementing a
Neural Network (NN) on a target embedded FPGA board
(Xilinx SoC/MPSoC) is depicted in Figure 1. This diagram
consists of three sub workflows. The most left path (depicted in
dark gray) explains the required steps for preparing a dedicated
hardware accelerator on FPGA logic gates using the Xilinx
DPU soft IP. This path also includes all the desired pre-
processing at the hardware level. Then using the implemented
design in the PL side, computational processes of a NN,
which are mainly multiplication and addition functions, are
performed on the FPGA side. The design strategy behind the
DPU IP is utilizing Digital Signal Processor (DSP) tiles for
conducting MAC (Multiply-Accumulate) operations.
The middle sub-workflow (depicted in light grey) describes
the required steps for converting the desired NN from its
initial form to an equivalent FPGA compatible form. This
step involves model quantization (converting floating-based
weights and biases to equivalent integer numbers). The quan-
tization step is very important not only because of reducing
the required storage memory but also optimizing the required
bandwidth for the data transmission between the off-chip
memory (DDR memory connected to the PS side) and the
hardware accelerator on the FPGA side. After the quantization
Hardware accelerator design on
FPGA (DPU, Zynq+ processor,...)
PetaLinux on target embedded
FPGA, Configure Linux Kernel
Export hardware
Json file for NN
Export boot files
Application in Python / C++
Cross compile (PetaLinux SDK)
Provide application dependencies
Running embedded Linux with a Neural Network (NN) on SoC/MPSoC
Board Support Package
(BSP) Quantization of model
(Convert FP to INT)
Machine Learning
(ML) framework
Neural Network (NN)
Float inference graph/
Frozen input
Calibration dataset /
Input function
Compilation of model
Xilinx Docker tool: Vitis-AI
Quantized model
Network xmodel
Example: ResNet50
(Classification network)
Fig. 1: Overall workflow for implementing NNs on the target
SoC/MPSoC board
step, the graph-based NN must be converted to a binary form
which is executable on logic gates of the PL subsystem. In this
step, the pre-quantized model is fed into the Xilinx compiler
for generating the xmodel file of the initial NN.
For controlling the data transmission among various parts
of the SoC/MPSoC as well as for servicing the interrupts, a
user-defined Linux application should be used. This part is
shown in the most right sub-workflow (depicted in white).
Based on the requirements of the project, there may be a need
for cross-compiling the given application (C++ application).
Finally, with booting the embedded Linux (PetaLinux) with
a given NN on the target board, further experiments are
possible. In this case, the classic ResNet50 [14] DNN is used
which is a pre-trained classification network on 224 ×224 ×3
(224 ×224 pixels with 3 color channels) images from Im-
ageNet dataset [15]. The current version of the ResNet50
network can classify 1000 different classes based on given
input images.
The overall design scheme for the hardware accelerator used
in this paper is depicted in Figure 2. This design includes
DPU core(s) used for the inference computation of the target
NN. The DPU core is responsible for the computation of
pixels in the feature map of input images. The number of
Processing Elements (PEs) inside of the convolution engine
is equivalent to the pixel parallelism of the feature map. The
implemented NN parameters are stored in off-chip memory
(DDR4 memory) and will be transmitted to the PL system with
the Full Power Domain (FPD) AXI connections. For boosting
the performance of the accelerator, connections between hard-
ware logics in FPGA and off-chip memory are built through
a Direct Memory Access (DMA) system. In this case, PL
can transmit data directly to or from external memory. The
ARM processors in the APU subsystem are responsible for
data coordination as well as required initialization for given
tasks to DPU core(s).
The DNN implementations over the Edge and Cloud are
presented in this Section. Moreover, the detailed description of
the environment setup for used Edge and Cloud components,
FPGA and GPU features and the model used for the execution
of the DNN are mentioned in related parts.
A. Deep Neural Network Implementation over Edge
The implementation details of a DNN on an edge-based
device are described in this Subsection.
1) Physical Infrastructure: The edge device used in this
set-up is the ZCU102 evaluation board which is an embedded
FPGA from Xilinx Inc. This board is considered as a MPSoC
with the Zynq UltraScale+ architecture. However, the overall
workflow presented in this paper can be replicated in any
Zynq and Zynq UltraScale+ based evaluation or custom boards
(for example, Zynq-7000 ARM-FPGA SoC boards). The main
reason for selecting a Zynq UltraScale+ system is the scalable
capability of MPSoC family boards, which is an important fac-
tor for creating smart networks. Moreover, ZCU102 provides
more logic resources in FPGA (including DSP blocks, which
are the main computational units) for performing required
computations of the inference graph of DNNs. Another reason
for selecting the Zynq UltraScale+ over Zynq architecture
is the possibility of implementing the SoftMax activation
function at the hardware level. The hardware implementation
of this activation function can boost the performance of the
system much more than the equivalent function at the software
level. The ZCU102 consists of ARM Cortex processors in the
Application Processing Unit (APU) and Real-time Processing
Unit (RPU) sub-systems which are implemented in the PS.
Moreover, a mobile GPU is encompassed in the PS side of the
board. The system specifications of Xilinx ZCU102 MPSoC
is described in Table I.
TABLE I: Embedded FPGA Hardware Specification
Processing System (PS):
APU Quad-core ARM Cortex-A53
RPU Dual-core ARM Cortex-R5
GPU Mali-400
Programmable Logic (PL):
System Logic Cells 600 K
Block Ram (BRAM) 32.1 MB
DSP Slices 2520
Memory types:
PS-Side DDR4 4GB 64-bit
PL-Side DDR4 512MB 16-bit
Programmable Logic (PL)
High Performance
AXI Masters
Full Power Domain (FPD)
Processing System (PS)
DPU Interrupts
Interrupt Unit
DDR Memory Controller
Off-chip DDR4 Memory
(Storage for Weights, Biases, and intermediate features of the DNN)
Read / Write DNN parameters
and DPU Instructions
Data Coordination
High Performance
AXI Slave
Low Power Domain (LPD)
DPU Task Initialization
DPU Cores
Computing Engine
Instruction Fetcher
DPU Scheduler
Quad cores
ARM Cortex-A53
L1 & L2 Cache
Fig. 2: The hardware accelerator design for implementing a DNN on the target MPSoC
2) DNN Implementation over Edge device: The imple-
mentation results for ResNet50 DNN on the Xilinx ZCU102
MPSoC board are shown in Table II. In this work, several test
scenarios are implemented on the target board using different
sizes and numbers of DPU cores. The achieved throughput
and power consumption of the system depends on the actual
requirements of a project; however, in these test scenarios
assumed a power budget of around 10 to 20 Watts is available
and also the target throughput of the system is around 100
to 200 Images per Second (Img/S). Achieving values higher
than the proposed range is possible by utilizing more logic
resources on FPGA, or using more expensive and advanced
embedded FPGAs, or by providing a bigger power supply with
a proper cooling system. One important design parameter for
these experiments is the Convolution architectures of the DPU
core. According to the Xilinx documentation [13], this design
parameter determines the number of operations the DPU core
can perform in one Clock Cycle (CC). For example, 1/B4096
architecture equals 1 DPU core capable of performing 4096
operations in one CC. The bigger the Convolution architectures
means higher the logic resources used for creating that DPU
core. The maximum number of implementable DPU is limited
due to the high DSP block usage in the internal structure
of this AI unit. The current version of this soft IP core
support up to 4 homogeneous cores. Considering the results
from these experiments, the direct relationship between the
power consumption and the DPU core size is considerable.
The number of test images used for each scenario is equal to
2000 images.
B. Deep Neural Network Implementation over Cloud
A detailed description of DNN implementation constructed
on a Cloud based GPU cluster infrastructure is discussed.
TABLE II: Power, Latency and Throughput of the Edge device
DPU Cores/Arch Power(W) Latency(S) Throughput(Img/S)
1/B4096 9.97 22.59 88.5
2/B4096 16.29 12.40 161.2
3/B4096 22.86 9.73 205.4
4/B2304 21.98 11.68 171.1
1) Physical Infrastructure: The experimental setup consists
of a Cloud Computing cluster with GPU servers namely
GTX1080Ti and RTX A6000. Table III provides the overall
GPU specifications used in the implementation. Slurm [16]
is used for managing to compute resources, scheduling jobs,
and execution on worker nodes, it provides fault-tolerant and
highly scalable cluster management.
TABLE III: GPU Cluster Hardware Specification
Specifications: GTX1080Ti RTX A6000
Architecture Pascal Ampere
GPU Memory(GB) 11 48
GPU per node 8 8
CPU per GPU 5-9 12
2) DNN Implementation over GPU Cluster: With the de-
ployed cluster environment, DNN is constructed using a
containerized environment using enroot, administrated, and
monitored through the Slurm cluster. The framework is built
using Keras with TensorFlow [17] backend to run the image
identifier model using ResNet50. The ResNet50 model em-
ployed in this study is the same pre-trained model used in the
Edge implementation. GPUs were subjected to iterative image
prediction analysis and throughput as well as the total power
consumption are recorded in the Table IV. The implementation
is tested over 2 different GPU servers working with GPU
in sets of 2 and 4 along with 2 CPUs. The CPU used
alongside the GTX1080Ti GPU is the Intel(R) Xeon(R) CPU
E5-2630 v4 and the CPU used with the RTXA6000 GPU
is the AMD(R) EPYC 7F72 24-Core Processor. The overall
latency is the total time taken during the resource allocation,
containerized environment creation, initialization ,and the final
image prediction. GPUs are allocated with a batch size of
8, to obtain a minimalistic comparison to the edge device.
With the increased batch size, high throughput can be achieved
with higher GPU utilization by compromising to operate with
huge power. Thus, the experiment performed uses lower GPU
capacity, providing a lower throughput. The total number of
test images for each experiment is equal to 2000.
Firstly, the system is tested with resources of 2 GPUs and
2 CPUs are allocated and tested on servers RTXA6000 and
GTX1080Ti and throughput obtained are 180 and 175 with
the latency of 65s and 46s. Then, the allocated number of
GPUs were increased to 4, and throughputs of 187 and 137
are obtained. One important fact about the power consumption
of these systems is the GPU utilization, the test scenario with
2/RTXA6000 has higher utilization than 4/RTXA6000 and this
is the reason for the higher power consumption.
TABLE IV: Power, Latency and Throughput of the GPU
Num/GPU Power(W) Latency(S) Throughput(Img/S)
2/RTXA6000 215.8 65 180
4/RTXA6000 174.4 83 187
2/GTX1080Ti 95.9 46 175
4/GTX1080Ti 138.4 68 137
C. Comparison results of DNN Implementation over Edge
versus GPU Cluster
Considering the obtained results from implementing our
target DNN (Classic ResNet50) over Edge (embedded FPGA
device) versus GPU Cluster (as the Cloud tool) shows dras-
tically lower latency using Edge solution. Figure 3 shows
the latency comparison of these two methodologies. The
main reason for lower latency using embedded FPGAs is
the negligible additional overhead for data transmission and
initialization in the target board. Therefore, in an Edge device,
the latency value equals the time it takes to compute the
input data on the board (computation latency). However, in
a Cloud implementation, there are other latency sources for
data transmission over the Cloud, creation of new instances,
and initialization of values. In this work, all the additional
latency sources in the Cloud are called ”Transmission and
initialization latency”. Our experiments show high flexibility
and low latency of Edge devices using hardware accelerators
built on FPGA logical units. Figure 4 shows the detailed
latency sources using the Cloud tool.
Moreover, using embedded FPGAs is more energy-efficient
than GPUs and this design parameter is a decisive point for
mobile projects with limited power sources. The noticeable
point is, GPU clusters are initially able to achieve much higher
Overall latency (Second)
Fig. 3: Latency comparison between the Edge and Cloud
2/RTXA6000 4/RTXA6000 2/GTX1080Ti 4/GTX1080Ti
Computation latency (Second) Transmission and Initialization latency (Second)
Fig. 4: Latency sources in Cloud implementations
throughput than our test scenarios, but this is equal to a
much higher power consumption of the system. Increasing the
batch size leads to higher GPU utilization, and this will result
in higher throughput. For instance, in our experiment with
increasing the batch size from 8 to 200, the throughput will
jump from 137 Img/S to 751.49 Img/S in the 4/GTX1080Ti;
however, the total power consumption changes from 138.4
Watts to 747 Watts. With replicating the same experiment
using the 4/RTXA6000 scheme, the throughput jumps from
187 Img/S to 1886.13 Img/S and at the same time overall
consumed power changes from 174.4 Watts to 899 Watts.
Considering the power budget as well as the practically
required throughput of the DNN, our experiments show better
energy efficiency in all the test cases.
In this paper, we proposed a Deep Neural Network imple-
mentation strategy on Edge devices using a Multiprocessor
System on Chip. Our experiments show that using Edge
devices for Artificial Intelligence inferences has superiority in
comparison to the Cloud implementation of the same network
not only in the latency optimization but also in the energy
efficiency of the system. One of the main components used in
the hardware accelerator of this work is the Xilinx dedicated
soft IP core for deep learning purposes (Xilinx DPU). The
presented system can boost the computation process of DNNs
using Digital Signal Processor tiles and logic gates on the
FPGA side as well as using ARM processing for controlling
and coordination purposes. In future research, the possibility of
implementing multiple DNNs on MPSoCs will be examined to
observe the effect of co-implementation of DNNs. Moreover,
another possibility for future works is using the Xilinx Versal
boards [18] as the next generation of AI cores with dedicated
AI engines instead of soft IP cores.
[1] R. Raina, A. Madhavan, and A. Y. Ng, “Large-scale
deep unsupervised learning using graphics processors,”
in Proceedings of the 26th annual international con-
ference on machine learning, 2009, pp. 873–880. DOI :
[2] “IMT Traffic estimates for the years 2020 to 2030,
International Telecommunication Union, M Series Mo-
bile, radiodetermination, amateur and related satellite
services Report ITU-R M.2370-0, 2015.
[3] W. Jiang, B. Han, M. A. Habibi, and H. D. Schotten,
“The Road Towards 6G: A Comprehensive Survey,”
IEEE Open Journal of the Communications Society,
vol. 2, pp. 334–366, 2021. DO I: 10 . 1109 / OJCOMS .
[4] Y. Huang, X. Ma, X. Fan, J. Liu, and W. Gong, “When
deep learning meets edge computing,” in 2017 IEEE
25th international conference on network protocols
(ICNP), IEEE, Toronto, ON, Canada, 2017, pp. 1–2.
DOI: 10.1109/ICNP.2017.8117585.
[5] C. Lipps, P. Ahr, M. Strufe, and H. D. Schotten, “The
PhySec Thing: About Trust and Security in Industrial
IoT Systems,” Journal of Information Warfare (JIW),
vol. 19, no. 3, pp. 35–49, 2020, ISSN Print: 1445-3312;
ISSN Online: 1445-3347.
[6] E. S¨
ackinger, B. E. Boser, J. M. Bromley, Y. LeCun, and
L. D. Jackel, “Application of the ANNA neural network
chip to high-speed character recognition,” IEEE Trans-
actions on Neural Networks, vol. 3, no. 3, pp. 498–505,
1992. DO I: 10.1109/72.129422.
[7] E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-
demand deep learning model co-inference with device-
edge synergy,MECOMM 2018 - Proceedings of the
2018 Workshop on Mobile Edge Communications, Part
of SIGCOMM 2018, pp. 31–36, 2018. DOI: 10 . 1145 /
3229556.3229562. arXiv: 1806.07840.
[8] E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge AI: On-
Demand Accelerating Deep Neural Network Inference
via Edge Computing,” IEEE Transactions on Wireless
Communications, vol. 19, no. 1, pp. 447–457, 2020.
DO I: 10.1109/TWC.2019.2946140. arXiv: 1910.05316.
[9] S. Teerapittayanon, B. McDanel, and H.-T. Kung,
“Branchynet: Fast inference via early exiting from deep
neural networks,” in 2016 23rd International Confer-
ence on Pattern Recognition (ICPR), IEEE, Cancun,
Mexico, 2016, pp. 2464–2469. DO I: 10 . 1109 / ICPR .
[10] X. Xu, Q. Lu, T. Wang, Y. Hu, C. Zhuo, J. Liu, and
Y. Shi, “Efficient hardware implementation of cellular
neural networks with incremental quantization and early
exit,ACM Journal on Emerging Technologies in Com-
puting Systems (JETC), vol. 14, no. 4, pp. 1–20, 2018.
DO I: 10.1145/3264817.
[11] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K.
Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh, “From
high-level deep neural models to FPGAs,” in 2016
49th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), IEEE, Oct. 2016, pp. 1–12.
DO I: 10. 1109/ MICRO.2016.7783720. [Online]. Avail-
[12] X. He, J. Liu, Z. Xie, H. Chen, G. Chen, W. Zhang,
and D. Li, “Enabling energy-efficient DNN training
on hybrid GPU-FPGA accelerators,” in Proceedings of
the ACM International Conference on Supercomputing,
2021, pp. 227–241. DO I: 10.1145/3447818.3460371.
[13] Zynq DPU v3.3: Product Guide, PG338 (v3.3), Xilinx,
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
learning for image recognition,” in Proceedings of the
IEEE conference on computer vision and pattern recog-
nition, 2016, pp. 770–778.
[15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
L. Fei-Fei, “Imagenet: A large-scale hierarchical image
database,” in 2009 IEEE conference on computer vision
and pattern recognition, Ieee, 2009, pp. 248–255.
[16] Slurm workload manager Version 20.02.7. [Online].
[17] TensorFlow, https:// www.tensorflow. org/, 2015. D OI:
[18] K. Vissers, “Versal: The xilinx adaptive compute ac-
celeration platform (acap),” in Proceedings of the
2019 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, 2019, pp. 83–83.
... Depending on how many server clusters (especially from different wind parks) are working collaboratively, the requirements for the global model architecture and model management can vary widely: For communication within a cluster, there is most likely a high-performance bandwidth, regardless of whether individual components can be connected by cable or exchange their parameters and data using new 6G technology such as THz communication for further distributed systems [14] [15]. Therefore for the intra-cluster communication message compressing can be mostly dismissed. ...
Full-text available
Main design criteria within the development of 6G are sustainability and resource-efficient application. With advances in Artificial Intelligence (AI) methods towards increasingly larger and more complex Neural Network (NN) architectures, the question arises of how the ever-greater computational effort and the associated power consumption can be compatible with sustainability and environmental requirements. Therefore, in this work, the utilization of excess renewable energy sources to operate server clusters for the computation of AI in the context of autonomous processing in upcoming 6G frameworks is discussed. The current trend toward larger AI models and the necessity of collaborative work is highlighted. In particular, a vision of incorporating small-scale computational units via wireless communication into holistic collaborative schemes in upcoming wireless networks is presented. Since wind power is, analogous to solar energy unreliable and the production depends on natural factors, the requirements and challenges for existing frameworks are elaborated. In this context, synthetic data of the Baltic Eagle windfarm in the northeast of Germany is utilized to illustrate the energy fluctuations of surplus energy and then derive the requirements for predictive AI and cluster organization. For that purpose, a heuristic simulation framework is designed investigating the impact of prediction accuracy for server failure due to such fluctuations utilizing Autoregressive Integrated Moving Average (ARIMA) as a baseline for future research.
Full-text available
As of today, the fifth generation (5G) mobile communication system has been rolled out in many countries and the number of 5G subscribers already reaches a very large scale. It is time for academia and industry to shift their attention towards the next generation. At this crossroad, an overview of the current state of the art and a vision of future communications are definitely of interest. This article thus aims to provide a comprehensive survey to draw a picture of the sixth generation (6G) system in terms of drivers, use cases, usage scenarios, requirements, key performance indicators (KPIs), architecture, and enabling technologies. First, we attempt to answer the question of “Is there any need for 6G?" by shedding light on its key driving factors, in which we predict the explosive growth of mobile traffic until 2030, and envision potential use cases and usage scenarios. Second, the technical requirements of 6G are discussed and compared with those of 5G with respect to a set of KPIs in a quantitative manner. Third, the state-of-the-art 6G research efforts and activities from representative institutions and countries are summarized, and a tentative roadmap of definition, specification, standardization, and regulation is projected. Then, we identify a dozen of potential technologies and introduce their principles, advantages, challenges, and open research issues. Finally, the conclusions are drawn to paint a picture of “What 6G may look like?". This survey is intended to serve as an enlightening guideline to spur interests and further investigations for subsequent research and development of 6G communications systems.
Full-text available
Cellular neural networks (CeNNs) have been widely adopted in image processing tasks. Recently, various hardware implementations of CeNNs have emerged in the literature, with Field Programmable Gate Array (FPGA) being one of the most popular choices due to its high flexibility and low time-to-market. However, CeNNs typically involve extensive computations in a recursive manner. As an example, to simply process an image of 1,920 × 1,080 pixels requires 4--8 Giga floating point multiplications (for 3 × 3 templates and 50–100 iterations), which needs to be done in a timely manner for real-time applications. To address this issue, in this article, we propose a compressed CeNN framework for efficient FPGA implementations. It involves various techniques, such as incremental quantization and early exit, which significantly reduces computation demands while maintaining an acceptable performance. Particularly, incremental quantization quantizes the numbers in CeNN templates to powers of two, so that complex and expensive multiplications can be converted to simple and cheap shift operations, which only require a minimum number of registers and logical elements (LEs). While a similar concept has been explored in hardware implementations of Convolutional Neural Networks (CNNs), CeNNs have completely different computation patterns, which require different quantization and implementation strategies. Experimental results on FPGAs show that incremental quantization and early exit can achieve a speedup of up to 7.8× and 8.3×, respectively, compared with the state-of-the-art implementations, while with almost no performance loss with four widely adopted applications. We also discover that different from CNNs, the optimal quantization strategies of CeNNs depend heavily on the applications. We hope that our work can serve as a pioneer in the hardware optimization of CeNNs.
As a key technology of enabling Artificial Intelligence (AI) applications in 5G era, Deep Neural Networks (DNNs) have quickly attracted widespread attention. However, it is challenging to run computation-intensive DNN-based tasks on mobile devices due to the limited computation resources. What’s worse, traditional cloud-assisted DNN inference is heavily hindered by the significant wide-area network latency, leading to poor real-time performance as well as low quality of user experience. To address these challenges, in this paper, we propose Edgent , a framework that leverages edge computing for DNN collaborative inference through device-edge synergy. Edgent exploits two design knobs: (1) DNN partitioning that adaptively partitions computation between device and edge for purpose of coordinating the powerful cloud resource and the proximal edge resource for real-time DNN inference; (2) DNN right-sizing that further reduces computing latency via early exiting inference at an appropriate intermediate DNN layer. In addition, considering the potential network fluctuation in real-world deployment, Edgent is properly design to specialize for both static and dynamic network environment. Specifically, in a static environment where the bandwidth changes slowly, Edgent derives the best configurations with the assist of regression-based prediction models, while in a dynamic environment where the bandwidth varies dramatically, Edgent generates the best execution plan through the online change point detection algorithm that maps the current bandwidth state to the optimal configuration. We implement Edgent prototype based on the Raspberry Pi and the desktop PC and the extensive experimental evaluations demonstrate Edgent ’s effectiveness in enabling on-demand low-latency edge intelligence.
Conference Paper
In this presentation I will present the new Adaptive Compute Acceleration Platform. I will show the overall system architecture of the family of devices including the Arm cores (scalar engines), the programmable logic (Adaptable Engines) and the new vector processor cores (AI engines). I will focus on the new AI engines in more detail and show the architecture, the integration in the total device, the programming environment and some applications, including Machine Learning and 5G wireless applications.
Conference Paper
As the backbone technology of machine learning, deep neural networks (DNNs) have have quickly ascended to the spotlight. Running DNNs on resource-constrained mobile devices is, however, by no means trivial, since it incurs high performance and energy overhead. While offloading DNNs to the cloud for execution suffers unpredictable performance, due to the uncontrolled long wide-area network latency. To address these challenges, in this paper, we propose Edgent, a collaborative and on-demand DNN co-inference framework with device-edge synergy. Edgent pursues two design knobs: (1) DNN partitioning that adaptively partitions DNN computation between device and edge, in order to leverage hybrid computation resources in proximity for real-time DNN inference. (2) DNN right-sizing that accelerates DNN inference through early-exit at a proper intermediate DNN layer to further reduce the computation latency. The prototype implementation and extensive evaluations based on Raspberry Pi demonstrate Edgent's effectiveness in enabling on-demand low-latency edge intelligence.