Convolutional neural networks (CNNs) have made impressive achievements in image classification and object detection. For hardware with limited resources, it is not easy to achieve CNN inference with a large number of parameters without external storage. Model parallelism is an effective way to reduce resource usage by distributing CNN inference among several devices. However, parallelizing a CNN model is not easy, because CNN models have an essentially tightly-coupled structure. In this work, we propose a novel model parallelism method to decouple the CNN structure with group convolution and a new channel shuffle procedure. Our method could eliminate inter-device synchronization while reducing the memory footprint of each device. Using the proposed model parallelism method, we designed a parallel FPGA accelerator for the classic CNN model ShuffleNet. This accelerator was further optimized with features such as aggregate read and kernel vectorization to fully exploit the hardware-level parallelism of the FPGA. We conducted experiments with ShuffleNet on two FPGA boards, each of which had an Intel Arria 10 GX1150 and 16GB DDR3 memory. The experimental results showed that when using two devices, ShuffleNet achieved a 1.42× speed increase and reduced its memory footprint by 34%, as compared to its non-parallel counterpart, while maintaining accuracy.
This study addresses the problem of real-time tracking of high-speed ballistic targets. Particle filters can be used to overcome the nonlinearity of motion and measurement models in ballistic targets. However, applying particle filters (PFs) to real-time systems is challenging since they generally require a significant computation time. So, most of the existing methods of accelerating PF using a graphics processing unit (GPU) for target tracking applications have accelerated computation weight function and resampling part. However, the computational time per part varies from application to application, and in this work, we confirm that it takes a lot of computational time in the model propagation part and propose accelerated PF by parallelizing the corresponding logic. The real-time performance of the proposed method was tested and analyzed using an embedded system. And compared to conventional PF on the central processing unit (CPU), the proposed method shows that the proposed method significantly reduces computational time by at least 10 times, improving real-time performance.
Satellite communication, especially low earth orbit satellites, an important part of the space‐air‐ground integrated network, is a promising solution to support global coverage and reliable communication. However, the originally connected links will be off, when the satellites move into polar regions. In addition, the condition of the path broken will dramatically increase with the number of satellites increasing. Thus, it is a continually considerable challenge to design an efficient and stationary routing algorithm for intra‐satellites. This paper presents a waypoint segment routing algorithm to realise persistent routes when the satellites move into polar regions. The waypoint selection problem is modelled and solved in stages. One of the source‐based routing algorithms, segment routing, is applied in low earth orbit satellite network to provide stable, efficient and reliable packets forwarding. By adding the selected waypoints into the segment list of the segment routing, the next‐snapshot broken links can be avoided. Based on the proposed waypoint segment routing algorithm, the problem of the path broken can be effectively solved. By comparing the packet loss, the delay and the delay jitter of the satellite simulation networks, the proposed algorithm is preferable to the popular optimised link‐state routing algorithm for the satellite mega‐constellation networks.
This paper addresses the problem of real-time model predictive control (MPC) in the integrated guidance and control (IGC) of missile systems. When the primal-dual interior point method (PD-IPM), which is a convex optimization method, is used as an optimization solution for the MPC, the real-time performance of PD-IPM degenerates due to the elevated computation time in checking the Karush–Kuhn–Tucker (KKT) conditions in PD-IPM. This paper proposes a graphics processing unit (GPU)-based method to parallelize and accelerate PD-IPM for real-time MPC. The real-time performance of the proposed method was tested and analyzed on a widely-used embedded system. The comparison results with the conventional PD-IPM and other methods showed that the proposed method improved the real-time performance by reducing the computation time significantly.
Artificial intelligence based on deep learning has gained popularity in a broad range of applications. Software libraries and frameworks for deep learning provide developers with tools for fast deployment, hiding the algorithmic complexity for training and inference of large neural networks. These frameworks allow mitigating the computational complexity of such algorithms by interfacing parallel computing libraries for specific graphic processing units, which are not available on all platforms, especially if embedded. The framework we propose in this paper enables fast prototyping of custom hardware accelerators for deep learning. In particular we describe how to design, evaluate and deploy accelerators for PyTorch applications written in Python and running on PYNQ compatible platforms, which are based on Xilinx Zynq Systems on Chips. This approach does not require traditional ASIC-style design tools, but rather it simplifies the interfacing between hardware and software components of the neural network, which includes support for deployment on embedded platforms. As an example, we use this framework to design hardware accelerators for a complex sound synthesis algorithm based on a recurrent neural network.
The popular Q-learning algorithm is known to overestimate action values under
certain conditions. It was not previously known whether, in practice, such
overestimations are common, whether this harms performance, and whether they
can generally be prevented. In this paper, we answer all these questions
affirmatively. In particular, we first show that the recent DQN algorithm,
which combines Q-learning with a deep neural network, suffers from substantial
overestimations in some games in the Atari 2600 domain. We then show that the
idea behind the Double Q-learning algorithm, which was introduced in a tabular
setting, can be generalized to work with large-scale function approximation. We
propose a specific adaptation to the DQN algorithm and show that the resulting
algorithm not only reduces the observed overestimations, as hypothesized, but
that this also leads to much better performance on several games.
We present the first deep learning model to successfully learn control
policies directly from high-dimensional sensory input using reinforcement
learning. The model is a convolutional neural network, trained with a variant
of Q-learning, whose input is raw pixels and whose output is a value function
estimating future rewards. We apply our method to seven Atari 2600 games from
the Arcade Learning Environment, with no adjustment of the architecture or
learning algorithm. We find that it outperforms all previous approaches on six
of the games and surpasses a human expert on three of them.
Cross-Point Based Routing Protocol in Low Earth Orbit Communication Networks
Y.-E Lee
K.-I Kim
A Study on the low-earth orbit satellite based non-terrestrial network systems via deep-reinforcement learning
J H Lee
K Y Chai
A Study on the Reinforcement Learning Routing for LEO Satellite Network
B.-S Roh
M.-H Han
D.-W Kum
K.-S Jeon
Dueling network architectures for deep reinforcement learning