Conference PaperPDF Available

Towards Resilient yet Efficient Parallel Execution of Convolutional Neural Networks

Authors:

Abstract

Machine learning is increasingly being deployed in many safety-critical and realtime applications, such as unmanned aerial vehicles (UAVs) [1] and self-driving cars [2], where the systems could be exposed to harsh environments. Due to their unique structure and computational behaviors, research has been done on relaxing their accuracy for performance benefits. We observe that not all transient errors affect program correctness, some errors only affect program accuracy, i.e., the program completes with certain acceptable deviations from error free outcome. In this paper we illustrate the idea of cross-layer soft-error resilience [3] using Convolutional Neural Network, where program accuracy is introduced as a tradeoff to deliver resilient yet efficient execution on futuristic large-scale multicores.
Towards Resilient yet Efficient Parallel Execution of Convolutional Neural Networks
Qingchuan Shi, Hamza Omar, Omer Khan
University of Connecticut, Storrs, CT, USA
Abstract—Machine learning is increasingly being deployed in
many safety-critical and realtime applications, such as unmanned
aerial vehicles (UAVs) [1] and self-driving cars [2], where the
systems could be exposed to harsh environments. Due to their
unique structure and computational behaviors, research has been
done on relaxing their accuracy for performance benefits. We
observe that not all transient errors affect program correctness,
some errors only affect program accuracy, i.e., the program
completes with certain acceptable deviations from error free
outcome. In this paper we illustrate the idea of cross-layer soft-
error resilience [3] using Convolutional Neural Network, where
program accuracy is introduced as a tradeoff to deliver resilient
yet efficient execution on futuristic large-scale multicores.
I. INTRODUCTION
Popular machine learning algorithms work on massive data
and perform perception computations. They can be used in
many applications, including image recognition, video analysis
and natural language processing. Such applications have the
potential to be deployed in safety-critical systems, where they
face harsh conditions that lead to vulnerability against transient
perturbations in the hardware system. Resiliency solutions,
such as Hardware Redundant Execution (HaRE) [
4
] deliver up
to 100% coverage. It relies on a local per-core checkpoint and
rollback mechanism to recover from detected errors. Moreover,
all code is redundantly (dual) executed, and thus has very high
performance overhead, which can be reduced by eliminating
unnecessary protection for certain code regions. On the other
hand, due to the inherent heuristic nature of machine learning
algorithms, individual floating point calculations hardly impact
program outcome. Thus, it is practical to improve processor
efficiency by trading off resilience overheads with program
accuracy.
At the program level, crucial and non-crucial code is
identified based on its impact on the program outcome. Crucial
code affects program correctness, which means the program
should be able to complete without crashing, deadlocking, or
aborting, etc., due to transient errors, while its outcome is
explicable. Non-crucial code only affects program accuracy,
which refers to how much the result is off compared to no errors
scenario. We developed a cross-layer resilient architecture [
3
],
through which hardware collaborates with software support to
enable efficient resilience with 100% error coverage. Crucial
code is executed under HaRE and hence suffers from redun-
dancy’s performance overheads. However, non-crucial code is
protected with resilience schemes that have minimal impact
on performance. Only program accuracy is compromised in
the worst-case scenario during non-crucial code execution. Our
evaluation for several popular convolution neural networks show
that the cross-layer resilient architecture [
3
] enables significant
performance gains while maintaining the error coverage of
HaRE.
II. CRO SS -L AYER RESILIENCE
The idea is to protect different code regions with different
resilience schemes. We assume the Hardware Redundant Execu-
tion (HaRE) [
4
] architecture, which provides strong protection
for the crucial code regions. For non-crucial code regions,
based on the notion of trading off resilience overheads with
program accuracy, coarse-grain hardware/software resilience
(e.g., software level bound checker) is applied that results in
low performance overhead. The programmer is responsible for
composing code regions as crucial versus non-crucial. Tuning
techniques, such as loop-unrolling and control flow elimination
are used to further improve the efficiency (see [
3
] for details).
After identifying the non-crucial code regions, program-level
fault injection and accuracy analysis is applied to verify the
acceptance of non-crucial code regions. The accuracy loss of
each region is obtained by only applying fault injection to it,
where the rate of “correct classification divided by the number
of tests” is defined as the accuracy. For example, when applying
100 hand written digits through CNN-MNIST workload, if 95
digits are classified correctly, its accuracy is defined as 95%.
Multiple simulations (1000 times in this paper) are performed
for each configuration of non-crucial regions to obtain the
average program accuracy. The aggregate accuracy loss due to
errors in non-crucial regions depends on a programmer defined
accuracy threshold, as well as the error rate due to the system
and environmental conditions.
A. Application Illustration
This paper evaluates four commonly used convolutional
neural networks (CNNs): AlexNet (ALEXNET) [
5
], VGG [
6
],
hand written digits recognition (MNIST) [7], and recognition
of German traffic signs (GTSRB) [
8
]. All CNNs consist of
several layers, such as input, convolutional, fully connected,
and output layers. The computation within each layer can be
identified as a candidate non-crucial region. Convolutional and
fully connected layers are illustrated in this section, as they
contribute most of the workload execution time.
Algorithm 1 CNN Convolutional Layer Pseudo Code
1: ConvolutionLayer(input, conv out, tid, threads) {
2: for each neuron in the thread do
3: \∗ The following 3 level loop is Unrolled ∗\
4: for (number of kernels k, kernel height, h, kernel width, w)do
5: \∗ Assign temp k/h/w ∗\
6: HaRE Off
7: conv out += do conv(input, temp k /h/w)
8: \∗ Update temp variables ∗\
9: conv out += do conv(input, temp k /h/w)
10: \∗ Update temp variables ∗\
11: :
12: HaRE On
13: \∗ Update k, h, w ∗\
14: Bound Checker(conv out)
15: }
Each convolutional layer takes the input feature map and
convolves it with the given kernels to produce an output
feature map. This results in an output feature matrix cell value.
These computations (shown in Algorithm 1, lines 7-9) can be
considered as non-crucial, since the effect of individual cell
value to the program outcome is limited. Furthermore, each
kernel produces an output feature matrix of its own. Cell values
are compared with each other to find the maximum one, which
is later used to construct a single output feature map. When
exposed to errors, the output map can get affected only if
the maximum cell values are perturbed into larger ones, since
smaller values are masked out. Note that out of bound values
are dropped, in which case the second largest value would be
used for the corresponding cell in the output feature map. The
loops (shown in Algorithm 1, line 4) are unrolled, and the
loop counters (k,h,w) are updated with HaRE protection. Only
temporary variables (temp k/h/j in Algorithm 1) are written in
non-crucial region.
Algorithm 2 CNN Fully Connected Layer Pseudo Code
1: FullyConnectedLayer(input, fully out, tid, threads) {
2: for each layer do
3: for each neuron do
4: \∗ The following loop is Unrolled ∗\
5: for each input ido
6: HaRE Off
7: O+= (input(i)weights(i))
8: i= 1
9: O+= (input(i)weights(i))
10: :
11: i= 2, 3, ....
12: O+= (input(i)weights(i))
13: HaRE On
14: T emp =I
15: Bound Checker(O)
16: fully out =Sigmoid(O)
17: Barrier
18: }
Each fully connected layer (as shown in Algorithm 2) is a
feed-forward network in which all the neurons in one layer are
connected to the neurons in the next layer. The neuron count
reduces towards the end of each fully connected layer. The first
layer of this feed-forward network provides the output data set
of the previous layer to the neurons as input. The later layers
perform accumulations and multiplications of the inputs with
their respective weights to compute the sigmoid. The result is
further propagated to the next layer. The computations done in
the fully connected layer can be considered non-crucial, since
the remaining unperturbed accumulations could overcome it. To
ensure correctness of the program, bound checkers (shown in
Algorithm 1 line 14 and Algorithm 2 line 15) are introduced in
the code so that the effects of the perturbations can be reduced.
Statically determined bounds based on programmer guidance
are used. For fully connected layers, the accumulated results
are used to compute the sigmoid. Based on the definition, this
complex sigmoid computation always results a value with in
the range of 0 to 1. For the sigmoid to output 0 or a 1, the
respective values (O) are -90 and 10. Thus, to limit the result
within this range of 0 to 1 (excluding 0 and 1), we limit our
accumulations to -90 and 10.
Non-crucial Time (%) Accuracy Loss
CNN-ALEXNET 91 6.9
CNN-VGG 92.5 7.3
CNN-MNIST 87 3.9
CNN-GTSRB 88 6.2
TABLE I:
Selected non-crucial regions at 0.1% error rate and 10%
accuracy threshold.
!"
!#$"
%"
%#$"
&"
'()*+,-*"
./0*"
1+"
'()*+,-*"
./0*"
1+"
'()*+,-*"
./0*"
1+"
'()*+,-*"
./0*"
1+"
'()*+,-*"
./0*"
1+"
1--2"
(+*3-*4"
1--2"
54)0'"
1--2"
6-,)4"
1--2755" (89:/;9"
09<=>=9?@9"
")A?@B:C?=D/EC?"
F=G9+=?9"H>I<B"
"':/?@B"
)G9@I>/EC?"
"69JC:A")K/>><"
"1CJGIK9")K/>><"
"+%2,"H9K@B")K/>><"
",?<K:I@EC?<"
Fig. 1: Completion Time Breakdown.
III. EVALUATI ON
We our modified Graphite multicore simulator [
9
] to evaluate
the cross-layer resilient architecture. Results in Figure 1
are normalized to BASELINE multicore system with no
resilience protection schemes. When applying CL (Cross-
Layer resilience), all CNNs show remarkable performance
improvement over HaRE. For example, completion time
overhead of ALEXNET improves from 1.83
×
for HaRE over
BASELINE to 1.15
×
for CL over BASELINE. This is because
HaRE is not able to hide resilience overhead, meanwhile
significant computations are identified as non-crucial in CL,
as shown in Table I. The accuracy loss in different regions
depends on the code structure and functionality. For CNNs our
analysis shows that apart from input and output layers, all other
convolution and fully-connected layers can tolerate acceptable
accuracy loss. Overall, this paper shows significant performance
improvement of CL (at 1.10X normalized to BASELINE) over
HaRE (at 1.67X) using CNN benchmarks. This is accomplished
while maintaining reasonable accuracy and high error coverage.
REFERENCES
[1]
V. Roberge, M. Tarbouchi, and G. Labonte, “Comparison of parallel
genetic algorithm and particle swarm optimization for real-time uav path
planning,” IEEE Transactions on Industrial Informatics, vol. 9, Feb 2013.
[2]
J. Kim, H. Kim, K. Lakshmanan, and R. Rajkumar, “Parallel scheduling
for cyber-physical systems: Analysis and case study on a self-driving car,”
in ACM/IEEE ICCPS, pp. 31–40, April 2013.
[3]
Q. Shi, H. Hoffmann, and O. Khan, A cross-layer multicore architecture
to tradeoff program accuracy and resilience overheads, IEEE Computer
Architecture Letters, vol. 14, pp. 85–89, July 2015.
[4]
Q. Shi and O. Khan, “Toward holistic soft-error-resilient shared-memory
multicores,” Computer, vol. 46, pp. 56–64, October 2013.
[5]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
deep convolutional neural networks, in Advances in Neural Information
Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.
Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc., 2012.
[6]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[7] M. A. Nielsen, “Neural networks and deep learning,” Det. Press, 2015.
[8]
P. Sermanet and Y. LeCun, “Traffic sign recognition with multi-scale con-
volutional networks, in Neural Networks (IJCNN), The 2011 International
Joint Conference on, pp. 2809–2813, July 2011.
[9]
J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio,
J. Eastep, and A. Agarwal, “Graphite: A distributed parallel simulator for
multicores,” in HPCA - 16 2010 The Sixteenth International Symposium
on High-Performance Computer Architecture, pp. 1–12, Jan 2010.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
To protect multicores from soft-error perturbations, resiliency schemes have been developed with high coverage but high power/performance overheads (2). We observe that not all soft-errors affect program correctness, some soft-errors only affect program accuracy, i.e., the program completes with certain acceptable deviations from soft-error free outcome. Thus, it is practical to improve processor efficiency by trading off resilience overheads with program accuracy. We propose the idea of declarative resilience that selectively applies resilience schemes to both crucial and non-crucial code, while ensuring program correctness. At the application level, crucial and non-crucial code is identified based on its impact on the program outcome. The hardware collaborates with software support to enable efficient resilience with 100% soft-error coverage. Only program accuracy is compromised in the worst-case scenario of a soft-error strike during non-crucial code execution. For a set of multithreaded benchmarks, declarative resilience improves completion time by an average of 21% over state-of-the-art hardware resilience scheme that protects all executed code. Its performance overhead is 1.38 over a multicore that does not support resilience.
Conference Paper
Full-text available
As the complexity of software for Cyber-Physical Systems (CPS) rapidly increases, multi-core processors and parallel programming models such as OpenNP become appealing to CPS developers for guaranteeing timeliness. Hence, a parallel task on multi-core processors is expected to become a vital component in CPS such as a self-driving car, where tasks must be scheduled in real-time. In this paper, we extend the fork-join parallel task model to be scheduled in real-time, where the number of parallel threads can vary depending on the physical attributes of the system. To efficiently schedule the proposed task model, we develop the task stretch* transform. Using this transform for global Deadline Monotonic scheduling for fork-join real-time tasks, we achieve a resource augmentation bound of 3.73. In other words, any task set that is feasible on m unit-speed processors can be scheduled by the proposed algorithm on m processors that are 3.73 times faster. The proposed scheme is implemented on Linux/RK as a proof of concept, and ported to Boss, the self-driving vehicle that won the 2007 DARPA Urban Challenge. We evaluate our scheme on Boss by showing its driving quality, i.e., curvature and velocity profiles of the vehicle.
Conference Paper
Full-text available
We apply Convolutional Networks (ConvNets) to the task of traffic sign classification as part of the GTSRB competition. ConvNets are biologically-inspired multi-stage architectures that automatically learn hierarchies of invariant features. While many popular vision approaches use hand-crafted features such as HOG or SIFT, ConvNets learn features at every level from data that are tuned to the task at hand. The traditional ConvNet architecture was modified by feeding 1<sup>st</sup> stage features in addition to 2<sup>nd</sup> stage features to the classifier. The system yielded the 2nd-best accuracy of 98.97% during phase I of the competition (the best entry obtained 98.98%), above the human performance of 98.81%, using 32×32 color input images. Experiments conducted after phase 1 produced a new record of 99.17% by increasing the network capacity, and by using greyscale images instead of color. Interestingly, random features still yielded competitive results (97.33%).
Conference Paper
Full-text available
This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multi-core processors containing dozens, hundreds, or even thousands of cores. It provides high performance for fast design space exploration and software development. Several techniques are used to achieve this including: direct execution, seamless multicore and multi-machine distribution, and lax synchronization. Graphite is capable of accelerating simulations by distributing them across multiple commodity Linux machines. When using multiple machines, it provides the illusion of a single process with a single, shared address space, allowing it to run off-the-shelf pthread applications with no source code modification. Our results demonstrate that Graphite can simulate target architectures containing over 1000 cores on ten 8-core servers. Performance scales well as more machines are added with near linear speedup in many cases. Simulation slowdown is as low as 41× versus native execution.
Chapter
A neural network, specifically known as an artificial neural network (ANN), has been developed by the inventor of one of the first neurocomputers, Dr. Robert Hecht-Nielsen. He defines a neural network as follows:
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
The development of autonomous unmanned aerial vehicles (UAVs) is of high interest to many governmental and military organizations around the world. An essential aspect of UAV autonomy is the ability for automatic path planning. In this paper, we use the genetic algorithm (GA) and the particle swarm optimization algorithm (PSO) to cope with the complexity of the problem and compute feasible and quasi-optimal trajectories for fixed wing UAVs in a complex 3D environment, while considering the dynamic properties of the vehicle. The characteristics of the optimal path are represented in the form of a multiobjective cost function that we developed. The paths produced are composed of line segments, circular arcs and vertical helices. We reduce the execution time of our solutions by using the “single-program, multiple-data” parallel programming paradigm and we achieve real-time performance on standard commercial off-the-shelf multicore CPUs. After achieving a quasi-linear speedup of 7.3 on 8 cores and an execution time of 10 s for both algorithms, we conclude that by using a parallel implementation on standard multicore CPUs, real-time path planning for UAVs is possible. Moreover, our rigorous comparison of the two algorithms shows, with statistical significance, that the GA produces superior trajectories to the PSO.
Article
A proposed lightweight, soft-error-resilient architecture for shared-memory multicores enables cores to autonomously perform redundant execution of uninterrupted instruction sequences. The distributed redundancy control mechanism operates in concert with the coherence protocol to provide resiliency for both computation and communication hardware. The Web extra at http://youtu.be/9A3oiIerI0w is a video interview in which guest editor Srinivas Devadas and author Omer Khan expand on how a proposed lightweight, soft-error-resilient architecture for shared-memory multicores enables cores to autonomously perform redundant execution of uninterrupted instruction sequences.