Content uploaded by Qingchuan Shi
Author content
All content in this area was uploaded by Qingchuan Shi on Feb 04, 2017
Content may be subject to copyright.
Towards Resilient yet Efficient Parallel Execution of Convolutional Neural Networks
Qingchuan Shi, Hamza Omar, Omer Khan
University of Connecticut, Storrs, CT, USA
Abstract—Machine learning is increasingly being deployed in
many safety-critical and realtime applications, such as unmanned
aerial vehicles (UAVs) [1] and self-driving cars [2], where the
systems could be exposed to harsh environments. Due to their
unique structure and computational behaviors, research has been
done on relaxing their accuracy for performance benefits. We
observe that not all transient errors affect program correctness,
some errors only affect program accuracy, i.e., the program
completes with certain acceptable deviations from error free
outcome. In this paper we illustrate the idea of cross-layer soft-
error resilience [3] using Convolutional Neural Network, where
program accuracy is introduced as a tradeoff to deliver resilient
yet efficient execution on futuristic large-scale multicores.
I. INTRODUCTION
Popular machine learning algorithms work on massive data
and perform perception computations. They can be used in
many applications, including image recognition, video analysis
and natural language processing. Such applications have the
potential to be deployed in safety-critical systems, where they
face harsh conditions that lead to vulnerability against transient
perturbations in the hardware system. Resiliency solutions,
such as Hardware Redundant Execution (HaRE) [
4
] deliver up
to 100% coverage. It relies on a local per-core checkpoint and
rollback mechanism to recover from detected errors. Moreover,
all code is redundantly (dual) executed, and thus has very high
performance overhead, which can be reduced by eliminating
unnecessary protection for certain code regions. On the other
hand, due to the inherent heuristic nature of machine learning
algorithms, individual floating point calculations hardly impact
program outcome. Thus, it is practical to improve processor
efficiency by trading off resilience overheads with program
accuracy.
At the program level, crucial and non-crucial code is
identified based on its impact on the program outcome. Crucial
code affects program correctness, which means the program
should be able to complete without crashing, deadlocking, or
aborting, etc., due to transient errors, while its outcome is
explicable. Non-crucial code only affects program accuracy,
which refers to how much the result is off compared to no errors
scenario. We developed a cross-layer resilient architecture [
3
],
through which hardware collaborates with software support to
enable efficient resilience with 100% error coverage. Crucial
code is executed under HaRE and hence suffers from redun-
dancy’s performance overheads. However, non-crucial code is
protected with resilience schemes that have minimal impact
on performance. Only program accuracy is compromised in
the worst-case scenario during non-crucial code execution. Our
evaluation for several popular convolution neural networks show
that the cross-layer resilient architecture [
3
] enables significant
performance gains while maintaining the error coverage of
HaRE.
II. CRO SS -L AYER RESILIENCE
The idea is to protect different code regions with different
resilience schemes. We assume the Hardware Redundant Execu-
tion (HaRE) [
4
] architecture, which provides strong protection
for the crucial code regions. For non-crucial code regions,
based on the notion of trading off resilience overheads with
program accuracy, coarse-grain hardware/software resilience
(e.g., software level bound checker) is applied that results in
low performance overhead. The programmer is responsible for
composing code regions as crucial versus non-crucial. Tuning
techniques, such as loop-unrolling and control flow elimination
are used to further improve the efficiency (see [
3
] for details).
After identifying the non-crucial code regions, program-level
fault injection and accuracy analysis is applied to verify the
acceptance of non-crucial code regions. The accuracy loss of
each region is obtained by only applying fault injection to it,
where the rate of “correct classification divided by the number
of tests” is defined as the accuracy. For example, when applying
100 hand written digits through CNN-MNIST workload, if 95
digits are classified correctly, its accuracy is defined as 95%.
Multiple simulations (1000 times in this paper) are performed
for each configuration of non-crucial regions to obtain the
average program accuracy. The aggregate accuracy loss due to
errors in non-crucial regions depends on a programmer defined
accuracy threshold, as well as the error rate due to the system
and environmental conditions.
A. Application Illustration
This paper evaluates four commonly used convolutional
neural networks (CNNs): AlexNet (ALEXNET) [
5
], VGG [
6
],
hand written digits recognition (MNIST) [7], and recognition
of German traffic signs (GTSRB) [
8
]. All CNNs consist of
several layers, such as input, convolutional, fully connected,
and output layers. The computation within each layer can be
identified as a candidate non-crucial region. Convolutional and
fully connected layers are illustrated in this section, as they
contribute most of the workload execution time.
Algorithm 1 CNN Convolutional Layer Pseudo Code
1: ConvolutionLayer(input, conv out, tid, threads) {
2: for each neuron in the thread do
3: \∗ The following 3 level loop is Unrolled ∗\
4: for (number of kernels k, kernel height, h, kernel width, w)do
5: \∗ Assign temp k/h/w ∗\
6: HaRE Off
7: conv out += do conv(input, temp k /h/w)
8: \∗ Update temp variables ∗\
9: conv out += do conv(input, temp k /h/w)
10: \∗ Update temp variables ∗\
11: :
12: HaRE On
13: \∗ Update k, h, w ∗\
14: Bound Checker(conv out)
15: }
Each convolutional layer takes the input feature map and
convolves it with the given kernels to produce an output
feature map. This results in an output feature matrix cell value.
These computations (shown in Algorithm 1, lines 7-9) can be
considered as non-crucial, since the effect of individual cell
value to the program outcome is limited. Furthermore, each
kernel produces an output feature matrix of its own. Cell values
are compared with each other to find the maximum one, which
is later used to construct a single output feature map. When
exposed to errors, the output map can get affected only if
the maximum cell values are perturbed into larger ones, since
smaller values are masked out. Note that out of bound values
are dropped, in which case the second largest value would be
used for the corresponding cell in the output feature map. The
loops (shown in Algorithm 1, line 4) are unrolled, and the
loop counters (k,h,w) are updated with HaRE protection. Only
temporary variables (temp k/h/j in Algorithm 1) are written in
non-crucial region.
Algorithm 2 CNN Fully Connected Layer Pseudo Code
1: FullyConnectedLayer(input, fully out, tid, threads) {
2: for each layer do
3: for each neuron do
4: \∗ The following loop is Unrolled ∗\
5: for each input ido
6: HaRE Off
7: O+= (input(i)∗weights(i))
8: i= 1
9: O+= (input(i)∗weights(i))
10: :
11: i= 2, 3, ....
12: O+= (input(i)∗weights(i))
13: HaRE On
14: T emp =I
15: Bound Checker(O)
16: fully out =Sigmoid(O)
17: Barrier
18: }
Each fully connected layer (as shown in Algorithm 2) is a
feed-forward network in which all the neurons in one layer are
connected to the neurons in the next layer. The neuron count
reduces towards the end of each fully connected layer. The first
layer of this feed-forward network provides the output data set
of the previous layer to the neurons as input. The later layers
perform accumulations and multiplications of the inputs with
their respective weights to compute the sigmoid. The result is
further propagated to the next layer. The computations done in
the fully connected layer can be considered non-crucial, since
the remaining unperturbed accumulations could overcome it. To
ensure correctness of the program, bound checkers (shown in
Algorithm 1 line 14 and Algorithm 2 line 15) are introduced in
the code so that the effects of the perturbations can be reduced.
Statically determined bounds based on programmer guidance
are used. For fully connected layers, the accumulated results
are used to compute the sigmoid. Based on the definition, this
complex sigmoid computation always results a value with in
the range of 0 to 1. For the sigmoid to output 0 or a 1, the
respective values (O) are -90 and 10. Thus, to limit the result
within this range of 0 to 1 (excluding 0 and 1), we limit our
accumulations to -90 and 10.
Non-crucial Time (%) Accuracy Loss
CNN-ALEXNET 91 6.9
CNN-VGG 92.5 7.3
CNN-MNIST 87 3.9
CNN-GTSRB 88 6.2
TABLE I:
Selected non-crucial regions at 0.1% error rate and 10%
accuracy threshold.
!"
!#$"
%"
%#$"
&"
'()*+,-*"
./0*"
1+"
'()*+,-*"
./0*"
1+"
'()*+,-*"
./0*"
1+"
'()*+,-*"
./0*"
1+"
'()*+,-*"
./0*"
1+"
1--2"
(+*3-*4"
1--2"
54)0'"
1--2"
6-,)4"
1--2755" (89:/;9"
09<=>=9?@9"
")A?@B:C?=D/EC?"
F=G9+=?9"H>I<B"
"':/?@B"
)G9@I>/EC?"
"69JC:A")K/>><"
"1CJGIK9")K/>><"
"+%2,"H9K@B")K/>><"
",?<K:I@EC?<"
Fig. 1: Completion Time Breakdown.
III. EVALUATI ON
We our modified Graphite multicore simulator [
9
] to evaluate
the cross-layer resilient architecture. Results in Figure 1
are normalized to BASELINE multicore system with no
resilience protection schemes. When applying CL (Cross-
Layer resilience), all CNNs show remarkable performance
improvement over HaRE. For example, completion time
overhead of ALEXNET improves from 1.83
×
for HaRE over
BASELINE to 1.15
×
for CL over BASELINE. This is because
HaRE is not able to hide resilience overhead, meanwhile
significant computations are identified as non-crucial in CL,
as shown in Table I. The accuracy loss in different regions
depends on the code structure and functionality. For CNNs our
analysis shows that apart from input and output layers, all other
convolution and fully-connected layers can tolerate acceptable
accuracy loss. Overall, this paper shows significant performance
improvement of CL (at 1.10X normalized to BASELINE) over
HaRE (at 1.67X) using CNN benchmarks. This is accomplished
while maintaining reasonable accuracy and high error coverage.
REFERENCES
[1]
V. Roberge, M. Tarbouchi, and G. Labonte, “Comparison of parallel
genetic algorithm and particle swarm optimization for real-time uav path
planning,” IEEE Transactions on Industrial Informatics, vol. 9, Feb 2013.
[2]
J. Kim, H. Kim, K. Lakshmanan, and R. Rajkumar, “Parallel scheduling
for cyber-physical systems: Analysis and case study on a self-driving car,”
in ACM/IEEE ICCPS, pp. 31–40, April 2013.
[3]
Q. Shi, H. Hoffmann, and O. Khan, “A cross-layer multicore architecture
to tradeoff program accuracy and resilience overheads,” IEEE Computer
Architecture Letters, vol. 14, pp. 85–89, July 2015.
[4]
Q. Shi and O. Khan, “Toward holistic soft-error-resilient shared-memory
multicores,” Computer, vol. 46, pp. 56–64, October 2013.
[5]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
deep convolutional neural networks,” in Advances in Neural Information
Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.
Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc., 2012.
[6]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[7] M. A. Nielsen, “Neural networks and deep learning,” Det. Press, 2015.
[8]
P. Sermanet and Y. LeCun, “Traffic sign recognition with multi-scale con-
volutional networks,” in Neural Networks (IJCNN), The 2011 International
Joint Conference on, pp. 2809–2813, July 2011.
[9]
J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio,
J. Eastep, and A. Agarwal, “Graphite: A distributed parallel simulator for
multicores,” in HPCA - 16 2010 The Sixteenth International Symposium
on High-Performance Computer Architecture, pp. 1–12, Jan 2010.