Conference PaperPDF Available

Training Modern Deep Neural Networks for Memory-Fault Robustness



Because deep neural networks (DNNs) rely on a large number of parameters and computations, their implementation in energy-constrained systems is challenging. In this paper, we investigate the solution of reducing the supply voltage of the memories used in the system, which results in bit-cell faults. We explore the robustness of state-of-the-art DNN architectures towards such defects and propose a regularizer meant to mitigate their effects on accuracy. Our experiments clearly demonstrate the interest of operating the system in a faulty regime to save energy without reducing accuracy.
Training Modern Deep Neural Networks for
Memory-Fault Robustness
Ghouthi Boukli Hacene1,2, Franc¸ois Leduc-Primeau3, Amal Ben Soussia2,
Vincent Gripon1,2and Franc¸ois Gagnon4,
e de Montr´
eal, MILA 2IMT Atlantique, Lab-STICC
3Dept. of Electrical Engineering, ´
Ecole Polytechnique de Montr´
eal 4Dept. of Electrical Engineering, ´
Ecole de technologie sup´
erieure de Montr´
Abstract—Because deep neural networks (DNNs) rely on a
large number of parameters and computations, their implemen-
tation in energy-constrained systems is challenging. In this paper,
we investigate the solution of reducing the supply voltage of the
memories used in the system, which results in bit-cell faults.
We explore the robustness of state-of-the-art DNN architectures
towards such defects and propose a regularizer meant to mitigate
their effects on accuracy. Our experiments clearly demonstrate
the interest of operating the system in a faulty regime to save
energy without reducing accuracy.
Deep Neural Networks [1] (DNNs) are the golden standard
for many challenges in machine learning. Thanks to the large
number of trainable parameters that they provide, DNNs can
capture the complexity of large training datasets and generalize
to previously unseen examples.
Among the many applications for DNNs, many are in
the field of embedded systems. Examples include monitor-
ing of health signals, human-machine interfaces, autonomous
drones, and smartphone applications. Many such embedded
applications cannot rely on cloud-based processing because
of stringent latency constraints or privacy issues. Even when
cloud processing is an option, processing in-device or at
the network edge can be useful to save network bandwidth.
The energy consumption of the inference task is thus a
major concern. Unfortunately, because state-of-the-art DNN
architectures are composed of a large number of trained
parameters, the inference step typically requires significant
energy to achieve accurate results on challenging tasks, with a
large part of the energy complexity being associated with the
memory accesses required to retrieve the parameters and save
temporary results.
Since off-chip memory accesses consume significant energy,
a first step for reducing energy consumption consists in storing
all parameters and temporary results on the same chip as
the hardware accelerator, using static random access mem-
ories (SRAMs), or potentially embedded dynamic RAMs [2].
However, even in this case, the energy consumed by memory
accesses still represents 30–60% of the total energy [3].
An effective way of lowering energy consumption of both
memory and logic circuits is to reduce the supply voltage,
but this has the effect of increasing the sensitivity of the
circuits to fabrication variations, causing bit-cell failures in
SRAMs. When approaching the minimum energy operating
point of SRAMs, the failure rates increase by several orders
of magnitude compared to operating at the nominal supply [4].
However, even such large bit-cell failure rates are not neces-
sarily catastrophic if appropriate mechanisms are in place to
safeguard the operation of the system.
DNNs naturally exhibit a limited amount of fault tolerance,
as noted for instance in [5], [6], and there is a growing body
of work that studies the operation of DNN inference hardware
built using faulty memories. We review several contributions
in Section II. The aim of this paper is to investigate the ability
to decrease the energy consumption of DNN accelerators by
allowing the memories used for storing weights and activations
to operate in a faulty regime, thus introducing deviations on
the stored values. We rely on simple but realistic energy-
deviation models to explore the impact of memory failures on
classification accuracy, and ultimately on energy consumption.
We quantify the impact on robustness of several design
aspects of state-of-the-art deep architectures in order to iden-
tify whether these aspects should be targeted when designing
robust architectures. Specifically, we consider the choice of
general architecture, how the depth of a layer impacts its
robustness, and the impact of faults occurring in the storage
of weights or of neuron activations. Interestingly, we find that
different architectures provide varying degrees of robustness.
We then consider whether faulty operation can lead to a
reduction in power consumption. Importantly, we compare the
energy consumption with a reliable reference implementation
that achieves the same application performance. We show that
using a faulty implementation to reduce energy consumption at
the cost of a reduction in accuracy is not necessarily beneficial,
even when the loss in accuracy appears small. Indeed, for
state-of-the-art architectures, accepting even a 1% reduction
in accuracy can significantly reduce the number of parameters
required by a reliable implementation. It it thus essential to
evaluate the improvement provided by a faulty architecture at
the same accuracy. Nonetheless, we show that faulty operation
can reduce energy consumption when the fault statistics are
taken into account during training.
The outline of the paper is as follows. Section II briefly
reviews related work. Section III introduces the deviation
models, which represent the impact of circuit faults on the
algorithm. Section IV presents an exploration of the design
space for faulty-memory implementations of modern DNNs.
Section V proposes a regularizer to increase the robustness of
DNNs to deviations. Section VI provides some conclusions.
The idea of exploiting fault tolerance to improve the en-
ergy efficiency of neural networks has attracted a significant
number of contributions. An early investigation of the effect
of transistor-level defects on neural networks was performed
in [7]. More recently, circuit-level methods for improving
the application performance of faulty implementations have
been proposed. One approach consists in using razor flip-
flops to detect faults and selectively apply a compensation
mechanism. When memory faults can be detected at the
bit level, a bit masking technique can be applied to ensure
that errors always reduce the magnitude of weights, helping
to decrease the impact of the errors on performance [8],
[9]. Similarly, razor flip-flops can be used to compensate
timing violations occurring in the datapath by dropping the
next operation, which effectively sets its weight parameter to
zero [10]. Finally, a low-precision replica can be added to
computations units to bound the maximum error that can be
introduced by a faulty processing unit [11].
To the best of our knowledge, few papers investigate the ef-
fect of training deep architectures to increase fault robustness.
One notable exception is [3], which proposes modifying the
training procedure to take into account bit flips occurring in
SRAMs, and present results on the MNIST benchmark [12].
The effect of faults occurring in the storage of the input is
also considered in [13], and [14] proposes on-chip learning for
support-vector machines, while decreasing the learning effort
using active learning. Finally, a slightly different problem
is considered in [15], [16], where the network is trained to
compensate for known defect locations.
Another line of work consists in compressing models to
reduce memory usage and number of computations. There are
mainly three ways to achieve this. A first one is to quantize
weights, using in the extreme case only one bit per weight
and per activation [17]–[19]. While the process has proven
very efficient on old and somewhat redundant architectures,
it can drastically affect accuracy when performed on already
compressed architectures. A second way to compress DNNs
is to prune the weights, significantly reducing the number of
parameters to be stored [20]. A last line of work consists in
factorizing weights, so that they can be used to perform multi-
ple computations throughout the processing of an input [21]–
[23]. However, in modern architectures the number of weights
is only a small portion of the memory, as activations of neurons
can be as many and even more if the batch size is large, that
is if several inputs are processed in parallel.
We focus on the energy consumed by memory accesses,
and assume that the amount of energy required to perform an
inference task is proportional to the number of accesses. We
thus define a base energy metric Eothat is the sum of the
number of parameters and of the number of activation values
generated during the inference.
To decrease the energy consumption of on-chip memories,
we consider reducing the supply voltage, which in turn causes
some bit cells to fail. In order to investigate the general
behavior of DNNs implemented with faulty memory, we need
a model linking the bit-cell fault probability pand the energy
consumed by memory accesses. We denote by 0η1
the normalized energy consumption of the memory, where the
normalization is with respect to the energy consumption of the
reliable memory (such that the energy is given by ηEo). Note
that we can obtain a simple upper bound for pfrom the fact
that instead of using a faulty memory, we could store only a
fraction ηof the data while declaring the missing bit-cells as
faulty, which yields a linearly decreasing p(η).
Based on reliability data published in [4, Fig.7], we will as-
sume that the energy-reliability function takes the exponential
p(η) = e.(1)
In order to obtain a specific value of parameter afor illus-
trative purposes, we select ato fit the energy data reported
in [24, Fig.1] and the reliability from [4, Fig.7] for 65nm
CMOS SRAM cells at VDD ∈ {0.5,1.1}. Performing the fit
by minimizing the sum of the relative squared error yields
a= 12.8. Specific energy gains will vary based on the value
of a, but in this paper we are only interested in identifying
general trends.
The manner in which memory faults introduce deviations
during inference depends on the strategy being used to cope
with faults. We consider the case where bit-cell faults can be
detected, and use the bit masking (BM) approach proposed
in [8]. When a fault is detected on the sign bit of a value,
this value is replaced with zero. In the case of failures on any
other bits, the affected bit values are replaced with the sign
bit, causing the value to deviate towards zero. We consider
that all bit cells have an equal fault probability p. When using
the deviation model in simulations, we assume that values are
quantized on 8 bits. However, for a fair comparison with the
reliable implementations that use a floating-point representa-
tion, we compute the deviation in the quantized domain, but
apply it on the floating-point representation. Unless otherwise
mentioned, we consider that faults affect both the weights
and the neuron activations. Note that activations are known
to be positive since they are generated by a ReLU function.
Therefore, we assume that their sign bit cannot be affected.
In this work, we use our deviation model during the
training phase to increase the robustness of networks and thus
their energy efficiency. Because training is computationally
intensive, we propose to simplify the BM deviation model
used during training to speed up the process. Since the BM
approach always causes values to deviate towards zero, we
propose to approximate it using a deviation model that will
be referred to as the erasure model, for which each value has
a probability peof being set to zero. We then need to choose
peto best approximate the effect of the BM model. We can
first note that in the case of weight parameters, the BM model
sets the faulty value to zero in case of a sign-bit fault, which
occurs with probability p. Therefore, we clearly need pe> p.
During training, this process is similar to dropout [25], but it is
Architecture Parameters Activations Accuracy
PreActResNet18 [27] 11.2×1060.55 ×10694.87%
MobileNetV2 [28] 2.30 ×1061.53 ×10693.80%
SENet18 [29] 11.3×1060.86 ×10694.77%
ResNet18 [30] 11.2×1060.56 ×10694.86%
Test set accuracy (%)
Fig. 1. Impact of the architecture on the robustness under BM deviations.
used to increase the robustness of networks, and not to prevent
overfitting. To find the best choice of peto approximate the
BM model, we evaluate the performance of both models on
the test set and choose the value of pethat best predicts the
accuracy of the network under the BM model.
A. Choice of architecture and dataset
We perform experiments using the CIFAR10 dataset [26]
made of tiny color images of 32×32 pixels. We first com-
pare four architectures, namely PreActResNet18 [27], Mo-
bileNetV2 [28], SENet18 [29] and ResNet18 [30], which
are all modern architectures achieving good accuracy on
CIFAR10. Table I shows for each architecture the number
of weights (parameters) and activation values of neurons that
must be retrieved from memory for processing one input, and
the accuracy achieved by that architecture.
In Fig. 1, we compare the robustness of the above-
mentioned architectures when the parameters and activations
are affected by the BM deviation model. We observe that some
architectures are inherently more robust than others, and that
this does not depend solely on the global number of parame-
ters. In Fig. 2, we plot the accuracy in terms of the energy ηEo
per inference, where the base energy Eocorresponds to the
sum of the parameter and activation columns of Table I, and
the fault probability is obtained from the normalized energy
ηusing (1). We observe that PreActResNet18 provides a very
interesting trade-off between accuracy, memory accesses and
robustness to BM. Therefore we choose to focus on this
architecture for the remaining experiments.
0.2 0.4 0.6 0.8 1
Normalized energy
Test set accuracy (%)
Fig. 2. Energy consumption of different architectures under BM deviations.
Test set accuracy (%)
Erasure model
Bit-mask model
Erasure (weights only)
Bit mask (weights only)
Fig. 3. Impact of memory faults on accuracy for different deviation models.
B. Comparison of the BM and erasure models
As motivated in Section III, we are interested in comparing
the effects of BM and erasures on the chosen architecture.
Results are depicted in Fig. 3. Since the BM model affects
weights and activations differently and since PreActResNet18
has about 20×more weights than activation values, we focus
on matching the accuracy of the two models when only
weights are affected by deviations. We observe for this case
that the BM and erasure models have a similar effect, provided
that pe= 2p, suggesting that using erasures as a proxy to
model the deviations induced by BM is a reasonable option.
This relation will be used in Section V to train networks to
be more resilient to BM deviations.
C. Relative importance of layer depth
In a new series of experiments, we aim at identifying
the relative robustness of various parts of the architecture
under BM deviations. To this end, we introduce deviations
on only a portion of the network. Since PreActResNet18
is composed of 4 sequential blocks (made of convolutional
layers and shortcuts), we apply BM deviations to the weights
and activations of only one block at a time. Results are
depicted in Fig. 4. We observe that all parts of the network
are sensitive to deviations. Interestingly, in the region of small
accuracy degradation shown in Fig. 4, robustness increases
Test set accuracy (%)
Deviations on Block 1
Deviations on Block 2
Deviations on Block 3
Deviations on Block 4
Fig. 4. Impact on accuracy of BM deviations applied to different stages of
the network, “Block 1” being the first and “Block 4” the last.
monotonically with the depth of the block. We thus consider
exploiting the varying robustness of the layers to improve
energy consumption by assigning different operating points
to each block. Denoting by pBithe fault probability assigned
to block i, we note from Fig. 4 that at a high accuracy of
94.8%, pB4 = 5pB3 = 5pB2 = 10pB1. The number of
parameters associated with each block, which in order of
block is [1.5,5.2,21,84] ×105, also varies over a wide range.
Following intuition, blocks that are more robust also have
more parameters. As shown in Fig. 5 (curves labeled “Diff.
Fault.”), applying this fault-rate policy significantly improves
the energy efficiency of the standard network.
D. Impact of the number of parameters in the architecture
The number of parameters can be easily adapted by modi-
fying the number of feature maps in the convolutional layers.
If the number of feature maps of each convolutional layer is
multiplied by k, then the total number of parameters will be
roughly multiplied by k2, as the number of parameters in a
convolutional layer increases linearly with both the number of
input feature maps and the number of output feature maps.
We train two variants of the PreActResNet architecture in
which the original number Fof feature maps is multiplied by
1/2and 1/2. These networks are used to provide a reference
for the performance achieved with faulty implementations. The
F/2and F/2networks achieve respectively an accuracy of
93.45% and 94.41% under reliable implementations, illustrat-
ing the fact that significant energy reductions can be obtained
easily if a reduced accuracy is acceptable.
All previous experiments confirm that modern DNN archi-
tectures can tolerate some amount of deviations. However,
in all the scenarios considered, we observe a sharp drop in
performance as soon as the probability pof defect becomes
too large or the energy too small. To improve the robustness
to deviations, we consider training the networks in the same
conditions they are used in, which means that we apply
erasures during the forward pass of the training phase. We
call this method the erasure regularizer. Note that the reason
that we use erasure rather than BM deviations is to speed up
the training process.
0 0.2 0.4 0.6 0.8 1
Normalized energy
Test set accuracy (%)
Reliable implem.
Faulty (F)
Faulty (F)+ reg.
+Diff. Fault. (F)
Diff. Fault. (F)+ reg.
Faulty (F/2)
Faulty (F/2)
Fig. 5. Energy consumption of the Preact-Resnet18 architecture under BM
deviations. Each faulty implementation curve corresponds to a fixed network
size, with the number of feature maps shown within parentheses.
In Fig. 5, we plot the accuracy of the networks as a
function of the energy they use. We compare reliable imple-
mentations of networks with varying number of parameters
with the performance obtained when reducing the supply
voltage of memories. For the specific energy model discussed
in Section III, the best energy reduction obtained by the
faulty implementations with Ffeature maps is 1.5×for the
network with standard training, achieved at an accuracy of
94.76% and a fault rate of p= 0.001, while the best energy
reduction obtained using the erasure regularizer is 2.3×at
an accuracy of 94.8% and p= 0.01. Furthermore, additional
gains can be obtained by combining the erasure regularizer
with blockwise reliability assignment of Sect. IV-C. We thus
see that training the network for robustness using the erasure
regularizer can significantly improve the energy reduction
obtained from faulty operation under the bit-masking model,
at equal accuracy. As discussed in Sect. IV-B, it is important
to perform the training with the appropriate peparameter:
using an erasure regularizer with pe=pdid not yield an
improvement in robustness.
In this work, we explored the possibility of exploiting
the fault tolerance of deep neural networks to reduce the
energy consumption of on-chip memories. We showed that
in some conditions, reducing the supply voltage can result in
better accuracy for the same energy consumption compared
to reducing the number of parameters. We showed that a
deviation model corresponding to detectable bit-cell faults
combined with a bit masking technique can be replaced by
a simpler erasure model to speed up the training, and that
the use of this regularizer during the training phase allows to
further reduce the energy with no impact on accuracy.
Finding the architecture that achieves the best accuracy for a
given energy budget still remains a highly open question, con-
sidering the very large number of possible solutions. As such,
a more systematic study of the combined impact of pruning,
quantizing, factorizing, reducing the number of parameters,
tweaking hyperparameters and reducing supply voltage is a
very promising direction for future work.
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
no. 7553, p. 436, 2015.
[2] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
Z. Xu, N. Sun, and O. Temam, “DaDianNao: A machine-learning su-
percomputer, in 2014 47th Annual IEEE/ACM International Symposium
on Microarchitecture, Dec 2014, pp. 609–622.
[3] S. Kim, P. Howe, T. Moreau, A. Alaghi, L. Ceze, and V. S. Sathe,
“Energy-efficient neural network acceleration in the presence of bit-
level memory errors,IEEE Trans. on Circuits and Systems I: Regular
Papers, pp. 1–14, 2018.
[4] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge,
“Near-threshold computing: Reclaiming Moore’s law through energy
efficient integrated circuits,Proc. of the IEEE, vol. 98, no. 2, pp. 253–
266, Feb. 2010.
[5] J.-C. Vialatte and F. Leduc-Primeau, “A study of deep learning robust-
ness against computation failures,” in Proc. 9th Int. Conf. on Advanced
Cognitive Technologies and Applications, Feb. 2017.
[6] X. Jiao, M. Luo, J. Lin, and R. K. Gupta, “An assessment of vulnera-
bility of hardware neural networks to dynamic voltage and temperature
variations,” in 2017 IEEE/ACM International Conference on Computer-
Aided Design (ICCAD), Nov 2017, pp. 945–950.
[7] O. Temam, “A defect-tolerant accelerator for emerging high-performance
applications,” in 39th Annual Int. Symp. on Computer Architecture
(ISCA), June 2012, pp. 356–367.
[8] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M.
andez-Lobato, G.-Y. Wei, and D. Brooks, “Minerva: Enabling low-
power, highly-accurate deep neural network accelerators,” in Proc. 43rd
Int. Symp. on Computer Architecture (ISCA’16). Piscataway, NJ, USA:
IEEE Press, 2016, pp. 267–278.
[9] P. N. Whatmough, S. K. Lee, H. Lee, S. Rama, D. Brooks, and G. Wei,
“A 28nm soc with a 1.2ghz 568nj/prediction sparse deep-neural-network
engine with >0.1 timing error rate tolerance for IoT applications,” in
2017 IEEE International Solid-State Circuits Conference (ISSCC), Feb
2017, pp. 242–243.
[10] J. Zhang, K. Rangineni, Z. Ghodsi, and S. Garg, “Thundervolt: Enabling
aggressive voltage underscaling and timing error resilience for energy
efficient deep neural network accelerators,CoRR, vol. abs/1802.03806,
2018. [Online]. Available:
[11] Y. Lin, S. Zhang, and N. R. Shanbhag, “Variation-tolerant architectures
for convolutional neural networks in the near threshold voltage regime,”
in 2016 IEEE International Workshop on Signal Processing Systems
(SiPS), Oct 2016, pp. 17–22.
[12] Y. LeCun, C. Cortes, and C. J. Burges, “The MNIST database of
handwritten digits,” 1998.
[13] L. Yang and B. Murmann, “SRAM voltage scaling for energy-efficient
convolutional neural networks,” in 18th Int. Symp. on Quality Electronic
Design (ISQED), March 2017, pp. 7–12.
[14] Z. Wang, K. H. Lee, and N. Verma, “Overcoming computational errors
in sensing platforms through embedded machine-learning kernels,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23,
no. 8, pp. 1459–1470, Aug 2015.
[15] C. Liu, M. Hu, J. P. Strachan, and H. Li, “Rescuing memristor-based
neuromorphic design with high defects,” in 2017 54th ACM/EDAC/IEEE
Design Automation Conference (DAC), June 2017, pp. 1–6.
[16] L. Xia, M. Liu, X. Ning, K. Chakrabarty, and Y. Wang, “Fault-tolerant
training enabled by on-line fault detection for RRAM-based neural
computing systems,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, pp. 1–1, 2018.
[17] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training
deep neural networks with binary weights during propagations,” in
Advances in neural information processing systems, 2015, pp. 3123–
[18] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,
“Binarized neural networks: Training neural networks with weights and
activations constrained to+ 1 or- 1,arXiv preprint arXiv:1602.02830,
[19] G. Souli´
e, V. Gripon, and M. Robert, “Compression of deep neural
networks on the fly,” in International Conference on Artificial Neural
Networks. Springer, 2016, pp. 153–160.
[20] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con-
nections for efficient neural network,” in Advances in neural information
processing systems, 2015, pp. 1135–1143.
[21] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
deep neural networks with pruning, trained quantization and huffman
coding,” arXiv preprint arXiv:1510.00149, 2015.
[22] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression
of deep convolutional neural networks for fast and low power mobile
applications,” arXiv preprint arXiv:1511.06530, 2015.
[23] L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep
networks,” arXiv preprint arXiv:1611.01600, 2016.
[24] G. Chen, D. Sylvester, D. Blaauw, and T. Mudge, “Yield-driven near-
threshold SRAM design,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 18, no. 11, pp. 1590–1598, Nov 2010.
[25] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: a simple way to prevent neural networks from over-
fitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.
1929–1958, 2014.
[26] A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset,” online:
http://www. cs. toronto. edu/kriz/cifar. html, 2014.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
networks,” in European conference on computer vision. Springer, 2016,
pp. 630–645.
[28] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
“Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
2018, pp. 4510–4520.
[29] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” arXiv
preprint arXiv:1709.01507, vol. 7, 2017.
[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
... The robustness to unreliability in computation operations and memories has been investigated for several signal processing and machine-learning applications, including binary recursive estimation [10], binary linear transformation [11], deep neural networks [12,13], multi-agent systems [14] and distributed logistic regression [15]. Moreover, several techniques have been proposed to compensate for faults introduced by unreliable systems. ...
... where δt = 1 and σ x = 0.01 and σ y = 10. The factor a in (9) is taken as a = 12.8 as in [13]. This section is divided into two parts. ...
Full-text available
This paper presents a quantized Kalman filter implemented using unreliable memories. We consider that both the quantization and the unreliable memories introduce errors in the computations, and we develop an error propagation model that takes into account these two sources of errors. In addition to providing updated Kalman filter equations, the proposed error model accurately predicts the covariance of the estimation error and gives a relation between the performance of the filter and its energy consumption, depending on the noise level in the memories. Then, since memories are responsible for a large part of the energy consumption of embedded systems, optimization methods are introduced to minimize the memory energy consumption under the desired estimation performance of the filter. The first method computes the optimal energy levels allocated to each memory bank individually, and the second one optimizes the energy allocation per groups of memory banks. Simulations show a close match between the theoretical analysis and experimental results. Furthermore, they demonstrate an important reduction in energy consumption of more than 50%.
... A growing body of work is therefore being devoted to the study of digital systems built out of unreliable circuit components. For instance, [2] considers faulttolerant linear computing, [3] addresses logistic regression on unreliable hardware, [4] considers noisy binary recursive estimation, and [5] proposes to train deep neural networks for robustness to hardware faults. Error-correction codes are closely related to the development of such systems, November 17, 2020 DRAFT first because of the interest of developing more energy-efficient decoder implementations, but also because they can be used within a computing system to restore the fully-reliable operation abstraction when it might be required. ...
... We also confirm these results on finite-length simulations. For the (5,6) and the (5, 10) code, we compare the BER performance obtained with both symmetric and asymmetric decoder iterations. We observe that asymmetric parameters clearly improve the BER performance of the Gallager B decoder under asymmetric deviations. ...
... In contrast, it is more flexible and cost-effective to take advantage of the inherent fault tolerance of the neural networks and alleviate the fault-tolerant design overhead at least. The authors in [8] [45] [12] explored fault tolerance of neural networks with retraining. Prasenjit Dey et al. [7] proposed to penalize the system errors with regularizing terms to make MLP robust to various errors such as link failures, multiplicative noise, and additive noise. ...
Full-text available
Winograd convolution is originally proposed to reduce the computing overhead by converting multiplication in neural network (NN) with addition via linear transformation. Other than the computing efficiency, we observe its great potential in improving NN fault tolerance and evaluate its fault tolerance comprehensively for the first time. Then, we explore the use of fault tolerance of winograd convolution for either fault-tolerant or energy-efficient NN processing. According to our experiments, winograd convolution can be utilized to reduce fault-tolerant design overhead by 27.49\% or energy consumption by 7.19\% without any accuracy loss compared to that without being aware of the fault tolerance
Memristors enable the computation of matrix-vector multiplications (MVM) in memory and, therefore, show great potential in highly increasing the energy efficiency of deep neural network (DNN) inference accelerators. However, computations in memristors suffer from hardware non-idealities and are subject to different sources of noise that may negatively impact system performance. In this work, we theoretically analyze the mean squared error of DNNs that use memristor crossbars to compute MVM. We take into account both the quantization noise, due to the necessity of reducing the DNN model size, and the programming noise, stemming from the variability during the programming of the memristance value. Simulations on pre-trained DNN models showcase the accuracy of the analytical prediction. Furthermore the proposed method is almost two order of magnitude faster than Monte-Carlo simulation, thus making it possible to optimize the implementation parameters to achieve minimal error for a given power constraint.
With the rapid advancements of deep learning in the past decade, it can be foreseen that deep learning will be continuously deployed in more and more safety-critical applications such as autonomous driving and robotics. In this context, reliability turns out to be critical to the deployment of deep learning in these applications and gradually becomes a first-class citizen among the major design metrics like performance and energy efficiency. Nevertheless, the back-box deep learning models combined with the diverse underlying hardware faults make resilient deep learning extremely challenging. In this special session, we conduct a comprehensive survey of fault-tolerant deep learning design approaches with a hierarchical perspective and investigate these approaches from model layer, architecture layer, circuit layer, and cross layer respectively.
As deep learning algorithms are widely adopted, an increasing number of them are positioned in embedded application domains with strict reliability constraints. The expenditure of significant resources to satisfy performance requirements in deep neural network accelerators has thinned out the margins for delivering safety in embedded deep learning applications, thus precluding the adoption of conventional fault tolerance methods. The potential of exploiting the inherent resilience characteristics of deep neural networks remains though unexplored, offering a promising low-cost path towards safety in embedded deep learning applications. This work demonstrates the possibility of such exploitation by juxtaposing the reduction of the vulnerability surface through the proper design of the quantization schemes with shaping the parameter distributions at each layer through the guidance offered by appropriate training methods, thus delivering deep neural networks of high resilience merely through algorithmic modifications. Unequaled error resilience characteristics can be thus injected into safety-critical deep learning applications to tolerate bit error rates of up to at absolutely zero hardware, energy, and performance costs while improving the error-free model accuracy even further.
Full-text available
Hardware accelerators are being increasingly deployed to boost the performance and energy efficiency of deep neural network (DNN) inference. In this paper we propose Thundervolt, a new framework that enables aggressive voltage underscaling of high-performance DNN accelerators without compromising classification accuracy even in the presence of high timing error rates. Using post-synthesis timing simulations of a DNN accelerator modeled on the Google TPU, we show that Thundervolt enables between 34$\%$-57$\%$ energy savings on state-of-the-art speech and image recognition benchmarks with less than 1$\%$ loss in classification accuracy and no performance loss. Further, we show that Thundervolt is synergistic with and can further increase the energy efficiency of commonly used run-time DNN pruning techniques like Zero-Skip.
Conference Paper
Full-text available
Memristor-based synaptic network has been widely investigated and applied to neuromorphic computing systems for the fast computation and low design cost. As memristors continue to mature and achieve higher density, bit failures within crossbar arrays can become a critical issue. These can degrade the computation accuracy significantly. In this work, we propose a defect rescuing design to restore the computation accuracy. In our proposed design, significant weights in a specified network are first identified and retraining and remapping algorithms are described. For a two layer neural network with 92.64% classification accuracy on MNIST digit recognition, our evaluation based on real device testing shows that our design can recover almost its full performance when 20% random defects are present.
An RRAM-based computing system (RCS) is an attractive hardware platform for implementing neural computing algorithms. On-line training for RCS enables hardware-based learning for a given application and reduces the additional error caused by device parameter variations. However, a high occurrence rate of hard faults due to immature fabrication processes and limited write endurance restrict the applicability of on-line training for RCS. We propose a fault-tolerant on-line training method that alternates between a fault-detection phase and a fault-tolerant training phase. In the fault-detection phase, a quiescent-voltage comparison method is utilized. In the training phase, a threshold-training method and a re-mapping scheme is proposed. Our results show that, compared to neural computing without fault tolerance, the recognition accuracy for the Cifar-10 dataset improves from 37% to 83% when using low-endurance RRAM cells, and from 63% to 76% when using RRAM cells with high endurance but a high percentage of initial faults.
As a result of the increasing demand for deep neural network (DNN)-based services, efforts to develop hardware accelerators for DNNs are growing rapidly. However, while highly efficient accelerators on convolutional DNNs (Conv-DNNs) have been developed, less progress has been made with regards to fully-connected DNNs. Based on analysis of bit-level SRAM errors, we propose memory adaptive training with in-situ canaries (MATIC), a methodology that enables aggressive voltage scaling of accelerator weight memories to improve the energy-efficiency of DNN accelerators. To enable accurate operation with voltage overscaling, MATIC combines characteristics of SRAM bit failures with the error resilience of neural networks in a memory-adaptive training (MAT) process. Furthermore, PVT-related voltage margins are eliminated using bit-cells from synaptic weights as in-situ canaries to track runtime environmental variation. Demonstrated on a low-power DNN accelerator fabricated in 65 nm CMOS, MATIC enables up to 3.3x energy reduction versus the nominal voltage, or 18.6x application error reduction. We also perform a simulation study that extends MAT to Conv-DNNs, and characterize the accuracy impact of bit failure statistics. Finally, we develop a weight refinement algorithm to improve the performance of MAT, and show that it improves absolute accuracy by 0.8-1.3% or reduces training time by 5-10x.
Conference Paper
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, We introduce a three stage pipeline: pruning, quantization and Huffman encoding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman encoding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory, which has 180x less access energy.
Conference Paper
Although the latest high-end smartphone has powerful CPU and GPU, running deeper convolutional neural networks (CNNs) for complex tasks such as ImageNet classification on mobile devices is challenging. To deploy deep CNNs on mobile devices, we present a simple and effective scheme to compress the entire CNN, which we call one-shot whole network compression. The proposed scheme consists of three steps: (1) rank selection with variational Bayesian matrix factorization, (2) Tucker decomposition on kernel tensor, and (3) fine-tuning to recover accumulated loss of accuracy, and each step can be easily implemented using publicly available tools. We demonstrate the effectiveness of the proposed scheme by testing the performance of various compressed CNNs (AlexNet, VGGS, GoogLeNet, and VGG-16) on the smartphone. Significant reductions in model size, runtime, and energy consumption are obtained, at the cost of small loss in accuracy. In addition, we address the important implementation level issue on 1?1 convolution, which is a key operation of inception module of GoogLeNet as well as CNNs compressed by our proposed scheme.
Convolutional neural networks are built upon the convolution operation, which extracts informative features by fusing spatial and channel-wise information together within local receptive fields. In order to boost the representational power of a network, much existing work has shown the benefits of enhancing spatial encoding. In this work, we focus on channels and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We demonstrate that by stacking these blocks together, we can construct SENet architectures that generalise extremely well across challenging datasets. Crucially, we find that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at slight computational cost. SENets formed the foundation of our ILSVRC 2017 classification submission which won first place and significantly reduced the top-5 error to 2.251%, achieving a 25% relative improvement over the winning entry of 2016.