Content uploaded by François Leduc-Primeau

Author content

All content in this area was uploaded by François Leduc-Primeau on Jan 31, 2019

Content may be subject to copyright.

Training Modern Deep Neural Networks for

Memory-Fault Robustness

Ghouthi Boukli Hacene1,2, Franc¸ois Leduc-Primeau3, Amal Ben Soussia2,

Vincent Gripon1,2and Franc¸ois Gagnon4,

1Universit´

e de Montr´

eal, MILA 2IMT Atlantique, Lab-STICC

3Dept. of Electrical Engineering, ´

Ecole Polytechnique de Montr´

eal 4Dept. of Electrical Engineering, ´

Ecole de technologie sup´

erieure de Montr´

eal

Abstract—Because deep neural networks (DNNs) rely on a

large number of parameters and computations, their implemen-

tation in energy-constrained systems is challenging. In this paper,

we investigate the solution of reducing the supply voltage of the

memories used in the system, which results in bit-cell faults.

We explore the robustness of state-of-the-art DNN architectures

towards such defects and propose a regularizer meant to mitigate

their effects on accuracy. Our experiments clearly demonstrate

the interest of operating the system in a faulty regime to save

energy without reducing accuracy.

I. INTRODUCTION

Deep Neural Networks [1] (DNNs) are the golden standard

for many challenges in machine learning. Thanks to the large

number of trainable parameters that they provide, DNNs can

capture the complexity of large training datasets and generalize

to previously unseen examples.

Among the many applications for DNNs, many are in

the ﬁeld of embedded systems. Examples include monitor-

ing of health signals, human-machine interfaces, autonomous

drones, and smartphone applications. Many such embedded

applications cannot rely on cloud-based processing because

of stringent latency constraints or privacy issues. Even when

cloud processing is an option, processing in-device or at

the network edge can be useful to save network bandwidth.

The energy consumption of the inference task is thus a

major concern. Unfortunately, because state-of-the-art DNN

architectures are composed of a large number of trained

parameters, the inference step typically requires signiﬁcant

energy to achieve accurate results on challenging tasks, with a

large part of the energy complexity being associated with the

memory accesses required to retrieve the parameters and save

temporary results.

Since off-chip memory accesses consume signiﬁcant energy,

a ﬁrst step for reducing energy consumption consists in storing

all parameters and temporary results on the same chip as

the hardware accelerator, using static random access mem-

ories (SRAMs), or potentially embedded dynamic RAMs [2].

However, even in this case, the energy consumed by memory

accesses still represents 30–60% of the total energy [3].

An effective way of lowering energy consumption of both

memory and logic circuits is to reduce the supply voltage,

but this has the effect of increasing the sensitivity of the

circuits to fabrication variations, causing bit-cell failures in

SRAMs. When approaching the minimum energy operating

point of SRAMs, the failure rates increase by several orders

of magnitude compared to operating at the nominal supply [4].

However, even such large bit-cell failure rates are not neces-

sarily catastrophic if appropriate mechanisms are in place to

safeguard the operation of the system.

DNNs naturally exhibit a limited amount of fault tolerance,

as noted for instance in [5], [6], and there is a growing body

of work that studies the operation of DNN inference hardware

built using faulty memories. We review several contributions

in Section II. The aim of this paper is to investigate the ability

to decrease the energy consumption of DNN accelerators by

allowing the memories used for storing weights and activations

to operate in a faulty regime, thus introducing deviations on

the stored values. We rely on simple but realistic energy-

deviation models to explore the impact of memory failures on

classiﬁcation accuracy, and ultimately on energy consumption.

We quantify the impact on robustness of several design

aspects of state-of-the-art deep architectures in order to iden-

tify whether these aspects should be targeted when designing

robust architectures. Speciﬁcally, we consider the choice of

general architecture, how the depth of a layer impacts its

robustness, and the impact of faults occurring in the storage

of weights or of neuron activations. Interestingly, we ﬁnd that

different architectures provide varying degrees of robustness.

We then consider whether faulty operation can lead to a

reduction in power consumption. Importantly, we compare the

energy consumption with a reliable reference implementation

that achieves the same application performance. We show that

using a faulty implementation to reduce energy consumption at

the cost of a reduction in accuracy is not necessarily beneﬁcial,

even when the loss in accuracy appears small. Indeed, for

state-of-the-art architectures, accepting even a 1% reduction

in accuracy can signiﬁcantly reduce the number of parameters

required by a reliable implementation. It it thus essential to

evaluate the improvement provided by a faulty architecture at

the same accuracy. Nonetheless, we show that faulty operation

can reduce energy consumption when the fault statistics are

taken into account during training.

The outline of the paper is as follows. Section II brieﬂy

reviews related work. Section III introduces the deviation

models, which represent the impact of circuit faults on the

algorithm. Section IV presents an exploration of the design

space for faulty-memory implementations of modern DNNs.

Section V proposes a regularizer to increase the robustness of

DNNs to deviations. Section VI provides some conclusions.

II. RE LATE D WOR K

The idea of exploiting fault tolerance to improve the en-

ergy efﬁciency of neural networks has attracted a signiﬁcant

number of contributions. An early investigation of the effect

of transistor-level defects on neural networks was performed

in [7]. More recently, circuit-level methods for improving

the application performance of faulty implementations have

been proposed. One approach consists in using razor ﬂip-

ﬂops to detect faults and selectively apply a compensation

mechanism. When memory faults can be detected at the

bit level, a bit masking technique can be applied to ensure

that errors always reduce the magnitude of weights, helping

to decrease the impact of the errors on performance [8],

[9]. Similarly, razor ﬂip-ﬂops can be used to compensate

timing violations occurring in the datapath by dropping the

next operation, which effectively sets its weight parameter to

zero [10]. Finally, a low-precision replica can be added to

computations units to bound the maximum error that can be

introduced by a faulty processing unit [11].

To the best of our knowledge, few papers investigate the ef-

fect of training deep architectures to increase fault robustness.

One notable exception is [3], which proposes modifying the

training procedure to take into account bit ﬂips occurring in

SRAMs, and present results on the MNIST benchmark [12].

The effect of faults occurring in the storage of the input is

also considered in [13], and [14] proposes on-chip learning for

support-vector machines, while decreasing the learning effort

using active learning. Finally, a slightly different problem

is considered in [15], [16], where the network is trained to

compensate for known defect locations.

Another line of work consists in compressing models to

reduce memory usage and number of computations. There are

mainly three ways to achieve this. A ﬁrst one is to quantize

weights, using in the extreme case only one bit per weight

and per activation [17]–[19]. While the process has proven

very efﬁcient on old and somewhat redundant architectures,

it can drastically affect accuracy when performed on already

compressed architectures. A second way to compress DNNs

is to prune the weights, signiﬁcantly reducing the number of

parameters to be stored [20]. A last line of work consists in

factorizing weights, so that they can be used to perform multi-

ple computations throughout the processing of an input [21]–

[23]. However, in modern architectures the number of weights

is only a small portion of the memory, as activations of neurons

can be as many and even more if the batch size is large, that

is if several inputs are processed in parallel.

III. ENERGY-DEVIATI ON MODEL

We focus on the energy consumed by memory accesses,

and assume that the amount of energy required to perform an

inference task is proportional to the number of accesses. We

thus deﬁne a base energy metric Eothat is the sum of the

number of parameters and of the number of activation values

generated during the inference.

To decrease the energy consumption of on-chip memories,

we consider reducing the supply voltage, which in turn causes

some bit cells to fail. In order to investigate the general

behavior of DNNs implemented with faulty memory, we need

a model linking the bit-cell fault probability pand the energy

consumed by memory accesses. We denote by 0≤η≤1

the normalized energy consumption of the memory, where the

normalization is with respect to the energy consumption of the

reliable memory (such that the energy is given by ηEo). Note

that we can obtain a simple upper bound for pfrom the fact

that instead of using a faulty memory, we could store only a

fraction ηof the data while declaring the missing bit-cells as

faulty, which yields a linearly decreasing p(η).

Based on reliability data published in [4, Fig.7], we will as-

sume that the energy-reliability function takes the exponential

form

p(η) = e−aη .(1)

In order to obtain a speciﬁc value of parameter afor illus-

trative purposes, we select ato ﬁt the energy data reported

in [24, Fig.1] and the reliability from [4, Fig.7] for 65nm

CMOS SRAM cells at VDD ∈ {0.5,1.1}. Performing the ﬁt

by minimizing the sum of the relative squared error yields

a= 12.8. Speciﬁc energy gains will vary based on the value

of a, but in this paper we are only interested in identifying

general trends.

The manner in which memory faults introduce deviations

during inference depends on the strategy being used to cope

with faults. We consider the case where bit-cell faults can be

detected, and use the bit masking (BM) approach proposed

in [8]. When a fault is detected on the sign bit of a value,

this value is replaced with zero. In the case of failures on any

other bits, the affected bit values are replaced with the sign

bit, causing the value to deviate towards zero. We consider

that all bit cells have an equal fault probability p. When using

the deviation model in simulations, we assume that values are

quantized on 8 bits. However, for a fair comparison with the

reliable implementations that use a ﬂoating-point representa-

tion, we compute the deviation in the quantized domain, but

apply it on the ﬂoating-point representation. Unless otherwise

mentioned, we consider that faults affect both the weights

and the neuron activations. Note that activations are known

to be positive since they are generated by a ReLU function.

Therefore, we assume that their sign bit cannot be affected.

In this work, we use our deviation model during the

training phase to increase the robustness of networks and thus

their energy efﬁciency. Because training is computationally

intensive, we propose to simplify the BM deviation model

used during training to speed up the process. Since the BM

approach always causes values to deviate towards zero, we

propose to approximate it using a deviation model that will

be referred to as the erasure model, for which each value has

a probability peof being set to zero. We then need to choose

peto best approximate the effect of the BM model. We can

ﬁrst note that in the case of weight parameters, the BM model

sets the faulty value to zero in case of a sign-bit fault, which

occurs with probability p. Therefore, we clearly need pe> p.

During training, this process is similar to dropout [25], but it is

TABLE I

NUMBER OF MEMORY ACCESSES AND ACCURACY BY ARCHITECTURE

Architecture Parameters Activations Accuracy

PreActResNet18 [27] 11.2×1060.55 ×10694.87%

MobileNetV2 [28] 2.30 ×1061.53 ×10693.80%

SENet18 [29] 11.3×1060.86 ×10694.77%

ResNet18 [30] 11.2×1060.56 ×10694.86%

10−310−2

60

70

80

90

p

Test set accuracy (%)

PreActResNet18

MobileNetV2

SENet18

ResNet18

Fig. 1. Impact of the architecture on the robustness under BM deviations.

used to increase the robustness of networks, and not to prevent

overﬁtting. To ﬁnd the best choice of peto approximate the

BM model, we evaluate the performance of both models on

the test set and choose the value of pethat best predicts the

accuracy of the network under the BM model.

IV. DESIGN-S PACE EX PL OR ATIO N FO R FAULTY

IMPLEMENTATIONS

A. Choice of architecture and dataset

We perform experiments using the CIFAR10 dataset [26]

made of tiny color images of 32×32 pixels. We ﬁrst com-

pare four architectures, namely PreActResNet18 [27], Mo-

bileNetV2 [28], SENet18 [29] and ResNet18 [30], which

are all modern architectures achieving good accuracy on

CIFAR10. Table I shows for each architecture the number

of weights (parameters) and activation values of neurons that

must be retrieved from memory for processing one input, and

the accuracy achieved by that architecture.

In Fig. 1, we compare the robustness of the above-

mentioned architectures when the parameters and activations

are affected by the BM deviation model. We observe that some

architectures are inherently more robust than others, and that

this does not depend solely on the global number of parame-

ters. In Fig. 2, we plot the accuracy in terms of the energy ηEo

per inference, where the base energy Eocorresponds to the

sum of the parameter and activation columns of Table I, and

the fault probability is obtained from the normalized energy

ηusing (1). We observe that PreActResNet18 provides a very

interesting trade-off between accuracy, memory accesses and

robustness to BM. Therefore we choose to focus on this

architecture for the remaining experiments.

0.2 0.4 0.6 0.8 1

90

92

94

Normalized energy

Test set accuracy (%)

PreActResNet18

MobileNetV2

SENet18

ResNet18

Fig. 2. Energy consumption of different architectures under BM deviations.

10−310−2

90

92

94

p

Test set accuracy (%)

Erasure model

Bit-mask model

Erasure (weights only)

Bit mask (weights only)

Fig. 3. Impact of memory faults on accuracy for different deviation models.

B. Comparison of the BM and erasure models

As motivated in Section III, we are interested in comparing

the effects of BM and erasures on the chosen architecture.

Results are depicted in Fig. 3. Since the BM model affects

weights and activations differently and since PreActResNet18

has about 20×more weights than activation values, we focus

on matching the accuracy of the two models when only

weights are affected by deviations. We observe for this case

that the BM and erasure models have a similar effect, provided

that pe= 2p, suggesting that using erasures as a proxy to

model the deviations induced by BM is a reasonable option.

This relation will be used in Section V to train networks to

be more resilient to BM deviations.

C. Relative importance of layer depth

In a new series of experiments, we aim at identifying

the relative robustness of various parts of the architecture

under BM deviations. To this end, we introduce deviations

on only a portion of the network. Since PreActResNet18

is composed of 4 sequential blocks (made of convolutional

layers and shortcuts), we apply BM deviations to the weights

and activations of only one block at a time. Results are

depicted in Fig. 4. We observe that all parts of the network

are sensitive to deviations. Interestingly, in the region of small

accuracy degradation shown in Fig. 4, robustness increases

10−310−2

94

94.2

94.4

94.6

94.8

p

Test set accuracy (%)

Deviations on Block 1

Deviations on Block 2

Deviations on Block 3

Deviations on Block 4

Fig. 4. Impact on accuracy of BM deviations applied to different stages of

the network, “Block 1” being the ﬁrst and “Block 4” the last.

monotonically with the depth of the block. We thus consider

exploiting the varying robustness of the layers to improve

energy consumption by assigning different operating points

to each block. Denoting by pBithe fault probability assigned

to block i, we note from Fig. 4 that at a high accuracy of

94.8%, pB4 = 5pB3 = 5pB2 = 10pB1. The number of

parameters associated with each block, which in order of

block is [1.5,5.2,21,84] ×105, also varies over a wide range.

Following intuition, blocks that are more robust also have

more parameters. As shown in Fig. 5 (curves labeled “Diff.

Fault.”), applying this fault-rate policy signiﬁcantly improves

the energy efﬁciency of the standard network.

D. Impact of the number of parameters in the architecture

The number of parameters can be easily adapted by modi-

fying the number of feature maps in the convolutional layers.

If the number of feature maps of each convolutional layer is

multiplied by k, then the total number of parameters will be

roughly multiplied by k2, as the number of parameters in a

convolutional layer increases linearly with both the number of

input feature maps and the number of output feature maps.

We train two variants of the PreActResNet architecture in

which the original number Fof feature maps is multiplied by

1/2and 1/√2. These networks are used to provide a reference

for the performance achieved with faulty implementations. The

F/2and F/√2networks achieve respectively an accuracy of

93.45% and 94.41% under reliable implementations, illustrat-

ing the fact that signiﬁcant energy reductions can be obtained

easily if a reduced accuracy is acceptable.

V. PRO PO SE D REGULARIZER

All previous experiments conﬁrm that modern DNN archi-

tectures can tolerate some amount of deviations. However,

in all the scenarios considered, we observe a sharp drop in

performance as soon as the probability pof defect becomes

too large or the energy too small. To improve the robustness

to deviations, we consider training the networks in the same

conditions they are used in, which means that we apply

erasures during the forward pass of the training phase. We

call this method the erasure regularizer. Note that the reason

that we use erasure rather than BM deviations is to speed up

the training process.

0 0.2 0.4 0.6 0.8 1

90

92

94

+

+

+

+

+

Normalized energy

Test set accuracy (%)

Reliable implem.

Faulty (F)

Faulty (F)+ reg.

+Diff. Fault. (F)

Diff. Fault. (F)+ reg.

Faulty (F/2)

Faulty (F/√2)

Fig. 5. Energy consumption of the Preact-Resnet18 architecture under BM

deviations. Each faulty implementation curve corresponds to a ﬁxed network

size, with the number of feature maps shown within parentheses.

In Fig. 5, we plot the accuracy of the networks as a

function of the energy they use. We compare reliable imple-

mentations of networks with varying number of parameters

with the performance obtained when reducing the supply

voltage of memories. For the speciﬁc energy model discussed

in Section III, the best energy reduction obtained by the

faulty implementations with Ffeature maps is 1.5×for the

network with standard training, achieved at an accuracy of

94.76% and a fault rate of p= 0.001, while the best energy

reduction obtained using the erasure regularizer is 2.3×at

an accuracy of 94.8% and p= 0.01. Furthermore, additional

gains can be obtained by combining the erasure regularizer

with blockwise reliability assignment of Sect. IV-C. We thus

see that training the network for robustness using the erasure

regularizer can signiﬁcantly improve the energy reduction

obtained from faulty operation under the bit-masking model,

at equal accuracy. As discussed in Sect. IV-B, it is important

to perform the training with the appropriate peparameter:

using an erasure regularizer with pe=pdid not yield an

improvement in robustness.

VI. CONCLUSION

In this work, we explored the possibility of exploiting

the fault tolerance of deep neural networks to reduce the

energy consumption of on-chip memories. We showed that

in some conditions, reducing the supply voltage can result in

better accuracy for the same energy consumption compared

to reducing the number of parameters. We showed that a

deviation model corresponding to detectable bit-cell faults

combined with a bit masking technique can be replaced by

a simpler erasure model to speed up the training, and that

the use of this regularizer during the training phase allows to

further reduce the energy with no impact on accuracy.

Finding the architecture that achieves the best accuracy for a

given energy budget still remains a highly open question, con-

sidering the very large number of possible solutions. As such,

a more systematic study of the combined impact of pruning,

quantizing, factorizing, reducing the number of parameters,

tweaking hyperparameters and reducing supply voltage is a

very promising direction for future work.

REFERENCES

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,

no. 7553, p. 436, 2015.

[2] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,

Z. Xu, N. Sun, and O. Temam, “DaDianNao: A machine-learning su-

percomputer,” in 2014 47th Annual IEEE/ACM International Symposium

on Microarchitecture, Dec 2014, pp. 609–622.

[3] S. Kim, P. Howe, T. Moreau, A. Alaghi, L. Ceze, and V. S. Sathe,

“Energy-efﬁcient neural network acceleration in the presence of bit-

level memory errors,” IEEE Trans. on Circuits and Systems I: Regular

Papers, pp. 1–14, 2018.

[4] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge,

“Near-threshold computing: Reclaiming Moore’s law through energy

efﬁcient integrated circuits,” Proc. of the IEEE, vol. 98, no. 2, pp. 253–

266, Feb. 2010.

[5] J.-C. Vialatte and F. Leduc-Primeau, “A study of deep learning robust-

ness against computation failures,” in Proc. 9th Int. Conf. on Advanced

Cognitive Technologies and Applications, Feb. 2017.

[6] X. Jiao, M. Luo, J. Lin, and R. K. Gupta, “An assessment of vulnera-

bility of hardware neural networks to dynamic voltage and temperature

variations,” in 2017 IEEE/ACM International Conference on Computer-

Aided Design (ICCAD), Nov 2017, pp. 945–950.

[7] O. Temam, “A defect-tolerant accelerator for emerging high-performance

applications,” in 39th Annual Int. Symp. on Computer Architecture

(ISCA), June 2012, pp. 356–367.

[8] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M.

Hern´

andez-Lobato, G.-Y. Wei, and D. Brooks, “Minerva: Enabling low-

power, highly-accurate deep neural network accelerators,” in Proc. 43rd

Int. Symp. on Computer Architecture (ISCA’16). Piscataway, NJ, USA:

IEEE Press, 2016, pp. 267–278.

[9] P. N. Whatmough, S. K. Lee, H. Lee, S. Rama, D. Brooks, and G. Wei,

“A 28nm soc with a 1.2ghz 568nj/prediction sparse deep-neural-network

engine with >0.1 timing error rate tolerance for IoT applications,” in

2017 IEEE International Solid-State Circuits Conference (ISSCC), Feb

2017, pp. 242–243.

[10] J. Zhang, K. Rangineni, Z. Ghodsi, and S. Garg, “Thundervolt: Enabling

aggressive voltage underscaling and timing error resilience for energy

efﬁcient deep neural network accelerators,” CoRR, vol. abs/1802.03806,

2018. [Online]. Available: http://arxiv.org/abs/1802.03806

[11] Y. Lin, S. Zhang, and N. R. Shanbhag, “Variation-tolerant architectures

for convolutional neural networks in the near threshold voltage regime,”

in 2016 IEEE International Workshop on Signal Processing Systems

(SiPS), Oct 2016, pp. 17–22.

[12] Y. LeCun, C. Cortes, and C. J. Burges, “The MNIST database of

handwritten digits,” 1998.

[13] L. Yang and B. Murmann, “SRAM voltage scaling for energy-efﬁcient

convolutional neural networks,” in 18th Int. Symp. on Quality Electronic

Design (ISQED), March 2017, pp. 7–12.

[14] Z. Wang, K. H. Lee, and N. Verma, “Overcoming computational errors

in sensing platforms through embedded machine-learning kernels,” IEEE

Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23,

no. 8, pp. 1459–1470, Aug 2015.

[15] C. Liu, M. Hu, J. P. Strachan, and H. Li, “Rescuing memristor-based

neuromorphic design with high defects,” in 2017 54th ACM/EDAC/IEEE

Design Automation Conference (DAC), June 2017, pp. 1–6.

[16] L. Xia, M. Liu, X. Ning, K. Chakrabarty, and Y. Wang, “Fault-tolerant

training enabled by on-line fault detection for RRAM-based neural

computing systems,” IEEE Transactions on Computer-Aided Design of

Integrated Circuits and Systems, pp. 1–1, 2018.

[17] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training

deep neural networks with binary weights during propagations,” in

Advances in neural information processing systems, 2015, pp. 3123–

3131.

[18] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,

“Binarized neural networks: Training neural networks with weights and

activations constrained to+ 1 or- 1,” arXiv preprint arXiv:1602.02830,

2016.

[19] G. Souli´

e, V. Gripon, and M. Robert, “Compression of deep neural

networks on the ﬂy,” in International Conference on Artiﬁcial Neural

Networks. Springer, 2016, pp. 153–160.

[20] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con-

nections for efﬁcient neural network,” in Advances in neural information

processing systems, 2015, pp. 1135–1143.

[21] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing

deep neural networks with pruning, trained quantization and huffman

coding,” arXiv preprint arXiv:1510.00149, 2015.

[22] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression

of deep convolutional neural networks for fast and low power mobile

applications,” arXiv preprint arXiv:1511.06530, 2015.

[23] L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep

networks,” arXiv preprint arXiv:1611.01600, 2016.

[24] G. Chen, D. Sylvester, D. Blaauw, and T. Mudge, “Yield-driven near-

threshold SRAM design,” IEEE Transactions on Very Large Scale

Integration (VLSI) Systems, vol. 18, no. 11, pp. 1590–1598, Nov 2010.

[25] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-

dinov, “Dropout: a simple way to prevent neural networks from over-

ﬁtting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.

1929–1958, 2014.

[26] A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset,” online:

http://www. cs. toronto. edu/kriz/cifar. html, 2014.

[27] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual

networks,” in European conference on computer vision. Springer, 2016,

pp. 630–645.

[28] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,

“Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition,

2018, pp. 4510–4520.

[29] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” arXiv

preprint arXiv:1709.01507, vol. 7, 2017.

[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

recognition,” in Proceedings of the IEEE conference on computer vision

and pattern recognition, 2016, pp. 770–778.