Conference PaperPDF Available

Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation

Authors:

Abstract and Figures

Over the past few years, softmax and SGD have become a commonly used component and the default training strategy in CNN frameworks, respectively. However, when optimizing CNNs with SGD, the saturation behavior behind softmax always gives us an illusion of training well and then is omitted. In this paper, we first emphasize that the early saturation behavior of softmax will impede the exploration of SGD, which sometimes is a reason for model converging at a bad local-minima, then propose Noisy Softmax to mitigating this early saturation issue by injecting annealed noise in softmax during each iteration. This operation based on noise injection aims at postponing the early saturation and further bringing continuous gradients propagation so as to significantly encourage SGD solver to be more exploratory and help to find a better local-minima. This paper empirically verifies the superiority of the early softmax desaturation, and our method indeed improves the generalization ability of CNN model by regularization. We experimentally find that this early desaturation helps optimization in many tasks, yielding state-of-the-art or competitive results on several popular benchmark datasets.
Content may be subject to copyright.
arXiv:1708.03769v1 [cs.CV] 12 Aug 2017
Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing
the Early Softmax Saturation
Binghui Chen1, Weihong Deng1, Junping Du2
1School of Information and Communication Engineering, Beijing University of Posts and Telecommunications,
2School of Computer Science, Beijing University of Posts and Telecommunications, Beijing China.
chenbinghui@bupt.edu.cn, whdeng@bupt.edu.cn, junpingd@bupt.edu.cn
Abstract
Over the past few years, softmax and SGD have become
a commonly used component and the default training strat-
egy in CNN frameworks, respectively. However, when op-
timizing CNNs with SGD, the saturation behavior behind
softmax always gives us an illusion of training well and
then is omitted. In this paper, we first emphasize that the
early saturation behavior of softmax will impede the explo-
ration of SGD, which sometimes is a reason for model con-
verging at a bad local-minima, then propose Noisy Soft-
max to mitigating this early saturation issue by injecting
annealed noise in softmax during each iteration. This oper-
ation based on noise injection aims at postponing the early
saturation and further bringing continuous gradients prop-
agation so as to significantly encourage SGD solver to be
more exploratory and help to find a better local-minima.
This paper empirically verifies the superiority of the early
softmax desaturation, and our method indeed improves the
generalization ability of CNN model by regularization. We
experimentally find that this early desaturation helps opti-
mization in many tasks, yielding state-of-the-art or compet-
itive results on several popular benchmark datasets.
1. Introduction
Recently, deep convolutional neural networks (DCNNs)
have taken the computer vision field by storm, significantly
improving the state-of-the-art performances in many visual
tasks, such as face recognition [43, 44, 33, 36], large-scale
image classification [22, 39, 46, 11, 13], and fine-grained
object classification [30, 17, 20, 48]. Meanwhile, softmax
layer and the training strategy of SGD together with back-
propagation (BP) become the default components, and are
generally applied in most of the aforementioned works.
It is widely observed that when optimizing with SGD
and BP, the smooth and free gradients propagation is crucial
to improve the training of DCNNs. For example, replacing
Figure 1. Decomposition of typical softmax layer in DCNN. It can
be rewritten into three parts: fully connection component, softmax
activation and cross-entropy loss.
sigmoid activation function with the piecewise-linear acti-
vation functions such as ReLU and PReLU [12] handles
the problem of gradients vanishing caused by sigmoid sat-
uration and, allows the training of much deeper networks.
While, it is interesting that softmax activation function (il-
lustrated in Figure 1) is implicitly like sigmoid function due
to their similar formulation (shown in Sec. 3), and has the
saturation behavior as well when its input is large. How-
ever, many take the softmax activation for granted and the
problem behind its saturation behavior is omitted as a result
of illusion of performance improvements based on DCNNs.
In standard SGD, the saturation behavior of softmax
turns up when its output is very close to the ground truth,
certainly it is our goal of model training. However, in some
ways, it is a barrier to improve the generalization ability
of CNNs especially when it shows up early (inopportune).
Concretely, for one instance input, it will early stop con-
tributing gradients to BP when its softmax output is prema-
turely saturated, yielding short-lived gradients propagation
in history which is not enough for robust learning. And
in this case, the learning process with SGD and BP hardly
explore more due to poor gradients propagation and param-
eters update. We define this saturation behavior as individ-
ual saturation and the corresponding individual as saturated
one. As the training going, the number of non-saturated
contributing training samples gradually decreases and the
robust learning of network will be impeded. It sometimes
is a reason for algorithm going to bad local-minima1and
being difficult to escape. Furthermore, the problem of over-
1For simplicity, we use local or ’global’ minima to represent a neigh-
bouring region not a single point.
fitting turns up. To this end, we need to give SGD chances
to explore more parts of parameter space and the early indi-
vidual saturation is undesired.
In this paper, we propose Noisy Softmax, a novel tech-
nique of early softmax desaturation, to address the afore-
mentioned issue. This is mainly achieved by injecting an-
nealed noise directly into softmax activations during each it-
eration. In another word, Noisy Softmax allows SGD to es-
cape from a bad local-minima and explore more by postpon-
ing the early individual saturation. Furthermore, it improves
the generalization ability of system by reducing over-fitting
as a direct consequence of more exploration. The main con-
tributions of this work are summarized as follows:
We provide an insight of softmax saturation, inter-
preted as individual saturation, that early individual
saturation produces short-lived gradients propagation
which is poor for robust exploration of SGD and fur-
ther causes over-fitting unintentionally.
We propose Noisy Softmax to aim at producing rich
and continuous gradients propagation by injecting an-
nealed noise into softmax activations. It allows the
’global’ convergence of SGD solver and aids gener-
alization by reducing over-fitting. To our knowledge,
it is the first attempt to address the early saturation be-
havior of softmax by adding noise.
Noisy Softamx can be easily performed as a drop-
in replacement for standard softmax and optimized
with standard SGD. It can be also applied in other
performance-improving techniques, such as neural ac-
tivation functions and network architectures.
Extensive experiments have been performed on several
datasets, including MNIST [25], CIFAR10/100 [21],
LFW [16], FGLFW [54] and YTF [49]. The im-
pressive results demonstrate the effectiveness of Noisy
Softmax.
2. Related Work
Many promising techniques have been developed, such
as novel network structures [29, 13, 41], non-linear activa-
tion functions [12, 7, 6, 38], pooling strategies [11, 8, 53]
and objective loss functions [43, 36], etc.
These approaches are mostly optimized with SGD and
back-propagation. In standard SGD, we use the chain rule
to compute and propagate the gradients. Thus, any satura-
tion behavior of neuron units or layer components2are un-
desired because the training of deeper framework attributes
to smooth and free flow of gradients information. The
early solution is to replace sigmoid function with non-linear
2The saturation of layer refers to gradients vanishing at a certain layer
in back-propagation.
piecewise activation function [22]. This is neuron desatura-
tion. Skip connections between different layers exponen-
tially expands the paths of propagation [41, 11, 13, 15, 14].
These belong to layer desaturation, since the forward and
backward information can be directly propagated from one
layer to any other layer without gradients vanishing. In con-
trast, only the early saturation behavior is harmful instead of
all of them, we focus on the early desaturation of softmax,
which hasn’t been investigated, and we achieve this by in-
jecting noise explicitly into softmax activations.
There are some other works that are related to noise
injection. Adding noise to ReLU has been developed to
encourage components to explore more in Boltzmann ma-
chines and feed-forward networks [4, 1]. Adding noise to
sigmoid provides possibilities of training with much wider
family of activation functions than previous [10]. Adding
weight noise [42], adaptive weight noise [9, 3] and gradi-
ents noise [32] also improve the learning. Adding annealed
noise can help the solver escape from a bad local-minima
and find a better one. We follow these inspiring ideas to ad-
dress individual saturation and encourage SGD to explore
more. The main differences are that we apply noise injec-
tion on CNN and impose noise on loss layer instead of pre-
vious layers. But different from adding noise on loss layer
in DisturbLabel [51], a method that seems weird to disturb
labels but indeed improves the performances of models, our
work has a clear object of the delay of early softmax satu-
ration by explicitly injecting noise into softmax activition.
Another noise injection way is randomly transforming
the input data, which is commonly referred to data augmen-
tation, such as randomly cropping, flipping [22, 50], rotat-
ing [24, 23] and jittering input data [34, 35]. And our work
can also be interpreted as a way of data augmentation which
will be discussed in the following discussion part.
3. Early Individual Saturation
In this section, we will give a toy example to describe
the early individual saturation of softmax, which is always
omitted, and analyse its impact on generalization. De-
fine the i-th input data xiwith the corresponding label yi,
yi[1 ···C]. Then processing training images with stan-
dard DCNN, we can obtain the cross-entropy loss and par-
tial derivative as follows:
L=1
NX
i
log P(yi|xi) = 1
NX
i
log efyi
Pjefj(1)
∂L
∂fj
=P(yi=j|xi)1{yi=j}=efj
Pkefk1{yi=j}
(2)
where fjrefers to the j-th element of the softmax input
vector f,j[1 ···C],Nis the number of training im-
ages. 1{condition}= 1 if condition is satisfied and
1{condition}= 0 if not.
To simplify our analysis, we consider the problem of bi-
nary classification3, where yi[1,2]. Under binary sce-
nario, we plot the softmax activation for class 1in Fig-
ure 2. Intuitively, the softmax activation is totally like sig-
moid function. The standard softmax encourages f1> f2
in order to classify class 1correctly and can be regarded as a
genius when its output P(yi= 1|xi) = 1
1+e(f1f2)is very
close to 1. In this case, the softmax output of data xiis satu-
rated and we define this as individual saturation. Of course,
making its softmax output close to 1is our ultimate goal
of CNN training. However, we would like to achieve it at
the end of SGD exploration not in the beginning or middle
stage. Since, when optimizing CNN with gradients-based
methods such as SGD, the prematurely saturated individual
early stops contributing gradients to back-propagation due
to negligible gradients, i.e. P(yi= 1|xi)1, L
∂fyi0
(see Eq. 2). And with the saturated individuals number ris-
ing, the amount of contributing data d ecreases and, SGD has
few chances to move around and is more likely to converge
at a local minima, therefore, it is easy to be over-fitting and
it requires extra data to recover. In short, the early saturated
ones introduce short-lived gradients propagation which is
not enough to help system converge at a ’global minima’
(i.e. a better local-minima), so the early individual satura-
tion is undesired.
−10 −8 −6 −4 −2 0 2 4 6 8 10
0
0.2
0.4
0.6
0.8
1
f1−f2
Posterior Probability
Figure 2. Softmax activation function: 1
1+e(f1f2). X axis rep-
resents the difference between f1and f2.
4. Noisy Softmax
Based on the fact analysed in Section 3, the short-lived
gradients propagation caused by early individual saturation
would not guide the robust learning. Thus the intuitive solu-
tion is to set up ’barrier’ along its way to satuation so as
to postpone the early saturation behavior and produce
rich and continuous gradients propagation. Particularly,
for training data point (xi, yi), a simple way to achieve this
is to factitiously reduce its softmax input fyi(note that, it is
theoretically same to enlarge fj,j6=yi, but it is so com-
plex to operate). Moreover, many research works point out
that adding noise gives the system chances to find ’global
minima’, such as injecting noise to sigmoid [10]. We follow
this inspiring idea to address the problem of early individ-
ual saturation. Therefore, our technique of slowing down
the early saturation is to inject appropriate noise in softmax
input fyi, and the resulting noise-associated one is as fol-
lows:
3Multi-classification complicates our analysis but has the same mecha-
nism as binary scenario.
fnoise
yi=fyin(3)
where n=µ+σξ,ξ N (0,1),µand σare used to
generate a wider family of noise from ξ. Intuitively, we
would prefer fnoise
yito be less than fyi(because fnoise
yi>
fyiwill speed up saturation). Thus, we simply require noise
nto be always positive, and we have the following form:
fnoise
yi=fyiσ|ξ|(4)
where noise nhas mean 0 and standard variance σ.
Moreover, we would like to make our noise annealed
by controlling the parameter σ. Considering our initial
thought, we intend to postpone the early saturation of xi
instead of to not allow its saturation, implying that the ini-
tially larger noise is required to boost the exploration ability
and later the relatively smaller noise is required for model
convergence.
In standard Softmax layer (Figure 1), fyiis also the out-
put of the fully connected component and can be written as
fyi=WT
yiXi+byiwhere Wyiis the yi-th column of W,
Xiis the input feature of this layer from training data xiand
byiis the basis. Since byiis a constant and fyimostly de-
pends on WT
yiXi, we construct our annealed noise by mak-
ing σto be related to WT
yiXi. In consideration of the fact
that WT
yiXi=kWyikkXikcos θyi(as shown in[31]), where
θyiis the angle between vector Wyiand Xi,σshould be
a joint function of kWyikkXikand θyiwhich hold ampli-
tude and angular information respectively. Parameter Wyi
followed by a loss function can be regarded as a linear clas-
sifier of class yi. And this linear classifier uses cosine simi-
larity to make angular decision boundary. As a result, with
the converging of system, the angle θyibetween Wyiand
Xiwill gradually decrease. Therefore, our annealed-noise-
associated softmax input is formulated as:
fnoise
yi=fyiαkWyikkXik(1 cos θyi)|ξ|(5)
where αkWyikkXik(1co s θyi) = σ, and hyper-parameter
αis used to adjust the scale of noise. In our annealed
noise, we leverage kWyikkXikto make the magnitude of
the noise and fyito be comparable, and use (1 cos θyi)
to adaptively anneal the noise. Notably, our early desatura-
tion work implies that make softmax later saturated instead
of non-saturated. We experimented with various function
types of σand empirically found that this surprising sim-
ple formulation performs better. Putting Eq. 5 into original
softmax, the Noisy Softmax loss is defined as:
L=1
NX
i
log efyiαkWyikkXik(1cos θyi)|ξ|
Pj6=yiefj+efyiαkWyikkXik(1cos θyi)|ξ|
(6)
Optimization. We use Eq. 6 throughout our experi-
ments and optimize our model with the commonly used
SGD. Thus we need to compute the forward and backward
propagation, and cos θyiis required to be replaced with
0 2 4 6 8 10 12 14 16
0.4
0.5
0.6
0.7
0.8
0.9
1
Iteration
Average Prediction
Softmax
Noisy Softmax
normal
neg x103
Figure 3. Saturation status vs. iteration with different formulations
of noise. Normal and Neg represent normal noise and negative
noise respectively. α2is set to 0.1 in our experiments.
0 20 40 60 80 100 120 140 160
0.2
0.4
0.6
0.8
1
Iteration
Recognition Error Rate
Softmax
Noisy Softmax
Normal
Neg
x100
Figure 4. CIFAR100 testing error vs. iteration with different for-
mulations of noise. Normal and Neg represent normal noise and
negative noise respectively. α2is set to 0.1 in our experiments.
WT
yiXi
kWyikkXik. For forward and backward propagation, the
only difference between Noisy Softmax loss and standard
softmax loss exists in fyi. For example, in forward propaga-
tion, j6=yifjis computed as the same as original softmax
while fyiis replaced with fnoise
yi. In backward propagation,
∂L
∂Xi=Pj
∂L
∂fj
∂fj
∂Xiand L
∂Wyi=Pj
∂L
∂fj
∂fj
∂Wyi
, only when
j=yithe computations of fj
∂Xiand fj
∂Wyi
are not the same
as original softmax which are listed as follows:
∂f noise
yi
∂Xi
=Wyiα|ξ|(XikWyik
kXikWyi)(7)
∂f noise
yi
∂Wyi
=Xiα|ξ|(WyikXik
kWyikXi)(8)
For simplicity, we leave out L
∂fjand fj
∂Xi, fj
∂Wyi(j6=yi)
since they are the same for both Noisy Softmax and original
softmax. In short, except for when j=yi, the overall com-
putation of Noisy Softmax is similar with original softmax.
5. Discussion
5.1. The Effect of Noise Scale α
In Noisy Softmax, the scale of annealed noise is largely
determined by the hyper-parameter α. Here we can imag-
ine that, when α= 0 in the 0noise limit, Noisy Softmax
is the same with ordinary softmax. Then individual satu-
ration will turn up and SGD solver has a high chance to
converge at a local-minima. Without extra data for training,
the model will be easily over-fitting. However, when αis
large enough, large gradients are obtained since backprop-
agating through fnoise
yigives rise to large derivatives. So
the algorithm just see the noise instead of real signal and
move around anywhere blindly. Hence, a relatively small α
is required to aid the generalization of model.
We evaluate the performances of Noisy Softmax with
different αon several datasets. Note that, the value of α
is not carefully adjusted and we make α= 0 (i.e. soft-
max) as our baseline. And these comparison results are
listed in Table 3, 4and 5. One can observe that Noisy Soft-
max with a relatively appropriate α(e.g. α2= 0.1) obtains
better recognition accuracy than ordinary softmax on all of
the datasets. To be intuitional, we summarize the results
on CIFAR100 in Figure 6. When α2= 0.1, our method
outperforms the original softmax. This demonstrates that
our Noisy Softmax does improve the generalization abil-
ity of CNN by encouraging the SGD solver to be more ex-
ploratory and to converge at a ’global-minima’. When α
rises to 1, the large noise causes the network to converge
slower and produces worse performance than baseline as
well. Since the large noise drowns the helpful signal and
solver just sees the noise.
5.2. Saturation Study
To illuminate the significance of early softmax desatu-
ration based on non-negative noise injection, we investi-
gate the impacts of different formulations of noise, such as
normal noise n=σξ and neg ative noise n=σ|ξ|(σ
is the same as in Eq. 5), on individual saturation. From
the formulations of noise, we can imagine that there will be
more saturated instances when training with negative noise
(which is a counterexample). To intuitively analyse the sat-
uration state, we compute the average possibility prediction
over the entire training set as follows:
P=1
N
C
X
j=1
Nj
X
i=1
P(yi|xi)(9)
where Cis the number of classes and Njis the number
of images within the j-th class. Figure 3 and 4 show the
saturation status of different noise and testing error rates on
CIFAR100 respectively.
From the results in Figure 3, one can observe that when
training with original softmax or negative noise, the av-
erage prediction rises quickly to a relatively high level, al-
most 0.9, implying that the early individual saturation is se-
rious, and finally goes up to nearly 1. Moreover, the av-
erage prediction of negative noise is higher than that of
softmax, implying that the early individual saturation is de-
teriorated since many instances are factitiously mapped to
0 20 40 60 80 100 120 140 160
0
0.2
0.4
0.6
0.8
1
Iteration
Recognition Error Rate
softmax
Noisy Softmax(α = 0.05)
Noisy Softmax(α = 0.1)
Noisy Softmax(α = 0.5)
Noisy Softmax(α = 1)
x100
Figure 5. CIFAR100 training error vs. iteration with different α.
0 20 40 60 80 100 120 140 160
0.2
0.4
0.6
0.8
1
Iteration
Recognition Error Rate
softmax
Noisy Softmax(α = 0.05)
Noisy Softmax(α = 0.1)
Noisy Softmax(α = 0.5)
Noisy Softmax(α = 1)
x100
Figure 6. CIFAR100 testing error vs. iteration with different α.
saturated ones. From the results in Figure 4, one can ob-
serve that the testing error of negative noise drops slowly
and finally achieves a relatively high level, nearly 37%, ver-
ifying that the exploration of SGD is seriously impeded by
early individual saturation. In normal noise case, the test-
ing error and the trend of average prediction rise are similar
with original softmax, shown in Figure 4 and 3 respectively,
since the expectation E(n)is close to zero.
In contrast, when training with Noisy Softmax, the aver-
age prediction rises slowly and is much lower than orig-
inal softmax at the early training stage, shown in Fig-
ure 3, verifying that the early individual saturation behav-
ior is significantly avoided. And from the results in Fig-
ure 4, one can observe that Noisy Softmax outperforms
the baseline and significantly improves the performance to
28.48% testing error rate. Note that, after 3,000 iterations,
our method achieves better testing error result but lower
average-prediction, demonstrating that the early desatura-
tion of softmax gives SGD solver chances to traverse more
portions of parameter space for optimal solution. As the
noise level is decreased, it will prefer a better local-minima
where signal gives strong response to SGD. Then the solver
will spend more time to explore this region and converge,
which can be regarded as ’global-minima’, in a finite num-
ber of steps.
In summary, injecting non-negative noise n=σ|ξ|in
softmax does prevent the early individual saturation and fur-
ther improve the generalization ability of CNN when opti-
mized by standard SGD.
5.3. Annealed Noise Study
When addressing the early individual saturation, the key
idea is to add annealed noise. In order to highlight the
superiority of our annealed noise described in Sec. 4, we
compare it to free noise n=α|ξ|and amplitude noise
α2Noisy Softmax free amplitude
0 31.77 31.77 31.77
0.05 29.99 31.43 30.96
0.1 28.48 31.04 29.97
0.5 30.22 30.88 fail
1 35.23 31.20 fail
Table 1. Testing error rates(%) vs. different noise on CIFAR100.
n=αkWyikkXik|ξ|. We evaluate them on CIFAR100 and
the results are listed in Table 1. From the results, it can
be observed that our Noisy Softmax outperforms the other
two formulations of noise. In free noise case, where σ(de-
scribed in Sec. 4) is set to a fixed value αand the noise
is totally independent, although adding this noise is a de-
saturation operation the accuracy gain over the baseline is
small, since it desaturates softmax worse not according to
the magnitude of softmax input, in another word it cannot
suit the remedy to the case. In amplitude noise case, where
σis set to αkWyikkXik, the subtractive noise is prudent
due to considering the level of softmax input, thus yield-
ing better accuracy gain than free noise. While it still is
worse than Noisy Softmax. Because, in both Noisy Softmax
and amplitude noise cases, as the exploration going, the
’globally better’ region of parameter space has been seen
by SGD and it is time to be patient in exploring this region,
in another word smaller noise is required. Noisy Softmax
holds this idea by annealing the noise but, in amplitude
noise case, the level-unchanging noise seems a little large
at this time and further causes difficulty in detailed learning.
Reviewing the formulation of our annealed noise, one can
observe that our annealed noise is constructed by combin-
ing θyi, a time identifier, into amplitude noise. With time
function 1cos θyiinjected, the noise will be adaptively
decreased.
5.4. Regularization Ability
We find experimentally that Noisy Softmax can regular-
ize the CNN model by preventing over-fitting. Figure 5
and 6 show the recognition accuracy results on CIFAR100
dataset with different α. One can observe that without
noise injection (i.e. α= 0), the training recognition error
drops fast to quite a low level, almost 0%, however the test-
ing recognition error stops falling at a relatively high level,
nearly 31.77%. Conversely, when α2is set to an appropri-
ate value such as 0.1, the training error drops slower and is
much higher than the baseline. But the testing error reaches
a lower level, nearly 28.48%, and still has a trend of de-
creasing. Even when α2= 0.5, the training error is higher
but the testing error becomes lower as well, nearly 30.22%.
This demonstrates that encouraging SGD to converge at a
better local-minima indeed prevent over-fitting and Noisy
Figure 7. Geometric Interpretation of data augmentation.
Softmax has a strong regularization ability.
As analysed above, our Noisy Softmax can be regarded
as a kind of regularization technique for preventing over-
fitting by making SGD to be more exploratory. Here we will
analyse this regularization ability from another data aug-
mentation perspective, which has a profound physical inter-
pretation. Under the original case, the softmax input com-
ing from data point (xi, yi)is fyi=kWyikkXikco s θyi
(we omit the constant byifor simplicity). Now we consider
a new input (x
i, yi), where kX
ik=kXikand the angle θ
yi
between vector Wyiand X
iis arccos((1 + α|ξ|) cos θyi
α|ξ|). Thus we have f
yi=kWyikkX
ikcos θ
yi=WT
yiXi
αkWyikkXik(1 cos θyi)|ξ|=fnoise
yi, implying that
fnoise
yican be regarded as coming from a new data point
(x
i, yi). Notably, since θ
yi> θyi, these generated data
have many boundary examples which are much helpful for
discriminative feature learning, illustrated in Figure 7. In
summary, generating the noisy input fnoise
yiis equivalent to
generating new training data, which is an efficient way of
data augmentation.
To verify our discussion above, we evaluate Noisy Soft-
max on two subsets of the MNIST dataset, which have
only 600(1%) and 6000(10%) training instances respec-
tively. Our CNN configuration is shown in Table 2. With
the same training strategy in Section 6.1, we achieve 3.82%
and 1.30% testing error rates on the original testing set,
respectively. Meanwhile, in both cases, the training error
rates quickly drop to nearly 0% which show that the over-
fitting turns up. However, when training with Noisy Soft-
max (α2= 0.5), we obtain 2.46% and 0.93% testing error
rates, respectively. This demonstrates that Noisy Softmax
improves the generalization ability of CNN with implicit
data augmentation. And from the accuracy improvements
on these two subsets and CIFAR100 (which has 500 in-
stances per class), it acts as an effective algorithm especially
in the case that the amount of training data is limited.
5.5. Relationship to Other Methods
Multi-task learning: combining several tasks to one
system does improve the generalization ability [37]. Con-
sidering a multi-task learning system with input data xi, the
overall object loss function is a combination of several sub-
object loss functions, written as L=PjLj(ϑ0, ϑj, xi),
where ϑ0and ϑj, j [1,2,···]are generic parameters and
task-specific parameters respectively. Optimizing with stan-
dard SGD, generic parameters ϑ0are updated as ϑ0=ϑ0
γ(Pj
∂Lj
∂ϑ0),γis the learning rate. While in Noisy Softmax,
from an overall training perspective, our loss function can
also be regarded as a combination of many noise-dependent
changing losses Lnoisek=log efyi
Pc6=yiefc+efnoise
yi
+
α|ξ|kkWyikkXik(1 cos θyi), k [1, m], i.e. L=
Pm
k=1 Lnoisekwhere mis an uncertain number and is re-
lated to noise scale and iteration number. Thus, the over-
all contribution to system can be regarded as ϑ=ϑ
γ(Pm
k=1
∂Lnoisek
∂ϑ ). So our method can be regarded as a
special case of multi-task learning, where the task-specific
parameters are shared across tasks.
However, in multi-task learning system, factitiously de-
signing task-specific losses is prohibitively expensive and
the number of tasks is limited and small. While in Noisy
Softmax training procedure, the model is constrained by
many randomly generated tasks (quantified by Lnoisek).
Thus, training a model with Noisy Softmax can be regarded
as training with massive tasks that are very costly and often
infeasible to design in original multi-task learning system.
Noise injection: some research works inject noise in
previous layers of neural networks such as in neuron activa-
tion functions ReLU [4, 1] and sigmoid [10], in weights [9,
3] and gradients [32]. We emphasize that Noisy Softmax
adds noise on a single loss layer instead of on many previ-
ous layers, which is more convenient and efficient for im-
plementation and model training, and is applied in DCNNs.
Intrinsically different from DisturbLabel [51], where noise
is produced by disturbing labels and also exerts effect on
loss layer, Noisy Softmax starts from a clear object of early
softmax desaturation and the noise is adaptively annealed
and injected in a explicit manner.
Desaturation: many other desaturation works, such as
replacing sigmoid with ReLU [6] and building skip connec-
tions between layers [41, 11, 13, 14], solve the problems of
gradients vanishing which happen in bottom layers. While
our Noisy Softmax solves the problem of early gradients
vanishing in the top layer (i.e. loss layer) which is caused
by the early individual saturation. We emphasize that solv-
ing the early gradients vanishing in top layer is crucial to
parameters update and model optimization, since top layer
is the source of gradients propagation. In summary, by post-
poning the early individual saturation we can obtain contin-
uous gradients propagation from the top layer and further
encourage SGD to be more exploratory.
6. Experiments and Results
We evaluate our proposed Noisy Softmax algorithm on
several benchmark datasets, including MNIST [25], CI-
FAR10/100 [21], LFW [16], FGLFW [54] and YTF [49].
Note that, in all of our experiments, we only use a single
Layer MNIST(for Sec. 5.4) MNIST CIFAR10/10+ CIFAR100 LFW/FGLFW/YTF
Block1 [3x3,40]x2 [3x3,64]x3 [3x3,64]x4 [3x3,96]x4 [3x3,64]x1
Pool1 Max [2x2], stride 2
Block2 [3x3,60]x1 [3x3,64]x3 [3x3,128]x4 [3x3,192]x4 [3x3,128]x1
Pool2 Max [2x2], stride 2
Block3 [3x3,60]x1 [3x3,64]x3 [3x3,256]x4 [3x3,384]x4 [3x3,256]x2
Pool3 Max [2x2], stride 2
Block4 - - - - [3x3,512]x3,padding 0
Fully Connected 100 256 512 512 3000
Table 2. CNN architectures for different benchmark datasets. Blockx denotes a container of several convolution components with the same
configuration. E.g. [3x3, 64]x4 denotes 4 cascaded convolution layers with 64 filters of size 3x3.
model for the evaluation of Noisy Softmax, and both soft-
max and Noisy Softmax in our experiments use the same
CNN architecture shown in Table 2.
6.1. Architecture Settings and Implementation
As VGG [39] becomes a commonly used CNN archi-
tecture, the cascaded layers with small size filters gradually
take the place of single layer with large size filters. Since
these cascaded layers have less parameters, lower computa-
tional complexity and stronger representation ability com-
pared to the single layer. E.g. a single 5x5 convolution
layer is replaced with 2 cascaded 3x3 convolution layers.
Inspired by [39, 31], we design the architectures as shown in
Table 2. In convolution layers, both the stride and padding
are set to 1 if not specified. In pooling layers, we use 2x2
max-pooling filter with stride 2. We adopt the piece-wise
linear functions PReLU [12] as our neuron activation func-
tions. Then we use the weight initialization [12] and batch
normalization [18] in our networks. All of our experiments
are implemented by Caffe library [47] with our own modi-
fications. We use standard SGD to optimize our CNNs and
the batch sizes are 256 and 200 for object experiments and
face experiments, respectively. For data preprocessing, we
just perform the mean substraction.
Training. In object recognition tasks, the initial learning
rate is 0.1 and is divided by 10 at 12k. The total iteration is
16k. Note that, although we train our CNNs with coarsely
adjusted learning rate, the results of all experiments are im-
pressive and consistent, verifying the effectiveness of our
method. For face recognition tasks, we start with a learning
rate of 0.01, divide it by 5 when the training loss does not
drop.
Testing. We use the original softmax to classify the test-
ing data in object datasets. In face datasets, we evaluate it
with the cosine distance rule after PCA reduction for face
recognition.
6.2. Evaluation on MNIST Dataset
MNIST [25] contains 60,000 training samples and
10,000 testing samples. These samples are uniformly dis-
tributed over 10 classes. And all samples are 28x28 gray
images. Our CNN network architecture is shown in Table 2
and we use a 0.001 weight decay. The results of the state-
of-the-art methods and our proposed Noisy Softmax with
different αare listed in Table 3.
From the results, our Noisy Softmax (α2= 0.1) not
only outperforms the original softmax over the same ar-
chitecture, but also achieves the competitive performance
compared to the state-of-the-art methods. It can also be
observed that Noisy Softmax produces consistent accuracy
gain with coarsely adjusted α2, such as 0.05, 0.1 and 0.5,
and our method achieves the same accuracy with Distur-
bLabel [51] which adds dropout to several layers, demon-
strating the effectiveness of our technique.
6.3. Evaluation on CIFAR Datasets
CIFAR [21] has two evaluation protocols over 10 and
100 classes respectively. CIFAR10/100 has 50,000 train-
ing samples and 10,000 testing samples, all the samples are
32x32 RGB images. And these images are uniformly dis-
tributed over 10 or 100 classes. We use different CNN ar-
chitectures in CIFAR10 and CIFAR100 experiments, and
these network configurations are shown in Table 2.
We evaluate our method on CIFAR10 and CIFAR100, as
these results are shown in Table 4. For data augmentation,
we perform a simple method: randomly crop a 30*30 im-
age. From our experimental results, one can observe that
Noisy Softmax (α2= 0.1) outperforms all of the other
methods on these two datasets. And it improves nearly 1%
and more than 3% accuracies over the baseline on CIFAR10
and CIFAR100 respectively.
6.4. Evaluation on Face Datasets
LFW[16] contains 13,233 images from 5749 celebrities.
Under unrestricted conditions, it provides 6,000 face pairs
for verification protocol and, closed-set and open-set for
identification protocol adopted in [2].
FGLFW [54] is a derivative of LFW, implying that the
images are all coming from LFW but the face pairs are diffi-
cult to classify whether they are from the same person. It is
a light and sweet dataset for performance evaluation due to
the simple verification protocol but challenging face pairs.
YTF [49] provides 5,000 video pairs for face verifica-
tion. We use the average representation of 100 randomly
Method MNIST
CNN [19] 0.53
NiN [29] 0.47
Maxout [7] 0.45
DSN [27] 0.39
R-CNN [28] 0.31
GenPool [26] 0.31
DisturbLabel [51] 0.33
Softmax 0.43
Noisy Softmax (α2= 1) 0.42
Noisy Softmax (α2= 0.5)0.33
Noisy Softmax (α2= 0.1)0.33
Noisy Softmax (α2= 0.05) 0.37
Table 3. Recognition error rates (%) on MNIST.
Method CIFAR10 CIFAR10+ CIFAR100
NiN [29] 10.47 8.81 35.68
Maxout [7] 11.68 9.38 38.57
DSN [27] 9.69 7.97 34.57
All-CNN [40] 9.08 7.25 33.71
R-CNN [28] 8.69 7.09 31.75
ResNet [13] N/A 6.43 N/A
DisturbLabel [51] 9.45 6.98 32.99
Softmax 8.11 6.98 31.77
Noisy Softmax (α2= 1) 9.09 8.77 35.23
Noisy Softmax (α2= 0.5) 7.84 7.13 30.22
Noisy Softmax (α2= 0.1)7.39 6.36 28.48
Noisy Softmax (α2= 0.05) 7.58 6.61 29.99
Table 4. Recognition error rates (%) on CIFAR datasets. + denotes data augmentation.
Method Images Models LFW Rank-1 DIR@FAR=1% FGLFW YTF
FaceNet [36] 200M* 1 99.65 - - - 95.18
DeepID2+ [44] 300k* 1 98.7 - - - 91.90
DeepID2+ [44] 300k* 25 99.47 95.00 80.70 - 93.20
Sparse [45] 300k* 1 99.30 - - - 92.70
VGG [33] 2.6M 1 97.27 74.10 52.01 88.13 92.80
WebFace [52] WebFace 1 97.73 - - - 90.60
Robust FR [5] WebFace 1 98.43 - - - -
Lightened CNN [50] WebFace 1 98.13 89.21 69.46 91.22 91.60
Softmax WebFace+1 98.83 91.68 69.51 92.95 94.22
Noisy Softmax(α2= 0.1) WebFace+199.18 92.68 78.43 94.50 94.88
Noisy Softmax(α2= 0.05) WebFace+1 99.02 92.24 75.67 94.02 94.51
Table 5. Recognition accuracies (%) on LFW, FGLFW and YTF datasets. * denotes the images are not publicly available and +denotes
data expansion. In LFW, closed-set and open-set accuracies are evaluated by Rank-1 and DIR@FAR=1 respectively.
selected samples from each video for evaluation.
For data preprocessing, we align and crop images based
on eyes and mouth centers, yielding 104 ×96 RGB
images. Our CNN configuration is shown in Table 2,
here we add element-wise maxout layer [50] after the
3,000-dimensional fully connected layer, yielding a 1,500-
dimensional output, and contrastive loss is applied on this
output as in DeepID2 [43]. Then we train a single CNN
model with outside data from the publicly available CASIA-
WebFace dataset [52] and our own collected data (about
400k from 14k identities). Extract the features for each
image and its horizontally flipped one, then compute a
mean feature vector as the representation. From the re-
sults shown in Table 5, one can observe that Noisy Soft-
max (α2= 0.1) improves the performance over the base-
line, and the result is also comparable to the current state-
of-the-art methods with private data and even model en-
semble. In addition, we further improve our results to
99.31%,94.43%,82.50%,94.88%,95.37%(listed in
the same protocol order as in Table. 5) by two models en-
semble.
7. Conclusion
In this paper, we propose Noisy Softmax to address the
early individual saturation by injecting annealed noise to the
softmax input. It is a way of early softmax desaturation by
postponing the early individual saturation. We show that
our method can be easily performed as a drop-in replace-
ment for standard softmax and is easier to optimize. It sig-
nificantly improves the performances of CNN models, since
the early desaturation operation indeed exerts much effect
on parameter update during back-propagation and further-
more improves the generalization ability of DCNNs. Em-
pirical studies verify the superiority of softmax desatura-
tion. Meanwhile, it achieves state-of-the-art or competitive
results on several datasets.
8. Acknowledgments
This work was partially supported by the National
Natural Science Foundation of China (Project 61573068,
61471048, 61375031 and 61532006), Beijing Nova Pro-
gram under Grant No. Z161100004916088, the Fundamen-
tal Research Funds for the Central Universities under Grant
No. 2014ZD03-01, and the Program for New Century Ex-
cellent Talents in University(NCET-13-0683).
References
[1] Y. Bengio. Estimating or propagating gradients through
stochastic neurons. Computer Science, 2013.
[2] L. Best-Rowden, H. Han, C. Otto, B. F. Klare, and A. K.
Jain. Unconstrained face recognition: Identifying a person
of interest from a media collection. IEEE Transactions on
Information Forensics and Security, 9(12):2144–2157, 2014.
[3] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra.
Weight uncertainty in neural networks. Computer Science,
2015.
[4] by V Nair and G. E. Hinton. Rectified linear units improve
restricted boltzmann machines. Proc Icml, pages 807–814,
2015.
[5] C. Ding and D. Tao. Robust face recognition via multimodal
deep face representation for multimedia applications. IEEE
Transactions on Multimedia, 17(11):2049–2058, 2015.
[6] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier
neural networks. Journal of Machine Learning Research, 15,
2010.
[7] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville,
and Y. Bengio. Maxout networks. Computer Science, pages
1319–1327, 2013.
[8] B. Graham. Fractional max-pooling. Eprint Arxiv, 2014.
[9] A. Graves. Practical variational inference for neural net-
works. Advances in Neural Information Processing Systems,
pages 2348–2356, 2011.
[10] C. Gulcehre, M. Moczulski, M. Denil, and Y. Bengio. Noisy
activation functions. 2016.
[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. Computer Science, 2015.
[12] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into
rectifiers: Surpassing human-level performance on imagenet
classification. pages 1026–1034, 2015.
[13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in
deep residual networks. 2016.
[14] G. Huang, Z. Liu, and K. Weinberger. Densely connected
convolutional networks. 2016.
[15] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger. Deep
networks with stochastic depth. 2016.
[16] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. La-
beled faces in the wild: A database forstudying face recog-
nition in unconstrained environments. 2008.
[17] S. Huang, Z. Xu, D. Tao, and Y. Zhang. Part-stacked cnn for
fine-grained visual categorization. Computer Science, 2015.
[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
Computer Science, 2015.
[19] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. Lecun. What
is the best multi-stage architecture for object recognition?
pages 2146–2153, 2009.
[20] J. Krause, H. Jin, J. Yang, and F. F. Li. Fine-grained recogni-
tion without part annotations. In IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 5546–5555,
2015.
[21] A. Krizhevsky. Learning multiple layers of features from
tiny images. 2012.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-
agenet classification with deep convolutional neural net-
works. Advances in Neural Information Processing Systems,
25(2):2012, 2012.
[23] D. Laptev and J. M. Buhmann. Transformation-invariant
convolutional jungles. In IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 3043–3051, 2015.
[24] D. Laptev, N. Savinov, J. M. Buhmann, and M. Pollefeys. Ti-
pooling: transformation-invariant pooling for feature learn-
ing in convolutional neural networks. 2016.
[25] Y. Lecun and C. Cortes. The mnist database of handwritten
digits.
[26] C. Y. Lee, P. W. Gallagher, and Z. Tu. Generalizing pooling
functions in convolutional neural networks: Mixed, gated,
and tree. Computer Science, 2015.
[27] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-
supervised nets. In AISTATS, volume 2, page 6, 2015.
[28] M. Liang and X. Hu. Recurrent convolutional neural network
for object recognition. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
3367–3375, 2015.
[29] M. Lin, Q. Chen, and S. Yan. Network in network. Computer
Science, 2013.
[30] T. Y. Lin, A. Roychowdhury, and S. Maji. Bilinear cnn mod-
els for fine-grained visual recognition. In IEEE International
Conference on Computer Vision, pages 1449–1457, 2015.
[31] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin soft-
max loss for convolutional neural networks. In International
Conference on International Conference on Machine Learn-
ing, pages 507–516, 2016.
[32] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser,
K. Kurach, and J. Martens. Adding gradient noise improves
learning for very deep networks. Computer Science, 2015.
[33] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face
recognition. In British Machine Vision Conference, 2015.
[34] R. Reed, R. J. Marks, and S. Oh. An equivalence between
sigmoidal gain scaling and training with noisy (jittered) in-
put data. In Neuroinformatics and Neurocomputers, 1992.,
RNNS/IEEE Symposium on, pages 120 – 127, 1992.
[35] R. Reed, S. Oh, and R. J. Marks. Regularization using jit-
tered training data. In International Joint Conference on
Neural Networks, pages 147–152 vol.3, 1992.
[36] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-
fied embedding for face recognition and clustering. pages
815–823, 2015.
[37] M. L. Seltzer and J. Droppo. Multi-task learning in deep
neural networks for improved phoneme recognition. pages
6965–6969, 2013.
[38] W. Shang, K. Sohn, D. Almeida, and H. Lee. Understanding
and improving convolutional neural networks via concate-
nated rectified linear units. 2016.
[39] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. Computer Sci-
ence, 2015.
[40] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Ried-
miller. Striving for simplicity: The all convolutional net.
Eprint Arxiv, 2014.
[41] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway
networks. Computer Science, 2015.
[42] M. Steijvers and P. Grunwald. A recurrent network that per-
forms a context-sensitive prediction task. In Conference of
the Cognitive Science, 2000.
[43] Y. Sun, X. Wang, and X. Tang. Deep learning face represen-
tation by joint identification-verification. Advances in Neural
Information Processing Systems, 27:1988–1996, 2014.
[44] Y. Sun, X. Wang, and X. Tang. Deeply learned face represen-
tations are sparse, selective, and robust. Computer Science,
pages 2892–2900, 2014.
[45] Y. Sun, X. Wang, and X. Tang. Sparsifying neural network
connections for face recognition. Computer Science, 2015.
[46] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. pages 1–9, 2015.
[47] V. Turchenko and A. Luczak. Caffe: Convolutional archi-
tecture for fast feature embedding. Eprint Arxiv, pages 675–
678, 2014.
[48] X. S. Wei, C. W. Xie, and J. Wu. Mask-cnn: Localizing parts
and selecting descriptors for fine-grained image recognition.
2016.
[49] L. Wolf, T. Hassner, and I. Maoz. Face recognition in
unconstrained videos with matched background similarity.
42(7):529–534, 2011.
[50] X. Wu, R. He, and Z. Sun. A lightened cnn for deep face
representation. Computer Science, 2015.
[51] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian. Disturblabel:
Regularizing cnn on the loss layer. 2016.
[52] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-
tation from scratch. Computer Science, 2014.
[53] M. D. Zeiler and R. Fergus. Stochastic pooling for regu-
larization of deep convolutional neural networks. Computer
Science, 2013.
[54] N. Zhang and W. Deng. Fine-grained lfw database. In 2016
International Conference on Biometrics (ICB), 2016.
... D EEP learning has achieved considerable success in computer vision [1], [2], [3], [4], significantly improving the state-of-art of face recognition [5], [6], [7], [8], [9], [10], [11], [12], [13]. This ubiquitous technology is now used to create innovative applications for entertainment and commercial services. ...
... With the help of deep learning technologies, face recognition has developed with unprecedented success [5], [6], [7], [8], [9], [10], [11], [12], [13]. Face recognition models are trained on large-scale training databases [29], [37], [38], and used as feature extractors to test identities that are usually disjoint from the training set [8]. ...
Preprint
While convenient in daily life, face recognition technologies also raise privacy concerns for regular users on the social media since they could be used to analyze face images and videos, efficiently and surreptitiously without any security restrictions. In this paper, we investigate the face privacy protection from a technology standpoint based on a new type of customized cloak, which can be applied to all the images of a regular user, to prevent malicious face recognition systems from uncovering their identity. Specifically, we propose a new method, named one person one mask (OPOM), to generate person-specific (class-wise) universal masks by optimizing each training sample in the direction away from the feature subspace of the source identity. To make full use of the limited training images, we investigate several modeling methods, including affine hulls, class centers, and convex hulls, to obtain a better description of the feature subspace of source identities. The effectiveness of the proposed method is evaluated on both common and celebrity datasets against black-box face recognition models with different loss functions and network architectures. In addition, we discuss the advantages and potential problems of the proposed method. In particular, we conduct an application study on the privacy protection of a video dataset, Sherlock, to demonstrate the potential practical usage of the proposed method. Datasets and code are available at https://github.com/zhongyy/OPOM.
... D EEP learning has achieved considerable success in computer vision [1], [2], [3], [4], significantly improving the state-of-art of face recognition [5], [6], [7], [8], [9], [10], [11], [12], [13]. This ubiquitous technology is now used to create innovative applications for entertainment and commercial services. ...
... With the help of deep learning technologies, face recognition has developed with unprecedented success [5], [6], [7], [8], [9], [10], [11], [12], [13]. Face recognition models are trained on large-scale training databases [29], [37], [38], and used as feature extractors to test identities that are usually disjoint from the training set [8]. ...
Article
Full-text available
While convenient in daily life, face recognition technologies also raise privacy concerns for regular users on the social media since they could be used to analyze face images and videos, efficiently and surreptitiously without any security restrictions. In this paper, we investigate the face privacy protection from a technology standpoint based on a new type of customized cloak, which can be applied to all the images of a regular user, to prevent malicious face recognition systems from uncovering their identity. Specifically, we propose a new method, named one person one mask (OPOM), to generate person-specific (class-wise) universal masks by optimizing each training sample in the direction away from the feature subspace of the source identity. To make full use of the limited training images, we investigate several modeling methods, including affine hulls, class centers and convex hulls, to obtain a better description of the feature subspace of source identities. The effectiveness of the proposed method is evaluated on both common and celebrity datasets against black-box face recognition models with different loss functions and network architectures. In addition, we discuss the advantages and potential problems of the proposed method.
... As the dimensionality M increases, H r decreases, which leads to the saturation of softmax [21].ω r tends to only give non-zero firing levels to the rule with maximum H r and thus results in a poor classification performance for highdimensional input data sets. ...
Article
Full-text available
Gait analysis and evaluation are vital for disease diagnosis and rehabilitation. Current gait analysis technologies require wearable devices or high-resolution vision systems within a limited usage space. To facilitate gait analysis and quantitative walking-ability evaluation in daily environments without using wearable devices, a mobile gait analysis and evaluation system is proposed based on a cane robot. Two laser range finders (LRFs) are mounted to obtain the leg motion data. An effective high-dimensional Takagi-Sugeno-Kang (HTSK) fuzzy system, which is suitable for high-dimensional data by solving the saturation problem caused by softmax function in defuzzification, is proposed to recognize the walking states using only the motion data acquired from LRFs. The gait spatial-temporal parameters are then extracted based on the gait cycle segmented by different walking states. Besides, a quantitative walking-ability evaluation index is proposed in terms of the conventional Tinetti scale. The plantar pressure sensing system records the walking states to label training data sets. Experiments were conducted with seven healthy subjects and four patients. Compared with five classical classification algorithms, the proposed method achieves the average accuracy rate of 96.57%, which is improved more than 10%, compared with conventional Takagi-Sugeno-Kang (TSK) fuzzy system. Compared with the gait parameters extracted by the motion capture system OptiTrack, the average errors of step length and gait cycle are only 0.02 m and 1.23 s, respectively. The comparison between the evaluation results of the robot system and the scores given by the physician also validates that the proposed method can effectively evaluate the walking ability.
... As the activation function, we utilise dense layer with Softmax. Softmax function [77] is well-known for its ability to handle multi-class classification problems successfully. The Softmax function is used as an activation function in the output of neural network layers for predicting the multinormal probability. ...
Article
Full-text available
Information extraction from e-commerce platform is a challenging task. Due to significant increase in number of ecommerce marketplaces, it is difficult to gain good accuracy by using existing data mining techniques to systematically extract key information. The first step toward recognizing e-commerce entities is to design an application that detects the entities from unstructured text, known as the Named Entity Recognition (NER) application. The previous NER solutions are specific for recognizing entities such as people, locations, and organizations in raw text, but they are limited in e-commerce domain. We proposed a Bi-directional LSTM with CNN model for detecting e-commerce entities. The proposed model represents rich and complex knowledge about entities and groups of entities about products sold on the dark web. Different experiments were conducted to compare state-of-the-art baselines. Our proposed approach achieves the best performance accuracy on the Dark Web dataset and Conll-2003. Results show good accuracy of 96.20% and 92.90% for the Dark Web dataset and the Conll-2003 dataset, which show good performance compared to other cutting-edge approaches.
... While ensuring the personal privacy and security of users, this scheme has better operational efficiency and practical value. Compared with the face recognition system in plaintext [7][8][9][10][11][12][13][14][15][16][17][18][19][20][21] proposed in recent years, this scheme has almost the same level on recognition accuracy and time efficiency. ...
Preprint
Full-text available
Face recognition is playing an increasingly important role in present society, and suffers from the privacy leakage in plaintext. Therefore, a recognition system based on homomorphic encryption that supports privacy preservation is designed and implemented in this paper. This system uses the CKKS algorithm in the SEAL library, Microsoft’s latest homomorphic encryption achievement, to encrypt the normalized face feature vectors, and uses the FaceNet neural network to learn on the image’s ciphertext to achieve face classification. Finally, face recognition in ciphertext is accomplished. After been tested, the whole process of extracting feature vectors and encrypting a face image takes only about 1.712s in the developed system. The average time to compare a group of images in ciphertext is about 2.06s, and a group of images can be effectively recognized within 30 degrees of face bias, the identification accuracy can reach 96.71%. Compared with the face recognition scheme based on the Advanced Encryption Standard(AES) encryption algorithm in ciphertext proposed by Wang et al. in 2019, our scheme improves the recognition accuracy by 4.21%. Compared with the image recognition scheme based on Elliptical encryption algorithm in ciphertext proposed by Kumar S et al. in 2018, the total time in our system is decreased by 76.2%. Therefore, this scheme has better operational efficiency and practical value while ensuring the users’ personal privacy. Compared with the face recognition system in plaintext presented in recent years, our scheme has almost the same level on recognition accuracy and time efficiency.
Article
Single Photon Emission Computed Tomography (SPECT) imaging has the potential to acquire information about areas of concerns in a non-invasive manner. Until now, however, deep learning based classification of SPECT images is still not studied yet. To examine the ability of convolutional neural networks on classifying whole-body SPECT bone scan images, in this work, we propose three different two-class classifiers based on the classical Visual Geometry Group (VGG) model. The proposed classifiers are able to automatically identify that whether or not a SPECT image include lesions via classifying this image into categories. Specifically, a pre-processing method is proposed to convert each SPECT file into an image via balancing difference of the detected uptake between SPECT files, normalizing elements of each file into an interval, and splitting an image into batches. Second, different strategies were introduced into the classical VGG 16 model to develop classifiers by minimizing the number of parameters as many as possible. Lastly, a group of clinical whole-body SPECT bone scan files were utilized to evaluate the developed classifiers. Experiment results show that our classifiers are workable for automated classification of SPECT images, obtaining the best values of 0.838, 0.929, 0.966, 0.908 and 0.875 for accuracy, precision, recall, F-1 score and AUC value, respectively.
Chapter
In various visual classification tasks, we enjoy significant performance improvement by deep convolutional neural networks (CNNs). To further boost performance, it is effective to regularize feature representation learning of CNNs such as by considering margin to improve feature distribution across classes. In this paper, we propose a regularization method based on random rotation of feature vectors. Random rotation is derived from cone representation to describe angular margin of a sample. While it induces geometric regularization to randomly rotate vectors by means of rotation matrices, we theoretically formulate the regularization in a statistical form which excludes costly geometric rotation as well as effectively imposes rotation-based regularization on classification in training CNNs. In the experiments on classification tasks, the method is thoroughly evaluated from various aspects, while producing favorable performance compared to the other regularization methods. Codes are available at https://github.com/tk1980/StatRot.KeywordsRegularizationRandom rotationCNNClassification
Conference Paper
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at https://github.com/liuzhuang13/DenseNet.
Article
Deep residual networks have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which further makes training easy and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10/100, and a 200-layer ResNet on ImageNet.
Article
While logistic sigmoid neurons are more biologically plausable that hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabelled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labelled data sets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised nueral networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry