Content uploaded by Binghui Chen
Author content
All content in this area was uploaded by Binghui Chen on Aug 15, 2017
Content may be subject to copyright.
Content uploaded by Binghui Chen
Author content
All content in this area was uploaded by Binghui Chen on Aug 15, 2017
Content may be subject to copyright.
arXiv:1708.03769v1 [cs.CV] 12 Aug 2017
Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing
the Early Softmax Saturation
Binghui Chen1, Weihong Deng1, Junping Du2
1School of Information and Communication Engineering, Beijing University of Posts and Telecommunications,
2School of Computer Science, Beijing University of Posts and Telecommunications, Beijing China.
chenbinghui@bupt.edu.cn, whdeng@bupt.edu.cn, junpingd@bupt.edu.cn
Abstract
Over the past few years, softmax and SGD have become
a commonly used component and the default training strat-
egy in CNN frameworks, respectively. However, when op-
timizing CNNs with SGD, the saturation behavior behind
softmax always gives us an illusion of training well and
then is omitted. In this paper, we first emphasize that the
early saturation behavior of softmax will impede the explo-
ration of SGD, which sometimes is a reason for model con-
verging at a bad local-minima, then propose Noisy Soft-
max to mitigating this early saturation issue by injecting
annealed noise in softmax during each iteration. This oper-
ation based on noise injection aims at postponing the early
saturation and further bringing continuous gradients prop-
agation so as to significantly encourage SGD solver to be
more exploratory and help to find a better local-minima.
This paper empirically verifies the superiority of the early
softmax desaturation, and our method indeed improves the
generalization ability of CNN model by regularization. We
experimentally find that this early desaturation helps opti-
mization in many tasks, yielding state-of-the-art or compet-
itive results on several popular benchmark datasets.
1. Introduction
Recently, deep convolutional neural networks (DCNNs)
have taken the computer vision field by storm, significantly
improving the state-of-the-art performances in many visual
tasks, such as face recognition [43, 44, 33, 36], large-scale
image classification [22, 39, 46, 11, 13], and fine-grained
object classification [30, 17, 20, 48]. Meanwhile, softmax
layer and the training strategy of SGD together with back-
propagation (BP) become the default components, and are
generally applied in most of the aforementioned works.
It is widely observed that when optimizing with SGD
and BP, the smooth and free gradients propagation is crucial
to improve the training of DCNNs. For example, replacing
Figure 1. Decomposition of typical softmax layer in DCNN. It can
be rewritten into three parts: fully connection component, softmax
activation and cross-entropy loss.
sigmoid activation function with the piecewise-linear acti-
vation functions such as ReLU and PReLU [12] handles
the problem of gradients vanishing caused by sigmoid sat-
uration and, allows the training of much deeper networks.
While, it is interesting that softmax activation function (il-
lustrated in Figure 1) is implicitly like sigmoid function due
to their similar formulation (shown in Sec. 3), and has the
saturation behavior as well when its input is large. How-
ever, many take the softmax activation for granted and the
problem behind its saturation behavior is omitted as a result
of illusion of performance improvements based on DCNNs.
In standard SGD, the saturation behavior of softmax
turns up when its output is very close to the ground truth,
certainly it is our goal of model training. However, in some
ways, it is a barrier to improve the generalization ability
of CNNs especially when it shows up early (inopportune).
Concretely, for one instance input, it will early stop con-
tributing gradients to BP when its softmax output is prema-
turely saturated, yielding short-lived gradients propagation
in history which is not enough for robust learning. And
in this case, the learning process with SGD and BP hardly
explore more due to poor gradients propagation and param-
eters update. We define this saturation behavior as individ-
ual saturation and the corresponding individual as saturated
one. As the training going, the number of non-saturated
contributing training samples gradually decreases and the
robust learning of network will be impeded. It sometimes
is a reason for algorithm going to bad local-minima1and
being difficult to escape. Furthermore, the problem of over-
1For simplicity, we use local or ’global’ minima to represent a neigh-
bouring region not a single point.
fitting turns up. To this end, we need to give SGD chances
to explore more parts of parameter space and the early indi-
vidual saturation is undesired.
In this paper, we propose Noisy Softmax, a novel tech-
nique of early softmax desaturation, to address the afore-
mentioned issue. This is mainly achieved by injecting an-
nealed noise directly into softmax activations during each it-
eration. In another word, Noisy Softmax allows SGD to es-
cape from a bad local-minima and explore more by postpon-
ing the early individual saturation. Furthermore, it improves
the generalization ability of system by reducing over-fitting
as a direct consequence of more exploration. The main con-
tributions of this work are summarized as follows:
•We provide an insight of softmax saturation, inter-
preted as individual saturation, that early individual
saturation produces short-lived gradients propagation
which is poor for robust exploration of SGD and fur-
ther causes over-fitting unintentionally.
•We propose Noisy Softmax to aim at producing rich
and continuous gradients propagation by injecting an-
nealed noise into softmax activations. It allows the
’global’ convergence of SGD solver and aids gener-
alization by reducing over-fitting. To our knowledge,
it is the first attempt to address the early saturation be-
havior of softmax by adding noise.
•Noisy Softamx can be easily performed as a drop-
in replacement for standard softmax and optimized
with standard SGD. It can be also applied in other
performance-improving techniques, such as neural ac-
tivation functions and network architectures.
•Extensive experiments have been performed on several
datasets, including MNIST [25], CIFAR10/100 [21],
LFW [16], FGLFW [54] and YTF [49]. The im-
pressive results demonstrate the effectiveness of Noisy
Softmax.
2. Related Work
Many promising techniques have been developed, such
as novel network structures [29, 13, 41], non-linear activa-
tion functions [12, 7, 6, 38], pooling strategies [11, 8, 53]
and objective loss functions [43, 36], etc.
These approaches are mostly optimized with SGD and
back-propagation. In standard SGD, we use the chain rule
to compute and propagate the gradients. Thus, any satura-
tion behavior of neuron units or layer components2are un-
desired because the training of deeper framework attributes
to smooth and free flow of gradients information. The
early solution is to replace sigmoid function with non-linear
2The saturation of layer refers to gradients vanishing at a certain layer
in back-propagation.
piecewise activation function [22]. This is neuron desatura-
tion. Skip connections between different layers exponen-
tially expands the paths of propagation [41, 11, 13, 15, 14].
These belong to layer desaturation, since the forward and
backward information can be directly propagated from one
layer to any other layer without gradients vanishing. In con-
trast, only the early saturation behavior is harmful instead of
all of them, we focus on the early desaturation of softmax,
which hasn’t been investigated, and we achieve this by in-
jecting noise explicitly into softmax activations.
There are some other works that are related to noise
injection. Adding noise to ReLU has been developed to
encourage components to explore more in Boltzmann ma-
chines and feed-forward networks [4, 1]. Adding noise to
sigmoid provides possibilities of training with much wider
family of activation functions than previous [10]. Adding
weight noise [42], adaptive weight noise [9, 3] and gradi-
ents noise [32] also improve the learning. Adding annealed
noise can help the solver escape from a bad local-minima
and find a better one. We follow these inspiring ideas to ad-
dress individual saturation and encourage SGD to explore
more. The main differences are that we apply noise injec-
tion on CNN and impose noise on loss layer instead of pre-
vious layers. But different from adding noise on loss layer
in DisturbLabel [51], a method that seems weird to disturb
labels but indeed improves the performances of models, our
work has a clear object of the delay of early softmax satu-
ration by explicitly injecting noise into softmax activition.
Another noise injection way is randomly transforming
the input data, which is commonly referred to data augmen-
tation, such as randomly cropping, flipping [22, 50], rotat-
ing [24, 23] and jittering input data [34, 35]. And our work
can also be interpreted as a way of data augmentation which
will be discussed in the following discussion part.
3. Early Individual Saturation
In this section, we will give a toy example to describe
the early individual saturation of softmax, which is always
omitted, and analyse its impact on generalization. De-
fine the i-th input data xiwith the corresponding label yi,
yi∈[1 ···C]. Then processing training images with stan-
dard DCNN, we can obtain the cross-entropy loss and par-
tial derivative as follows:
L=−1
NX
i
log P(yi|xi) = −1
NX
i
log efyi
Pjefj(1)
∂L
∂fj
=P(yi=j|xi)−1{yi=j}=efj
Pkefk−1{yi=j}
(2)
where fjrefers to the j-th element of the softmax input
vector f,j∈[1 ···C],Nis the number of training im-
ages. 1{condition}= 1 if condition is satisfied and
1{condition}= 0 if not.
To simplify our analysis, we consider the problem of bi-
nary classification3, where yi∈[1,2]. Under binary sce-
nario, we plot the softmax activation for class 1in Fig-
ure 2. Intuitively, the softmax activation is totally like sig-
moid function. The standard softmax encourages f1> f2
in order to classify class 1correctly and can be regarded as a
genius when its output P(yi= 1|xi) = 1
1+e−(f1−f2)is very
close to 1. In this case, the softmax output of data xiis satu-
rated and we define this as individual saturation. Of course,
making its softmax output close to 1is our ultimate goal
of CNN training. However, we would like to achieve it at
the end of SGD exploration not in the beginning or middle
stage. Since, when optimizing CNN with gradients-based
methods such as SGD, the prematurely saturated individual
early stops contributing gradients to back-propagation due
to negligible gradients, i.e. P(yi= 1|xi)≈1,∂ L
∂fyi≈0
(see Eq. 2). And with the saturated individuals number ris-
ing, the amount of contributing data d ecreases and, SGD has
few chances to move around and is more likely to converge
at a local minima, therefore, it is easy to be over-fitting and
it requires extra data to recover. In short, the early saturated
ones introduce short-lived gradients propagation which is
not enough to help system converge at a ’global minima’
(i.e. a better local-minima), so the early individual satura-
tion is undesired.
−10 −8 −6 −4 −2 0 2 4 6 8 10
0
0.2
0.4
0.6
0.8
1
f1−f2
Posterior Probability
Figure 2. Softmax activation function: 1
1+e−(f1−f2). X axis rep-
resents the difference between f1and f2.
4. Noisy Softmax
Based on the fact analysed in Section 3, the short-lived
gradients propagation caused by early individual saturation
would not guide the robust learning. Thus the intuitive solu-
tion is to set up ’barrier’ along its way to satuation so as
to postpone the early saturation behavior and produce
rich and continuous gradients propagation. Particularly,
for training data point (xi, yi), a simple way to achieve this
is to factitiously reduce its softmax input fyi(note that, it is
theoretically same to enlarge fj,∀j6=yi, but it is so com-
plex to operate). Moreover, many research works point out
that adding noise gives the system chances to find ’global
minima’, such as injecting noise to sigmoid [10]. We follow
this inspiring idea to address the problem of early individ-
ual saturation. Therefore, our technique of slowing down
the early saturation is to inject appropriate noise in softmax
input fyi, and the resulting noise-associated one is as fol-
lows:
3Multi-classification complicates our analysis but has the same mecha-
nism as binary scenario.
fnoise
yi=fyi−n(3)
where n=µ+σξ,ξ∼ N (0,1),µand σare used to
generate a wider family of noise from ξ. Intuitively, we
would prefer fnoise
yito be less than fyi(because fnoise
yi>
fyiwill speed up saturation). Thus, we simply require noise
nto be always positive, and we have the following form:
fnoise
yi=fyi−σ|ξ|(4)
where noise nhas mean 0 and standard variance σ.
Moreover, we would like to make our noise annealed
by controlling the parameter σ. Considering our initial
thought, we intend to postpone the early saturation of xi
instead of to not allow its saturation, implying that the ini-
tially larger noise is required to boost the exploration ability
and later the relatively smaller noise is required for model
convergence.
In standard Softmax layer (Figure 1), fyiis also the out-
put of the fully connected component and can be written as
fyi=WT
yiXi+byiwhere Wyiis the yi-th column of W,
Xiis the input feature of this layer from training data xiand
byiis the basis. Since byiis a constant and fyimostly de-
pends on WT
yiXi, we construct our annealed noise by mak-
ing σto be related to WT
yiXi. In consideration of the fact
that WT
yiXi=kWyikkXikcos θyi(as shown in[31]), where
θyiis the angle between vector Wyiand Xi,σshould be
a joint function of kWyikkXikand θyiwhich hold ampli-
tude and angular information respectively. Parameter Wyi
followed by a loss function can be regarded as a linear clas-
sifier of class yi. And this linear classifier uses cosine simi-
larity to make angular decision boundary. As a result, with
the converging of system, the angle θyibetween Wyiand
Xiwill gradually decrease. Therefore, our annealed-noise-
associated softmax input is formulated as:
fnoise
yi=fyi−αkWyikkXik(1 −cos θyi)|ξ|(5)
where αkWyikkXik(1−co s θyi) = σ, and hyper-parameter
αis used to adjust the scale of noise. In our annealed
noise, we leverage kWyikkXikto make the magnitude of
the noise and fyito be comparable, and use (1 −cos θyi)
to adaptively anneal the noise. Notably, our early desatura-
tion work implies that make softmax later saturated instead
of non-saturated. We experimented with various function
types of σand empirically found that this surprising sim-
ple formulation performs better. Putting Eq. 5 into original
softmax, the Noisy Softmax loss is defined as:
L=−1
NX
i
log efyi−αkWyikkXik(1−cos θyi)|ξ|
Pj6=yiefj+efyi−αkWyikkXik(1−cos θyi)|ξ|
(6)
Optimization. We use Eq. 6 throughout our experi-
ments and optimize our model with the commonly used
SGD. Thus we need to compute the forward and backward
propagation, and cos θyiis required to be replaced with
0 2 4 6 8 10 12 14 16
0.4
0.5
0.6
0.7
0.8
0.9
1
Iteration
Average Prediction
Softmax
Noisy Softmax
normal
neg x103
Figure 3. Saturation status vs. iteration with different formulations
of noise. Normal and Neg represent normal noise and negative
noise respectively. α2is set to 0.1 in our experiments.
0 20 40 60 80 100 120 140 160
0.2
0.4
0.6
0.8
1
Iteration
Recognition Error Rate
Softmax
Noisy Softmax
Normal
Neg
x100
Figure 4. CIFAR100 testing error vs. iteration with different for-
mulations of noise. Normal and Neg represent normal noise and
negative noise respectively. α2is set to 0.1 in our experiments.
WT
yiXi
kWyikkXik. For forward and backward propagation, the
only difference between Noisy Softmax loss and standard
softmax loss exists in fyi. For example, in forward propaga-
tion, ∀j6=yifjis computed as the same as original softmax
while fyiis replaced with fnoise
yi. In backward propagation,
∂L
∂Xi=Pj
∂L
∂fj
∂fj
∂Xiand ∂ L
∂Wyi=Pj
∂L
∂fj
∂fj
∂Wyi
, only when
j=yithe computations of ∂ fj
∂Xiand ∂ fj
∂Wyi
are not the same
as original softmax which are listed as follows:
∂f noise
yi
∂Xi
=Wyi−α|ξ|(XikWyik
kXik−Wyi)(7)
∂f noise
yi
∂Wyi
=Xi−α|ξ|(WyikXik
kWyik−Xi)(8)
For simplicity, we leave out ∂ L
∂fjand ∂ fj
∂Xi,∂ fj
∂Wyi(∀j6=yi)
since they are the same for both Noisy Softmax and original
softmax. In short, except for when j=yi, the overall com-
putation of Noisy Softmax is similar with original softmax.
5. Discussion
5.1. The Effect of Noise Scale α
In Noisy Softmax, the scale of annealed noise is largely
determined by the hyper-parameter α. Here we can imag-
ine that, when α= 0 in the 0noise limit, Noisy Softmax
is the same with ordinary softmax. Then individual satu-
ration will turn up and SGD solver has a high chance to
converge at a local-minima. Without extra data for training,
the model will be easily over-fitting. However, when αis
large enough, large gradients are obtained since backprop-
agating through fnoise
yigives rise to large derivatives. So
the algorithm just see the noise instead of real signal and
move around anywhere blindly. Hence, a relatively small α
is required to aid the generalization of model.
We evaluate the performances of Noisy Softmax with
different αon several datasets. Note that, the value of α
is not carefully adjusted and we make α= 0 (i.e. soft-
max) as our baseline. And these comparison results are
listed in Table 3, 4and 5. One can observe that Noisy Soft-
max with a relatively appropriate α(e.g. α2= 0.1) obtains
better recognition accuracy than ordinary softmax on all of
the datasets. To be intuitional, we summarize the results
on CIFAR100 in Figure 6. When α2= 0.1, our method
outperforms the original softmax. This demonstrates that
our Noisy Softmax does improve the generalization abil-
ity of CNN by encouraging the SGD solver to be more ex-
ploratory and to converge at a ’global-minima’. When α
rises to 1, the large noise causes the network to converge
slower and produces worse performance than baseline as
well. Since the large noise drowns the helpful signal and
solver just sees the noise.
5.2. Saturation Study
To illuminate the significance of early softmax desatu-
ration based on non-negative noise injection, we investi-
gate the impacts of different formulations of noise, such as
normal noise n=σξ and neg ative noise n=−σ|ξ|(σ
is the same as in Eq. 5), on individual saturation. From
the formulations of noise, we can imagine that there will be
more saturated instances when training with negative noise
(which is a counterexample). To intuitively analyse the sat-
uration state, we compute the average possibility prediction
over the entire training set as follows:
P=1
N
C
X
j=1
Nj
X
i=1
P(yi|xi)(9)
where Cis the number of classes and Njis the number
of images within the j-th class. Figure 3 and 4 show the
saturation status of different noise and testing error rates on
CIFAR100 respectively.
From the results in Figure 3, one can observe that when
training with original softmax or negative noise, the av-
erage prediction rises quickly to a relatively high level, al-
most 0.9, implying that the early individual saturation is se-
rious, and finally goes up to nearly 1. Moreover, the av-
erage prediction of negative noise is higher than that of
softmax, implying that the early individual saturation is de-
teriorated since many instances are factitiously mapped to
0 20 40 60 80 100 120 140 160
0
0.2
0.4
0.6
0.8
1
Iteration
Recognition Error Rate
softmax
Noisy Softmax(α = 0.05)
Noisy Softmax(α = 0.1)
Noisy Softmax(α = 0.5)
Noisy Softmax(α = 1)
x100
Figure 5. CIFAR100 training error vs. iteration with different α.
0 20 40 60 80 100 120 140 160
0.2
0.4
0.6
0.8
1
Iteration
Recognition Error Rate
softmax
Noisy Softmax(α = 0.05)
Noisy Softmax(α = 0.1)
Noisy Softmax(α = 0.5)
Noisy Softmax(α = 1)
x100
Figure 6. CIFAR100 testing error vs. iteration with different α.
saturated ones. From the results in Figure 4, one can ob-
serve that the testing error of negative noise drops slowly
and finally achieves a relatively high level, nearly 37%, ver-
ifying that the exploration of SGD is seriously impeded by
early individual saturation. In normal noise case, the test-
ing error and the trend of average prediction rise are similar
with original softmax, shown in Figure 4 and 3 respectively,
since the expectation E(n)is close to zero.
In contrast, when training with Noisy Softmax, the aver-
age prediction rises slowly and is much lower than orig-
inal softmax at the early training stage, shown in Fig-
ure 3, verifying that the early individual saturation behav-
ior is significantly avoided. And from the results in Fig-
ure 4, one can observe that Noisy Softmax outperforms
the baseline and significantly improves the performance to
28.48% testing error rate. Note that, after 3,000 iterations,
our method achieves better testing error result but lower
average-prediction, demonstrating that the early desatura-
tion of softmax gives SGD solver chances to traverse more
portions of parameter space for optimal solution. As the
noise level is decreased, it will prefer a better local-minima
where signal gives strong response to SGD. Then the solver
will spend more time to explore this region and converge,
which can be regarded as ’global-minima’, in a finite num-
ber of steps.
In summary, injecting non-negative noise n=σ|ξ|in
softmax does prevent the early individual saturation and fur-
ther improve the generalization ability of CNN when opti-
mized by standard SGD.
5.3. Annealed Noise Study
When addressing the early individual saturation, the key
idea is to add annealed noise. In order to highlight the
superiority of our annealed noise described in Sec. 4, we
compare it to free noise n=α|ξ|and amplitude noise
α2Noisy Softmax free amplitude
0 31.77 31.77 31.77
0.05 29.99 31.43 30.96
0.1 28.48 31.04 29.97
0.5 30.22 30.88 fail
1 35.23 31.20 fail
Table 1. Testing error rates(%) vs. different noise on CIFAR100.
n=αkWyikkXik|ξ|. We evaluate them on CIFAR100 and
the results are listed in Table 1. From the results, it can
be observed that our Noisy Softmax outperforms the other
two formulations of noise. In free noise case, where σ(de-
scribed in Sec. 4) is set to a fixed value αand the noise
is totally independent, although adding this noise is a de-
saturation operation the accuracy gain over the baseline is
small, since it desaturates softmax worse not according to
the magnitude of softmax input, in another word it cannot
suit the remedy to the case. In amplitude noise case, where
σis set to αkWyikkXik, the subtractive noise is prudent
due to considering the level of softmax input, thus yield-
ing better accuracy gain than free noise. While it still is
worse than Noisy Softmax. Because, in both Noisy Softmax
and amplitude noise cases, as the exploration going, the
’globally better’ region of parameter space has been seen
by SGD and it is time to be patient in exploring this region,
in another word smaller noise is required. Noisy Softmax
holds this idea by annealing the noise but, in amplitude
noise case, the level-unchanging noise seems a little large
at this time and further causes difficulty in detailed learning.
Reviewing the formulation of our annealed noise, one can
observe that our annealed noise is constructed by combin-
ing θyi, a time identifier, into amplitude noise. With time
function 1−cos θyiinjected, the noise will be adaptively
decreased.
5.4. Regularization Ability
We find experimentally that Noisy Softmax can regular-
ize the CNN model by preventing over-fitting. Figure 5
and 6 show the recognition accuracy results on CIFAR100
dataset with different α. One can observe that without
noise injection (i.e. α= 0), the training recognition error
drops fast to quite a low level, almost 0%, however the test-
ing recognition error stops falling at a relatively high level,
nearly 31.77%. Conversely, when α2is set to an appropri-
ate value such as 0.1, the training error drops slower and is
much higher than the baseline. But the testing error reaches
a lower level, nearly 28.48%, and still has a trend of de-
creasing. Even when α2= 0.5, the training error is higher
but the testing error becomes lower as well, nearly 30.22%.
This demonstrates that encouraging SGD to converge at a
better local-minima indeed prevent over-fitting and Noisy
Figure 7. Geometric Interpretation of data augmentation.
Softmax has a strong regularization ability.
As analysed above, our Noisy Softmax can be regarded
as a kind of regularization technique for preventing over-
fitting by making SGD to be more exploratory. Here we will
analyse this regularization ability from another data aug-
mentation perspective, which has a profound physical inter-
pretation. Under the original case, the softmax input com-
ing from data point (xi, yi)is fyi=kWyikkXikco s θyi
(we omit the constant byifor simplicity). Now we consider
a new input (x′
i, yi), where kX′
ik=kXikand the angle θ′
yi
between vector Wyiand X′
iis arccos((1 + α|ξ|) cos θyi−
α|ξ|). Thus we have f′
yi=kWyikkX′
ikcos θ′
yi=WT
yiXi−
αkWyikkXik(1 −cos θyi)|ξ|=fnoise
yi, implying that
fnoise
yican be regarded as coming from a new data point
(x′
i, yi). Notably, since θ′
yi> θyi, these generated data
have many boundary examples which are much helpful for
discriminative feature learning, illustrated in Figure 7. In
summary, generating the noisy input fnoise
yiis equivalent to
generating new training data, which is an efficient way of
data augmentation.
To verify our discussion above, we evaluate Noisy Soft-
max on two subsets of the MNIST dataset, which have
only 600(1%) and 6000(10%) training instances respec-
tively. Our CNN configuration is shown in Table 2. With
the same training strategy in Section 6.1, we achieve 3.82%
and 1.30% testing error rates on the original testing set,
respectively. Meanwhile, in both cases, the training error
rates quickly drop to nearly 0% which show that the over-
fitting turns up. However, when training with Noisy Soft-
max (α2= 0.5), we obtain 2.46% and 0.93% testing error
rates, respectively. This demonstrates that Noisy Softmax
improves the generalization ability of CNN with implicit
data augmentation. And from the accuracy improvements
on these two subsets and CIFAR100 (which has 500 in-
stances per class), it acts as an effective algorithm especially
in the case that the amount of training data is limited.
5.5. Relationship to Other Methods
Multi-task learning: combining several tasks to one
system does improve the generalization ability [37]. Con-
sidering a multi-task learning system with input data xi, the
overall object loss function is a combination of several sub-
object loss functions, written as L=PjLj(ϑ0, ϑj, xi),
where ϑ0and ϑj, j ∈[1,2,···]are generic parameters and
task-specific parameters respectively. Optimizing with stan-
dard SGD, generic parameters ϑ0are updated as ϑ0=ϑ0−
γ(Pj
∂Lj
∂ϑ0),γis the learning rate. While in Noisy Softmax,
from an overall training perspective, our loss function can
also be regarded as a combination of many noise-dependent
changing losses Lnoisek=−log efyi
Pc6=yiefc+efnoise
yi
+
α|ξ|kkWyikkXik(1 −cos θyi), k ∈[1, m], i.e. L=
Pm
k=1 Lnoisekwhere mis an uncertain number and is re-
lated to noise scale and iteration number. Thus, the over-
all contribution to system can be regarded as ϑ=ϑ−
γ(Pm
k=1
∂Lnoisek
∂ϑ ). So our method can be regarded as a
special case of multi-task learning, where the task-specific
parameters are shared across tasks.
However, in multi-task learning system, factitiously de-
signing task-specific losses is prohibitively expensive and
the number of tasks is limited and small. While in Noisy
Softmax training procedure, the model is constrained by
many randomly generated tasks (quantified by Lnoisek).
Thus, training a model with Noisy Softmax can be regarded
as training with massive tasks that are very costly and often
infeasible to design in original multi-task learning system.
Noise injection: some research works inject noise in
previous layers of neural networks such as in neuron activa-
tion functions ReLU [4, 1] and sigmoid [10], in weights [9,
3] and gradients [32]. We emphasize that Noisy Softmax
adds noise on a single loss layer instead of on many previ-
ous layers, which is more convenient and efficient for im-
plementation and model training, and is applied in DCNNs.
Intrinsically different from DisturbLabel [51], where noise
is produced by disturbing labels and also exerts effect on
loss layer, Noisy Softmax starts from a clear object of early
softmax desaturation and the noise is adaptively annealed
and injected in a explicit manner.
Desaturation: many other desaturation works, such as
replacing sigmoid with ReLU [6] and building skip connec-
tions between layers [41, 11, 13, 14], solve the problems of
gradients vanishing which happen in bottom layers. While
our Noisy Softmax solves the problem of early gradients
vanishing in the top layer (i.e. loss layer) which is caused
by the early individual saturation. We emphasize that solv-
ing the early gradients vanishing in top layer is crucial to
parameters update and model optimization, since top layer
is the source of gradients propagation. In summary, by post-
poning the early individual saturation we can obtain contin-
uous gradients propagation from the top layer and further
encourage SGD to be more exploratory.
6. Experiments and Results
We evaluate our proposed Noisy Softmax algorithm on
several benchmark datasets, including MNIST [25], CI-
FAR10/100 [21], LFW [16], FGLFW [54] and YTF [49].
Note that, in all of our experiments, we only use a single
Layer MNIST(for Sec. 5.4) MNIST CIFAR10/10+ CIFAR100 LFW/FGLFW/YTF
Block1 [3x3,40]x2 [3x3,64]x3 [3x3,64]x4 [3x3,96]x4 [3x3,64]x1
Pool1 Max [2x2], stride 2
Block2 [3x3,60]x1 [3x3,64]x3 [3x3,128]x4 [3x3,192]x4 [3x3,128]x1
Pool2 Max [2x2], stride 2
Block3 [3x3,60]x1 [3x3,64]x3 [3x3,256]x4 [3x3,384]x4 [3x3,256]x2
Pool3 Max [2x2], stride 2
Block4 - - - - [3x3,512]x3,padding 0
Fully Connected 100 256 512 512 3000
Table 2. CNN architectures for different benchmark datasets. Blockx denotes a container of several convolution components with the same
configuration. E.g. [3x3, 64]x4 denotes 4 cascaded convolution layers with 64 filters of size 3x3.
model for the evaluation of Noisy Softmax, and both soft-
max and Noisy Softmax in our experiments use the same
CNN architecture shown in Table 2.
6.1. Architecture Settings and Implementation
As VGG [39] becomes a commonly used CNN archi-
tecture, the cascaded layers with small size filters gradually
take the place of single layer with large size filters. Since
these cascaded layers have less parameters, lower computa-
tional complexity and stronger representation ability com-
pared to the single layer. E.g. a single 5x5 convolution
layer is replaced with 2 cascaded 3x3 convolution layers.
Inspired by [39, 31], we design the architectures as shown in
Table 2. In convolution layers, both the stride and padding
are set to 1 if not specified. In pooling layers, we use 2x2
max-pooling filter with stride 2. We adopt the piece-wise
linear functions PReLU [12] as our neuron activation func-
tions. Then we use the weight initialization [12] and batch
normalization [18] in our networks. All of our experiments
are implemented by Caffe library [47] with our own modi-
fications. We use standard SGD to optimize our CNNs and
the batch sizes are 256 and 200 for object experiments and
face experiments, respectively. For data preprocessing, we
just perform the mean substraction.
Training. In object recognition tasks, the initial learning
rate is 0.1 and is divided by 10 at 12k. The total iteration is
16k. Note that, although we train our CNNs with coarsely
adjusted learning rate, the results of all experiments are im-
pressive and consistent, verifying the effectiveness of our
method. For face recognition tasks, we start with a learning
rate of 0.01, divide it by 5 when the training loss does not
drop.
Testing. We use the original softmax to classify the test-
ing data in object datasets. In face datasets, we evaluate it
with the cosine distance rule after PCA reduction for face
recognition.
6.2. Evaluation on MNIST Dataset
MNIST [25] contains 60,000 training samples and
10,000 testing samples. These samples are uniformly dis-
tributed over 10 classes. And all samples are 28x28 gray
images. Our CNN network architecture is shown in Table 2
and we use a 0.001 weight decay. The results of the state-
of-the-art methods and our proposed Noisy Softmax with
different αare listed in Table 3.
From the results, our Noisy Softmax (α2= 0.1) not
only outperforms the original softmax over the same ar-
chitecture, but also achieves the competitive performance
compared to the state-of-the-art methods. It can also be
observed that Noisy Softmax produces consistent accuracy
gain with coarsely adjusted α2, such as 0.05, 0.1 and 0.5,
and our method achieves the same accuracy with Distur-
bLabel [51] which adds dropout to several layers, demon-
strating the effectiveness of our technique.
6.3. Evaluation on CIFAR Datasets
CIFAR [21] has two evaluation protocols over 10 and
100 classes respectively. CIFAR10/100 has 50,000 train-
ing samples and 10,000 testing samples, all the samples are
32x32 RGB images. And these images are uniformly dis-
tributed over 10 or 100 classes. We use different CNN ar-
chitectures in CIFAR10 and CIFAR100 experiments, and
these network configurations are shown in Table 2.
We evaluate our method on CIFAR10 and CIFAR100, as
these results are shown in Table 4. For data augmentation,
we perform a simple method: randomly crop a 30*30 im-
age. From our experimental results, one can observe that
Noisy Softmax (α2= 0.1) outperforms all of the other
methods on these two datasets. And it improves nearly 1%
and more than 3% accuracies over the baseline on CIFAR10
and CIFAR100 respectively.
6.4. Evaluation on Face Datasets
LFW[16] contains 13,233 images from 5749 celebrities.
Under unrestricted conditions, it provides 6,000 face pairs
for verification protocol and, closed-set and open-set for
identification protocol adopted in [2].
FGLFW [54] is a derivative of LFW, implying that the
images are all coming from LFW but the face pairs are diffi-
cult to classify whether they are from the same person. It is
a light and sweet dataset for performance evaluation due to
the simple verification protocol but challenging face pairs.
YTF [49] provides 5,000 video pairs for face verifica-
tion. We use the average representation of 100 randomly
Method MNIST
CNN [19] 0.53
NiN [29] 0.47
Maxout [7] 0.45
DSN [27] 0.39
R-CNN [28] 0.31
GenPool [26] 0.31
DisturbLabel [51] 0.33
Softmax 0.43
Noisy Softmax (α2= 1) 0.42
Noisy Softmax (α2= 0.5)0.33
Noisy Softmax (α2= 0.1)0.33
Noisy Softmax (α2= 0.05) 0.37
Table 3. Recognition error rates (%) on MNIST.
Method CIFAR10 CIFAR10+ CIFAR100
NiN [29] 10.47 8.81 35.68
Maxout [7] 11.68 9.38 38.57
DSN [27] 9.69 7.97 34.57
All-CNN [40] 9.08 7.25 33.71
R-CNN [28] 8.69 7.09 31.75
ResNet [13] N/A 6.43 N/A
DisturbLabel [51] 9.45 6.98 32.99
Softmax 8.11 6.98 31.77
Noisy Softmax (α2= 1) 9.09 8.77 35.23
Noisy Softmax (α2= 0.5) 7.84 7.13 30.22
Noisy Softmax (α2= 0.1)7.39 6.36 28.48
Noisy Softmax (α2= 0.05) 7.58 6.61 29.99
Table 4. Recognition error rates (%) on CIFAR datasets. + denotes data augmentation.
Method Images Models LFW Rank-1 DIR@FAR=1% FGLFW YTF
FaceNet [36] 200M* 1 99.65 - - - 95.18
DeepID2+ [44] 300k* 1 98.7 - - - 91.90
DeepID2+ [44] 300k* 25 99.47 95.00 80.70 - 93.20
Sparse [45] 300k* 1 99.30 - - - 92.70
VGG [33] 2.6M 1 97.27 74.10 52.01 88.13 92.80
WebFace [52] WebFace 1 97.73 - - - 90.60
Robust FR [5] WebFace 1 98.43 - - - -
Lightened CNN [50] WebFace 1 98.13 89.21 69.46 91.22 91.60
Softmax WebFace+1 98.83 91.68 69.51 92.95 94.22
Noisy Softmax(α2= 0.1) WebFace+199.18 92.68 78.43 94.50 94.88
Noisy Softmax(α2= 0.05) WebFace+1 99.02 92.24 75.67 94.02 94.51
Table 5. Recognition accuracies (%) on LFW, FGLFW and YTF datasets. * denotes the images are not publicly available and +denotes
data expansion. In LFW, closed-set and open-set accuracies are evaluated by Rank-1 and DIR@FAR=1 respectively.
selected samples from each video for evaluation.
For data preprocessing, we align and crop images based
on eyes and mouth centers, yielding 104 ×96 RGB
images. Our CNN configuration is shown in Table 2,
here we add element-wise maxout layer [50] after the
3,000-dimensional fully connected layer, yielding a 1,500-
dimensional output, and contrastive loss is applied on this
output as in DeepID2 [43]. Then we train a single CNN
model with outside data from the publicly available CASIA-
WebFace dataset [52] and our own collected data (about
400k from 14k identities). Extract the features for each
image and its horizontally flipped one, then compute a
mean feature vector as the representation. From the re-
sults shown in Table 5, one can observe that Noisy Soft-
max (α2= 0.1) improves the performance over the base-
line, and the result is also comparable to the current state-
of-the-art methods with private data and even model en-
semble. In addition, we further improve our results to
99.31%,94.43%,82.50%,94.88%,95.37%(listed in
the same protocol order as in Table. 5) by two models en-
semble.
7. Conclusion
In this paper, we propose Noisy Softmax to address the
early individual saturation by injecting annealed noise to the
softmax input. It is a way of early softmax desaturation by
postponing the early individual saturation. We show that
our method can be easily performed as a drop-in replace-
ment for standard softmax and is easier to optimize. It sig-
nificantly improves the performances of CNN models, since
the early desaturation operation indeed exerts much effect
on parameter update during back-propagation and further-
more improves the generalization ability of DCNNs. Em-
pirical studies verify the superiority of softmax desatura-
tion. Meanwhile, it achieves state-of-the-art or competitive
results on several datasets.
8. Acknowledgments
This work was partially supported by the National
Natural Science Foundation of China (Project 61573068,
61471048, 61375031 and 61532006), Beijing Nova Pro-
gram under Grant No. Z161100004916088, the Fundamen-
tal Research Funds for the Central Universities under Grant
No. 2014ZD03-01, and the Program for New Century Ex-
cellent Talents in University(NCET-13-0683).
References
[1] Y. Bengio. Estimating or propagating gradients through
stochastic neurons. Computer Science, 2013.
[2] L. Best-Rowden, H. Han, C. Otto, B. F. Klare, and A. K.
Jain. Unconstrained face recognition: Identifying a person
of interest from a media collection. IEEE Transactions on
Information Forensics and Security, 9(12):2144–2157, 2014.
[3] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra.
Weight uncertainty in neural networks. Computer Science,
2015.
[4] by V Nair and G. E. Hinton. Rectified linear units improve
restricted boltzmann machines. Proc Icml, pages 807–814,
2015.
[5] C. Ding and D. Tao. Robust face recognition via multimodal
deep face representation for multimedia applications. IEEE
Transactions on Multimedia, 17(11):2049–2058, 2015.
[6] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier
neural networks. Journal of Machine Learning Research, 15,
2010.
[7] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville,
and Y. Bengio. Maxout networks. Computer Science, pages
1319–1327, 2013.
[8] B. Graham. Fractional max-pooling. Eprint Arxiv, 2014.
[9] A. Graves. Practical variational inference for neural net-
works. Advances in Neural Information Processing Systems,
pages 2348–2356, 2011.
[10] C. Gulcehre, M. Moczulski, M. Denil, and Y. Bengio. Noisy
activation functions. 2016.
[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. Computer Science, 2015.
[12] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into
rectifiers: Surpassing human-level performance on imagenet
classification. pages 1026–1034, 2015.
[13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in
deep residual networks. 2016.
[14] G. Huang, Z. Liu, and K. Weinberger. Densely connected
convolutional networks. 2016.
[15] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger. Deep
networks with stochastic depth. 2016.
[16] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. La-
beled faces in the wild: A database forstudying face recog-
nition in unconstrained environments. 2008.
[17] S. Huang, Z. Xu, D. Tao, and Y. Zhang. Part-stacked cnn for
fine-grained visual categorization. Computer Science, 2015.
[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
Computer Science, 2015.
[19] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. Lecun. What
is the best multi-stage architecture for object recognition?
pages 2146–2153, 2009.
[20] J. Krause, H. Jin, J. Yang, and F. F. Li. Fine-grained recogni-
tion without part annotations. In IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 5546–5555,
2015.
[21] A. Krizhevsky. Learning multiple layers of features from
tiny images. 2012.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-
agenet classification with deep convolutional neural net-
works. Advances in Neural Information Processing Systems,
25(2):2012, 2012.
[23] D. Laptev and J. M. Buhmann. Transformation-invariant
convolutional jungles. In IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 3043–3051, 2015.
[24] D. Laptev, N. Savinov, J. M. Buhmann, and M. Pollefeys. Ti-
pooling: transformation-invariant pooling for feature learn-
ing in convolutional neural networks. 2016.
[25] Y. Lecun and C. Cortes. The mnist database of handwritten
digits.
[26] C. Y. Lee, P. W. Gallagher, and Z. Tu. Generalizing pooling
functions in convolutional neural networks: Mixed, gated,
and tree. Computer Science, 2015.
[27] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-
supervised nets. In AISTATS, volume 2, page 6, 2015.
[28] M. Liang and X. Hu. Recurrent convolutional neural network
for object recognition. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
3367–3375, 2015.
[29] M. Lin, Q. Chen, and S. Yan. Network in network. Computer
Science, 2013.
[30] T. Y. Lin, A. Roychowdhury, and S. Maji. Bilinear cnn mod-
els for fine-grained visual recognition. In IEEE International
Conference on Computer Vision, pages 1449–1457, 2015.
[31] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin soft-
max loss for convolutional neural networks. In International
Conference on International Conference on Machine Learn-
ing, pages 507–516, 2016.
[32] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser,
K. Kurach, and J. Martens. Adding gradient noise improves
learning for very deep networks. Computer Science, 2015.
[33] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face
recognition. In British Machine Vision Conference, 2015.
[34] R. Reed, R. J. Marks, and S. Oh. An equivalence between
sigmoidal gain scaling and training with noisy (jittered) in-
put data. In Neuroinformatics and Neurocomputers, 1992.,
RNNS/IEEE Symposium on, pages 120 – 127, 1992.
[35] R. Reed, S. Oh, and R. J. Marks. Regularization using jit-
tered training data. In International Joint Conference on
Neural Networks, pages 147–152 vol.3, 1992.
[36] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-
fied embedding for face recognition and clustering. pages
815–823, 2015.
[37] M. L. Seltzer and J. Droppo. Multi-task learning in deep
neural networks for improved phoneme recognition. pages
6965–6969, 2013.
[38] W. Shang, K. Sohn, D. Almeida, and H. Lee. Understanding
and improving convolutional neural networks via concate-
nated rectified linear units. 2016.
[39] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. Computer Sci-
ence, 2015.
[40] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Ried-
miller. Striving for simplicity: The all convolutional net.
Eprint Arxiv, 2014.
[41] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway
networks. Computer Science, 2015.
[42] M. Steijvers and P. Grunwald. A recurrent network that per-
forms a context-sensitive prediction task. In Conference of
the Cognitive Science, 2000.
[43] Y. Sun, X. Wang, and X. Tang. Deep learning face represen-
tation by joint identification-verification. Advances in Neural
Information Processing Systems, 27:1988–1996, 2014.
[44] Y. Sun, X. Wang, and X. Tang. Deeply learned face represen-
tations are sparse, selective, and robust. Computer Science,
pages 2892–2900, 2014.
[45] Y. Sun, X. Wang, and X. Tang. Sparsifying neural network
connections for face recognition. Computer Science, 2015.
[46] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. pages 1–9, 2015.
[47] V. Turchenko and A. Luczak. Caffe: Convolutional archi-
tecture for fast feature embedding. Eprint Arxiv, pages 675–
678, 2014.
[48] X. S. Wei, C. W. Xie, and J. Wu. Mask-cnn: Localizing parts
and selecting descriptors for fine-grained image recognition.
2016.
[49] L. Wolf, T. Hassner, and I. Maoz. Face recognition in
unconstrained videos with matched background similarity.
42(7):529–534, 2011.
[50] X. Wu, R. He, and Z. Sun. A lightened cnn for deep face
representation. Computer Science, 2015.
[51] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian. Disturblabel:
Regularizing cnn on the loss layer. 2016.
[52] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-
tation from scratch. Computer Science, 2014.
[53] M. D. Zeiler and R. Fergus. Stochastic pooling for regu-
larization of deep convolutional neural networks. Computer
Science, 2013.
[54] N. Zhang and W. Deng. Fine-grained lfw database. In 2016
International Conference on Biometrics (ICB), 2016.