Content uploaded by Binghui Chen

Author content

All content in this area was uploaded by Binghui Chen on Aug 15, 2017

Content may be subject to copyright.

Content uploaded by Binghui Chen

Author content

All content in this area was uploaded by Binghui Chen on Aug 15, 2017

Content may be subject to copyright.

arXiv:1708.03769v1 [cs.CV] 12 Aug 2017

Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing

the Early Softmax Saturation

Binghui Chen1, Weihong Deng1, Junping Du2

1School of Information and Communication Engineering, Beijing University of Posts and Telecommunications,

2School of Computer Science, Beijing University of Posts and Telecommunications, Beijing China.

chenbinghui@bupt.edu.cn, whdeng@bupt.edu.cn, junpingd@bupt.edu.cn

Abstract

Over the past few years, softmax and SGD have become

a commonly used component and the default training strat-

egy in CNN frameworks, respectively. However, when op-

timizing CNNs with SGD, the saturation behavior behind

softmax always gives us an illusion of training well and

then is omitted. In this paper, we ﬁrst emphasize that the

early saturation behavior of softmax will impede the explo-

ration of SGD, which sometimes is a reason for model con-

verging at a bad local-minima, then propose Noisy Soft-

max to mitigating this early saturation issue by injecting

annealed noise in softmax during each iteration. This oper-

ation based on noise injection aims at postponing the early

saturation and further bringing continuous gradients prop-

agation so as to signiﬁcantly encourage SGD solver to be

more exploratory and help to ﬁnd a better local-minima.

This paper empirically veriﬁes the superiority of the early

softmax desaturation, and our method indeed improves the

generalization ability of CNN model by regularization. We

experimentally ﬁnd that this early desaturation helps opti-

mization in many tasks, yielding state-of-the-art or compet-

itive results on several popular benchmark datasets.

1. Introduction

Recently, deep convolutional neural networks (DCNNs)

have taken the computer vision ﬁeld by storm, signiﬁcantly

improving the state-of-the-art performances in many visual

tasks, such as face recognition [43, 44, 33, 36], large-scale

image classiﬁcation [22, 39, 46, 11, 13], and ﬁne-grained

object classiﬁcation [30, 17, 20, 48]. Meanwhile, softmax

layer and the training strategy of SGD together with back-

propagation (BP) become the default components, and are

generally applied in most of the aforementioned works.

It is widely observed that when optimizing with SGD

and BP, the smooth and free gradients propagation is crucial

to improve the training of DCNNs. For example, replacing

Figure 1. Decomposition of typical softmax layer in DCNN. It can

be rewritten into three parts: fully connection component, softmax

activation and cross-entropy loss.

sigmoid activation function with the piecewise-linear acti-

vation functions such as ReLU and PReLU [12] handles

the problem of gradients vanishing caused by sigmoid sat-

uration and, allows the training of much deeper networks.

While, it is interesting that softmax activation function (il-

lustrated in Figure 1) is implicitly like sigmoid function due

to their similar formulation (shown in Sec. 3), and has the

saturation behavior as well when its input is large. How-

ever, many take the softmax activation for granted and the

problem behind its saturation behavior is omitted as a result

of illusion of performance improvements based on DCNNs.

In standard SGD, the saturation behavior of softmax

turns up when its output is very close to the ground truth,

certainly it is our goal of model training. However, in some

ways, it is a barrier to improve the generalization ability

of CNNs especially when it shows up early (inopportune).

Concretely, for one instance input, it will early stop con-

tributing gradients to BP when its softmax output is prema-

turely saturated, yielding short-lived gradients propagation

in history which is not enough for robust learning. And

in this case, the learning process with SGD and BP hardly

explore more due to poor gradients propagation and param-

eters update. We deﬁne this saturation behavior as individ-

ual saturation and the corresponding individual as saturated

one. As the training going, the number of non-saturated

contributing training samples gradually decreases and the

robust learning of network will be impeded. It sometimes

is a reason for algorithm going to bad local-minima1and

being difﬁcult to escape. Furthermore, the problem of over-

1For simplicity, we use local or ’global’ minima to represent a neigh-

bouring region not a single point.

ﬁtting turns up. To this end, we need to give SGD chances

to explore more parts of parameter space and the early indi-

vidual saturation is undesired.

In this paper, we propose Noisy Softmax, a novel tech-

nique of early softmax desaturation, to address the afore-

mentioned issue. This is mainly achieved by injecting an-

nealed noise directly into softmax activations during each it-

eration. In another word, Noisy Softmax allows SGD to es-

cape from a bad local-minima and explore more by postpon-

ing the early individual saturation. Furthermore, it improves

the generalization ability of system by reducing over-ﬁtting

as a direct consequence of more exploration. The main con-

tributions of this work are summarized as follows:

•We provide an insight of softmax saturation, inter-

preted as individual saturation, that early individual

saturation produces short-lived gradients propagation

which is poor for robust exploration of SGD and fur-

ther causes over-ﬁtting unintentionally.

•We propose Noisy Softmax to aim at producing rich

and continuous gradients propagation by injecting an-

nealed noise into softmax activations. It allows the

’global’ convergence of SGD solver and aids gener-

alization by reducing over-ﬁtting. To our knowledge,

it is the ﬁrst attempt to address the early saturation be-

havior of softmax by adding noise.

•Noisy Softamx can be easily performed as a drop-

in replacement for standard softmax and optimized

with standard SGD. It can be also applied in other

performance-improving techniques, such as neural ac-

tivation functions and network architectures.

•Extensive experiments have been performed on several

datasets, including MNIST [25], CIFAR10/100 [21],

LFW [16], FGLFW [54] and YTF [49]. The im-

pressive results demonstrate the effectiveness of Noisy

Softmax.

2. Related Work

Many promising techniques have been developed, such

as novel network structures [29, 13, 41], non-linear activa-

tion functions [12, 7, 6, 38], pooling strategies [11, 8, 53]

and objective loss functions [43, 36], etc.

These approaches are mostly optimized with SGD and

back-propagation. In standard SGD, we use the chain rule

to compute and propagate the gradients. Thus, any satura-

tion behavior of neuron units or layer components2are un-

desired because the training of deeper framework attributes

to smooth and free ﬂow of gradients information. The

early solution is to replace sigmoid function with non-linear

2The saturation of layer refers to gradients vanishing at a certain layer

in back-propagation.

piecewise activation function [22]. This is neuron desatura-

tion. Skip connections between different layers exponen-

tially expands the paths of propagation [41, 11, 13, 15, 14].

These belong to layer desaturation, since the forward and

backward information can be directly propagated from one

layer to any other layer without gradients vanishing. In con-

trast, only the early saturation behavior is harmful instead of

all of them, we focus on the early desaturation of softmax,

which hasn’t been investigated, and we achieve this by in-

jecting noise explicitly into softmax activations.

There are some other works that are related to noise

injection. Adding noise to ReLU has been developed to

encourage components to explore more in Boltzmann ma-

chines and feed-forward networks [4, 1]. Adding noise to

sigmoid provides possibilities of training with much wider

family of activation functions than previous [10]. Adding

weight noise [42], adaptive weight noise [9, 3] and gradi-

ents noise [32] also improve the learning. Adding annealed

noise can help the solver escape from a bad local-minima

and ﬁnd a better one. We follow these inspiring ideas to ad-

dress individual saturation and encourage SGD to explore

more. The main differences are that we apply noise injec-

tion on CNN and impose noise on loss layer instead of pre-

vious layers. But different from adding noise on loss layer

in DisturbLabel [51], a method that seems weird to disturb

labels but indeed improves the performances of models, our

work has a clear object of the delay of early softmax satu-

ration by explicitly injecting noise into softmax activition.

Another noise injection way is randomly transforming

the input data, which is commonly referred to data augmen-

tation, such as randomly cropping, ﬂipping [22, 50], rotat-

ing [24, 23] and jittering input data [34, 35]. And our work

can also be interpreted as a way of data augmentation which

will be discussed in the following discussion part.

3. Early Individual Saturation

In this section, we will give a toy example to describe

the early individual saturation of softmax, which is always

omitted, and analyse its impact on generalization. De-

ﬁne the i-th input data xiwith the corresponding label yi,

yi∈[1 ···C]. Then processing training images with stan-

dard DCNN, we can obtain the cross-entropy loss and par-

tial derivative as follows:

L=−1

NX

i

log P(yi|xi) = −1

NX

i

log efyi

Pjefj(1)

∂L

∂fj

=P(yi=j|xi)−1{yi=j}=efj

Pkefk−1{yi=j}

(2)

where fjrefers to the j-th element of the softmax input

vector f,j∈[1 ···C],Nis the number of training im-

ages. 1{condition}= 1 if condition is satisﬁed and

1{condition}= 0 if not.

To simplify our analysis, we consider the problem of bi-

nary classiﬁcation3, where yi∈[1,2]. Under binary sce-

nario, we plot the softmax activation for class 1in Fig-

ure 2. Intuitively, the softmax activation is totally like sig-

moid function. The standard softmax encourages f1> f2

in order to classify class 1correctly and can be regarded as a

genius when its output P(yi= 1|xi) = 1

1+e−(f1−f2)is very

close to 1. In this case, the softmax output of data xiis satu-

rated and we deﬁne this as individual saturation. Of course,

making its softmax output close to 1is our ultimate goal

of CNN training. However, we would like to achieve it at

the end of SGD exploration not in the beginning or middle

stage. Since, when optimizing CNN with gradients-based

methods such as SGD, the prematurely saturated individual

early stops contributing gradients to back-propagation due

to negligible gradients, i.e. P(yi= 1|xi)≈1,∂ L

∂fyi≈0

(see Eq. 2). And with the saturated individuals number ris-

ing, the amount of contributing data d ecreases and, SGD has

few chances to move around and is more likely to converge

at a local minima, therefore, it is easy to be over-ﬁtting and

it requires extra data to recover. In short, the early saturated

ones introduce short-lived gradients propagation which is

not enough to help system converge at a ’global minima’

(i.e. a better local-minima), so the early individual satura-

tion is undesired.

−10 −8 −6 −4 −2 0 2 4 6 8 10

0

0.2

0.4

0.6

0.8

1

f1−f2

Posterior Probability

Figure 2. Softmax activation function: 1

1+e−(f1−f2). X axis rep-

resents the difference between f1and f2.

4. Noisy Softmax

Based on the fact analysed in Section 3, the short-lived

gradients propagation caused by early individual saturation

would not guide the robust learning. Thus the intuitive solu-

tion is to set up ’barrier’ along its way to satuation so as

to postpone the early saturation behavior and produce

rich and continuous gradients propagation. Particularly,

for training data point (xi, yi), a simple way to achieve this

is to factitiously reduce its softmax input fyi(note that, it is

theoretically same to enlarge fj,∀j6=yi, but it is so com-

plex to operate). Moreover, many research works point out

that adding noise gives the system chances to ﬁnd ’global

minima’, such as injecting noise to sigmoid [10]. We follow

this inspiring idea to address the problem of early individ-

ual saturation. Therefore, our technique of slowing down

the early saturation is to inject appropriate noise in softmax

input fyi, and the resulting noise-associated one is as fol-

lows:

3Multi-classiﬁcation complicates our analysis but has the same mecha-

nism as binary scenario.

fnoise

yi=fyi−n(3)

where n=µ+σξ,ξ∼ N (0,1),µand σare used to

generate a wider family of noise from ξ. Intuitively, we

would prefer fnoise

yito be less than fyi(because fnoise

yi>

fyiwill speed up saturation). Thus, we simply require noise

nto be always positive, and we have the following form:

fnoise

yi=fyi−σ|ξ|(4)

where noise nhas mean 0 and standard variance σ.

Moreover, we would like to make our noise annealed

by controlling the parameter σ. Considering our initial

thought, we intend to postpone the early saturation of xi

instead of to not allow its saturation, implying that the ini-

tially larger noise is required to boost the exploration ability

and later the relatively smaller noise is required for model

convergence.

In standard Softmax layer (Figure 1), fyiis also the out-

put of the fully connected component and can be written as

fyi=WT

yiXi+byiwhere Wyiis the yi-th column of W,

Xiis the input feature of this layer from training data xiand

byiis the basis. Since byiis a constant and fyimostly de-

pends on WT

yiXi, we construct our annealed noise by mak-

ing σto be related to WT

yiXi. In consideration of the fact

that WT

yiXi=kWyikkXikcos θyi(as shown in[31]), where

θyiis the angle between vector Wyiand Xi,σshould be

a joint function of kWyikkXikand θyiwhich hold ampli-

tude and angular information respectively. Parameter Wyi

followed by a loss function can be regarded as a linear clas-

siﬁer of class yi. And this linear classiﬁer uses cosine simi-

larity to make angular decision boundary. As a result, with

the converging of system, the angle θyibetween Wyiand

Xiwill gradually decrease. Therefore, our annealed-noise-

associated softmax input is formulated as:

fnoise

yi=fyi−αkWyikkXik(1 −cos θyi)|ξ|(5)

where αkWyikkXik(1−co s θyi) = σ, and hyper-parameter

αis used to adjust the scale of noise. In our annealed

noise, we leverage kWyikkXikto make the magnitude of

the noise and fyito be comparable, and use (1 −cos θyi)

to adaptively anneal the noise. Notably, our early desatura-

tion work implies that make softmax later saturated instead

of non-saturated. We experimented with various function

types of σand empirically found that this surprising sim-

ple formulation performs better. Putting Eq. 5 into original

softmax, the Noisy Softmax loss is deﬁned as:

L=−1

NX

i

log efyi−αkWyikkXik(1−cos θyi)|ξ|

Pj6=yiefj+efyi−αkWyikkXik(1−cos θyi)|ξ|

(6)

Optimization. We use Eq. 6 throughout our experi-

ments and optimize our model with the commonly used

SGD. Thus we need to compute the forward and backward

propagation, and cos θyiis required to be replaced with

0 2 4 6 8 10 12 14 16

0.4

0.5

0.6

0.7

0.8

0.9

1

Iteration

Average Prediction

Softmax

Noisy Softmax

normal

neg x103

Figure 3. Saturation status vs. iteration with different formulations

of noise. Normal and Neg represent normal noise and negative

noise respectively. α2is set to 0.1 in our experiments.

0 20 40 60 80 100 120 140 160

0.2

0.4

0.6

0.8

1

Iteration

Recognition Error Rate

Softmax

Noisy Softmax

Normal

Neg

x100

Figure 4. CIFAR100 testing error vs. iteration with different for-

mulations of noise. Normal and Neg represent normal noise and

negative noise respectively. α2is set to 0.1 in our experiments.

WT

yiXi

kWyikkXik. For forward and backward propagation, the

only difference between Noisy Softmax loss and standard

softmax loss exists in fyi. For example, in forward propaga-

tion, ∀j6=yifjis computed as the same as original softmax

while fyiis replaced with fnoise

yi. In backward propagation,

∂L

∂Xi=Pj

∂L

∂fj

∂fj

∂Xiand ∂ L

∂Wyi=Pj

∂L

∂fj

∂fj

∂Wyi

, only when

j=yithe computations of ∂ fj

∂Xiand ∂ fj

∂Wyi

are not the same

as original softmax which are listed as follows:

∂f noise

yi

∂Xi

=Wyi−α|ξ|(XikWyik

kXik−Wyi)(7)

∂f noise

yi

∂Wyi

=Xi−α|ξ|(WyikXik

kWyik−Xi)(8)

For simplicity, we leave out ∂ L

∂fjand ∂ fj

∂Xi,∂ fj

∂Wyi(∀j6=yi)

since they are the same for both Noisy Softmax and original

softmax. In short, except for when j=yi, the overall com-

putation of Noisy Softmax is similar with original softmax.

5. Discussion

5.1. The Effect of Noise Scale α

In Noisy Softmax, the scale of annealed noise is largely

determined by the hyper-parameter α. Here we can imag-

ine that, when α= 0 in the 0noise limit, Noisy Softmax

is the same with ordinary softmax. Then individual satu-

ration will turn up and SGD solver has a high chance to

converge at a local-minima. Without extra data for training,

the model will be easily over-ﬁtting. However, when αis

large enough, large gradients are obtained since backprop-

agating through fnoise

yigives rise to large derivatives. So

the algorithm just see the noise instead of real signal and

move around anywhere blindly. Hence, a relatively small α

is required to aid the generalization of model.

We evaluate the performances of Noisy Softmax with

different αon several datasets. Note that, the value of α

is not carefully adjusted and we make α= 0 (i.e. soft-

max) as our baseline. And these comparison results are

listed in Table 3, 4and 5. One can observe that Noisy Soft-

max with a relatively appropriate α(e.g. α2= 0.1) obtains

better recognition accuracy than ordinary softmax on all of

the datasets. To be intuitional, we summarize the results

on CIFAR100 in Figure 6. When α2= 0.1, our method

outperforms the original softmax. This demonstrates that

our Noisy Softmax does improve the generalization abil-

ity of CNN by encouraging the SGD solver to be more ex-

ploratory and to converge at a ’global-minima’. When α

rises to 1, the large noise causes the network to converge

slower and produces worse performance than baseline as

well. Since the large noise drowns the helpful signal and

solver just sees the noise.

5.2. Saturation Study

To illuminate the signiﬁcance of early softmax desatu-

ration based on non-negative noise injection, we investi-

gate the impacts of different formulations of noise, such as

normal noise n=σξ and neg ative noise n=−σ|ξ|(σ

is the same as in Eq. 5), on individual saturation. From

the formulations of noise, we can imagine that there will be

more saturated instances when training with negative noise

(which is a counterexample). To intuitively analyse the sat-

uration state, we compute the average possibility prediction

over the entire training set as follows:

P=1

N

C

X

j=1

Nj

X

i=1

P(yi|xi)(9)

where Cis the number of classes and Njis the number

of images within the j-th class. Figure 3 and 4 show the

saturation status of different noise and testing error rates on

CIFAR100 respectively.

From the results in Figure 3, one can observe that when

training with original softmax or negative noise, the av-

erage prediction rises quickly to a relatively high level, al-

most 0.9, implying that the early individual saturation is se-

rious, and ﬁnally goes up to nearly 1. Moreover, the av-

erage prediction of negative noise is higher than that of

softmax, implying that the early individual saturation is de-

teriorated since many instances are factitiously mapped to

0 20 40 60 80 100 120 140 160

0

0.2

0.4

0.6

0.8

1

Iteration

Recognition Error Rate

softmax

Noisy Softmax(α = 0.05)

Noisy Softmax(α = 0.1)

Noisy Softmax(α = 0.5)

Noisy Softmax(α = 1)

x100

Figure 5. CIFAR100 training error vs. iteration with different α.

0 20 40 60 80 100 120 140 160

0.2

0.4

0.6

0.8

1

Iteration

Recognition Error Rate

softmax

Noisy Softmax(α = 0.05)

Noisy Softmax(α = 0.1)

Noisy Softmax(α = 0.5)

Noisy Softmax(α = 1)

x100

Figure 6. CIFAR100 testing error vs. iteration with different α.

saturated ones. From the results in Figure 4, one can ob-

serve that the testing error of negative noise drops slowly

and ﬁnally achieves a relatively high level, nearly 37%, ver-

ifying that the exploration of SGD is seriously impeded by

early individual saturation. In normal noise case, the test-

ing error and the trend of average prediction rise are similar

with original softmax, shown in Figure 4 and 3 respectively,

since the expectation E(n)is close to zero.

In contrast, when training with Noisy Softmax, the aver-

age prediction rises slowly and is much lower than orig-

inal softmax at the early training stage, shown in Fig-

ure 3, verifying that the early individual saturation behav-

ior is signiﬁcantly avoided. And from the results in Fig-

ure 4, one can observe that Noisy Softmax outperforms

the baseline and signiﬁcantly improves the performance to

28.48% testing error rate. Note that, after 3,000 iterations,

our method achieves better testing error result but lower

average-prediction, demonstrating that the early desatura-

tion of softmax gives SGD solver chances to traverse more

portions of parameter space for optimal solution. As the

noise level is decreased, it will prefer a better local-minima

where signal gives strong response to SGD. Then the solver

will spend more time to explore this region and converge,

which can be regarded as ’global-minima’, in a ﬁnite num-

ber of steps.

In summary, injecting non-negative noise n=σ|ξ|in

softmax does prevent the early individual saturation and fur-

ther improve the generalization ability of CNN when opti-

mized by standard SGD.

5.3. Annealed Noise Study

When addressing the early individual saturation, the key

idea is to add annealed noise. In order to highlight the

superiority of our annealed noise described in Sec. 4, we

compare it to free noise n=α|ξ|and amplitude noise

α2Noisy Softmax free amplitude

0 31.77 31.77 31.77

0.05 29.99 31.43 30.96

0.1 28.48 31.04 29.97

0.5 30.22 30.88 fail

1 35.23 31.20 fail

Table 1. Testing error rates(%) vs. different noise on CIFAR100.

n=αkWyikkXik|ξ|. We evaluate them on CIFAR100 and

the results are listed in Table 1. From the results, it can

be observed that our Noisy Softmax outperforms the other

two formulations of noise. In free noise case, where σ(de-

scribed in Sec. 4) is set to a ﬁxed value αand the noise

is totally independent, although adding this noise is a de-

saturation operation the accuracy gain over the baseline is

small, since it desaturates softmax worse not according to

the magnitude of softmax input, in another word it cannot

suit the remedy to the case. In amplitude noise case, where

σis set to αkWyikkXik, the subtractive noise is prudent

due to considering the level of softmax input, thus yield-

ing better accuracy gain than free noise. While it still is

worse than Noisy Softmax. Because, in both Noisy Softmax

and amplitude noise cases, as the exploration going, the

’globally better’ region of parameter space has been seen

by SGD and it is time to be patient in exploring this region,

in another word smaller noise is required. Noisy Softmax

holds this idea by annealing the noise but, in amplitude

noise case, the level-unchanging noise seems a little large

at this time and further causes difﬁculty in detailed learning.

Reviewing the formulation of our annealed noise, one can

observe that our annealed noise is constructed by combin-

ing θyi, a time identiﬁer, into amplitude noise. With time

function 1−cos θyiinjected, the noise will be adaptively

decreased.

5.4. Regularization Ability

We ﬁnd experimentally that Noisy Softmax can regular-

ize the CNN model by preventing over-ﬁtting. Figure 5

and 6 show the recognition accuracy results on CIFAR100

dataset with different α. One can observe that without

noise injection (i.e. α= 0), the training recognition error

drops fast to quite a low level, almost 0%, however the test-

ing recognition error stops falling at a relatively high level,

nearly 31.77%. Conversely, when α2is set to an appropri-

ate value such as 0.1, the training error drops slower and is

much higher than the baseline. But the testing error reaches

a lower level, nearly 28.48%, and still has a trend of de-

creasing. Even when α2= 0.5, the training error is higher

but the testing error becomes lower as well, nearly 30.22%.

This demonstrates that encouraging SGD to converge at a

better local-minima indeed prevent over-ﬁtting and Noisy

Figure 7. Geometric Interpretation of data augmentation.

Softmax has a strong regularization ability.

As analysed above, our Noisy Softmax can be regarded

as a kind of regularization technique for preventing over-

ﬁtting by making SGD to be more exploratory. Here we will

analyse this regularization ability from another data aug-

mentation perspective, which has a profound physical inter-

pretation. Under the original case, the softmax input com-

ing from data point (xi, yi)is fyi=kWyikkXikco s θyi

(we omit the constant byifor simplicity). Now we consider

a new input (x′

i, yi), where kX′

ik=kXikand the angle θ′

yi

between vector Wyiand X′

iis arccos((1 + α|ξ|) cos θyi−

α|ξ|). Thus we have f′

yi=kWyikkX′

ikcos θ′

yi=WT

yiXi−

αkWyikkXik(1 −cos θyi)|ξ|=fnoise

yi, implying that

fnoise

yican be regarded as coming from a new data point

(x′

i, yi). Notably, since θ′

yi> θyi, these generated data

have many boundary examples which are much helpful for

discriminative feature learning, illustrated in Figure 7. In

summary, generating the noisy input fnoise

yiis equivalent to

generating new training data, which is an efﬁcient way of

data augmentation.

To verify our discussion above, we evaluate Noisy Soft-

max on two subsets of the MNIST dataset, which have

only 600(1%) and 6000(10%) training instances respec-

tively. Our CNN conﬁguration is shown in Table 2. With

the same training strategy in Section 6.1, we achieve 3.82%

and 1.30% testing error rates on the original testing set,

respectively. Meanwhile, in both cases, the training error

rates quickly drop to nearly 0% which show that the over-

ﬁtting turns up. However, when training with Noisy Soft-

max (α2= 0.5), we obtain 2.46% and 0.93% testing error

rates, respectively. This demonstrates that Noisy Softmax

improves the generalization ability of CNN with implicit

data augmentation. And from the accuracy improvements

on these two subsets and CIFAR100 (which has 500 in-

stances per class), it acts as an effective algorithm especially

in the case that the amount of training data is limited.

5.5. Relationship to Other Methods

Multi-task learning: combining several tasks to one

system does improve the generalization ability [37]. Con-

sidering a multi-task learning system with input data xi, the

overall object loss function is a combination of several sub-

object loss functions, written as L=PjLj(ϑ0, ϑj, xi),

where ϑ0and ϑj, j ∈[1,2,···]are generic parameters and

task-speciﬁc parameters respectively. Optimizing with stan-

dard SGD, generic parameters ϑ0are updated as ϑ0=ϑ0−

γ(Pj

∂Lj

∂ϑ0),γis the learning rate. While in Noisy Softmax,

from an overall training perspective, our loss function can

also be regarded as a combination of many noise-dependent

changing losses Lnoisek=−log efyi

Pc6=yiefc+efnoise

yi

+

α|ξ|kkWyikkXik(1 −cos θyi), k ∈[1, m], i.e. L=

Pm

k=1 Lnoisekwhere mis an uncertain number and is re-

lated to noise scale and iteration number. Thus, the over-

all contribution to system can be regarded as ϑ=ϑ−

γ(Pm

k=1

∂Lnoisek

∂ϑ ). So our method can be regarded as a

special case of multi-task learning, where the task-speciﬁc

parameters are shared across tasks.

However, in multi-task learning system, factitiously de-

signing task-speciﬁc losses is prohibitively expensive and

the number of tasks is limited and small. While in Noisy

Softmax training procedure, the model is constrained by

many randomly generated tasks (quantiﬁed by Lnoisek).

Thus, training a model with Noisy Softmax can be regarded

as training with massive tasks that are very costly and often

infeasible to design in original multi-task learning system.

Noise injection: some research works inject noise in

previous layers of neural networks such as in neuron activa-

tion functions ReLU [4, 1] and sigmoid [10], in weights [9,

3] and gradients [32]. We emphasize that Noisy Softmax

adds noise on a single loss layer instead of on many previ-

ous layers, which is more convenient and efﬁcient for im-

plementation and model training, and is applied in DCNNs.

Intrinsically different from DisturbLabel [51], where noise

is produced by disturbing labels and also exerts effect on

loss layer, Noisy Softmax starts from a clear object of early

softmax desaturation and the noise is adaptively annealed

and injected in a explicit manner.

Desaturation: many other desaturation works, such as

replacing sigmoid with ReLU [6] and building skip connec-

tions between layers [41, 11, 13, 14], solve the problems of

gradients vanishing which happen in bottom layers. While

our Noisy Softmax solves the problem of early gradients

vanishing in the top layer (i.e. loss layer) which is caused

by the early individual saturation. We emphasize that solv-

ing the early gradients vanishing in top layer is crucial to

parameters update and model optimization, since top layer

is the source of gradients propagation. In summary, by post-

poning the early individual saturation we can obtain contin-

uous gradients propagation from the top layer and further

encourage SGD to be more exploratory.

6. Experiments and Results

We evaluate our proposed Noisy Softmax algorithm on

several benchmark datasets, including MNIST [25], CI-

FAR10/100 [21], LFW [16], FGLFW [54] and YTF [49].

Note that, in all of our experiments, we only use a single

Layer MNIST(for Sec. 5.4) MNIST CIFAR10/10+ CIFAR100 LFW/FGLFW/YTF

Block1 [3x3,40]x2 [3x3,64]x3 [3x3,64]x4 [3x3,96]x4 [3x3,64]x1

Pool1 Max [2x2], stride 2

Block2 [3x3,60]x1 [3x3,64]x3 [3x3,128]x4 [3x3,192]x4 [3x3,128]x1

Pool2 Max [2x2], stride 2

Block3 [3x3,60]x1 [3x3,64]x3 [3x3,256]x4 [3x3,384]x4 [3x3,256]x2

Pool3 Max [2x2], stride 2

Block4 - - - - [3x3,512]x3,padding 0

Fully Connected 100 256 512 512 3000

Table 2. CNN architectures for different benchmark datasets. Blockx denotes a container of several convolution components with the same

conﬁguration. E.g. [3x3, 64]x4 denotes 4 cascaded convolution layers with 64 ﬁlters of size 3x3.

model for the evaluation of Noisy Softmax, and both soft-

max and Noisy Softmax in our experiments use the same

CNN architecture shown in Table 2.

6.1. Architecture Settings and Implementation

As VGG [39] becomes a commonly used CNN archi-

tecture, the cascaded layers with small size ﬁlters gradually

take the place of single layer with large size ﬁlters. Since

these cascaded layers have less parameters, lower computa-

tional complexity and stronger representation ability com-

pared to the single layer. E.g. a single 5x5 convolution

layer is replaced with 2 cascaded 3x3 convolution layers.

Inspired by [39, 31], we design the architectures as shown in

Table 2. In convolution layers, both the stride and padding

are set to 1 if not speciﬁed. In pooling layers, we use 2x2

max-pooling ﬁlter with stride 2. We adopt the piece-wise

linear functions PReLU [12] as our neuron activation func-

tions. Then we use the weight initialization [12] and batch

normalization [18] in our networks. All of our experiments

are implemented by Caffe library [47] with our own modi-

ﬁcations. We use standard SGD to optimize our CNNs and

the batch sizes are 256 and 200 for object experiments and

face experiments, respectively. For data preprocessing, we

just perform the mean substraction.

Training. In object recognition tasks, the initial learning

rate is 0.1 and is divided by 10 at 12k. The total iteration is

16k. Note that, although we train our CNNs with coarsely

adjusted learning rate, the results of all experiments are im-

pressive and consistent, verifying the effectiveness of our

method. For face recognition tasks, we start with a learning

rate of 0.01, divide it by 5 when the training loss does not

drop.

Testing. We use the original softmax to classify the test-

ing data in object datasets. In face datasets, we evaluate it

with the cosine distance rule after PCA reduction for face

recognition.

6.2. Evaluation on MNIST Dataset

MNIST [25] contains 60,000 training samples and

10,000 testing samples. These samples are uniformly dis-

tributed over 10 classes. And all samples are 28x28 gray

images. Our CNN network architecture is shown in Table 2

and we use a 0.001 weight decay. The results of the state-

of-the-art methods and our proposed Noisy Softmax with

different αare listed in Table 3.

From the results, our Noisy Softmax (α2= 0.1) not

only outperforms the original softmax over the same ar-

chitecture, but also achieves the competitive performance

compared to the state-of-the-art methods. It can also be

observed that Noisy Softmax produces consistent accuracy

gain with coarsely adjusted α2, such as 0.05, 0.1 and 0.5,

and our method achieves the same accuracy with Distur-

bLabel [51] which adds dropout to several layers, demon-

strating the effectiveness of our technique.

6.3. Evaluation on CIFAR Datasets

CIFAR [21] has two evaluation protocols over 10 and

100 classes respectively. CIFAR10/100 has 50,000 train-

ing samples and 10,000 testing samples, all the samples are

32x32 RGB images. And these images are uniformly dis-

tributed over 10 or 100 classes. We use different CNN ar-

chitectures in CIFAR10 and CIFAR100 experiments, and

these network conﬁgurations are shown in Table 2.

We evaluate our method on CIFAR10 and CIFAR100, as

these results are shown in Table 4. For data augmentation,

we perform a simple method: randomly crop a 30*30 im-

age. From our experimental results, one can observe that

Noisy Softmax (α2= 0.1) outperforms all of the other

methods on these two datasets. And it improves nearly 1%

and more than 3% accuracies over the baseline on CIFAR10

and CIFAR100 respectively.

6.4. Evaluation on Face Datasets

LFW[16] contains 13,233 images from 5749 celebrities.

Under unrestricted conditions, it provides 6,000 face pairs

for veriﬁcation protocol and, closed-set and open-set for

identiﬁcation protocol adopted in [2].

FGLFW [54] is a derivative of LFW, implying that the

images are all coming from LFW but the face pairs are difﬁ-

cult to classify whether they are from the same person. It is

a light and sweet dataset for performance evaluation due to

the simple veriﬁcation protocol but challenging face pairs.

YTF [49] provides 5,000 video pairs for face veriﬁca-

tion. We use the average representation of 100 randomly

Method MNIST

CNN [19] 0.53

NiN [29] 0.47

Maxout [7] 0.45

DSN [27] 0.39

R-CNN [28] 0.31

GenPool [26] 0.31

DisturbLabel [51] 0.33

Softmax 0.43

Noisy Softmax (α2= 1) 0.42

Noisy Softmax (α2= 0.5)0.33

Noisy Softmax (α2= 0.1)0.33

Noisy Softmax (α2= 0.05) 0.37

Table 3. Recognition error rates (%) on MNIST.

Method CIFAR10 CIFAR10+ CIFAR100

NiN [29] 10.47 8.81 35.68

Maxout [7] 11.68 9.38 38.57

DSN [27] 9.69 7.97 34.57

All-CNN [40] 9.08 7.25 33.71

R-CNN [28] 8.69 7.09 31.75

ResNet [13] N/A 6.43 N/A

DisturbLabel [51] 9.45 6.98 32.99

Softmax 8.11 6.98 31.77

Noisy Softmax (α2= 1) 9.09 8.77 35.23

Noisy Softmax (α2= 0.5) 7.84 7.13 30.22

Noisy Softmax (α2= 0.1)7.39 6.36 28.48

Noisy Softmax (α2= 0.05) 7.58 6.61 29.99

Table 4. Recognition error rates (%) on CIFAR datasets. + denotes data augmentation.

Method Images Models LFW Rank-1 DIR@FAR=1% FGLFW YTF

FaceNet [36] 200M* 1 99.65 - - - 95.18

DeepID2+ [44] 300k* 1 98.7 - - - 91.90

DeepID2+ [44] 300k* 25 99.47 95.00 80.70 - 93.20

Sparse [45] 300k* 1 99.30 - - - 92.70

VGG [33] 2.6M 1 97.27 74.10 52.01 88.13 92.80

WebFace [52] WebFace 1 97.73 - - - 90.60

Robust FR [5] WebFace 1 98.43 - - - -

Lightened CNN [50] WebFace 1 98.13 89.21 69.46 91.22 91.60

Softmax WebFace+1 98.83 91.68 69.51 92.95 94.22

Noisy Softmax(α2= 0.1) WebFace+199.18 92.68 78.43 94.50 94.88

Noisy Softmax(α2= 0.05) WebFace+1 99.02 92.24 75.67 94.02 94.51

Table 5. Recognition accuracies (%) on LFW, FGLFW and YTF datasets. * denotes the images are not publicly available and +denotes

data expansion. In LFW, closed-set and open-set accuracies are evaluated by Rank-1 and DIR@FAR=1 respectively.

selected samples from each video for evaluation.

For data preprocessing, we align and crop images based

on eyes and mouth centers, yielding 104 ×96 RGB

images. Our CNN conﬁguration is shown in Table 2,

here we add element-wise maxout layer [50] after the

3,000-dimensional fully connected layer, yielding a 1,500-

dimensional output, and contrastive loss is applied on this

output as in DeepID2 [43]. Then we train a single CNN

model with outside data from the publicly available CASIA-

WebFace dataset [52] and our own collected data (about

400k from 14k identities). Extract the features for each

image and its horizontally ﬂipped one, then compute a

mean feature vector as the representation. From the re-

sults shown in Table 5, one can observe that Noisy Soft-

max (α2= 0.1) improves the performance over the base-

line, and the result is also comparable to the current state-

of-the-art methods with private data and even model en-

semble. In addition, we further improve our results to

99.31%,94.43%,82.50%,94.88%,95.37%(listed in

the same protocol order as in Table. 5) by two models en-

semble.

7. Conclusion

In this paper, we propose Noisy Softmax to address the

early individual saturation by injecting annealed noise to the

softmax input. It is a way of early softmax desaturation by

postponing the early individual saturation. We show that

our method can be easily performed as a drop-in replace-

ment for standard softmax and is easier to optimize. It sig-

niﬁcantly improves the performances of CNN models, since

the early desaturation operation indeed exerts much effect

on parameter update during back-propagation and further-

more improves the generalization ability of DCNNs. Em-

pirical studies verify the superiority of softmax desatura-

tion. Meanwhile, it achieves state-of-the-art or competitive

results on several datasets.

8. Acknowledgments

This work was partially supported by the National

Natural Science Foundation of China (Project 61573068,

61471048, 61375031 and 61532006), Beijing Nova Pro-

gram under Grant No. Z161100004916088, the Fundamen-

tal Research Funds for the Central Universities under Grant

No. 2014ZD03-01, and the Program for New Century Ex-

cellent Talents in University(NCET-13-0683).

References

[1] Y. Bengio. Estimating or propagating gradients through

stochastic neurons. Computer Science, 2013.

[2] L. Best-Rowden, H. Han, C. Otto, B. F. Klare, and A. K.

Jain. Unconstrained face recognition: Identifying a person

of interest from a media collection. IEEE Transactions on

Information Forensics and Security, 9(12):2144–2157, 2014.

[3] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra.

Weight uncertainty in neural networks. Computer Science,

2015.

[4] by V Nair and G. E. Hinton. Rectiﬁed linear units improve

restricted boltzmann machines. Proc Icml, pages 807–814,

2015.

[5] C. Ding and D. Tao. Robust face recognition via multimodal

deep face representation for multimedia applications. IEEE

Transactions on Multimedia, 17(11):2049–2058, 2015.

[6] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectiﬁer

neural networks. Journal of Machine Learning Research, 15,

2010.

[7] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville,

and Y. Bengio. Maxout networks. Computer Science, pages

1319–1327, 2013.

[8] B. Graham. Fractional max-pooling. Eprint Arxiv, 2014.

[9] A. Graves. Practical variational inference for neural net-

works. Advances in Neural Information Processing Systems,

pages 2348–2356, 2011.

[10] C. Gulcehre, M. Moczulski, M. Denil, and Y. Bengio. Noisy

activation functions. 2016.

[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning

for image recognition. Computer Science, 2015.

[12] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into

rectiﬁers: Surpassing human-level performance on imagenet

classiﬁcation. pages 1026–1034, 2015.

[13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in

deep residual networks. 2016.

[14] G. Huang, Z. Liu, and K. Weinberger. Densely connected

convolutional networks. 2016.

[15] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger. Deep

networks with stochastic depth. 2016.

[16] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. La-

beled faces in the wild: A database forstudying face recog-

nition in unconstrained environments. 2008.

[17] S. Huang, Z. Xu, D. Tao, and Y. Zhang. Part-stacked cnn for

ﬁne-grained visual categorization. Computer Science, 2015.

[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating

deep network training by reducing internal covariate shift.

Computer Science, 2015.

[19] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. Lecun. What

is the best multi-stage architecture for object recognition?

pages 2146–2153, 2009.

[20] J. Krause, H. Jin, J. Yang, and F. F. Li. Fine-grained recogni-

tion without part annotations. In IEEE Conference on Com-

puter Vision and Pattern Recognition, pages 5546–5555,

2015.

[21] A. Krizhevsky. Learning multiple layers of features from

tiny images. 2012.

[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-

agenet classiﬁcation with deep convolutional neural net-

works. Advances in Neural Information Processing Systems,

25(2):2012, 2012.

[23] D. Laptev and J. M. Buhmann. Transformation-invariant

convolutional jungles. In IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 3043–3051, 2015.

[24] D. Laptev, N. Savinov, J. M. Buhmann, and M. Pollefeys. Ti-

pooling: transformation-invariant pooling for feature learn-

ing in convolutional neural networks. 2016.

[25] Y. Lecun and C. Cortes. The mnist database of handwritten

digits.

[26] C. Y. Lee, P. W. Gallagher, and Z. Tu. Generalizing pooling

functions in convolutional neural networks: Mixed, gated,

and tree. Computer Science, 2015.

[27] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-

supervised nets. In AISTATS, volume 2, page 6, 2015.

[28] M. Liang and X. Hu. Recurrent convolutional neural network

for object recognition. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition, pages

3367–3375, 2015.

[29] M. Lin, Q. Chen, and S. Yan. Network in network. Computer

Science, 2013.

[30] T. Y. Lin, A. Roychowdhury, and S. Maji. Bilinear cnn mod-

els for ﬁne-grained visual recognition. In IEEE International

Conference on Computer Vision, pages 1449–1457, 2015.

[31] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin soft-

max loss for convolutional neural networks. In International

Conference on International Conference on Machine Learn-

ing, pages 507–516, 2016.

[32] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser,

K. Kurach, and J. Martens. Adding gradient noise improves

learning for very deep networks. Computer Science, 2015.

[33] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face

recognition. In British Machine Vision Conference, 2015.

[34] R. Reed, R. J. Marks, and S. Oh. An equivalence between

sigmoidal gain scaling and training with noisy (jittered) in-

put data. In Neuroinformatics and Neurocomputers, 1992.,

RNNS/IEEE Symposium on, pages 120 – 127, 1992.

[35] R. Reed, S. Oh, and R. J. Marks. Regularization using jit-

tered training data. In International Joint Conference on

Neural Networks, pages 147–152 vol.3, 1992.

[36] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-

ﬁed embedding for face recognition and clustering. pages

815–823, 2015.

[37] M. L. Seltzer and J. Droppo. Multi-task learning in deep

neural networks for improved phoneme recognition. pages

6965–6969, 2013.

[38] W. Shang, K. Sohn, D. Almeida, and H. Lee. Understanding

and improving convolutional neural networks via concate-

nated rectiﬁed linear units. 2016.

[39] K. Simonyan and A. Zisserman. Very deep convolutional

networks for large-scale image recognition. Computer Sci-

ence, 2015.

[40] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Ried-

miller. Striving for simplicity: The all convolutional net.

Eprint Arxiv, 2014.

[41] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway

networks. Computer Science, 2015.

[42] M. Steijvers and P. Grunwald. A recurrent network that per-

forms a context-sensitive prediction task. In Conference of

the Cognitive Science, 2000.

[43] Y. Sun, X. Wang, and X. Tang. Deep learning face represen-

tation by joint identiﬁcation-veriﬁcation. Advances in Neural

Information Processing Systems, 27:1988–1996, 2014.

[44] Y. Sun, X. Wang, and X. Tang. Deeply learned face represen-

tations are sparse, selective, and robust. Computer Science,

pages 2892–2900, 2014.

[45] Y. Sun, X. Wang, and X. Tang. Sparsifying neural network

connections for face recognition. Computer Science, 2015.

[46] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,

D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.

Going deeper with convolutions. pages 1–9, 2015.

[47] V. Turchenko and A. Luczak. Caffe: Convolutional archi-

tecture for fast feature embedding. Eprint Arxiv, pages 675–

678, 2014.

[48] X. S. Wei, C. W. Xie, and J. Wu. Mask-cnn: Localizing parts

and selecting descriptors for ﬁne-grained image recognition.

2016.

[49] L. Wolf, T. Hassner, and I. Maoz. Face recognition in

unconstrained videos with matched background similarity.

42(7):529–534, 2011.

[50] X. Wu, R. He, and Z. Sun. A lightened cnn for deep face

representation. Computer Science, 2015.

[51] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian. Disturblabel:

Regularizing cnn on the loss layer. 2016.

[52] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-

tation from scratch. Computer Science, 2014.

[53] M. D. Zeiler and R. Fergus. Stochastic pooling for regu-

larization of deep convolutional neural networks. Computer

Science, 2013.

[54] N. Zhang and W. Deng. Fine-grained lfw database. In 2016

International Conference on Biometrics (ICB), 2016.