ArticlePDF Available

Deep Learners Benefit More from Out-of-Distribution Examples.


Abstract and Figures

Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because they can be shared across tasks and examples from different but related distributions, can yield even more benefits. Comparative experiments were performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits), using both a multi-task setting and perturbed examples in order to obtain out-of-distribution examples. The results agree with the hypothesis, and show that a deep learner did beat previously published results and reached human-level performance.
Content may be subject to copyright.
Deep Learners Benefit More from Out-of-Distribution Examples
Yoshua Bengio and Fed´eric Bastien and Arnaud Bergeron and Nicolas Boulanger-Lewandowski and
Thomas Breuel and Youssouf Chherawala and Moustapha Cisse and Myriam Cˆot´e and Dumitru Erhan
and Jeremy Eustache and Xavier Glorot and Xavier Muller and Sylvain Pannetier Lebeuf
and Razvan Pascanu and Salah Rifai and Francois Savard and Guillaume Sicard
Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada
Recent theoretical and empirical work in statis-
tical machine learning has demonstrated the po-
tential of learning algorithms for deep architec-
tures, i.e., function classes obtained by compos-
ing multiple levels of representation. The hy-
pothesis evaluated here is that intermediate lev-
els of representation, because they can be shared
across tasks and examples from different but re-
lated distributions, can yield even more bene-
fits. Comparative experiments were performed
on a large-scale handwritten character recogni-
tion setting with 62 classes (upper case, lower
case, digits), using both a multi-task setting and
perturbed examples in order to obtain out-of-
distribution examples. The results agree with
the hypothesis, and show that a deep learner
did beat previously published results and reached
human-level performance.
1 Introduction
Deep Learning has emerged as a promising new area
of research in statistical machine learning [Hinton et al.,
2006, Ranzato et al., 2007, Bengio et al., 2007, Vincent
et al., 2008, Ranzato et al., 2008, Taylor and Hinton, 2009,
Larochelle et al., 2009, Salakhutdinov and Hinton, 2009,
Lee et al., 2009a,b, Jarrett et al., 2009, Taylor et al., 2010].
See Bengio [2009] for a review. Learning algorithms for
deep architectures are centered on the learning of useful
representations of data, which are better suited to the
Appearing in Proceedings of the 14th International Con-
ference on Artificial Intelligence and Statistics (AISTATS)
2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR:
W&CP 15. Copyright 2011 by the authors.
task at hand, and are organized in a hierarchy with mul-
tiple levels. This is in part inspired by observations of
the mammalian visual cortex, which consists of a chain
of processing elements, each of which is associated with a
different representation of the raw visual input. In fact, it
was found recently that the features learnt in deep archi-
tectures resemble those observed in the first two of these
stages (in areas V1 and V2 of visual cortex) [Lee et al.,
2008], and that they become more and more invariant to
factors of variation (such as camera movement) in higher
layers [Goodfellow et al., 2009]. It has been hypothesized
that learning a hierarchy of features increases the ease and
practicality of developing representations that are at once
tailored to specific tasks, yet are able to borrow statistical
strength from other related tasks (e.g., modeling different
kinds of objects). Finally, learning the feature representa-
tion can lead to higher-level (more abstract, more general)
features that are more robust to unanticipated sources of
variance extant in real data.
Whereas a deep architecture can in principle be more
powerful than a shallow one in terms of representation,
depth appears to render the training problem more diffi-
cult in terms of optimization and local minima. It is also
only recently that successful algorithms were proposed to
overcome some of these difficulties. All are based on un-
supervised learning, often in an greedy layer-wise “unsu-
pervised pre-training” stage [Bengio, 2009]. The princi-
ple is that each layer starting from the bottom is trained
to represent its input (the output of the previous layer).
After this unsupervised initialization, the stack of lay-
ers can be converted into a deep supervised feedforward
neural network and fine-tuned by stochastic gradient de-
scent. One of these layer initialization techniques, ap-
plied here, is the Denoising Auto-encoder (DA) [Vincent
et al., 2008] (see Figure 2), which performed similarly or
better [Vincent et al., 2008] than previously proposed Re-
stricted Boltzmann Machines (RBM) [Hinton et al., 2006]
in terms of unsupervised extraction of a hierarchy of fea-
tures useful for classification. Each layer is trained to
denoise its input, creating a layer of features that can be
Bengio et. al.
used as input for the next layer, forming a Stacked Denois-
ing Auto-encoder (SDA). Note that training a Denoising
Auto-encoder can actually been seen as training a partic-
ular RBM by an inductive principle different from maxi-
mum likelihood [Vincent, 2010], namely by Score Match-
ing [Hyv¨arinen, 2005, Hyv¨arinen, 2008].
Previous comparative experimental results with stacking
of RBMs and DAs to build deep supervised predictors
had shown that they could outperform shallow architec-
tures in a variety of settings, especially when the data
involves complex interactions between many factors of
variation [Larochelle et al., 2007, Bengio, 2009]. Other
experiments have suggested that the unsupervised layer-
wise pre-training acted as a useful prior [Erhan et al.,
2010] that allows one to initialize a deep neural network
in a relatively much smaller region of parameter space,
corresponding to better generalization.
To further the understanding of the reasons for the good
performance observed with deep learners, we focus here on
the following hypothesis: intermediate levels of representa-
tion, especially when there are more such levels, can be ex-
ploited to share statistical strength across different
but related types of examples, such as examples com-
ing from other tasks than the task of interest (the multi-
task setting [Caruana, 1997]), or examples coming from
an overlapping but different distribution (images with dif-
ferent kinds of perturbations and noises, here). This is
consistent with the hypotheses discussed in Bengio [2009]
regarding the potential advantage of deep learning and
the idea that more levels of representation can give rise
to more abstract, more general features of the raw input.
This hypothesis is related to a learning setting called self-
taught learning [Raina et al., 2007], which combines
principles of semi-supervised and multi-task learning: in
addition to the labeled examples from the target distribu-
tion, the learner can exploit examples that are unlabeled
and possibly come from a distribution different from the
target distribution, e.g., from other classes than those of
interest. It has already been shown that deep learners
can clearly take advantage of unsupervised learning and
unlabeled examples [Bengio, 2009, Weston et al., 2008]
in order to improve performance on a supervised task,
but more needed to be done to explore the impact of
out-of-distribution examples and of the multi-task setting
(two exceptions are Collobert and Weston [2008], which
shares and uses unsupervised pre-training only with the
first layer, and Mobahi et al. [2009] in the case of video
data). In particular the relative advantage of deep learn-
ing for these settings has not been evaluated.
The main claim of this paper is that deep learners (with
several levels of representation) can benefit more from
out-of-distribution examples than shallow learners
(with a single level), both in the context of the multi-
task setting and from perturbed examples. Because we
are able to improve on state-of-the-art performance and
reach human-level performance on a large-scale task, we
consider that this paper is also a contribution to advance
the application of machine learning to handwritten char-
acter recognition. More precisely, we ask and answer the
following questions:
Do the good results previously obtained with deep ar-
chitectures on the MNIST digit images generalize to the
setting of a similar but much larger and richer dataset,
the NIST special database 19, with 62 classes and around
800k examples?
To what extent does the perturbation of input images
(e.g. adding noise, affine transformations, background
images) make the resulting classifiers better not only on
similarly perturbed images but also on the original clean
examples? We study this question in the context of the
62-class and 10-class tasks of the NIST special database
Do deep architectures benefit more from such out-of-
distribution examples, in particular do they benefit more
from examples that are perturbed versions of the exam-
ples from the task of interest?
Similarly, does the feature learning step in deep learning
algorithms benefit more from training with moderately
different classes (i.e. a multi-task learning scenario) than
a corresponding shallow and purely supervised architec-
ture? We train on 62 classes and test on 10 (digits) or 26
(upper case or lower case) to answer this question.
Our experimental results provide positive evidence to-
wards all of these questions, as well as classifiers that
reach human-level performance on 62-class iso-
lated character recognition and beat previously
published results on the NIST dataset (special
database 19). To achieve these results, we introduce
in the next section a sophisticated system for stochas-
tically transforming character images and then explain
the methodology, which is based on training with or
without these transformed images and testing on clean
ones. Code for generating these transformations as well
as for the deep learning algorithms are made available at
2 Perturbed and Transformed Character
Figure 1 shows the different transformations we used
to stochastically transform 32 ×32 source images (such
as the one in Fig.1(a)) in order to obtain data from
a larger distribution which covers a domain substan-
tially larger than the clean characters distribution from
which we start. Although character transformations
Bengio et. al.
have been used before to improve character recogniz-
ers, this effort is on a large scale both in number of
classes and in the complexity of the transformations,
hence in the complexity of the learning task. The code
for these transformations (mostly Python) is available at All the modules in
the pipeline (Figure 1) share a global control parameter
(0 complexity 1) that allows one to modulate the
amount of deformation or noise introduced. There are
two main parts in the pipeline. The first one, from thick-
ness to pinch, performs transformations. The second part,
from blur to contrast, adds different kinds of noise. More
details can be found in Bastien et al. [2010].
3 Experimental Setup
Much previous work on deep learning had been performed
on the MNIST digits task [Hinton et al., 2006, Ranzato
et al., 2007, Bengio et al., 2007, Salakhutdinov and Hin-
ton, 2009], with 60,000 examples, and variants involving
10,000 examples [Larochelle et al., 2009, Vincent et al.,
2008]1The focus here is on much larger training sets,
from 10 times to to 1000 times larger, and 62 classes.
The first step in constructing the larger datasets (called
NISTP and P07) is to sample from a data source:NIST
(NIST database 19), Fonts,Captchas, and OCR data
(scanned machine printed characters). See more in Sec-
tion 3.1 below. Once a character is sampled from one of
these sources (chosen randomly), the second step is to ap-
ply a pipeline of transformations and/or noise processes
outlined in section 2.
To provide a baseline of error rate comparison we also
estimate human performance on both the 62-class task
and the 10-class digits task. We compare the best Multi-
Layer Perceptrons (MLP) against the best Stacked De-
noising Auto-encoders (SDA), when both models’ hyper-
parameters are selected to minimize the validation set er-
ror. We also provide a comparison against a precise esti-
mate of human performance obtained via Amazon’s Me-
chanical Turk (AMT) service ( AMT
users are paid small amounts of money to perform tasks
for which human intelligence is required. Mechanical Turk
has been used extensively in natural language processing
and vision. AMT users were presented with 10 charac-
ter images (from a test set) on a screen and asked to
label them. They were forced to choose a single char-
acter class (either among the 62 or 10 character classes)
for each image. 80 subjects classified 2500 images per
(dataset,task) pair. Different humans labelers sometimes
provided a different label for the same example, and we
were able to estimate the error variance due to this ef-
fect because each image was classified by 3 different per-
1Fortunately, there are more and more exceptions of course,
such as Raina et al. [2009] using a million examples.
sons. The average error of humans on the 62-class task
NIST test set is 18.2%, with a standard error of 0.1%.
We controlled noise in the labelling process by (1) requir-
ing AMT workers with a higher than normal average of
accepted responses (>95%) on other tasks (2) discarding
responses that were not complete (10 predictions) (3) dis-
carding responses for which for which the time to predict
was smaller than 3 seconds for NIST (the mean response
time was 20 seconds) and 6 seconds seconds for NISTP
(average response time of 45 seconds) (4) discarding re-
sponses which were obviously wrong (10 identical ones, or
”12345...”). Overall, after such filtering, we kept approx-
imately 95% of the AMT workers’ responses.
3.1 Data Sources
NIST. Our main source of characters is the NIST Spe-
cial Database 19 [Grother, 1995], widely used for training
and testing character recognition systems [Granger et al.,
2007, P´erez-Cortes et al., 2000, Oliveira et al., 2002, Mil-
gram et al., 2005]. The dataset is composed of 814255
digits and characters (upper and lower cases), with hand
checked classifications, extracted from handwritten sam-
ple forms of 3600 writers. The characters are labelled
by one of the 62 classes corresponding to “0”-“9”,“A”-
“Z” and “a”-“z”. The dataset contains 8 parts (parti-
tions) of varying complexity. The fourth partition (called
hsf4, 82,587 examples), experimentally recognized to be
the most difficult one, is the one recommended by NIST
as a testing set and is used in our work as well as some
previous work [Granger et al., 2007, P´erez-Cortes et al.,
2000, Oliveira et al., 2002, Milgram et al., 2005] for that
purpose. We randomly split the remainder (731,668 ex-
amples) into a training set and a validation set for model
selection. The performances reported by previous work
on that dataset mostly use only the digits. Here we use
all the classes both in the training and testing phase. This
is especially useful to estimate the effect of a multi-task
setting. The distribution of the classes in the NIST train-
ing and test sets differs substantially, with relatively many
more digits in the test set, and a more uniform distribu-
tion of letters in the test set (whereas in the training set
they are distributed more like in natural text).
Fonts. In order to have a good variety of sources we
downloaded an important number of free fonts from:
Including an operating system’s (Windows 7) fonts, there
we uniformly chose from 9817 different fonts. The chosen
ttf file is either used as input of the Captcha generator
(see next item) or, by producing a corresponding image,
directly as input to our models.
Captchas. The Captcha data source is an adaptation
of the pycaptcha library (a Python-based captcha genera-
tor library) for generating characters of the same format
as the NIST dataset. This software is based on a random
Bengio et. al.
(a) Original (b) Thickness (c) Slant (d) Affine Trans-
(e) Local Elastic
(f) Pinch
(g) Motion Blur (h) Occlusion (i) Gaussian
(j) Pixels Permuta-
(k) Gaussian Noise (l) Background
Image Addition
(m) Salt & Pepper (n) Scratches (o) Grey Level &
Figure 1: Top left (a): example original image. Others (b-o): examples of the effect of each transformation module
taken separately. Actual perturbed examples are obtained by a pipeline of these, with random choices about which
module to apply and how much perturbation to apply.
character class generator and various kinds of transforma-
tions similar to those described in the previous sections.
In order to increase the variability of the data generated,
many different fonts are used for generating the charac-
ters. Transformations (slant, distortions, rotation, trans-
lation) are applied to each randomly generated character
with a complexity depending on the value of the complex-
ity parameter provided by the user of the data source.
OCR data. A large set (2 million) of scanned,
OCRed and manually verified machine-printed charac-
ters where included as an additional source. This set
is part of a larger corpus being collected by the Im-
age Understanding Pattern Recognition Research group
led by Thomas Breuel at University of Kaiserslautern
3.2 Data Sets
All data sets contain 32×32 grey-level images (values in
[0,1]) associated with one of 62 character labels.
NIST. This is the raw NIST special database 19 [Grother,
1995]. It has {651,668 / 80,000 / 82,587} {training /
validation / test}examples.
P07. This dataset is obtained by taking raw characters
from the above 4 sources and sending them through the
transformation pipeline described in section 2. For each
generated example, a data source is selected with prob-
ability 10% from the fonts, 25% from the captchas, 25%
from the OCR data and 40% from NIST. The transforma-
tions are applied in the order given above, and for each
of them we sample uniformly a complexity in the range
[0,0.7]. It has {81,920,000 / 80,000 / 20,000} {training /
validation / test}examples obtained from the correspond-
ing NIST sets plus other sources.
NISTP. This one is equivalent to P07 (complexity pa-
rameter of 0.7 with the same proportions of data sources)
except that we only apply transformations from slant to
pinch (see Fig.1(b-f)). Therefore, the character is trans-
formed but without added noise, yielding images closer to
the NIST dataset. It has {81,920,000 / 80,000 / 20,000}
{training / validation / test}examples obtained from the
corresponding NIST sets plus other sources.
3.3 Models and their Hyper-parameters
The experiments are performed using MLPs (with a sin-
gle hidden layer) and deep SDAs. Hyper-parameters are
Bengio et. al.
selected based on the NISTP validation set error.
Multi-Layer Perceptrons (MLP). The MLP output
estimates the class-conditional probabilities
P(class|input = x) = softmax(b2+W2tanh(b1+W1x)),
i.e., two layers, where p= softmax(a) means that pi(x) =
exp(ai)/Pjexp(aj) representing the probability for class
i, tanh is the element-wise hyperbolic tangent, biare pa-
rameter vectors, and Wiare parameter matrices (one per
layer). The number of rows of W1is called the number of
hidden units (of the single hidden layer, here), and is one
way to control capacity (the main other ways to control
capacity are the number of training iterations and option-
ally a regularization penalty on the parameters, not used
here because it did not help). Whereas previous work had
compared deep architectures to both shallow MLPs and
SVMs, we only compared to MLPs here because of the
very large datasets used (making the use of SVMs com-
putationally challenging because of their quadratic scal-
ing behavior). Preliminary experiments on training SVMs
(libSVM) with subsets of the training set allowing the pro-
gram to fit in memory yielded substantially worse results
than those obtained with MLPs2For training on nearly
a hundred million examples (with the perturbed data),
the MLPs and SDA are much more convenient than clas-
sifiers based on kernel methods. The MLP has a single
hidden layer with tanh activation functions, and softmax
(normalized exponentials) on the output layer for estimat-
ing P(class|input). The number of hidden units is taken
in {300,500,800,1000,1500}. Training examples are pre-
sented in minibatches of size 20, i.e., the parameters are
iteratively updated in the direction of the mean gradient
of the next 20 examples. A constant learning rate was
chosen among {0.001,0.01,0.025,0.075,0.1,0.5}.
Stacked Denoising Auto-encoders (SDA). Various
auto-encoder variants and Restricted Boltzmann Ma-
chines (RBMs) can be used to initialize the weights of
each layer of a deep MLP (with many hidden layers) [Hin-
ton et al., 2006, Ranzato et al., 2007, Bengio et al., 2007],
apparently setting parameters in the basin of attraction
of supervised gradient descent yielding better generaliza-
tion [Erhan et al., 2010]. This initial unsupervised pre-
training phase does not use the training labels. Each
layer is trained in turn to produce a new representation
of its input (starting from the raw pixels). It is hypothe-
sized that the advantage brought by this procedure stems
2RBF SVMs trained with a subset of NISTP or NIST, 100k
examples, to fit in memory, yielded 64% test error or worse;
online linear SVMs trained on the whole of NIST or 800k from
NISTP yielded no better than 42% error; slightly better results
were obtained by sparsifying the pixel intensities and project-
ing to a second-order polynomial (a very sparse vector), still
41% error. We expect that better results could be obtained
with a better implementation allowing for training with more
examples and a higher-order non-linear projection.
from a better prior, on the one hand taking advantage
of the link between the input distribution P(x) and the
conditional distribution of interest P(y|x) (like in semi-
supervised learning), and on the other hand taking advan-
tage of the expressive power and bias implicit in the deep
architecture (whereby complex concepts are expressed as
compositions of simpler ones through a deep hierarchy).
Here we chose to use the Denoising Auto-encoder [Vin-
cent et al., 2008] as the building block for these deep
hierarchies of features, as it is simple to train and
explain (see Figure 2, as well as tutorial and code
there:, provides
efficient inference, and yielded results comparable or bet-
ter than RBMs in series of experiments [Vincent et al.,
2008]. Some denoising auto-encoders correspond to a
Gaussian RBM trained by a Score Matching criterion Vin-
cent [2010]. During its unsupervised training, a Denoising
Auto-encoder is presented with a stochastically corrupted
version ˜xof the input xand trained to reconstruct to pro-
duce a reconstruction zof the uncorrupted input x. Be-
cause the network has to denoise, it is forcing the hidden
units yto represent the leading regularities in the data.
In a slight departure from Vincent et al. [2008], the hid-
den units output yis obtained through the tanh-affine en-
coder y= tanh(c+V x) and the reconstruction is obtained
through the transposed transformation z= tanh(d+V0y).
The training set average of the cross-entropy reconstruc-
tion loss (after mapping back numbers in (-1,1) into (0,1))
LH(x, z) = X
(zi+ 1)
2log (xi+ 1)
2log xi
is minimized. Here we use the random binary masking
corruption (which in ˜xsets to 0 a random subset of the
elements of x, and copies the rest). Once the first denois-
ing auto-encoder is trained, its parameters can be used
to set the first layer of the deep MLP. The original data
are then processed through that first layer, and the out-
put of the hidden units form a new representation that
can be used as input data for training a second denois-
ing auto-encoder, still in a purely unsupervised way. This
is repeated for the desired number of hidden layers. Af-
ter this unsupervised pre-training stage, the parameters
are used to initialize a deep MLP (similar to the above,
but with more layers), which is fine-tuned by the same
standard procedure (stochastic gradient descent) used to
train MLPs in general (see above). The top layer pa-
rameters of the deep MLP (the one which outputs the
class probabilities and takes the top hidden layer as in-
put) can be initialized at 0. The SDA hyper-parameters
are the same as for the MLP, with the addition of the
amount of corruption noise (we used the masking noise
process, whereby a fixed proportion of the input values,
randomly selected, are zeroed), and a separate learning
rate for the unsupervised pre-training stage (selected from
the same above set). The fraction of inputs corrupted
Bengio et. al.
Conclusions and Future Work
Pre-training adds robustness to a deep architecture.
Pre-training is a type of regularization: in the sense of restricting the start-
ing points of the optimization to a data-dependent manifold.
It is not simply a way of getting a good initial marginal distribution: it
captures more intricate dependencies.
Pre-training seems more effective for lower layers than for higher layers.
Visualizations confirmed that the solutions corresponding to the two initial-
ization strategies are qualitatively different.
Is the a pre-training advantage for very large (“infinite”) datasets? i.e. Does
pre-training help with optimization in a deep architecture?
Future work: “InfiniteMNIST”, non-MNIST data, DBNs.
[1] BENGIO, Y. Learning deep architectures for AI. Tech. Rep. 1312, Universit´
e de Montr´
eal, dept. IRO, 2007.
[2] BENGIO, Y., LAMBLIN, P., POPOV ICI , D., AND LA ROC HEL LE, H. Greedy layer-wise training of deep networks. In NIPS 19 (2007), B. Sch ¨
olkopf, J. Platt,
and T. Hoffman, Eds., MIT Press, pp. 153–160.
[3] HINTON, G. E., OS INDERO , S., AND TEH, Y. A fast learning algorithm for deep belief nets. Neural Computation 18 (2006), 1527–1554.
[4] RAN ZATO, M ., POULTNEY, C., C HOPRA, S., AND LECUN, Y. Efficient learning of sparse representations with an energy-based model. In NIPS 19 (2007),
B. Sch¨
olkopf, J. Platt, and T. Hoffman, Eds., MIT Press.
[5] VAN DER MA ATEN, L., AND HI NTO N, G. E. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research(2008).
[6] VINCENT, P., LARO CHE LLE, H., BENGIO, Y., AND MANZAGO L, P.-A. Extracting and composing robust features with denoising autoencoders. In Proc.
ICML 2008 (2008), pp. 1096–1103.
Functional space approximation
2D approximation of the outputs of the 2-layer networks during supervised
training. Outputs were projected using t-SNE[5].
1. The pre-trained and not pre-trained models start and stay in different re-
gions of function space.
2. All trajectories of a given type (with pre-training or without) initially move
together, but at some point (after about 7 epochs), different trajectories di-
verge and never get back close to each other. This suggests that each tra-
jectory moves into a different local minimum.
Error landscape analysis
Training errors obtained on Shapeset when stepping in parameter space
around a converged model in 7random gradient directions (stepsize of 0.1).
Top/Bottom: no / with pre-training. Left–Right: 1–3 hidden layers.
We seem to be near a local minimum in all directions investigated, as opposed
to a saddle point or a plateau. Figures also suggest that the error landscape is
a bit flatter in the case of pre-training, and flatter for deeper architectures.
Pre-Training Different Layers
Hybrid initialization: some layers are
taken from a pre-trained model and
others are initialized randomly in the
usual way.
Results are consistent with the hy-
pothesis [1] that training the lower
layers is harder because gradient in-
formation becomes less informative
as it is backpropagated through more
Effect of Layer Size
We measure the effect of layer size on the changes brought by pre-training.
Experiments on MNIST. Error bars have a height of two standard devia-
tions (over initialization seed). Pre-training hurts for smaller layer sizes
and shallower networks, but it helps for all depths for larger networks.
In this scenario, pre-training acts like an additional regularizer: for smaller
networks, it constrains the capacity even more and hurts performance.
A Better Random Initialization?
Alternative hypothesis: pre-training provides a better marginal distribution
of weights compared to random initialization (thus, it is data-indepenent).
We measured the effect of various initialization strategies (MNIST):
initialization. Uniform Histogram Unsup.pre-tr.
1 layer 1.81 ±0.07 1.94 ±0.09 1.41 ±0.07
2 layers 1.77 ±0.10 1.69 ±0.11 1.37 ±0.09
1. independent uniform densities (one per parameter)
2. independent densities from the marginals after pre-training
3. unsupervised pre-training (which samples the parameters in a highly de-
pendent way so that they collaborate to make up good denoising auto-
Clearly, we can’t simply replace the unsupervised initialization with sam-
pling from the marginal distribution induced by it.
Pre-training as Regularization
For 2 and 3-layer networks, pre-training seems to act like a regularizer:
It hurts the training error, yet it helps with generalization.
Pre-training with denoising auto-encoders can be seen as decreasing the
variance and introducing a bias (towards parameter configurations suit-
able for performing denoising).
Unlike ordinary regularizers, pre-training changes the distribution of pa-
rameter values before training and does not constrain them during train-
ing (“prior”).
Unlike ordinary regularizers, pre-training with denoising auto-encoders
does so in a data-dependent manner.
Better Optimization or Generalization?
Evolution without pre-training (blue) and with pre-training (red) on
MNIST of the log of the test NLL plotted against the log of the train NLL
as training proceeds. Each of the 2×400 curves represents a different
Since training error tends to decrease during training, the trajectories run
from right to left. Trajectories moving up (as we go leftward) indicate a
form of overfitting. Note that:
Pretrained networks start in a better region.
For 2 and 3-layer networks, pretrained networks converge to a lowertest-
ing error, but a higher training error (implying a regularization effect).
Effect of depth and pre-training
Effect of depth on performance for 400 models trained (left) without pre-
training and (right) with pre-training, for 1 to 5 hidden layers, using 400
different initialization seeds:
Increasing depth seems to increase the probability of finding poor local
minima (not so for pretrained models).
Histograms presenting the test errors obtained on MNIST using models
trained with or without pre-training. Left:1hidden layer. Right:4hidden
Experimental setup
Two datasets:
Shapeset:10 ×10 triangles and squares (50k/10k/10k train/valid/test)
MNIST: 28 ×28 digit images (50k/10k/10k train/valid/test)
Training procedure for pretrained networks:
50 epochs of unsupervised pre-training all layers at the same time
followed by 50 epochs of supervised training
In both cases, initial weights are sampled independently from a
k, 1/
Hyperparameters: number of hidden layers, units per layer, unsupervised
and supervised learning rates, L2weight decay rate. For the optimal hy-
perparameters (as determined by the validation error), we launched exper-
iments using an additional 400 initializations.
(Stacked) Denoising Auto-Encoders
A denoising auto-encoder [6]:
with ˆ
x= sigmoid(c+WTh(C(x))), where C(x)is a stochastic corrup-
tion of x. A simple modification of the auto-encoder that
improves upon the classical auto-encoder and
can be used to pretrain a deep network
In our case, KL(x||ˆ
x)is used to learn (b, c, W )and as done by [6], we
set Ci(x)=xior 0, with a random subset (of a fixed size) selected for
Unanswered questions
Why is it more difficult to train deep architectures?
What does the cost function landscape of deep architectures look like?
Is the advantage of unsupervised pre-training related to optimization,
or perhaps some form of regularization?
What is the effect of random initialization on the learning trajectories?
Is pretraining certain layers more important than others?
Answering such questions could lead us into further improving the
strategies employed for training deep architectures.
Deep Architectures
Efficient training of deep neural networks (more than 2 hidden layers)
did not seem possible before the Deep Belief Nets (DBN) by [3].
DBNs use greedy layer-wise unsupervised pre-training via Restricted
Boltzmann Machines to initialize a deep neural network.
This principle can be extended to auto-associators and related mod-
els [2, 4]
Applied successfully in classification tasks, regression, dimensional-
ity reduction, modeling textures, information retrieval, robotics, nat-
ural language processing and collaborative filtering
Introduction and Motivation
Automatic learning of deep hierarchies of features is an emerging area
of research in the Machine Learning community.
Most current approaches are neural-network-based and use unsuper-
vised learning (pre-training) to initialize parameters.
This approach gives state-of-the-art for a variety of character recog-
nition, vision and some NLP problems.
Nonetheless, training deep architectures is a difficult problem and un-
supervised pre-training is relatively poorly understood.
Goal: large-scale empirical evaluations of deep architectures in order
to get further insights into the effect of depth and pre-training.
One-line summary: pre-training acts like a clever data-dependent reg-
ularizer, in the broad sense of the word.
Dumitru Erhan (UMontreal)
Pierre-Antoine Manzagol (UMontreal)
Yoshua Bengio (UMontreal)
Pascal Vincent (UMontreal)
Samy Bengio (Google)
The Difficulty of Training Deep Architectures and
the Effect of Unsupervised Pre-Training
Figure 2: Illustration of the computations and training criterion for the denoising auto-encoder used to pre-train each
layer of the deep architecture. Input xof the layer (i.e. raw input or output of previous layer) s corrupted into ˜x
and encoded into code yby the encoder fθ(·). The decoder gθ0(·) maps yto reconstruction z, which is compared to
the uncorrupted input xthrough the loss function LH(x, z), whose expected value is approximately minimized during
training by tuning θand θ0.
was selected among {10%,20%,50%}. Another hyper-
parameter is the number of hidden layers but it was fixed
to 3 for our experiments, based on previous work with
SDAs on MNIST [Vincent et al., 2008]. We also compared
against 1 and against 2 hidden layers, to disantangle the
effect of depth from that of unsupervised pre-training.
The size of each hidden layer was kept constant across
hidden layers, and the best results were obtained with
the largest values that we tried (1000 hidden units).
4 Experimental Results
The models are either trained on NIST (MLP0 and
SDA0), NISTP (MLP1 and SDA1), or P07 (MLP2 and
SDA2), and tested on either NIST, NISTP or P07 (regard-
less of the data set used for training), either on the 62-class
task or on the 10-digits task. Training time (including
about half for unsupervised pre-training, for DAs) on the
larger datasets is around one day on a GPU (GTX 285).
Figure 3 summarizes the results obtained, comparing hu-
mans, the three MLPs (MLP0, MLP1, MLP2) and the
three SDAs (SDA0, SDA1, SDA2), along with the previ-
ous results on the digits NIST special database 19 test set
from the literature, respectively based on ARTMAP neu-
ral networks [Granger et al., 2007], fast nearest-neighbor
search [P´erez-Cortes et al., 2000], MLPs [Oliveira et al.,
2002], and SVMs [Milgram et al., 2005].The deep learner
not only outperformed the shallow ones and previously
published performance (in a statistically and qualitatively
significant way) but when trained with perturbed data
reaches human performance on both the 62-class task and
the 10-class (digits) task. 17% error (SDA1) or 18% error
(humans) may seem large but a large majority of the er-
rors from humans and from SDA1 are from out-of-context
confusions (e.g. a vertical bar can be a “1”, an “l” or an
“L”, and a “c” and a “C” are often indistinguishible). Re-
garding shallower networks pre-trained with unsupervised
denoising auto-encders, we find that the NIST test error
is 21% with one hidden layer and 20% with two hidden
layers (vs 17% in the same conditions with 3 hidden lay-
ers). Compare this with the 23% error achieved by the
MLP, i.e. a single hidden layer and no unsupervised pre-
training. As found in previous work Erhan et al. [2010],
Larochelle et al. [2009], these results show that both depth
and unsupervised pre-training need to be combined in or-
der to achieve the best results.
In addition, as shown in the left of Figure 4, the relative
improvement in error rate brought by out-of-distribution
examples is greater for the deep SDA, and these differ-
ences with the shallow MLP are statistically and quali-
tatively significant. The left side of the figure shows the
improvement to the clean NIST test set error brought by
the use of out-of-distribution examples (i.e. the perturbed
examples examples from NISTP or P07), over the mod-
els trained exclusively on NIST (respectively SDA0 and
MLP0). Relative percent change is measured by taking
100%×(original model’s error / perturbed-data model’s
error - 1). The right side of Figure 4 shows the relative
improvement brought by the use of a multi-task setting,
in which the same model is trained for more classes than
the target classes of interest (i.e. training with all 62
classes when the target classes are respectively the dig-
its, lower-case, or upper-case characters). Again, whereas
the gain from the multi-task setting is marginal or nega-
tive for the MLP, it is substantial for the SDA. Note that
to simplify these multi-task experiments, only the origi-
nal NIST dataset is used. For example, the MLP-digits
bar shows the relative percent improvement in MLP er-
ror rate on the NIST digits test set as 100%×(single-task
model’s error / multi-task model’s error - 1). The single-
task model is trained with only 10 outputs (one per digit),
seeing only digit examples, whereas the multi-task model
is trained with 62 outputs, with all 62 character classes
as examples. Hence the hidden units are shared across
all tasks. For the multi-task model, the digit error rate is
measured by comparing the correct digit class with the
output class associated with the maximum conditional
probability among only the digit classes outputs. The
Bengio et. al.
Figure 3: SDAx are the deep models. Error bars indicate a 95% confidence interval. 0 indicates that the model was
trained on NIST, 1 on NISTP, and 2 on P07. Left: overall results of all models, on NIST and NISTP test sets. Right:
error rates on NIST test digits only, along with the previous results from literature [Granger et al., 2007, P´erez-Cortes
et al., 2000, Oliveira et al., 2002, Milgram et al., 2005] respectively based on ART, nearest neighbors, MLPs, and SVMs.
setting is similar for the other two target classes (lower
case characters and upper case characters). Note however
that some types of perturbations (NISTP) help more than
others (P07) when testing on the clean images.
5 Conclusions and Discussion
We have found that out-of-distribution examples (multi-
task learning and perturbed examples) are more beneficial
to a deep learner than to a traditional shallow and purely
supervised learner. More precisely, the answers are posi-
tive for all the questions asked in the introduction.
Do the good results previously obtained with
deep architectures on the MNIST digits generalize
to a much larger and richer (but similar) dataset,
the NIST special database 19, with 62 classes and
around 800k examples? Yes, the SDA systematically
outperformed the MLP and all the previously published
results on this dataset (the ones that we are aware of),
in fact reaching human-level performance at around 17%
error on the 62-class task and 1.4% on the digits, and
beating previously published results on the same data.
To what extent do out-of-distribution examples
help deep learners, and do they help them more
than shallow supervised ones? We found that dis-
torted training examples not only made the resulting clas-
sifier better on similarly perturbed images but also on
the original clean examples, and more importantly and
more novel, that deep architectures benefit more from
such out-of-distribution examples. Shallow MLPs were
helped by perturbed training examples when tested on
perturbed input images (65% relative improvement on
NISTP) but only marginally helped (5% relative improve-
ment on all classes) or even hurt (10% relative loss on dig-
its) with respect to clean examples. On the other hand,
the deep SDAs were significantly boosted by these out-
of-distribution examples. Similarly, whereas the improve-
ment due to the multi-task setting was marginal or nega-
tive for the MLP (from +5.6% to -3.6% relative change),
it was quite significant for the SDA (from +13% to +27%
relative change), which may be explained by the argu-
ments below. Since out-of-distribution data (perturbed
or from other related classes) is very common, this con-
clusion is of practical importance.
In the original self-taught learning framework [Raina
et al., 2007], the out-of-sample examples were used as a
source of unsupervised data, and experiments showed its
positive effects in a limited labeled data scenario. How-
ever, many of the results by Raina et al. [2007] (who used
a shallow, sparse coding approach) suggest that the rel-
ative gain of self-taught learning vs ordinary supervised
learning diminishes as the number of labeled examples
increases. We note instead that, for deep architectures,
our experiments show that such a positive effect is ac-
complished even in a scenario with a large number of la-
beled examples, i.e., here, the relative gain of self-taught
learning and out-of-distribution examples is probably pre-
served in the asymptotic regime. However, note that in
our perturbation experiments (but not in our multi-task
experiments), even the out-of-distribution examples are
Bengio et. al.
Figure 4: Relative improvement in error rate due to out-of-distribution examples. Left: Improvement (or loss, when
negative) induced by out-of-distribution examples (perturbed data). Right: Improvement (or loss, when negative)
induced by multi-task learning (training on all classes and testing only on either digits, upper case, or lower-case). The
deep learner (SDA) benefits more from out-of-distribution examples, compared to the shallow MLP.
labeled, unlike in the earlier self-taught learning experi-
ments [Raina et al., 2007].
Why would deep learners benefit more from
the self-taught learning framework and out-of-
distribution examples? The key idea is that the lower
layers of the predictor compute a hierarchy of features
that can be shared across tasks or across variants of the
input distribution. A theoretical analysis of generaliza-
tion improvements due to sharing of intermediate fea-
tures across tasks already points towards that explana-
tion [Baxter, 1995]. Intermediate features that can be
used in different contexts can be estimated in a way that
allows to share statistical strength. Features extracted
through many levels are more likely to be more abstract
and more invariant to some of the factors of variation in
the underlying distribution (as the experiments in Good-
fellow et al. [2009] suggest), increasing the likelihood that
they would be useful for a larger array of tasks and input
conditions. Therefore, we hypothesize that both depth
and unsupervised pre-training play a part in explaining
the advantages observed here, and future experiments
could attempt at teasing apart these factors. And why
would deep learners benefit from the self-taught learn-
ing scenarios even when the number of labeled exam-
ples is very large? We hypothesize that this is related to
the hypotheses studied in Erhan et al. [2010]. In Erhan
et al. [2010] it was found that online learning on a huge
dataset did not make the advantage of the deep learning
bias vanish, and a similar phenomenon may be happen-
ing here. We hypothesize that unsupervised pre-training
of a deep hierarchy with out-of-distribution examples ini-
tializes the model in the basin of attraction of supervised
gradient descent that corresponds to better generaliza-
tion. Furthermore, such good basins of attraction are
not discovered by pure supervised learning (with or with-
out out-of-distribution examples) from random initializa-
tion, and more labeled examples does not allow the shal-
low or purely supervised models to discover the kind of
better basins associated with deep learning and out-of-
distribution examples.
A Java demo of the recognizer (where both the MLP and
the SDA can be compared) can be executed on-line at
Fed´eric Bastien, Yoshua Bengio, Arnaud Bergeron, Nico-
las Boulanger-Lewandowski, Thomas Breuel, Youssouf
Chherawala, Moustapha Cisse, Myriam Cˆot´e, Dumitru
Erhan, Jeremy Eustache, Xavier Glorot, Xavier Muller,
Sylvain Pannetier Lebeuf, Razvan Pascanu, Salah Ri-
fai, Fran¸cois Savard, and Guillaume Sicard. Deep self-
taught learning for handwritten character recognition.
Technical Report 1353, University of Montr´eal, 2010.
Jonathan Baxter. Learning internal representations.
In Proceedings of the 8th International Conference
on Computational Learning Theory (COLT’95), pages
311–320, Santa Cruz, California, 1995. ACM Press.
Yoshua Bengio. Learning deep architectures for AI. Foun-
dations and Trends in Machine Learning, 2(1):1–127,
2009. Also published as a book. Now Publishers, 2009.
Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo
Larochelle. Greedy layer-wise training of deep net-
works. In NIPS 19, pages 153–160. MIT Press, 2007.
Rich Caruana. Multitask learning. Machine Learning, 28
(1):41–75, 1997.
Bengio et. al.
Ronan Collobert and Jason Weston. A unified architec-
ture for natural language processing: Deep neural net-
works with multitask learning. In ICML 2008, pages
160–167, 2008.
Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-
Antoine Manzagol, Pascal Vincent, and Samy Bengio.
Why does unsupervised pre-training help deep learn-
ing? JMLR, 11:625–660, 2010.
Ian Goodfellow, Quoc Le, Andrew Saxe, and Andrew Ng.
Measuring invariances in deep networks. In NIPS’09,
pages 646–654. 2009.
Eric Granger, Robert Sabourin, Luiz S. Oliveira, and
Catolica Parana. Supervised learning of fuzzy artmap
neural networks through particle swarm optimization.
JPRR, 2(1):27–60, 2007.
P.J. Grother. Handprinted forms and character database,
NIST special database 19. In National Institute of Stan-
dards and Technology (NIST) Intelligent Systems Divi-
sion (NISTIR), 1995.
Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh.
A fast learning algorithm for deep belief nets. Neural
Computation, 18:1527–1554, 2006.
Aapo Hyv¨arinen. Estimation of non-normalized statistical
models using score matching. JMLR, 6:695–709, 2005.
Aapo Hyv¨arinen. Optimal approximation of signal priors.
Neural Computation, 20(12):3087–3110, 2008.
Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ran-
zato, and Yann LeCun. What is the best multi-stage
architecture for object recognition? In ICCV’09. IEEE,
Hugo Larochelle, Dumitru Erhan, Aaron Courville, James
Bergstra, and Yoshua Bengio. An empirical evaluation
of deep architectures on problems with many factors of
variation. In ICML’07, pages 473–480. ACM, 2007.
Hugo Larochelle, Yoshua Bengio, Jerome Louradour, and
Pascal Lamblin. Exploring strategies for training deep
neural networks. JMLR, 10:1–40, 2009.
Honglak Lee, Chaitanya Ekanadham, and Andrew Ng.
Sparse deep belief net model for visual area V2. In
NIPS’07, pages 873–880. MIT Press, Cambridge, MA,
Honglak Lee, Roger Grosse, Rajesh Ranganath, and An-
drew Y. Ng. Convolutional deep belief networks for
scalable unsupervised learning of hierarchical represen-
tations. In ICML 2009. Montreal (Qc), Canada, 2009a.
Honglak Lee, Peter Pham, Yan Largman, and Andrew Ng.
Unsupervised feature learning for audio classification
using convolutional deep belief networks. In NIPS’09,
pages 1096–1104. 2009b.
J. Milgram, M. Cheriet, and R. Sabourin. Estimating ac-
curate multi-class probabilities with support vector ma-
chines. In Int. Joint Conf. on Neural Networks, pages
906–1911, 2005.
Hossein Mobahi, Ronan Collobert, and Jason Weston.
Deep learning from temporal coherence in video. In
ICML 2009, pages 737–744, Montreal, June 2009. Om-
L.S. Oliveira, R. Sabourin, F. Bortolozzi, and C.Y.
Suen. Automatic recognition of handwritten numer-
ical strings: a recognition and verification strategy.
IEEE Trans. Pattern Analysis and Mach. Intelli., 24
(11):1438–1454, 2002.
Juan Carlos P´erez-Cortes, Rafael Llobet, and Joaquim
Arlandis. Fast and accurate handwritten charac-
ter recognition using approximate nearest neighbours
search on large databases. In IAPR, pages 767–776,
London, UK, 2000. Springer-Verlag.
Rajat Raina, Alexis Battle, Honglak Lee, Benjamin
Packer, and Andrew Y. Ng. Self-taught learning: trans-
fer learning from unlabeled data. In ICML 2007, pages
759–766, 2007.
Rajat Raina, Anand Madhavan, and Andrew Y. Ng.
Large-scale deep unsupervised learning using graphics
processors. In ICML 2009, pages 873–880, New York,
NY, USA, 2009. ISBN 978-1-60558-516-1.
M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Effi-
cient learning of sparse representations with an energy-
based model. In NIPS’06, 2007.
Marc’Aurelio Ranzato, Y-Lan Boureau, and Yann LeCun.
Sparse feature learning for deep belief networks. In
NIPS’07, pages 1185–1192, Cambridge, MA, 2008. MIT
Ruslan Salakhutdinov and Geoffrey E. Hinton. Deep
Boltzmann machines. In AISTATS’2009, volume 5,
pages 448–455, 2009.
Graham Taylor and Geoffrey Hinton. Factored condi-
tional restricted Boltzmann machines for modeling mo-
tion style. In ICML 2009, pages 1025–1032, Montreal,
June 2009. Omnipress.
Graham Taylor, Leonid Sigal, David Fleet, and Geoffrey
Hinton. Dynamic binary latent variable models for 3D
pose tracking. In Proc. CVPR’10, 2010.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.
Extracting and composing robust features with denois-
ing autoencoders. In ICML 2008, 2008.
Pascal Vincent. A connection between Score Matching
and Denoising Autoencoders. Technical Report 1359,
Universite de Montreal, 2010.
J. Weston, F. Ratle, and R. Collobert. Deep learning via
semi-supervised embedding. In ICML 2008, 2008.
... Through nonlinear processing in several layers, the feature maps are finally mapped into multiple independent neurons of the last layer, which are then linked to probabilities of specific classes to which the inputs belong. Several CNN models such as VGGnet, Resnet, Densenet, and efficient net, have been designed [24][25][26][27][28] and the performance capabilities have been shown to increase in general as the number of layers and the numbers of neurons increase 24 . ...
... Here y n is the true value and y n is the predicted value for the n th input. When y n is either 0 or 1, during the training, the mean squared error between the output value and the true value is monitored: Data augmentation is known to be effective to prevent neural networks from overfitting [26][27][28][29] . We added white noise to the input images and applied a circular shift along the time axis as well as a cutout 28 during the training. ...
Full-text available
This paper proposes a method that automatically measures non-invasive blood pressure (BP) based on an auscultatory approach using Korotkoff sounds (K-sounds). There have been methods utilizing K-sounds that were more accurate in general than those using cuff pressure signals only under well-controlled environments, but most were vulnerable to the measurement conditions and to external noise because blood pressure is simply determined based on threshold values in the sound signal. The proposed method enables robust and precise BP measurements by evaluating the probability that each sound pulse is an audible K-sound based on a deep learning using a convolutional neural network (CNN). Instead of classifying sound pulses into two categories, audible K-sounds and others, the proposed CNN model outputs probability values. These values in a Korotkoff cycle are arranged in time order, and the blood pressure is determined. The proposed method was tested with a dataset acquired in practice that occasionally contains considerable noise, which can degrade the performance of the threshold-based methods. The results demonstrate that the proposed method outperforms a previously reported CNN-based classification method using K-sounds. With larger amounts of various types of data, the proposed method can potentially achieve more precise and robust results.
... Step 5: Data augmentation Data augmentation involves creating multiple copies of the same images, but with transformations such as flipping, rotating, scaling and cropping. Data augmentations can help reduce overfitting in deep convolutional neural networks [23], improve performance [23,24], model convergence [25], generalisation and robustness on out-of-distribution samples [26,27]. Depending on the method of computer vision used, data augmentation steps will differ. ...
Full-text available
Citizen science platforms, social media and smart phone applications enable the collection of large amounts of georeferenced images. This provides a huge opportunity in biodiversity and ecological research, but also creates challenges for efficient data handling and processing. Recreational and small-scale fisheries is one of the fields that could be revolutionised by efficient, widely accessible and machine learning-based processing of georeferenced images. Most non-commercial inland and coastal fisheries are considered data poor and are rarely assessed, yet they provide multiple societal benefits and can have substantial ecological impacts. Given that large quantities of georeferenced fish images are being collected by fishers every day, artificial intelligence (AI) and computer vision applications offer a great opportunity to automate their analyses by providing species identification, and potentially also fish size estimation. This would deliver data needed for fisheries management and fisher engagement. To date, however, many AI image analysis applications in fisheries are focused on the commercial sector, limited to specific species or settings, and are not publicly available. In addition, using AI and computer vision tools often requires a strong background in programming. In this study, we aim to facilitate broader use of computer vision tools in fisheries and ecological research by compiling an open-source user friendly and modular framework for large-scale image storage, handling, annotation and automatic classification, using cost- and labour-efficient methodologies. The tool is based on TensorFlow Lite Model Maker library, and includes data augmentation and transfer learning techniques applied to different convolutional neural network models. We demonstrate the potential application of this framework using a small example dataset of fish images taken through a recreational fishing smartphone application. The framework presented here can be used to develop region-specific species identification models, which could potentially be combined into a larger hierarchical model.
... The extracted features can then feed a classifier to perform gait recognition. Cross-dataset gait recognition can potentially be formulated as an out-of-distribution (OOD) testing problem, where the generalization ability of a deep model beyond the biases of the training set is evaluated [232]. We expect that OOD tests [233] become increasingly popular for evaluating the generalization ability of gait recognition methods. ...
Full-text available
Gait recognition is an appealing biometric modality which aims to identify individuals based on the way they walk. Deep learning has reshaped the research landscape in this area since 2015 through the ability to automatically learn discriminative representations. Gait recognition methods based on deep learning now dominate the state-of-the-art in the field and have fostered real-world applications. In this paper, we present a comprehensive overview of breakthroughs and recent developments in gait recognition with deep learning, and cover broad topics including datasets, test protocols, state-of-the-art solutions, challenges, and future research directions. We first review the commonly used gait datasets along with the principles designed for evaluating them. We then propose a novel taxonomy made up of four separate dimensions namely body representation, temporal representation, feature representation, and neural architecture, to help characterize and organize the research landscape and literature in this area. Following our proposed taxonomy, a comprehensive survey of gait recognition methods using deep learning is presented with discussions on their performances, characteristics, advantages, and limitations. We conclude this survey with a discussion on current challenges and mention a number of promising directions for future research in gait recognition.
Object detection and specifically face detection are challenging computer vision problems. The purpose of this study is to explore the effect of data augmentation and image decluttering technique on performance of YoloV4 model. In this work, we proposed the idea of image decluttering technique and evaluated its effect on the face detection. We have also investigated Mosaic augmentation technique and identified some drawbacks of using that and suggested an enhancement to the existing Mosaic augmentation to address the drawbacks and showed the impact of the new proposed mosaic augmentation technique on performance of face detection using YoloV4 model. This study is structured to find the effect of the proposed techniques on various images with diverse backgrounds, illumination, occlusions, and viewpoints. We achieved promising results that prove the effectiveness of the proposed techniques on detection probability, specifically in the challenging conditions.
Full-text available
Transfer learning attempts to use the knowledge learned from one task and apply it to improve the learning of a separate but similar task. This article proposes to evaluate this technique’s effectiveness in classifying images from the medical domain. The article presents a model TrFEMNet (Transfer Learning with Feature Extraction Modules Network), for classifying medical images. Feature representations from General Feature Extraction Module (GFEM) and Specific Feature Extraction Module (SFEM) are input to a projection head and the classification module to learn the target data. The aim is to extract representations at different levels of hierarchy and use them for the final representation learning. To compare with TrFEMNet, we have trained three other models with transfer learning. Experiments on the COVID-19 dataset, brain MRI binary classification, and brain MRI multiclass data show that TrFEMNet performs comparably to the other models. Pretrained model ResNet50 trained on a large image dataset, the ImageNet, is used as the base model.
Power control in massive multiple input multiple output (MIMO) systems is an appealing technique to improve network performance and reliability. The traditional methods to solve such problems are based on the convex optimization theory, which incurs high computational complexity. In contrast, this work leverages deep neural networks to maximize the minimum data rate of the downlink users in the massive MIMO network. Based on the different setups in the practical systems, namely in-distribution case , weakly out-of-distribution case , and out-of-distribution case , we propose three approaches to solve them. Specifically, first, we establish a densely connected neural network for solving the power control problem. Then, for the case where the test data and training data come from the same distributions (the in-distribution case ), we use the established neural network to obtain an approximately optimal power control strategy, and propose a self-training-based algorithm to use the unlabeled samples collected in the actual system to further improve the system performance. For the case that the test data and training data come from different distributions and a small amount of labeled target domain data can be obtained (the weakly out-of-distribution case ), we propose a pre-training and fine-tuning algorithm to solve the problem. For the issue that we can not obtain the labels of the target domain data (the out-of-distribution case ), we transform the problem into a covariate shift problem and propose an algorithm by weighting the loss function. Numerical results demonstrate that the proposed schemes not only match the optimal solution well but also have low online inference complexity.
Artificial intelligence (AI) of IoE technology is also playing an important role in the industrial fields. Past a lot of problems we are dependent on the technology of artificial, lag, intelligent deployment based on machine learning/depth of IoT, automation of the air companies through automatic device information sent from the designed itself will be able to understand the problem in time, predict costly downtime, then execute remote troubleshooting, and technical staff working hours and cost saving. Firstly, Mobile edge computing (MEC) can effectively overcome the shortcomings of high-latency in mobile cloud computing (MCC) by deploying the cloud resources, e.g., storage and computational capability, to the edge. However, the limited computation capability of the MEC restricts the scalability of offloading. Therefore, the basic requirements of the MEC system are to explore effective offloading decisions and resource allocation methods. To address it, we develop a collaborative computing system composed of local computing (mobile device), MEC (edge cloud) and MCC (central cloud). Based on the proposed collaborative computing system, we design a novel Q-learning based computation offloading (QLCOF) policy to achieve the optimal resource allocation and offloading scheme by prescheduling the computation side for each task from a global perspective. Specifically, we first model the offloading decision process as a Markov decision process (MDP) and design a state loss function (STLF) to measure the quality of experience (QoE). After that, we define the cumulation of STLFs as the system loss function (SYLF) and formulate an SYLF minimization problem. Due to the difficulty to directly solve the formulated problem, we decompose it into multiple subproblems and preferentially optimize the transmission power and computation frequency of the edge cloud by the quasi-convex bisection and polynomial analysis method, respectively. Based on the precalculated offline transmission power and edge cloud computation frequency, we develop a Q-learning based offloading (QLOF) scheme to minimize the SYLF by optimizing offloading decisions. Finally, the numeral results show that the proposed QLOF scheme effectively reduces the SYLF under different parameters. Then, we investigate a machine learning-based power allocation design for secure transmission in a cognitive radio (CR) network. In particular, a neural network (NN)-based approach is proposed to maximize the secrecy rate of the secondary receiver under the constraints of total transmit power of secondary transmitter, and the interference leakage to the primary receiver, within which three different regularization schemes are developed. The key advantage of the proposed algorithm over conventional approaches is the capability to solve the power allocation problem with both perfect and imperfect channel state information. In a conventional setting, two completely different optimization frameworks have to be designed, namely the robust and nonrobust designs. Furthermore, conventional algorithms are often based on iterative techniques, and hence, they require a considerable number of iterations, rendering them less suitable in future wireless networks where there are very stringent delay constraints. To meet the unprecedented requirements of future ultra-reliable low-latency networks, we propose an NN-based approach that can determine the power allocation in a CR network with significantly reduced computational time and complexity. As this trained NN only requires a small number of linear operations to yield the required power allocations, the approach can also be extended to different delay sensitive applications and services in future wireless networks. When evaluate the proposed method versus conventional approaches, using a suitable test set, the proposed approach can achieve more than 94% of the secrecy rate performance with less than 1% computation time and more than 93% satisfaction of interference leakage constraints. These results are obtained with significant reduction in computational time, which we believe that it is suitable for future real-time wireless applications.
Artificial intelligent approaches have been considered as the promising techniques to enable smart communications in future wireless networks. In this paper, we investigate the deep learning based resource allocation approach for secure transmission in a simultaneously wireless information and power transfer (SWIPT) network. In particular, we design the resource allocations to maximize the minimum achievable secrecy rate of the legitimate user under the constraints of energy harvesting requirements of the energy receivers (ERs). Conventionally, the optimal or suboptimal solutions of resource allocation problems can be obtained by exploiting convex optimization approaches, which are often developed based on iterative algorithms, and always result in long computational time. To satisfy ultra low latency demands and achieve physical layer security for future SWIPT systems, we develop a DNN based approach that has the capability to optimize the power allocations for a SWIPT network, where the computational time and complexity have been significantly cut down. Numerical results are provided to illustrate that the effectiveness of our proposed DNN based approach, which is capable to achieve near optimal secrecy rate performances in comparing with convex optimization approach.
Full-text available
Machine learning algorithms, including recent advances in deep learning, are promising for tools for detection and classification of broadband high frequency signals in passive acoustic recordings. However, these methods are generally data-hungry and progress has been limited by challenges related to the lack of labeled datasets adequate for training and testing. Large quantities of known and as yet unidentified broadband signal types mingle in marine recordings, with variability introduced by acoustic propagation, source depths and orientations, and interacting signals. Manual classification of these datasets is unmanageable without an in-depth knowledge of the acoustic context of each recording location. A signal classification pipeline is presented which combines unsupervised and supervised learning phases with opportunities for expert oversight to label signals of interest. The method is illustrated with a case study using unsupervised clustering to identify five toothed whale echolocation click types and two anthropogenic signal categories. These categories are used to train a deep network to classify detected signals in either averaged time bins or as individual detections, in two independent datasets. Bin-level classification achieved higher overall precision (>99%) than click-level classification. However, click-level classification had the advantage of providing a label for every signal, and achieved higher overall recall, with overall precision from 92 to 94%. The results suggest that unsupervised learning is a viable solution for efficiently generating the large, representative training sets needed for applications of deep learning in passive acoustics.
Full-text available
In this paper, the impact on fuzzy ARTMAP performance of decisions taken for batch supervised learning is assessed through computer simulation. By learning different real-world and synthetic data, using different learning strategies, training set sizes, and hyper-parameter values, the generalization error and resources requirements of this neural network are compared. In particular, the degradation of fuzzy ARTMAP performance due to overtraining is shown to depend on factors such as the training set size and the number of training epochs, and occur for pattern recognition problems in which class distributions overlap. Although the hold-out learning strategy is commonly employed to avoid overtraining, results indicate that it is not necessarily justified. As an alternative, a new Particle Swarm Optimization (PSO) learning strategy, based on the concept of neural network evolution, has been introduced. It co-jointly determines the weights, architecture and hyper-parameters such that generalization error is minimized. Through a comprehensive set of simulations, it has been shown that when fuzzy ARTMAP uses this strategy, it produces a significantly lower generalization error, and mitigates the degradation of error due to overtraining. Overall, the results reveal the importance of optimizing all fuzzy ARTMAP parameters for a given problem, using a consistent objective function.
Conference Paper
Full-text available
Motivated in part by the hierarchical organization of the cortex, a number of al- gorithms have recently been proposed that try to learn hierarchical, or "deep," structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hier- archy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear ("contour") features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex "corner" features matches well with the results from the Ito & Komatsu's study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features.
Conference Paper
Full-text available
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not been extensively stud- ied for auditory data. In this paper, we apply convolutional deep belief net- works to audio data and empirically evaluate them on various audio classification tasks. In the case of speech data, we show that the learned features correspond to phones/phonemes. In addition, our feature representations learned from unlabeled audio data show very good performance for multiple audio classification tasks. We hope that this paper will inspire more research on deep learning approaches applied to a wide range of audio recognition tasks.
Conference Paper
Full-text available
For many pattern recognition tasks, the ideal input feature would be invariant to multiple confounding properties (such as illumination and viewing angle, in com- puter vision applications). Recently, deep architectures trained in an unsupervised manner have been proposed as an automatic method for extracting useful features. However, it is difficult to evaluate the learned features by a ny means other than using them in a classifier. In this paper, we propose a number o f empirical tests that directly measure the degree to which these learned features are invariant to different input transformations. We find that stacked autoe ncoders learn modestly increasingly invariant features with depth when trained on natural images. We find that convolutional deep belief networks learn substantial ly more invariant features in each layer. These results further justify the use of "deep " vs. "shallower" repre- sentations, but suggest that mechanisms beyond merely stacking one autoencoder on top of another may be important for achieving invariance. Our evaluation met- rics can also be used to evaluate future work in deep learning, and thus help the development of future algorithms.
Conference Paper
Full-text available
We introduce a new class of probabilistic latent vari- able model called the Implicit Mixture of Conditional Re- stricted Boltzmann Machines (imCRBM) for use in human pose tracking. Key properties of the imCRBM are as fol- lows: (1) learning is linear in the number of training exem- plars so it can be learned from large datasets; (2) it learns coherent models of multiple activities; (3) it automatically discovers atomic "movemes"; and (4) it can infer transi- tions between activities, even when such transitions are not present in the training set. We describe the model and how it is learned and we demonstrate its use in the context of Bayesian filtering for multi-view and monocular pose track- ing. The model handles difficult scenarios including multi- ple activities and transitions among activities. We report state-of-the-art results on the HumanEva dataset.
Conference Paper
Full-text available
Previous work has shown that the dicul- ties in learning deep generative or discrim- inative models can be overcome by an ini- tial unsupervised learning step that maps in- puts to useful intermediate representations. We introduce and motivate a new training principle for unsupervised learning of a rep- resentation based on the idea of making the learned representations robust to partial cor- ruption of the input pattern. This approach can be used to train autoencoders, and these denoising autoencoders can be stacked to ini- tialize deep architectures. The algorithm can be motivated from a manifold learning and information theoretic perspective or from a generative model perspective. Comparative experiments clearly show the surprising ad- vantage of corrupting the input of autoen- coders on a pattern classification benchmark suite.
Conference Paper
The Conditional Restricted Boltzmann Machine (CRBM) is a recently proposed model for time series that has a rich, distributed hidden state and permits simple, exact inference. We present a new model, based on the CRBM that pre- serves its most important computational proper- ties and includes multiplicative three-way inter- actions that allow the effective interaction weight between two units to be modulated by the dy- namic state of a third unit. We factorize the three- way weight tensor implied by the multiplicative model, reducing the number of parameters from O(N3) toO(N2). The result is an efficient, com- pact model whose effectiveness we demonstrate by modeling human motion. Like the CRBM, our model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve the model's ability to blend motion styles or to transition smoothly between them.