ArticlePDF Available

Abstract and Figures

We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN.
Content may be subject to copyright.
Maxout Networks
Ian J. Goodfellow goodfeli@iro.umontreal.ca
David Warde-Farley wardefar@iro.umontreal.ca
Mehdi Mirza mirzamom@iro.umontreal.ca
Aaron Courville aaron.courville@umontreal.ca
Yoshua Bengio yoshua.bengio@umontreal.ca
epartement d’Informatique et de Recherche Op´erationelle, Universit´e de Montr´eal
2920, chemin de la Tour, Montr´eal, Qu´ebec, Canada, H3T 1J8
Abstract
We consider the problem of designing mod-
els to leverage a recently introduced ap-
proximate model averaging technique called
dropout. We define a simple new model called
maxout (so named because its output is the
max of a set of inputs, and because it is a nat-
ural companion to dropout) designed to both
facilitate optimization by dropout and im-
prove the accuracy of dropout’s fast approxi-
mate model averaging technique. We empir-
ically verify that the model successfully ac-
complishes both of these tasks. We use max-
out and dropout to demonstrate state of the
art classification performance on four bench-
mark datasets: MNIST, CIFAR-10, CIFAR-
100, and SVHN.
1. Introduction
Dropout (Hinton et al.,2012) provides an inexpensive
and simple means of both training a large ensemble of
models that share parameters and approximately aver-
aging together these models’ predictions. Dropout ap-
plied to multilayer perceptrons and deep convolutional
networks has improved the state of the art on tasks
ranging from audio classification to very large scale ob-
ject recognition (Hinton et al.,2012;Krizhevsky et al.,
2012). While dropout is known to work well in prac-
tice, it has not previously been demonstrated to actu-
ally perform model averaging for deep architectures 1.
Proceedings of the 30th International Conference on Ma-
chine Learning, Atlanta, Georgia, USA, 2013. JMLR:
W&CP volume 28. Copyright 2013 by the author(s).
1Between submission and publication of this paper, we
have learned that Srivastava (2013) performed experiments
on this subject similar to ours.
Dropout is generally viewed as an indiscriminately ap-
plicable tool that reliably yields a modest improvement
in performance when applied to almost any model.
We argue that rather than using dropout as a slight
performance enhancement applied to arbitrary mod-
els, the best performance may be obtained by directly
designing a model that enhances dropout’s abilities as
a model averaging technique. Training using dropout
differs significantly from previous approaches such as
ordinary stochastic gradient descent. Dropout is most
effective when taking relatively large steps in param-
eter space. In this regime, each update can be seen
as making a significant update to a different model on
a different subset of the training set. The ideal oper-
ating regime for dropout is when the overall training
procedure resembles training an ensemble with bag-
ging under parameter sharing constraints. This differs
radically from the ideal stochastic gradient operating
regime in which a single model makes steady progress
via small steps. Another consideration is that dropout
model averaging is only an approximation when ap-
plied to deep models. Explicitly designing models to
minimize this approximation error may thus enhance
dropout’s performance as well.
We propose a simple model that we call maxout that
has beneficial characteristics both for optimization and
model averaging with dropout. We use this model in
conjunction with dropout to set the state of the art on
four benchmark datasets 2.
2. Review of dropout
Dropout is a technique that can be applied to deter-
ministic feedforward architectures that predict an out-
put ygiven input vector v. These architectures contain
2Code and hyperparameters available at http://
www-etud.iro.umontreal.ca/˜goodfeli/maxout.html
arXiv:1302.4389v4 [stat.ML] 20 Sep 2013
Maxout Networks
a series of hidden layers h={h(1), . . . , h(L)}. Dropout
trains an ensemble of models consisting of the set of
all models that contain a subset of the variables in
both vand h. The same set of parameters θis used
to parameterize a family of distributions p(y|v;θ, µ)
where µ∈ M is a binary mask determining which vari-
ables to include in the model. On each presentation of
a training example, we train a different sub-model by
following the gradient of log p(y|v;θ, µ) for a different
randomly sampled µ. For many parameterizations of p
(such as most multilayer perceptrons) the instantiation
of different sub-models p(y|v;θ, µ) can be obtained by
elementwise multiplication of vand hwith the mask
µ. Dropout training is similar to bagging (Breiman,
1994), where many different models are trained on dif-
ferent subsets of the data. Dropout training differs
from bagging in that each model is trained for only
one step and all of the models share parameters. For
this training procedure to behave as if it is training an
ensemble rather than a single model, each update must
have a large effect, so that it makes the sub-model in-
duced by that µfit the current input vwell.
The functional form becomes important when it comes
time for the ensemble to make a prediction by aver-
aging together all the sub-models’ predictions. Most
prior work on bagging averages with the arithmetic
mean, but it is not obvious how to do so with the
exponentially many models trained by dropout. For-
tunately, some model families yield an inexpensive ge-
ometric mean. When p(y|v;θ) = softmax(vTW+b),
the predictive distribution defined by renormalizing
the geometric mean of p(y|v;θ, µ) over Mis simply
given by softmax(vTW/2+b). In other words, the aver-
age prediction of exponentially many sub-models can
be computed simply by running the full model with
the weights divided by 2. This result holds exactly
in the case of a single layer softmax model. Previous
work on dropout applies the same scheme in deeper ar-
chitectures, such as multilayer perceptrons, where the
W/2 method is only an approximation to the geometric
mean. The approximation has not been characterized
mathematically, but performs well in practice.
3. Description of maxout
The maxout model is simply a feed-forward achitec-
ture, such as a multilayer perceptron or deep convo-
lutional neural network, that uses a new type of ac-
tivation function: the maxout unit. Given an input
xRd(xmay be v, or may be a hidden layer’s state),
a maxout hidden layer implements the function
hi(x) = max
j[1,k]zij
where zij =xTW···ij +bij , and WRd×m×kand
bRm×kare learned parameters. In a convolutional
network, a maxout feature map can be constructed
by taking the maximum across kaffine feature maps
(i.e., pool across channels, in addition spatial loca-
tions). When training with dropout, we perform the
elementwise multiplication with the dropout mask im-
mediately prior to the multiplication by the weights in
all cases–we do not drop inputs to the max operator.
A single maxout unit can be interpreted as making a
piecewise linear approximation to an arbitrary convex
function. Maxout networks learn not just the rela-
tionship between hidden units, but also the activation
function of each hidden unit. See Fig. 1for a graphical
depiction of how this works.
x
hi(x)
Rectifier
x
hi(x)
Absolute value
x
hi(x)
Quadratic
Figure 1. Graphical depiction of how the maxout activa-
tion function can implement the rectified linear, absolute
value rectifier, and approximate the quadratic activation
function. This diagram is 2D and only shows how max-
out behaves with a 1D input, but in multiple dimensions a
maxout unit can approximate arbitrary convex functions.
Maxout abandons many of the mainstays of traditional
activation function design. The representation it pro-
duces is not sparse at all (see Fig. 2), though the
gradient is highly sparse and dropout will artificially
sparsify the effective representation during training.
While maxout may learn to saturate on one side or
the other this is a measure zero event (so it is almost
never bounded from above). While a significant pro-
portion of parameter space corresponds to the function
being bounded from below, maxout is not constrained
to learn to be bounded at all. Maxout is locally lin-
ear almost everywhere, while many popular activation
functions have signficant curvature. Given all of these
departures from standard practice, it may seem sur-
prising that maxout activation functions work at all,
but we find that they are very robust and easy to train
with dropout, and achieve excellent performance.
4. Maxout is a universal approximator
A standard MLP with enough hidden units is a uni-
versal approximator. Similarly, maxout networks are
universal approximators. Provided that each individ-
ual maxout unit may have arbitrarily many affine com-
ponents, we show that a maxout model with just two
hidden units can approximate, arbitrarily well, any
Maxout Networks
-4 -2 0 2 4 6
Activation
0
5
10
15
20
25
30
35
# of occurrences
Histogram of maxout responses
Figure 2. The activations of maxout units are not sparse.
h2
h1
g
z2,·
z1,·
v
W1=1
W2=1
Figure 3. An MLP containing two maxout units can arbi-
trarily approximate any continuous function. The weights
in the final layer can set gto be the difference of h1and h2.
If z1and z2are allowed to have arbitrarily high cardinal-
ity, h1and h2can approximate any convex function. gcan
thus approximate any continuous function due to being a
difference of approximations of arbitrary convex functions.
continuous function of vRn. A diagram illustrating
the basic idea of the proof is presented in Fig. 3.
Consider the continuous piecewise linear (PWL) func-
tion g(v) consisting of klocally affine regions on Rn.
Proposition 4.1 (From Theorem 2.1 in (Wang,
2004)) For any positive integers mand n, there exist
two groups of n+ 1-dimensional real-valued parame-
ter vectors [W1j, b1j], j [1, k]and [W2j, b2j], j [1, k]
such that:
g(v) = h1(v)h2(v) (1)
That is, any continuous PWL function can be ex-
pressed as a difference of two convex PWL functions.
The proof is given in (Wang,2004).
Proposition 4.2 From the Stone-Weierstrass ap-
proximation theorem, let Cbe a compact domain
CRn,f:CRbe a continuous function, and
 > 0be any positive real number. Then there exists a
continuous PWL function g, (depending upon ), such
that for all vC,|f(v)g(v)|< .
Theorem 4.3 Universal approximator theorem. Any
continuous function fcan be approximated arbitrar-
ily well on a compact domain CRnby a maxout
network with two maxout hidden units.
Table 1. Test set misclassification rates for the best meth-
ods on the permutation invariant MNIST dataset. Only
methods that are regularized by modeling the input distri-
bution outperform the maxout MLP.
Method Test error
Rectifier MLP + dropout (Sri-
vastava,2013)
1.05%
DBM (Salakhutdinov & Hin-
ton,2009)
0.95%
Maxout MLP + dropout 0.94%
MP-DBM (Goodfellow et al.,
2013)
0.91%
Deep Convex Network (Yu &
Deng,2011)
0.83%
Manifold Tangent Classifier
(Rifai et al.,2011)
0.81%
DBM + dropout (Hinton
et al.,2012)
0.79%
Sketch of Proof By Proposition 4.2, any continuous
function can be approximated arbitrarily well (up to
), by a piecewise linear function. We now note that
the representation of piecewise linear functions given
in Proposition 4.1 exactly matches a maxout network
with two hidden units h1(v) and h2(v), with suffi-
ciently large kto achieve the desired degree of approx-
imation . Combining these, we conclude that a two
hidden unit maxout network can approximate any con-
tinuous function f(v) arbitrarily well on the compact
domain C. In general as 0, we have k→ ∞.
Figure 4. Example filters learned by a maxout MLP
trained with dropout on MNIST. Each row contains the
filters whose responses are pooled to form a maxout unit.
5. Benchmark results
We evaluated the maxout model on four benchmark
datasets and set the state of the art on all of them.
Maxout Networks
Table 2. Test set misclassification rates for the best meth-
ods on the general MNIST dataset, excluding methods that
augment the training data.
Method Test error
2-layer CNN+2-layer NN (Jar-
rett et al.,2009)
0.53%
Stochastic pooling (Zeiler &
Fergus,2013)
0.47%
Conv. maxout + dropout 0.45%
5.1. MNIST
The MNIST (LeCun et al.,1998) dataset consists of 28
×28 pixel greyscale images of handwritten digits 0-9,
with 60,000 training and 10,000 test examples. For the
permutation invariant version of the MNIST task, only
methods unaware of the 2D structure of the data are
permitted. For this task, we trained a model consist-
ing of two densely connected maxout layers followed
by a softmax layer. We regularized the model with
dropout and by imposing a constraint on the norm of
each weight vector, as in (Srebro & Shraibman,2005).
Apart from the maxout units, this is the same archi-
tecture used by Hinton et al. (2012). We selected the
hyperparameters by minimizing the error on a valida-
tion set consisting of the last 10,000 training examples.
To make use of the full training set, we recorded the
value of the log likelihood on the first 50,000 exam-
ples at the point of minimal validation error. We then
continued training on the full 60,000 example train-
ing set until the validation set log likelihood matched
this number. We obtained a test set error of 0.94%,
which is the best result we are aware of that does not
use unsupervised pretraining. We summarize the best
published results on permutation invariant MNIST in
Table 1.
We also considered the MNIST dataset without the
permutation invariance restriction. In this case, we
used three convolutional maxout hidden layers (with
spatial max pooling on top of the maxout layers) fol-
lowed by a densely connected softmax layer. We were
able to rapidly explore hyperparameter space thanks
to the extremely fast GPU convolution library devel-
oped by Krizhevsky et al. (2012). We obtained a test
set error rate of 0.45%, which sets a new state of the
art in this category. (It is possible to get better results
on MNIST by augmenting the dataset with transfor-
mations of the standard set of images (Ciresan et al.,
2010) ) A summary of the best methods on the general
MNIST dataset is provided in Table 2.
Table 3. Test set misclassification rates for the best meth-
ods on the CIFAR-10 dataset.
Method Test error
Stochastic pooling (Zeiler &
Fergus,2013)
15.13%
CNN + Spearmint (Snoek
et al.,2012)
14.98%
Conv. maxout + dropout 11.68 %
CNN + Spearmint + data aug-
mentation (Snoek et al.,2012)
9.50 %
Conv. maxout + dropout +
data augmentation
9.38 %
5.2. CIFAR-10
The CIFAR-10 dataset (Krizhevsky & Hinton,2009)
consists of 32 ×32 color images drawn from 10 classes
split into 50,000 train and 10,000 test images. We pre-
process the data using global contrast normalization
and ZCA whitening.
We follow a similar procedure as with the MNIST
dataset, with one change. On MNIST, we find the
best number of training epochs in terms of validation
set error, then record the training set log likelihood
and continue training using the entire training set un-
til the validation set log likelihood has reached this
value. On CIFAR-10, continuing training in this fash-
ion is infeasible because the final value of the learn-
ing rate is very small and the validation set error is
very high. Training until the validation set likelihood
matches the cross-validated value of the training like-
lihood would thus take prohibitively long. Instead, we
retrain the model from scratch, and stop when the new
likelihood matches the old one.
Our best model consists of three convolutional maxout
layers, a fully connected maxout layer, and a fully con-
nected softmax layer. Using this approach we obtain a
test set error of 11.68%, which improves upon the state
of the art by over two percentage points. (If we do not
train on the validation set, we obtain a test set error
of 13.2%, which also improves over the previous state
of the art). If we additionally augment the data with
translations and horizontal reflections, we obtain the
absolute state of the art on this task at 9.35% error.
In this case, the likelihood during the retrain never
reaches the likelihood from the validation run, so we
retrain for the same number of epochs as the valida-
tion run. A summary of the best CIFAR-10 methods
is provided in Table 3.
Maxout Networks
0.0 0.2 0.4 0.6 0.8 1.0 1.2
# examples ×107
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Error
CIFAR-10 validation error with and without droput
Dropout/Validation
No dropout/Validation
Dropout/Train
No dropout/Train
Figure 5. When training maxout, the improvement in val-
idation set error that results from using dropout is dra-
matic. Here we find a greater than 25% reduction in our
validation set error on CIFAR-10.
Table 4. Test set misclassification rates for the best meth-
ods on the CIFAR-100 dataset.
Method Test error
Learned pooling (Malinowski
& Fritz,2013)
43.71%
Stochastic pooling(Zeiler &
Fergus,2013)
42.51%
Conv. maxout + dropout 38.57%
5.3. CIFAR-100
The CIFAR-100 (Krizhevsky & Hinton,2009) dataset
is the same size and format as the CIFAR-10 dataset,
but contains 100 classes, with only one tenth as many
labeled examples per class. Due to lack of time we
did not extensively cross-validate hyperparameters on
CIFAR-100 but simply applied hyperparameters we
found to work well on CIFAR-10. We obtained a test
set error of 38.57%, which is state of the art. If we do
not retrain using the entire training set, we obtain a
test set error of 41.48%, which also surpasses the cur-
rent state of the art. A summary of the best methods
on CIFAR-100 is provided in Table 4.
5.4. Street View House Numbers
The SVHN (Netzer et al.,2011) dataset consists of
color images of house numbers collected by Google
Street View. The dataset comes in two formats. We
consider the second format, in which each image is of
size 32 ×32 and the task is to classify the digit in
the center of the image. Additional digits may appear
beside it but must be ignored. There are 73,257 digits
in the training set, 26,032 digits in the test set and
531,131 additional, somewhat less difficult examples,
Table 5. Test set misclassification rates for the best meth-
ods on the SVHN dataset.
Method Test error
(Sermanet et al.,2012a) 4.90%
Stochastic pooling (Zeiler &
Fergus,2013)
2.80 %
Rectifiers + dropout (Srivas-
tava,2013)
2.78 %
Rectifiers + dropout +
synthetic translation (Srivas-
tava,2013)
2.68 %
Conv. maxout + dropout 2.47 %
to use as an extra training set. Following Sermanet
et al. (2012b), to build a validation set, we select 400
samples per class from the training set and 200 sam-
ples per class from the extra set. The remaining digits
of the train and extra sets are used for training.
For SVHN, we did not train on the validation set at
all. We used it only to find the best hyperparameters.
We applied local contrast normalization preprocessing
the same way as Zeiler & Fergus (2013). Otherwise,
we followed the same approach as on MNIST. Our best
model consists of three convolutional maxout hidden
layers and a densely connected maxout layer followed
by a densely connected softmax layer. We obtained a
test set error rate of 2.47%, which sets the state of the
art. A summary of comparable methods is provided
in Table 5.
6. Comparison to rectifiers
One obvious question about our results is whether we
obtained them by improved preprocessing or larger
models, rather than by the use of maxout. For MNIST
we used no preprocessing, and for SVHN, we use the
same preprocessing as Zeiler & Fergus (2013). How-
ever on the CIFAR datasets we did use a new form of
preprocessing. We therefore compare maxout to rec-
tifiers run with the same processing and a variety of
model sizes on this dataset.
By running a large cross-validation experiment (see
Fig. 6) we found that maxout offers a clear im-
provement over rectifiers. We also found that our
preprocessing and size of models improves rectifiers
and dropout beyond the previous state of the art re-
sult. Cross-channel pooling is a method for reducing
the size of state and number of parameters needed to
have a given number of filters in the model. Perfor-
mance seems to correlate well with the number of fil-
ters for maxout but with the number of output units
for rectifiers–i.e, rectifier units do not benefit much
Maxout Networks
from cross-channel pooling. Rectifier units do best
without cross-channel pooling but with the same num-
ber of filters, meaning that the size of the state and the
number of parameters must be about ktimes higher
for rectifiers to obtain generalization performance ap-
proaching that of maxout.
0 100 200 300 400 500 600 700 800
training epochs
0.125
0.130
0.135
0.140
0.145
0.150
0.155
0.160
validation set error for best experiment
Comparison of large rectifier networks to maxout
Maxout
Rectifier, no channel pooling
Rectifier + channel pooling
Large rectifier, no channel pooling
Figure 6. We cross-validated the momentum and learn-
ing rate for four architectures of model: 1) Medium-sized
maxout network. 2) Rectifier network with cross-channel
pooling, and exactly the same number of parameters and
units as the maxout network. 3) Rectifier network without
cross-channel pooling, and the same number of units as
the maxout network (thus fewer parameters). 4) Rectifier
network without cross-channel pooling, but with ktimes
as many units as the maxout network. Because making
layer ihave ktimes more outputs increases the number
of inputs to layer i+ 1, this network has roughly ktimes
more parameters than the maxout network, and requires
significantly more memory and runtime. We sampled 10
learning rate and momentum schedules and random seeds
for dropout, then ran each configuration for all 4 architec-
tures. Each curve terminates after failing to improve the
validation error in the last 100 epochs.
7. Model averaging
Having demonstrated that maxout networks are effec-
tive models, we now analyze the reasons for their suc-
cess. We first identify reasons that maxout is highly
compatible with dropout’s approximate model averag-
ing technique.
The intuitive justification for averaging sub-models by
dividing the weights by 2 given by (Hinton et al.,2012)
is that this does exact model averaging for a single
layer model, softmax regression. To this characteriza-
tion, we add the observation that the model averaging
remains exact if the model is extended to multiple lin-
ear layers. While this has the same representational
power as a single layer, the expression of the weights
as a product of several matrices could have a differ-
100101102103
# samples
0.008
0.010
0.012
0.014
0.016
0.018
0.020
0.022
0.024
0.026
Test error
Model averaging: MNIST classification
Sampling, maxout
Sampling, tanh
W/2, maxout
W/2, tanh
Figure 7. The error rate of the prediction obtained by sam-
pling several sub-models and taking the geometric mean of
their predictions approaches the error rate of the predic-
tion made by dividing the weights by 2. However, the
divided weights still obtain the best test error, suggesting
that dropout is a good approximation to averaging over a
very large number of models. Note that the correspondence
is more clear in the case of maxout.
ent inductive bias. More importantly, it indicates that
dropout does exact model averaging in deeper archi-
tectures provided that they are locally linear among
the space of inputs to each layer that are visited by
applying different dropout masks.
We argue that dropout training encourages maxout
units to have large linear regions around inputs that
appear in the training data. Because each sub-model
must make a good prediction of the output, each unit
should learn to have roughly the same activation re-
gardless of which inputs are dropped. In a maxout
network with arbitrarily selected parameters, varying
the dropout mask will often move the effective in-
puts far enough to escape the local region surround-
ing the clean inputs in which the hidden units are lin-
ear, i.e., changing the dropout mask could frequently
change which piece of the piecewise function an input
is mapped to. Maxout trained with dropout may have
the identity of the maximal filter in each unit change
relatively rarely as the dropout mask changes. Net-
works of linear operations and max(·) may learn to
exploit dropout’s approximate model averaging tech-
nique well.
Many popular activation functions have significant
curvature nearly everywhere. These observations sug-
gest that the approximate model averaging of dropout
will not be as accurate for networks incorporating such
activation functions. To test this, we compared the
best maxout model trained on MNIST with dropout
to a hyperbolic tangent network trained on MNIST
Maxout Networks
100101102103
# samples
0.0004
0.0006
0.0008
0.0010
0.0012
0.0014
0.0016
0.0018
KL divergence
KL divergence between model averaging strategies
Maxout
Tanh
Figure 8. The KL divergence between the distribution pre-
dicted using the dropout technique of dividing the weights
by 2 and the distribution obtained by taking the geomet-
ric mean of the predictions of several sampled models de-
creases as the number of samples increases. This suggests
that dropout does indeed do model averaging, even for deep
networks. The approximation is more accurate for maxout
units than for tanh units.
with dropout. We sampled several subsets of each
model and compared the geometric mean of these sam-
pled models’ predictions to the prediction made using
the dropout technique of dividing the weights by 2.
We found evidence that dropout is indeed performing
model averaging, even in multilayer networks, and that
it is more accurate in the case of maxout. See Fig. 7
and Fig. 8for details.
8. Optimization
The second key reason that maxout performs well is
that it improves the bagging style training phase of
dropout. Note that the arguments in section 7moti-
vating the use of maxout also apply equally to rectified
linear units (Salinas & Abbott,1996;Hahnloser,1998;
Glorot et al.,2011). The only difference between max-
out and max pooling over a set of rectified linear units
is that maxout does not include a 0 in the max. Su-
perficially, this seems to be a small difference, but we
find that including this constant 0 is very harmful to
optimization in the context of dropout. For instance,
on MNIST our best validation set error with an MLP
is 1.04%. If we include a 0 in the max, this rises to
over 1.2%. We argue that, when trained with dropout,
maxout is easier to optimize than rectified linear units
with cross-channel pooling.
8.1. Optimization experiments
To verify that maxout yields better optimization per-
formance than max pooled rectified linear units when
training with dropout, we carried out two experiments.
First, we stressed the optimization capabilities of the
training algorithm by training a small (two hidden
convolutional layers with k= 2 and sixteen kernels)
model on the large (600,000 example) SVHN dataset.
When training with rectifier units the training error
gets stuck at 7.3%. If we train instead with max-
out units, we obtain 5.1% training error. As another
optimization stress test, we tried training very deep
and narrow models on MNIST, and found that max-
out copes better with increasing depth than pooled
rectifiers. See Fig. 9for details.
1234567
# layers
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
Error
MNIST classification error versus network depth
Maxout test error
Rectifier test error
Maxout train error
Rectifier train error
Figure 9. We trained a series of models with increasing
depth on MNIST. Each layer contains only 80 units (k=5)
to make fitting the training set difficult. Maxout optimiza-
tion degrades gracefully with depth but pooled rectifier
units worsen noticeably at 6 layers and dramatically at 7.
8.2. Saturation
20 30 40 50 60 70 80 90 100
# epochs
0.00
0.05
0.10
0.15
0.20
0.25
Proportion of sign that switches this epoch
Training set h0activation sign switches/epoch
Maxout: posneg
Maxout: negpos
Pooled rect: poszero
Pooled rect: zeropos
Figure 10. During dropout training, rectifier units transi-
tion from positive to 0 activation more frequently than they
make the opposite transition, resulting a preponderence of
0 activations. Maxout units freely move between positive
and negative signs at roughly equal rates.
Maxout Networks
Optimization proceeds very differently when using
dropout than when using ordinary stochastic gradi-
ent descent. SGD usually works best with a small
learning rate that results in a smoothly decreasing
objective function, while dropout works best with a
large learning rate, resulting in a constantly fluctuat-
ing objective function. Dropout rapidly explores many
different directions and rejects the ones that worsen
performance, while SGD moves slowly and steadily
in the most promising direction. We find empirically
that these different operating regimes result in differ-
ent outcomes for rectifier units. When training with
SGD, we find that the rectifier units saturate at 0 less
than 5% of the time. When training with dropout, we
initialize the units to sature rarely but training grad-
ually increases their saturation rate to 60%. Because
the 0 in the max(0, z) activation function is a con-
stant, this blocks the gradient from flowing through
the unit. In the absence of gradient through the unit,
it is difficult for training to change this unit to be-
come active again. Maxout does not suffer from this
problem because gradient always flows through every
maxout unit–even when a maxout unit is 0, this 0 is a
function of the parameters and may be adjusted Units
that take on negative activations may be steered to
become positive again later. Fig. 10 illustrates how
active rectifier units become inactive at a greater rate
than inactive units become active when training with
dropout, but maxout units, which are always active,
transition between positive and negative activations at
about equal rates in each direction. We hypothesize
that the high proportion of zeros and the difficulty of
escaping them impairs the optimization performance
of rectifiers relative to maxout.
To test this hypothesis, we trained two MLPs on
MNIST, both with two hidden layers and 1200 fil-
ters per layer pooled in groups of 5. When we in-
clude a constant 0 in the max pooling, the resulting
trained model fails to make use of 17.6% of the filters
in the second layer and 39.2% of the filters in the sec-
ond layer. A small minority of the filters usually took
on the maximal value in the pool, and the rest of the
time the maximal value was a constant 0. Maxout, on
the other hand, used all but 2 of the 2400 filters in
the network. Each filter in each maxout unit in the
network was maximal for some training example. All
filters had been utilised and tuned.
8.3. Lower layer gradients and bagging
To behave differently from SGD, dropout requires the
gradient to change noticeably as the choice of which
units to drop changes. If the gradient is approximately
constant with respect to the dropout mask, then
dropout simplifies to SGD training. We tested the
hypothesis that rectifier networks suffer from dimin-
ished gradient flow to the lower layers of the network
by monitoring the variance with respect to dropout
masks for fixed data during training of two different
MLPs on MNIST. The variance of the gradient on the
output weights was 1.4 times larger for maxout on an
average training step, while the variance on the gradi-
ent of the first layer weights was 3.4 times larger for
maxout than for rectifiers. Combined with our previ-
ous result showing that maxout allows training deeper
networks, this greater variance suggests that maxout
better propagates varying information downward to
the lower layers and helps dropout training to better
resemble bagging for the lower-layer parameters. Rec-
tifier networks, with more of their gradient lost to satu-
ration, presumably cause dropout training to resemble
regular SGD toward the bottom of the network.
9. Conclusion
We have proposed a new activation function called
maxout that is particularly well suited for training
with dropout, and for which we have proven a univer-
sal approximation theorem. We have shown empirical
evidence that dropout attains a good approximation
to model averaging in deep models. We have shown
that maxout exploits this model averaging behavior
because the approximation is more accurate for max-
out units than for tanh units. We have demonstrated
that optimization behaves very differently in the con-
text of dropout than in the pure SGD case. By de-
signing the maxout gradient to avoid pitfalls such as
failing to use many of a model’s filters, we are able to
train deeper networks than is possible using rectifier
units. We have also shown that maxout propagates
variations in the gradient due to different choices of
dropout masks to the lowest layers of a network, en-
suring that every parameter in the model can enjoy
the full benefit of dropout and more faithfully emulate
bagging training. The state of the art performance of
our approach on five different benchmark tasks moti-
vates the design of further models that are explicitly
intended to perform well when combined with inex-
pensive approximations to model averaging.
Acknowledgements
The authors would like to thank the developers of
Theano (Bergstra et al.,2010;Bastien et al.,2012),
in particular Fed´eric Bastien and Pascal Lamblin for
their assistance with infrastructure development and
performance optimization. We would also like to thank
Yann Dauphin for helpful discussions.
Maxout Networks
References
Bastien, Fed´eric, Lamblin, Pascal, Pascanu, Razvan,
Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud,
Bouchard, Nicolas, and Bengio, Yoshua. Theano: new
features and speed improvements. Deep Learning and
Unsupervised Feature Learning NIPS 2012 Workshop,
2012.
Bergstra, James, Breuleux, Olivier, Bastien, Fr´ed´eric,
Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guil-
laume, Turian, Joseph, Warde-Farley, David, and Ben-
gio, Yoshua. Theano: a CPU and GPU math expression
compiler. In Proceedings of the Python for Scientific
Computing Conference (SciPy), June 2010. Oral Pre-
sentation.
Breiman, Leo. Bagging predictors. Machine Learning, 24
(2):123–140, 1994.
Ciresan, D. C., Meier, U., Gambardella, L. M., and
Schmidhuber, J. Deep big simple neural nets for hand-
written digit recognition. Neural Computation, 22:1–14,
2010.
Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua.
Deep sparse rectifier neural networks. In JMLR W&CP:
Proceedings of the Fourteenth International Confer-
ence on Artificial Intel ligence and Statistics (AISTATS
2011), April 2011.
Goodfellow, Ian J., Courville, Aaron, and Bengio, Yoshua.
Joint training of deep Boltzmann machines for classifi-
cation. In International Conference on Learning Repre-
sentations: Workshops Track, 2013.
Hahnloser, Richard H. R. On the piecewise analysis of
networks of linear threshold neurons. Neural Networks,
11(4):691–697, 1998.
Hinton, Geoffrey E., Srivastava, Nitish, Krizhevsky, Alex,
Sutskever, Ilya, and Salakhutdinov, Ruslan. Improving
neural networks by preventing co-adaptation of feature
detectors. Technical report, arXiv:1207.0580, 2012.
Jarrett, Kevin, Kavukcuoglu, Koray, Ranzato,
Marc’Aurelio, and LeCun, Yann. What is the best
multi-stage architecture for object recognition? In
Proc. International Conference on Computer Vision
(ICCV’09), pp. 2146–2153. IEEE, 2009.
Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple
layers of features from tiny images. Technical report,
University of Toronto, 2009.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey.
ImageNet classification with deep convolutional neural
networks. In Advances in Neural Information Processing
Systems 25 (NIPS’2012). 2012.
LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner,
Patrick. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324,
November 1998.
Malinowski, Mateusz and Fritz, Mario. Learnable pooling
regions for image classification. In International Con-
ference on Learning Representations: Workshop track,
2013.
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B.,
and Ng, A. Y. Reading digits in natural images with
unsupervised feature learning. Deep Learning and Un-
supervised Feature Learning Workshop, NIPS, 2011.
Rifai, Salah, Dauphin, Yann, Vincent, Pascal, Bengio,
Yoshua, and Muller, Xavier. The manifold tangent clas-
sifier. In NIPS’2011, 2011. Student paper award.
Salakhutdinov, R. and Hinton, G.E. Deep Boltzmann
machines. In Proceedings of the Twelfth International
Conference on Artificial Intelligence and Statistics (AIS-
TATS 2009), volume 8, 2009.
Salinas, E. and Abbott, L. F. A model of multiplicative
neural responses in parietal cortex. Proc Natl Acad Sci
U S A, 93(21):11956–11961, October 1996.
Sermanet, Pierre, Chintala, Soumith, and LeCun, Yann.
Convolutional neural networks applied to house numbers
digit classification. CoRR, abs/1204.3968, 2012a.
Sermanet, Pierre, Chintala, Soumith, and LeCun, Yann.
Convolutional neural networks applied to house num-
bers digit classification. In International Conference on
Pattern Recognition (ICPR 2012), 2012b.
Snoek, Jasper, Larochelle, Hugo, and Adams,
Ryan Prescott. Practical bayesian optimization of
machine learning algorithms. In Neural Information
Processing Systems, 2012.
Srebro, Nathan and Shraibman, Adi. Rank, trace-norm
and max-norm. In Proceedings of the 18th Annual
Conference on Learning Theory, pp. 545–560. Springer-
Verlag, 2005.
Srivastava, Nitish. Improving neural networks with
dropout. Master’s thesis, U. Toronto, 2013.
Wang, Shuning. General constructive representations for
continuous piecewise-linear functions. IEEE Trans. Cir-
cuits Systems, 51(9):1889–1896, 2004.
Yu, Dong and Deng, Li. Deep convex net: A scalable ar-
chitecture for speech pattern classification. In INTER-
SPEECH, pp. 2285–2288, 2011.
Zeiler, Matthew D. and Fergus, Rob. Stochastic pooling for
regularization of deep convolutional neural networks. In
International Conference on Learning Representations,
2013.
Article
Full-text available
The vanishing gradient issue in convolutional neural networks (CNNs) is often addressed by improving activation functions, such as the S-shaped rectified linear activation unit (SReLU). However, SReLU can pose challenges in updating training parameters effectively. To mitigate this, we propose applying the Aggregation Fischer–Burmeister (AFB) function to SReLU, which smooths the secant line slope of the function from both sides. However, direct application of AFB to SReLU can intensify the vanishing gradient issue due to irregular function behavior. To address this concern, we introduce a regulated version of AFB (ReAFB) that ensures proper gradient and mean activation output conditions when applied to SReLU (ReAFBSReLU). We evaluate the performance of CNNs using ReAFBSReLU on three benchmark datasets: MNIST, CIFAR-10 (with and without data augmentation), and CIFAR-100. Specifically, we implement Network in Network (NIN) for MNIST and CIFAR-10, and LeNet for CIFAR-100 dataset. Additionally, we utilize SqueezeNet exclusively to compare the performance of CNNs using the proposed ReAFBSReLU activation function against state-of-the-art activation functions. Our results demonstrate that ReAFBSReLU outperforms other activation functions tested in this study, indicating its efficacy in enhancing training parameter updates and subsequently improving accuracy.
Article
Full-text available
It is crucial to figure out the parameters of asynchronous machines, which have an essential function in industry, to guarantee secure operation, control, and analysis. In order to address the timeconsuming calculations associated with conventional approaches, the shortcomings in the manufacturer’s documentation, and the interruptions caused by experimental studies, the methods presented primarily were concentrated on determining electrical parameters. However, research concerning the estimation of mechanical parameters was restricted to a minor quantity of parameters and utilized a sample size that is insufficient to establish broad conclusions. Hence, in this research, it is aimed at developing a machine learning-based, high-accuracy, and fast prediction system that surpasses this restricted range. This approach was specifically developed to estimate 17 mechanical dimensions by evaluating three prediction algorithms—Multilayer Perceptron (MLP), Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM)—to choose the most effective one within a specified power range and enable parameter configuration in less than a minute for practical implementation. The RNN demonstrated the best performance by capturing dependencies effectively and achieving the highest accuracy, while MLP provided rapid results with a simpler structure but limited capacity for modeling complex relationships. LSTM, despite its theoretical advantages, fell short due to high computational demands and inconsistent test performance. A correlation coefficient of 0.99, mean absolute error values below 0.0025, and root mean square error values below 0.0045 were attained throughout the study, thereby signifying a statistically significant relationship between the variables. This research offers a remarkable framework for enhancing the design and operation of machines by improving a parameter determination approach.
Article
Super-resolution (SR) of remote-sensing images (RSI) has been improved significantly with the development of deep learning. However, better performances usually come from complex network architectures and require a substantial number of parameters. Moreover, many methods can only deal with SR of single and fixed-scale factors. As such, we propose a binary lightweight SR (BLiSR) method to decrease the computation and storage burden and increase the practicality of SR, where we employ a binary neural network (BNN) as the backbone and leverage a binary continuous up-sampling module (BCUM) to achieve arbitrary scale RSI SR. Specifically, we introduce an adaptive binary convolution as the basic unit of BLiSR, which can adaptively adjust the learnable parameters to fit the distribution of full-precision weights and activations. Then, a scalable hyperbolic tangent function is presented to approximate the Sign function in backpropagation and increase the learning capability of BNN. Furthermore, we design a lightweight SR network that considers the full-precision information flow of BNN. The network comprises several basic binary units and a multi-layer group fusion block, which can extract and fuse the multi-level information from LR images, respectively. Finally, BCUM can predict the pixel values of HR images based on the frequency implicit representation network and reconstruct the LR images at arbitrary scales. Extensive experiments on four RSI datasets demonstrate that the proposed BLiSR is superior to several lightweight state-of-the-art methods on both fixed and arbitrary scale SR settings, with a better balance of complexity and performance.
Article
Full-text available
Recent human magnetic resonance imaging (MRI) studies continually push the boundaries of spatial resolution as a means to enhance levels of neuroanatomical detail and increase the accuracy and sensitivity of derived brain morphometry measures. However, acquisitions required to achieve these resolutions have a higher noise floor, potentially impacting segmentation and morphometric analysis results. This study proposes a novel, fast, robust, and resolution-invariant deep learning method to denoise structural human brain MRIs. We explore denoising of T1-weighted (T1w) brain images from varying field strengths (1.5T to 7T), voxel sizes (1.2 mm to 250 µm), scanner vendors (Siemens, GE, and Phillips), and diseased and healthy participants from a wide age range (young adults to aging individuals). Our proposed Fast-Optimized Network for Denoising through residual Unified Ensembles (FONDUE) method demonstrated stable denoising capabilities across multiple resolutions with performance on par or superior to the state-of-the-art methods while being several orders of magnitude faster at low relative cost when using a dedicated Graphics Processing Unit (GPU). FONDUE achieved the best performance on at least one of the four denoising-performance metrics on all the test datasets used, showing its generalization capabilities and stability. Due to its high-quality performance, robustness, fast execution times, and relatively low-GPU memory requirements, as well as its open-source public availability, FONDUE can be widely used for structural MRI denoising, especially in large-cohort studies. We have made the FONDUE repository and all training and evaluation scripts as well as the trained weights available at https://github.com/waadgo/FONDUE.
Article
Perceiving the environment is one of the most fundamental keys to enabling Cooperative Driving Automation, which is regarded as the revolutionary solution to addressing the safety, mobility, and sustainability issues of contemporary transportation systems. Although an unprecedented evolution is now happening in the area of computer vision for object perception, state-of-the-art perception methods are still struggling with sophisticated real-world traffic environments due to the inevitable physical occlusion and limited receptive field of single-vehicle systems. Based on multiple spatially separated perception nodes, Cooperative Perception (CP) is born to unlock the bottleneck of perception for driving automation. In this paper, we comprehensively review and analyze the research progress on CP, and we propose a unified CP framework. The architectures and taxonomy of CP systems based on different types of sensors are reviewed to show a high-level description of the workflow and different structures for CP systems. The node structure, sensing modality, and fusion schemes are reviewed and analyzed with detailed explanations for CP. A Hierarchical Cooperative Perception (HCP) framework is proposed, followed by a review of existing open-source tools that support CP development. The discussion highlights the current opportunities, open challenges, and anticipated future trends.
Article
Full-text available
We introduce a new method for training deep Boltzmann machines jointly. Prior methods of training DBMs require an initial learning pass that trains the model greedily, one layer at a time, or do not perform well on classification tasks. In our approach, we train all layers of the DBM simultaneously, using a novel training procedure called multi-prediction training. The resulting model can either be interpreted as a single generative model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent networks that share parameters and may be approximately averaged together using a novel technique we call the multi-inference trick. We show that our approach performs competitively for classification and outperforms previous methods in terms of accuracy of approximate inference and classification with missing inputs.
Conference Paper
Full-text available
Biologically inspired, from the early HMAX model to Spatial Pyramid Matching, pooling has played an important role in visual recognition pipelines. Spatial pooling, by grouping of local codes, equips these methods with a certain degree of robustness to translation and deformation yet preserving important spatial information. Despite the predominance of this approach in current recognition systems, we have seen little progress to fully adapt the pooling strategy to the task at hand. This paper proposes a model for learning task dependent pooling scheme -- including previously proposed hand-crafted pooling schemes as a particular instantiation. In our work, we investigate the role of different regularization terms showing that the smooth regularization term is crucial to achieve strong performance using the presented architecture. Finally, we propose an efficient and parallel method to train the model. Our experiments show improved performance over hand-crafted pooling schemes on the CIFAR-10 and CIFAR-100 datasets -- in particular improving the state-of-the-art to 56.29% on the latter.
Article
Full-text available
We combine three important ideas present in previous work for building classi-fiers: the semi-supervised hypothesis (the input distribution contains information about the classifier), the unsupervised manifold hypothesis (data density concen-trates near low-dimensional manifolds), and the manifold hypothesis for classifi-cation (different classes correspond to disjoint manifolds separated by low den-sity). We exploit a novel algorithm for capturing manifold structure (high-order contractive auto-encoders) and we show how it builds a topological atlas of charts, each chart being characterized by the principal singular vectors of the Jacobian of a representation mapping. This representation learning algorithm can be stacked to yield a deep architecture, and we combine it with a domain knowledge-free version of the TangentProp algorithm to encourage the classifier to be insensitive to local directions changes along the manifold. Record-breaking classification results are obtained.
Article
Full-text available
When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
Detecting and reading text from natural images is a hard computer vision task that is central to a variety of emerging applications. Related problems like document character recognition have been widely studied by computer vision and machine learning researchers and are virtually solved for practical applications like reading handwritten digits. Reliably recognizing characters in more complex scenes like photographs, however, is far more difficult: the best existing methods lag well behind human performance on the same tasks. In this paper we attack the prob-lem of recognizing digits in a real application using unsupervised feature learning methods: reading house numbers from street level photos. To this end, we intro-duce a new benchmark dataset for research use containing over 600,000 labeled digits cropped from Street View images. We then demonstrate the difficulty of recognizing these digits when the problem is approached with hand-designed fea-tures. Finally, we employ variants of two recently proposed unsupervised feature learning methods and find that they are convincingly superior on our benchmarks.
Article
April 8, 2009Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it di cult to learn a good set of lters from the images. We show how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex. Using a novel parallelization algorithm to distribute the work among multiple machines connected on a network, we show how training such a model can be done in reasonable time. A second problematic aspect of the tiny images dataset is that there are no reliable class labels which makes it hard to use for object recognition experiments. We created two sets of reliable labels. The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. Using these labels, we show that object recognition is signi cantly
Article
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.