ArticlePDF Available

Visualizing Higher-Layer Features of a Deep Network

Authors:

Abstract and Figures

Deep architectures have demonstrated state-of-the-art results in a variety of settings, especially with vision datasets. Beyond the model definitions and the quantitative analyses, there is a need for qualitative comparisons of the solutions learned by various deep architectures. The goal of this paper is to find good qualita-tive interpretations of high level features represented by such models. To this end, we contrast and compare several techniques applied on Stacked Denoising Auto-encoders and Deep Belief Networks, trained on several vision datasets. We show that, perhaps counter-intuitively, such interpretation is possible at the unit level, that it is simple to accomplish and that the results are consistent across various techniques. We hope that such techniques will allow researchers in deep architec-tures to understand more of how and why deep architectures work.
Content may be subject to copyright.
Visualizing Higher-Layer Features of a Deep Network
Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent
Dept. IRO, Universit´
e de Montr´
eal
P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7, QC, Canada
first.last@umontreal.ca
Technical Report 1341
D´
epartement d’Informatique et Recherche Op´
erationnelle
June 9th, 2009
Abstract
Deep architectures have demonstrated state-of-the-art results in a variety of
settings, especially with vision datasets. Beyond the model definitions and the
quantitative analyses, there is a need for qualitative comparisons of the solutions
learned by various deep architectures. The goal of this paper is to find good qualita-
tive interpretations of high level features represented by such models. To this end,
we contrast and compare several techniques applied on Stacked Denoising Auto-
encoders and Deep Belief Networks, trained on several vision datasets. We show
that, perhaps counter-intuitively, such interpretation is possible at the unit level,
that it is simple to accomplish and that the results are consistent across various
techniques. We hope that such techniques will allow researchers in deep architec-
tures to understand more of how and why deep architectures work.
1 Introduction
Until 2006, it was not known how to efficiently learn deep hierarchies of features with
a densely-connected neural network of many layers. The breakthrough, by Hinton
et al. (2006a), came with the realization that unsupervised models such as Restricted
Boltzmann Machines (RBMs) can be used to initialize the network in a region of the
parameter space that makes it easier to subsequently find a good minimum of the su-
pervised objective. The greedy, layer-wise unsupervised initialization of a network can
also be carried out by using auto-associators and related models, as shown by Bengio
et al. (2007) and Ranzato et al. (2007). Recently, there has been a surge in research on
training deep architectures: Bengio (2009) gives a comprehensive review.
While quantitative analyses and comparisons of such models exist, and visualiza-
tions of the first layer representations are common in the literature, one area where more
work needs to be done is the qualitative analysis of representations learned beyond the
first level.
Some of the deep architectures (such as Deep Belief Nets (Hinton et al., 2006a)) are
associated with a generative procedure, and one could potentially use such a procedure
1
to gain insight into what an individual hidden unit represents. We explore one such
sampling technique here. However, it is sometimes difficult to obtain samples that
cover well the modes of a Boltzmann or RBM distribution, and these sampling-based
visualizations cannot be applied to to other deep architectures such as those based
on auto-encoders (Bengio et al., 2007; Ranzato et al., 2007; Larochelle et al., 2007;
Ranzato et al., 2008; Vincent et al., 2008) or on semi-supervised learning of similarity-
preserving embeddings at each level (Weston et al., 2008).
A typical qualitative way of comparing features extracted by a first layer of a deep
architecture is by looking at the “filters” learned by the model, that is the linear weights
in the input-to-first layer weight matrix, represented in input space. This is particularly
convenient when the inputs are images or waveforms, which can be visualized. Often,
these filters take the shape of stroke detectors, when trained on digit data, or edge detec-
tors (Gabor filters) when trained on natural image patches (Hinton et al., 2006a; Hinton
et al., 2006b; Osindero & Hinton, 2008; Larochelle et al., 2009). The techniques we
study here also suppose that the input patterns can be displayed and are meaningful for
humans, and we evaluate all of them on image data.
Our aim was to explore ways of visualizing what a unit computes in an arbitrary
layer of a deep network. The goal was to have this visualization in the input space (of
images), to have an efficient way of computing it, and to make it as general as possible
(in the sense of it being applicable to a large class of neural-network-like models). To
this end, we explore several visualization methods that allow us to gain insight into
what a particular unit of a neural network represents. We compare and contrast them
qualitatively on two image datasets, and we also explore connections between all of
them.
The main experimental finding of this investigation is very surprising: the response
of an internal unit to input images, as a function in image space, appears to be unimodal,
or at least that the maximum is found reliably and consistently for all the random ini-
tializations tested. This is interesting because finding this dominant mode is relatively
easy, and displaying it then provides a good characterization of what the unit does.
2 The models
We shall consider two deep architectures as representatives of two families of mod-
els encountered in the deep learning literature. The first model is a Deep Belief Net
(DBN) (Hinton et al., 2006a), obtained by training and stacking three layers as Re-
stricted Boltzmann Machines (RBM) in a greedy manner. This means that we trained
a RBM with Contrastive Divergence (Hinton, 2002) on the training data, we fixed the
parameters of this RBM, and then trained another RBM to model the hidden layer
representations of the first level RBM. This process can be repeated to yield a deep
architecture that is an unsupervised model of the training distribution. Note that it is
also a generative model of the data and one can easily obtain samples from a trained
model. DBNs have been described numerous times in the literature and we use them
as described by (Bengio et al., 2007) and (Hinton et al., 2006a); we omit more details
in favor of describing the other deep architecture.
The second model, by Vincent et al. (2008), is the so-called Stacked Denoising
Auto-Encoder (SDAE). It borrows the greedy principle from DBNs, but uses denois-
2
ing auto-encoders as a building block for unsupervised modeling. An auto-encoder
learns an encoder h(·)and a decoder g(·)whose composition approaches the identity
for examples in the training set, i.e., g(h(x)) xfor xin the training set. The denois-
ing auto-encoder is a stochastic variant of the ordinary auto-encoder with the property
that even with a high capacity model, it cannot learn the identity. Furthermore, its
training criterion is a variational lower bound on the likelihood of a generative model.
It is explicitly trained to denoise a corrupted version of its input. It has been shown
on an array of datasets to perform significantly better than ordinary auto-encoders and
similarly or better than RBMs when stacked into a deep supervised architecture (Vin-
cent et al., 2008). Another way to prevent regular auto-encoders with more code units
than inputs to learn the identity is to impose sparsity on the code (Ranzato et al., 2007;
Ranzato et al., 2008). The activation maximization technique presented below is appli-
cable to any trained deep neural network, and we evaluate it on networks obtained by
stacking RBMs and denoising auto-encoders.
We now summarize the training algorithm of the Stacked Denoising Auto-Encoders.
More details are given by Vincent et al. (2008). Each denoising auto-encoder operates
on its inputs x, either the raw inputs or the outputs of the previous layer. The denoising
auto-encoder is trained to reconstruct xfrom a stochastically corrupted (noisy) trans-
formation of it. The output of each denoising auto-encoder is the “code vector” h(x).
In our experiments h(x) = sigmoid(b+Wx)is an ordinary neural network layer,
with hidden unit biases b, weight matrix W, and sigmoid(a) = 1/(1 + exp(a))
(applied element-wise on a vector a). Let C(x)represent a stochastic corruption of
x. As done by Vincent et al. (2008), we set Ci(x) = xior 0, with a random subset
(of a fixed size) selected for zeroing. We have also considered a salt and pepper noise,
where we select a random subset of a fixed size and set Ci(x) = Bernoulli(0.5). The
“reconstruction” is obtained from the noisy input with ˆ
x= sigmoid(c+WTh(C(x))),
using biases cand the transpose of the feed-forward weights W. In the experiments
on images, both the raw input xiand its reconstruction ˆxifor a particular pixel ican
be interpreted as a Bernoulli probability for that pixel: the probability of painting the
pixel as black at that location. We denote KL(x||ˆ
x) = PiKL(xi|| ˆxi)the sum
of component-wise KL divergences between the Bernoulli probability distributions as-
sociated with each element of xand its reconstruction probabilities ˆ
x:KL(x||ˆ
x) =
Pi(xilog ˆxi+ (1 xi) log (1 ˆxi)). The Bernoulli model only makes sense when
the input components and their reconstruction are in [0,1]; another option is to use a
Gaussian model, which corresponds to a Mean Squared Error (MSE) criterion.
For each unlabeled example x, a stochastic gradient estimator is then obtained by
computing KL(x||ˆ
x)/∂θ for θ= (b,c, W ). The gradient is stochastic because of
sampling example xand because of the stochastic corruption C(x). Stochastic gradient
descent θθ·KL(x||ˆ
x)/∂θ is then performed with learning rate , for a fixed
number of pre-training iterations.
3
3 Maximizing the activation
The first idea is simple: we look for input patterns of bounded norm which maximize
the activation of a given hidden unit1; since the activation function of a unit in the first
layer is a linear function of the input, in the case of the first layer, this input pattern is
proportional to the filter itself.
The reasoning behind this idea is that a pattern to which the unit is responding
maximally could be a good first-order representation of what a unit is doing. One
simple way of doing this is to find, for a given unit, the input sample(s) (from either the
training or the test set) that give rise to the highest activation of the unit. Unfortunately,
this still leaves us with the problem of choosing how many samples to keep for each unit
and the problem of how to “combine” these samples. Ideally, we would like to find out
what these samples have in common. Furthermore, it may be that only some subsets of
the input vector contribute to the high activation, and it is not easy to determine which
by inspection.
Note that we restricted ourselves needlessly to searching for an input pattern from
the training or test sets. We can take a more general view and see our idea—maximizing
the activation of a unit—as an optimization problem. Let θdenote our neural network
parameters (weights and biases) and let hij (θ, x)be the activation of a given unit i
from a given layer jin the network; hij is a function of both θand the input sample
x. Assuming a fixed θ(for instance, the parameters after training the network), we can
view our idea as looking for
x= arg max
xs.t. ||x||=ρhij (θ, x).
This is, in general, a non-convex optimization problem. But it is a problem for which
we can at least try to find a local minimum. This can be done most easily by performing
simple gradient ascent in the input space, i.e. computing the gradient of hij (θ, x)and
moving xin the direction of this gradient2.
Two scenarios are possible: the same (qualitative) minimum is found when starting
from different random initializations or two or more local minima are found. In both
cases, the unit can then be characterized by the minimum or set of minima found. In
the latter case, one can either average the results, or choose the one which maximizes
the activation, or display all the local minima obtained to characterize that unit.
This optimization technique (we will call it “activation maximization”) is applica-
ble to any network in which we can compute the above gradients. Like any gradient
descent technique, it does involve a choice of hyperparameters: the learning rate and
a stopping criterion (the maximum number of gradient ascent updates, in our experi-
ments).
4 Sampling from a unit of a Deep Belief Network
Consider a Deep Belief Network with jlayers, as described in Section 2. In particular,
layers j1and jform an RBM from which we can sample using block Gibbs sampling,
1The total sum of the input to the unit from the previous layer plus its bias.
2Since we are trying to maximize hij .
4
which successively samples from p(hj1|hj)and p(hj|hj1), denoting by hjthe bi-
nary vector of units from layer j. Along this Markov chain, we propose to “clamp”
unit hij , and only this unit, to 1. We can then sample inputs xby performing ancestral
top-down sampling in the directed belief network going from layer j1to the input,
in the DBN. This will produce a distribution that we shall denote by pj(x|hij = 1)
where hij is the unit that is clamped, and pjdenotes the depth-jDBN containing only
the first jlayers. This procedure is similar to and inspired from experiments by Hinton
et al. (2006a), where the top layer RBM is trained on the representations learned by
the previous RBM and the label as a one-hot vector; in that case, one can “clamp” the
label vector to a particular configuration and sample from a particular class distribution
p(x|class =k).
In essence, we use the distribution pj(x|hij = 1) to characterize hij . In analogy to
Section 3, we can characterize the unit by many samples from this distribution or sum-
marize the information by computing the expectation E[x|hij = 1]. This method has,
essentially, no hyperparameters except the number of samples that we use to estimate
the expectation. It is relatively efficient provided the Markov chain at layer jmixes
well (which is not always the case, unfortunately).
There is an interesting link between the method of maximizing the activation and
E[x|hij = 1]. By definition, E[x|hij = 1] = Rxpj(x|hij = 1)dx. If we consider the
extreme case where the distribution concentrates at x+,pj(x|hij = 1) δx+(x), then
the expectation is E[x|hij = 1] = x+.
On the other hand, when applying the activation maximization technique to a DBN,
we are approximately 3looking for arg maxxp(hij = 1|x), since this probability
is monotonic in the activation of unit hij. Using Bayes’ rule and the concentration
assumption about p(x|hij = 1), we find that
p(hij = 1|x) = p(x|hij = 1)p(hij = 1)
p(x)=δx+(x)p(hij = 1)
p(x)
This is zero everywhere except at x+so under our assumption, arg maxxp(hij =
1|x) = x+.
More generally, one can show that if p(x|hij = 1) concentrates sufficiently around
x+compared to p(x), then the two methods (expected value over samples vs activa-
tion maximization) should produce very similar results. Generally speaking, it is easy
to imagine how such an assumption could be untrue because of the nonlinearities in-
volved. In fact, what we observe is that although the samples or their average may look
like training examples, the images obtained by activation maximization look more like
image parts, which may be a more accurate representation of what the particular units
does (by opposition to all the other units involved in the sampled patterns).
5 Linear combination of previous layers’ filters
Lee et al. (2008) showed one way of visualizing what the units in the second hidden
layer of a network are responding to. They made the assumption that a unit can be
3because of the approximate optimization and because the true posteriors are intractable for higher layers,
and only approximated by the corresponding neural network unit outputs.
5
characterized by the filters of the previous layer to which it is most strongly connected4.
By taking a weighted linear combination of the previous layer filters—where the weight
of the filters is its weight to the unit considered—they show that a Deep Belief Network
with sparsity constraints on the activations, trained on natural images, will tend to learn
“corner detectors” at the second layer. Lee et al. (2009) used an extended version of
this method for visualizing units of the third layer: by simply weighing the “filters”
found at the second layer by their connections to the third layer, and choosing again
the largest weights.
Such a technique is simple and efficient. One disadvantage is that it is not clear
how to automatically choose the appropriate number of filters to keep at each layer.
Moreover, by selecting only the very few most strongly connected filters from the first
layer, one can potentially get a misleading picture, since one is essentially ignoring the
rest of the previous layer units. Finally, this method also bypasses the nonlinearities
between layers, which may be an important part of the model. One motivation for this
paper is to validate whether the patterns obtained by Lee et al. (2008) are similar to
those obtained by the other methods explored here.
One should note that there is indeed a link between the gradient updates for max-
imizing the activation of a unit and finding the linear combination of weights as de-
scribed by Lee et al. (2009). Take, for instance hi2, i.e. the activation of unit i
from layer 2with hi2=v0sigmoid(Wx), with vbeing the unit’s weights and W
being the first layer weight matrix. Then ∂hi2/∂x=v0diag(sigmoid(Wx)(1
sigmoid(Wx)))W, where is the element-wise multiplication, diag is the operator
that creates a diagonal matrix from a vector, and 1is a vector filled with ones. If the
units of the first layer do not saturate, then ∂hi2/∂xpoints roughly in the direction of
v0W, which can be approximated by taking the terms with the largest absolute value
of vi.
6 Experiments
6.1 Data and setup
We used two datasets: the first is an extended version of the MNIST digit classification
dataset, by Loosli et al. (2007), in which elastic deformations of digits are generated
stochastically. We used 2.5million examples as training data, where each example is
a28 ×28 gray-scale image. The second is a collection of 100000 12 ×12 patches
of natural images, generated from the collection of whitened natural image patches by
Olshausen and Field (1996).
The visualization procedures were tested on the models described in Section 2:
Deep Belief Nets (DBNs) and Stacked Denoising Auto-encoders (SDAE). The hyper-
parameters are: unsupervised and supervised learning rates, number of hidden units
per layer, and the amount of noise in the case of SDAE; they were chosen to minimize
4i.e. whose weight to the upper unit is large in magnitude
6
the classification error on MNIST5or the reconstruction error6on natural images, for
a given validation set. For MNIST, we show the results obtained after unsupervised
training only; this allows us to compare all the methods (since we cannot sample from
a DBN after supervised fine-tuning). For the SDAE, we used salt and pepper noise
as a corruption technique, as opposed to the zero-masking noise described by Vincent
et al. (2008): such noise seems to better model natural images. For both SDAE and
DBN we used a Gaussian input layer when modeling natural images; these are more
appropriate than the standard Bernoulli units, given the distribution of pixel grey levels
in such patches (Bengio et al., 2007; Larochelle et al., 2009).
In the case of activation maximization (Section 3), the procedure is as follows for a
given unit from either the second or the third layer: we initialize xto a vector of 28×28
or 12 ×12 dimensions in which each pixel is sampled independently from a uniform
over [0; 1]. We then compute the gradient of the activation of the unit w.r.t. xand make
a step in the gradient direction. The gradient updates are continued until convergence,
i.e. until the activation function does not increase by much anymore. Note that after
each gradient update, the current estimate of xis re-normalized to the average norm
of examples from the respective dataset7. Interestingly, the same optimal value (i.e. the
one that seems to maximize activation) for the learning rate of the gradient ascent works
for all the units from the same layer.
Sampling from a DBN is done as described in Section 4, by running the randomly-
initialized Markov chain and top-down sampling every 100 iterations. In the case of
the method described in Section 5, the (subjective) optimal number of previous layer
filters was taken to be 100.
6.2 Activation Maximization
We begin by the analysis of the activation maximization method. Figures 1 and 2
contain the results of the optimization of units from the 2nd and 3rd layers of a DBN
and an SDAE, along with the first layer filters. Figure 1 shows such an analysis for
MNIST and Figure 2 shows it for the natural image data.
To test the dependence of this gradient ascent on the initial conditions, 9 different
random initializations were tried. The retained “filter” corresponding to each unit is
the one (out of the 9 random initializations) which maximizes the activation. In the
same figures we also show the variations found by the different random initializations
for a given unit from the 3rd layer. Surprisingly, most random initializations yield
roughly the same prominent input pattern. Moreover, we measured the maximum
5We are indeed choosing our hyperparameters based on the supervised objective. This objective is com-
puted by using the unsupervised networks as initial parameters for supervised backpropagation. We chose to
select the hyperparameters based on the classification error because for this problem we do have an objective
criterion for comparing networks, which is not the case for the natural image data.
6For RBMs, the reconstruction error is obtained by treating the RBM as an auto-encoder and computing
a deterministic value using either the KL divergence or the MSE, as appropriate. The reconstruction error of
the first layer RBM is used for model selection.
7There is no constraint that the resulting values in xbe in the domain of the training/test set values.
For instance, we experimented with making sure that the values of xare in [0; 1] (for MNIST), but this
produced worse results. On the other hand, the goal is to find a “filter”-like result and a constraint that this
“filter” is strictly in the same domain as the input image may not be necessary.
7
Figure 1: Activation maximization applied on MNIST. On the left side: visualization of 36
units from the first (1st column), second (2nd column) and third (3rd column) hidden layers of a
DBN (top) and SDAE (bottom), using the technique of maximizing the activation of the hidden
unit. On the right side: 4 examples of the solutions to the optimization problem for units in the
3rd layer of the SDAE, from 9 random initializations.
values for the activation function to be quite close to each other (not shown). Such
results are relatively surprising, given that, generally speaking, the activation function
of a third layer unit is a highly non-convex function of its input. Therefore, either we are
consistently lucky or, at least in this particular case (a network trained on MNIST digits
or natural images), the activation functions of the units tend to be more “unimodal”.
To further test the robustness of the activation maximization
method, we perform a sensitivity analysis in order to test
whether the units are selective to these patterns found by the
optimization routine, and whether these patterns strongly ac-
tivate other units as well. The figure on the right shows the
post-sigmoidal activation of unit j(columns) when the input
to the network is the “optimal” pattern i(rows), found by
our gradient procedure for unit i, normalized across columns
in order to eliminate the effect of units that are activated for
very many patterns in general. The strong values on the di-
agonal suggest that the results of the optimization have un-
covered patterns that are mostly specific to a particular unit.
One important point is that, qualitatively speaking, the filters at the 3rd layer look
interpretable and quite complex. For MNIST, some look like pseudo-digits. In the
case of natural images, we can observe grating filters at the second layer of DBNs
and complicated units that detect, for instance, corners at the second and third layer of
SDAE; some of the units have the same characteristics that we would associate with
so-called complex cells. It suggests that higher level units did indeed learn meaningful
combinations of lower level features.
Note that the first layer filters obtained by the SDAE when trained on natural im-
ages are Gabor-like features. It is interesting that in the case of DBN, the filters that
minimized the reconstruction error8, i.e. those that are pictured in Figure 2 (top-left cor-
ner), do not have the same low-frequency and sparsity properties like the ones found
8Which is only a proxy for the actual objective function that is minimized by a stack of RBMs.
8
by the first-level denoising auto-encoder9. Yet at the second layer, the filters found by
activation maximization are a mixture of Gabor-like features and grating filters. This
shows that appearances can be deceiving: we might have dismissed the RBM whose
weights are shown in Figure 2 as a bad model of natural images had we looked only at
the first layer filters, but the global qualitative assessment of this model, which includes
the visualization of the second and third layers, points to the fact that the 3-layer DBN
is in effect learning something quite interesting.
Figure 2: On the left side: Visualization of 144 units from the first (1st column), second (2nd
column) and third (3rd column) hidden layers of a DBN (top) and an SDAE (bottom), using the
technique of maximizing the activation of the hidden unit. On the right side: 4 examples of the
solutions to the optimization problem for units in the 3rd layer of the SDAE, subject to 9 random
initializations.
6.3 Sampling a unit
Of note is the fact that unlike the results of the activation maximization method, the
samples are much more likely to be part of the underlying distribution of examples
(digits or patches). The activation maximization method seems to produce features
and it is up to us to decide which examples would “fit” these features; the sampling
method produces examples and it lets us decide which features these examples have in
common. In this respect, the two techniques serve complementary purposes.
6.4 Comparison of methods
In Figure 4, we can see a comparison of the three techniques, including the linear
combination method. The methods are tested on the second layer of a DBN trained
on MNIST. In the above, we noted links between the three techniques. The experi-
ments show that many of the filters found by the three methods share some features,
but have a different nature. Unfortunately, we do not have an objective measure that
would allow us to compare the three methods, but visually the activation maximization
9It is possible to obtain Gabor-like features with RBMs—work by Osindero and Hinton (2008) shows
that—but in our case these filters were never those that minimized the reconstruction error of an RBM. This
points to a larger issue: it appears that using different learning rates for Contrastive Divergence learning will
induce features that are qualitatively different, depending on the value of the learning rate.
9
We now turn to the sampling technique
described in Section 4. Figure 3 shows
samples obtained by clamping a second
layer unit to 1; both MNIST and natural im-
age patches are considered. In the case of
natural image patches, the distributions are
roughly unimodal, in that the samples are
of the same pattern, for a given unit. For
MNIST, the situation is slightly more deli-
cate: there seem to be one or two modes for
each unit. The average input (the expecta-
tion of the distribution), as seen in Figure 4,
then looks like a digit or a superposition of
two digits.
Figure 3: Visualization of 6 units from the sec-
ond hidden layer of a DBN trained on MNIST
(left) and natural image patches (right). The vi-
sualizations are produced by sampling from the
DBN and clamping the respective unit to 1. Each
unit’s distribution is a row of samples; the mean
of each row is in the first column of Figure 4
(left).
method seems to produce more interesting results: by comparison, the average sam-
ples from the DBN are almost always in the shape of a digit (for MNIST), while the
linear combination method seems to find only parts of the features that are found by
activation maximization, which tends to find sharper patterns.
6.5 Limitations
We tested the activation maximization procedure on image patches of 20 ×20 pixels
(instead of 12×12) and found that the optimization does not converge to a single global
minimum. Moreover, the input distribution that is sampled with the units clamped to
1 has many different modes and its expectation is not meaningful or interpretable any-
more. We posit that these methods break down because of the complexity of the input
distribution: both MNIST and 12 ×12 image patches are relatively simple distribu-
tions to model and this could be the reason these methods work in the first place. It is
perhaps unrealistic to expect that as we scale the datasets to larger and larger images,
one could still find a simple representation of a higher layer unit. We should note,
however, that there is a recent trend of developing convolutional versions of deep ar-
chitectures (Kavukcuoglu et al., 2009; Lee et al., 2009; Desjardins & Bengio, 2008): it
is likely that one will be able to apply the same techniques in that scenario and still be
able to recover good visualizations, even with large inputs.
7 Conclusions and Future Work
We started from a simple premise: to better understand the solution that is learned and
represented by a deep architecture, by investigating the response of individual units in
the network. Like the analysis of individual neurons in the brain by neuroscientists,
this approach has limitations, but we hope that such visualization techniques can help
understand the nature of the functions learned by the network.
We presented three simple techniques: activation maximization and sampling from
a unit are both new (to the best of our knowledge), while the linear combination tech-
nique had been previously published. We showed the intuitive similarities between
10
Figure 4: Visualization of 36 units from the second hidden layer of a DBN trained on MNIST
(top) and 144 units from the second hidden layer of a DBN trained on natural image patches
(bottom). Left: sampling with clamping, Centre: linear combination of previous layer filters,
Right: maximizing the activation of the unit. Black is negative, white is positive and gray is
zero.
them and compared and contrasted them on two well-known datasets. Our results con-
firm the intuitions that we had about the hierarchical representations learned by deep
architectures: namely that the higher layer units represent features that are (meaning-
fully) more complicated and that correspond to combinations of features of the lower
layers. We have also found that the two deep architectures considered learn quite dif-
ferent features.
The same procedures can be applied to the weights obtained after supervised learn-
ing and the observations are similar: convergence occurs and features seem more com-
plicated at higher layers. In the future, we would like to use such visualization tools
to compare the features learned by these networks after supervised learning, in order
to better understand the differences in test error and to further understand the influence
of unsupervised initialization in training deep models. We will also extend our results
by performing experiments with other datasets and models, such as Convolutional Net-
works applied to higher-resolution natural images. Finally, we would like to compare
the behaviour of higher level units in a deep network to features that are presumed to
be encoded by the higher levels of the visual cortex.
References
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in
Machine Learning,2, Issue 1.
11
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise
training of deep networks. Advances in Neural Information Processing Systems 19
(NIPS’06) (pp. 153–160). MIT Press.
Desjardins, G., & Bengio, Y. (2008). Empirical evaluation of convolutional RBMs for
vision (Technical Report 1327). Dept. IRO, U. Montreal.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive diver-
gence. Neural Computation,14, 1771–1800.
Hinton, G. E., Osindero, S., & Teh, Y. (2006a). A fast learning algorithm for deep
belief nets. Neural Computation,18, 1527–1554.
Hinton, G. E., Osindero, S., Welling, M., & Teh, Y. (2006b). Unsupervised discovery
of non-linear structure using contrastive backpropagation. Cognitive Science,30.
Kavukcuoglu, K., Ranzato, M., Fergus, R., & LeCun, Y. (2009). Learning invariant
features through topographic filter maps. Proceedings of the Computer Vision and
Pattern Recognition Conference (CVPR’09). IEEE.
Larochelle, H., Bengio, Y., Louradour, J., & Lamblin, P. (2009). Exploring strategies
for training deep neural networks. Journal of Machine Learning Research,10, 1–40.
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An em-
pirical evaluation of deep architectures on problems with many factors of variation.
Proceedings of the Twenty-fourth International Conference on Machine Learning
(ICML’07) (pp. 473–480). Corvallis, OR: ACM.
Lee, H., Ekanadham, C., & Ng, A. (2008). Sparse deep belief net model for visual
area V2. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in neural
information processing systems 20 (nips’07). Cambridge, MA: MIT Press.
Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief net-
works for scalable unsupervised learning of hierarchical representations. In L. Bot-
tou and M. Littman (Eds.), Proceedings of the twenty-sixth international conference
on machine learning (icml’09). Montreal (Qc), Canada: ACM.
Loosli, G., Canu, S., & Bottou, L. (2007). Training invariant support vector machines
using selective sampling. In L. Bottou, O. Chapelle, D. DeCoste and J. Weston
(Eds.), Large scale kernel machines, 301–320. Cambridge, MA.: MIT Press.
Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field
properties by learning a sparse code for natural images. Nature,381, 607–609.
Osindero, S., & Hinton, G. E. (2008). Modeling image patches with a directed hier-
archy of markov random field. Advances in Neural Information Processing Systems
20 (NIPS’07) (pp. 1121–1128). Cambridge, MA: MIT Press.
Ranzato, M., Boureau, Y.-L., & LeCun, Y. (2008). Sparse feature learning for deep
belief networks. Advances in Neural Information Processing Systems 20 (NIPS’07)
(pp. 1185–1192). Cambridge, MA: MIT Press.
12
Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Efficient learning of
sparse representations with an energy-based model. Advances in Neural Information
Processing Systems 19 (NIPS’06) (pp. 1137–1144). MIT Press.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and
composing robust features with denoising autoencoders. Proceedings of the Twenty-
fifth International Conference on Machine Learning (ICML’08) (pp. 1096–1103).
ACM.
Weston, J., Ratle, F., & Collobert, R. (2008). Deep learning via semi-supervised
embedding. Proceedings of the Twenty-fifth International Conference on Machine
Learning (ICML’08) (pp. 1168–1175). New York, NY, USA: ACM.
13
... The trained encoderdecoder network is subsequently used by the most crucial part of our algorithm, the style conditioning module. It uses a gradient-based optimization approach through activation maximization [6] to update z. This will favorably influence the decoder outputs (i.e., the sequence of n − 1 visual transitions) and thereby, also satisfy condition (c). ...
... We leverage this simple yet crucial property to achieve the purpose of this module in V-Trans4Style. We implement the style conditioning module by employing a gradient-based optimization strategy, guided by activation maximization(AM) [6] to update z. The update is designed to ensure that the transition sequence generated by the decoder D exhibits the desired production style characteristics. ...
Preprint
We introduce V-Trans4Style, an innovative algorithm tailored for dynamic video content editing needs. It is designed to adapt videos to different production styles like documentaries, dramas, feature films, or a specific YouTube channel's video-making technique. Our algorithm recommends optimal visual transitions to help achieve this flexibility using a more bottom-up approach. We first employ a transformer-based encoder-decoder network to learn recommending temporally consistent and visually seamless sequences of visual transitions using only the input videos. We then introduce a style conditioning module that leverages this model to iteratively adjust the visual transitions obtained from the decoder through activation maximization. We demonstrate the efficacy of our method through experiments conducted on our newly introduced AutoTransition++ dataset. It is a 6k video version of AutoTransition Dataset that additionally categorizes its videos into different production style categories. Our encoder-decoder model outperforms the state-of-the-art transition recommendation method, achieving improvements of 10% to 80% in Recall@K and mean rank values over baseline. Our style conditioning module results in visual transitions that improve the capture of the desired video production style characteristics by an average of around 12% in comparison to other methods when measured with similarity metrics. We hope that our work serves as a foundation for exploring and understanding video production styles further.
... Aggregated saliency maps were computed for multiple samples to obtain a mean map for each class [18]. ...
... We also evaluated the responsiveness of each EEG frequency band of delta (1-4 Hz), theta (4-8 Hz), alpha (8-13 Hz), beta (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30), and gamma (30-50 Hz) to TEAS. The classification accuracies were averaged across various EEG frequency bands during different TEAS phases as indicated in Table 2. Figure 2 summarizes the averaged accuracies, revealing that the beta and gamma frequency band exhibited the highest responsiveness, with classification accuracies exceeding 90%. ...
Conference Paper
Full-text available
This research explores the neurophysiological impacts of transcutaneous electroacupuncture stimulation (TEAS) on brain activity, employing advanced machine learning methodologies. We examine electroencephalo-grams (EEG) to discern brain responses to various TEAS frequencies (2.5, 10, 80 Hz, and a sham at 160 pulses per second) across 48 subjects through baseline, stimulation, and post-stimulation phases. Our study introduces several innovative elements. We employ EEGNet, a convolutional neural network tailored for EEG signal processing, which achieves over 95% classification accuracy in detecting brain responses to different TEAS frequencies. Saliency maps are utilized to pinpoint the most critical EEG electrodes, potentially reducing the number needed without compromising accuracy. By implementing a phase-based analysis, we capture dynamic brain responses throughout the different stimulation stages. Furthermore, we examine the responsiveness of different EEG frequency bands to TEAS. Our findings indicate that EEGNet excels in classifying EEG signals with high accuracy, highlighting distinct responses across EEG frequency bands. Notably, gamma band activity exhibited the highest sensitivity to TEAS, suggesting significant effects on higher cognitive functions. Saliency mapping revealed that a select group of electrodes (Fp1, Fp2, Fz, F7, F8, T3, T4) could achieve accurate classification, indicating potential for more efficient EEG setups. This study underscores the effectiveness of EEGNet in reliably classifying EEG responses to TEAS, enhancing its relevance in clinical and therapeutic contexts.
... The ML discipline generally focuses on classification problems, which has led to rapid development of classification-based interpretability algorithms (e.g., Erhan et al. 2009;Zeiler & Fergus 2013;Simonyan et al. 2014;Bau et al. 2017). For example, Gradient-weighted Class Activation Mapping (GradCAM;Selvaraju et al. 2017) enables researchers to interrogate the model to reveal the pixels that are most important for each class prediction. ...
Preprint
Galaxy appearances reveal the physics of how they formed and evolved. Machine learning models can now exploit galaxies' information-rich morphologies to predict physical properties directly from image cutouts. Learning the relationship between pixel-level features and galaxy properties is essential for building a physical understanding of galaxy evolution, but we are still unable to explicate the details of how deep neural networks represent image features. To address this lack of interpretability, we present a novel neural network architecture called a Sparse Feature Network (SFNet). SFNets produce interpretable features that can be linearly combined in order to estimate galaxy properties like optical emission line ratios or gas-phase metallicity. We find that SFNets do not sacrifice accuracy in order to gain interpretability, and that they perform comparably well to cutting-edge models on astronomical machine learning tasks. Our novel approach is valuable for finding physical patterns in large datasets and helping astronomers interpret machine learning results.
... As early as 2013, (Zeiler and Fergus, 2013) attached a deconvnet in order to examine what prompted the activation of a particular feature map by inverting it all the way back from the feature space to the input space. This was soon followed by (Simonyan et al., 2014) who extended the approach of (Erhan et al., 2009) to convolutional neural networks in order to generate representative examples of a given class for a CNN. Both (Zeiler and Fergus, 2013) as well as (Simonyan et al., 2014) also proposed techniques which could help examine what regions of an image were most responsible towards a model's decisions namely occlusion analysis, and image-specific class saliency respectively. ...
Preprint
Full-text available
An updated version of this paper has been published and can be accessed here: https: //hal.science/hal-04728928/document. Deep learning models have shown tremendous gains in medical imaging in the last decade. At the same time, their highly non-linear nature has also led to a parallel field aimed at decoding their decisions and behavior, namely that of eXplainable Artificial Intelligence (XAI). However, activity in XAI is almost exclusively dominated with a focus on image classification models, to identify which parts of the input image are most relevant in the model's decision-making process. Dense prediction tasks such as image segmentation have received relatively little attention. This trend has started to change in the last few years with works specifically focused on explaining image segmentation models. A good number of these techniques have heavily borrowed from image classification with important points of departure. Given that the XAI techniques focused on image segmentation are bound to grow in number as we progress in time, this seems to be an optimal point to review the journey so far. The present paper aims to present an overview of XAI in image segmentation as well as highlighting potential causes for the lack of interest in this field, and comments on currently underexplored avenues. Given the relative nascency of the field, no review papers currently exist, a gap we aim to fill.
... Several insights have come from attempts at visualising what specific neurons respond to, for a given convolutional neural network [56], [57], [58]. Some approaches pose the problem of understanding neuron behavior as an optimization problem, where images are propagated through the network to find which image region maximizes the activation of a particular neuron [59]. ...
Preprint
Re-identification is generally carried out by encoding the appearance of a subject in terms of outfit, suggesting scenarios where people do not change their attire. In this paper we overcome this restriction, by proposing a framework based on a deep convolutional neural network, SOMAnet, that additionally models other discriminative aspects, namely, structural attributes of the human figure (e.g. height, obesity, gender). Our method is unique in many respects. First, SOMAnet is based on the Inception architecture, departing from the usual siamese framework. This spares expensive data preparation (pairing images across cameras) and allows the understanding of what the network learned. Second, and most notably, the training data consists of a synthetic 100K instance dataset, SOMAset, created by photorealistic human body generation software. Synthetic data represents a good compromise between realistic imagery, usually not required in re-identification since surveillance cameras capture low-resolution silhouettes, and complete control of the samples, which is useful in order to customize the data w.r.t. the surveillance scenario at-hand, e.g. ethnicity. SOMAnet, trained on SOMAset and fine-tuned on recent re-identification benchmarks, outperforms all competitors, matching subjects even with different apparel. The combination of synthetic data with Inception architectures opens up new research avenues in re-identification.
... Since the new regularizer, like the TV norm, is not learned from data but is entirely handcrafted, the resulting visualizations avoid potential biases arising form the use of learned regularizers [15]. Likewise, we show that the same regularizer works well for "activation maximization", namely the problem of synthesizing images that highly activate a certain neuron [17]. ...
Preprint
Deep convolutional networks have become a popular tool for image generation and restoration. Generally, their excellent performance is imputed to their ability to learn realistic image priors from a large number of example images. In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning. In order to do so, we show that a randomly-initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, super-resolution, and inpainting. Furthermore, the same prior can be used to invert deep neural representations to diagnose them, and to restore images based on flash-no flash input pairs. Apart from its diverse applications, our approach highlights the inductive bias captured by standard generator network architectures. It also bridges the gap between two very popular families of image restoration methods: learning-based methods using deep convolutional networks and learning-free methods based on handcrafted image priors such as self-similarity. Code and supplementary material are available at https://dmitryulyanov.github.io/deep_image_prior .
Chapter
Conventional medical practices rely on population-derived guidelines, striving for optimal outcomes for the “average” patient through a so-called “one-size-fits-all” approach. Precision health, on the other hand, enhances health decision-making by considering individual characteristics such as genotype, environment, and lifestyle. An in-depth analysis of the roles played by artificial intelligence (AI) and machine learning (ML) in precision health, personalized care, and disease prevention contributes to a comprehensive understanding of the dynamic healthcare landscape. This chapter navigates the paradigm shift from traditional medical practices to the burgeoning field of precision health, grounded in AI and modern ML. We provide a comprehensive overview of the application of AI/ML in three precision health categories: disease screening and detection, disease monitoring and progression, and treatment selection and outcome prediction. While addressing challenges in data quality and fairness, this chapter discusses the diverse considerations of stakeholders in realizing the benefits of precision health. Delving into AI/ML techniques, this chapter addresses challenges posed by massive multimodal health data, ensuring model trustworthiness and fairness, and highlighting notable techniques. Furthermore, this chapter extends to AI/ML applications, addressing diverse stakeholders’ needs, and discusses challenges in the practical application of AI/ML in precision health.
Chapter
Full-text available
Solutions for learning from large scale datasets, including kernel learning algorithms that scale linearly with the volume of the data and experiments carried out on realistically large datasets. Pervasive and networked computers have dramatically reduced the cost of collecting and distributing large datasets. In this context, machine learning algorithms that scale poorly could simply become irrelevant. We need learning algorithms that scale linearly with the volume of the data while maintaining enough statistical efficiency to outperform algorithms that simply process a random subset of the data. This volume offers researchers and engineers practical solutions for learning from large scale datasets, with detailed descriptions of algorithms and experiments carried out on realistically large datasets. At the same time it offers researchers information that can address the relative lack of theoretical grounding for many useful algorithms. After a detailed description of state-of-the-art support vector machine technology, an introduction of the essential concepts discussed in the volume, and a comparison of primal and dual optimization techniques, the book progresses from well-understood techniques to more novel and controversial approaches. Many contributors have made their code and data available online for further experimentation. Topics covered include fast implementations of known algorithms, approximations that are amenable to theoretical guarantees, and algorithms that perform well in practice but are difficult to analyze theoretically. ContributorsLéon Bottou, Yoshua Bengio, Stéphane Canu, Eric Cosatto, Olivier Chapelle, Ronan Collobert, Dennis DeCoste, Ramani Duraiswami, Igor Durdanovic, Hans-Peter Graf, Arthur Gretton, Patrick Haffner, Stefanie Jegelka, Stephan Kanthak, S. Sathiya Keerthi, Yann LeCun, Chih-Jen Lin, Gaëlle Loosli, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Gunnar Rätsch, Vikas Chandrakant Raykar, Konrad Rieck, Vikas Sindhwani, Fabian Sinz, Sören Sonnenburg, Jason Weston, Christopher K. I. Williams, Elad Yom-Tov
Conference Paper
Full-text available
Several recently-proposed architectures for high-performance object recognition are composed of two main stages: a feature extraction stage that extracts locally-invariant feature vectors from regularly spaced image patches, and a somewhat generic supervised classifier. The first stage is often composed of three main modules: (1) a bank of filters (often oriented edge detectors); (2) a non-linear transform, such as a point-wise squashing functions, quantization, or normalization; (3) a spatial pooling operation which combines the outputs of similar filters over neighboring regions. We propose a method that automatically learns such feature extractors in an unsupervised fashion by simultaneously learning the filters and the pooling units that combine multiple filter outputs together. The method automatically generates topographic maps of similar filters that extract features of orientations, scales, and positions. These similar filters are pooled together, producing locally-invariant outputs. The learned feature descriptors give comparable results as SIFT on image recognition tasks for which SIFT is well suited, and better results than SIFT on tasks for which SIFT is less well suited.
Conference Paper
Full-text available
Motivated in part by the hierarchical organization of the cortex, a number of al- gorithms have recently been proposed that try to learn hierarchical, or "deep," structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hier- archy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear ("contour") features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex "corner" features matches well with the results from the Ito & Komatsu's study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features.
Conference Paper
Full-text available
Previous work has shown that the dicul- ties in learning deep generative or discrim- inative models can be overcome by an ini- tial unsupervised learning step that maps in- puts to useful intermediate representations. We introduce and motivate a new training principle for unsupervised learning of a rep- resentation based on the idea of making the learned representations robust to partial cor- ruption of the input pattern. This approach can be used to train autoencoders, and these denoising autoencoders can be stacked to ini- tialize deep architectures. The algorithm can be motivated from a manifold learning and information theoretic perspective or from a generative model perspective. Comparative experiments clearly show the surprising ad- vantage of corrupting the input of autoen- coders on a pattern classification benchmark suite.
Article
We describe a way of modelling high-dimensional data-vectors by using an unsupervised, non-linear, multilayer neural network in which the activity of each neuron-like unit makes an additive contribution to a global energy score that indicates how surprised the network is by the data-vector. The connection weights which determine how the activity of each unit depends on the activities in earlier layers are learned by minimizing the energy assigned to data-vectors that are actually observed and maximizing the energy assigned to “confabulations ” that are generated by perturbing an observed data-vector in a direction that decreases its energy under the current model. The backpropagation algorithm (Rumelhart et al., 1986) trains the units in the intermediate layers of a feedforward neural net to represent features of the input vector that are useful for predicting the desired output. This is achieved by propagating information about the discrepancy between the actual output and the desired output backwards through the net to compute how to change the connection weights in a
Conference Paper
We describe an efficient learning procedure for multilayer g enerative models that combine the best aspects of Markov random fields and deep, dir ected belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approx- imation to the posterior distribution in all of the hidden la yers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natur al images.