Conference PaperPDF Available

Extracting and composing robust features with denoising autoencoders

Authors:

Abstract and Figures

Previous work has shown that the dicul- ties in learning deep generative or discrim- inative models can be overcome by an ini- tial unsupervised learning step that maps in- puts to useful intermediate representations. We introduce and motivate a new training principle for unsupervised learning of a rep- resentation based on the idea of making the learned representations robust to partial cor- ruption of the input pattern. This approach can be used to train autoencoders, and these denoising autoencoders can be stacked to ini- tialize deep architectures. The algorithm can be motivated from a manifold learning and information theoretic perspective or from a generative model perspective. Comparative experiments clearly show the surprising ad- vantage of corrupting the input of autoen- coders on a pattern classification benchmark suite.
Content may be subject to copyright.
Extracting and Composing Robust Features with
Denoising Autoencoders
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol
Dept. IRO, Universit´e de Montr´eal
C.P. 6128, Montreal, Qc, H3C 3J7, Canada
http://www.iro.umontreal.ca/lisa
Technical Report 1316, February 2008
Abstract
Previous work has shown that the difficulties in learning deep genera-
tive or discriminative models can be overcome by an initial unsupervised
learning step that maps inputs to useful intermediate representations. We
introduce and motivate a new training principle for unsupervised learning
of a representation based on the idea of making the learned representa-
tions robust to partial corruption of the input pattern. This approach can
be used to train autoencoders, and these denoising autoencoders can be
stacked to initialize deep architectures. The algorithm can be motivated
from a manifold learning and information theoretic perspective or from a
generative model perspective. Comparative experiments clearly show the
surprising advantage of corrupting the input of autoencoders on a pattern
classification benchmark suite.
1 Introduction
Recent theoretical studies indicate that deep architectures (Bengio & Le Cun,
2007; Bengio, 2007) may be needed to efficiently model complex distributions
and achieve better generalization performance on challenging recognition tasks.
The belief that additional levels of functional composition will yield increased
representational and modeling power is not new (McClelland et al., 1986; Hin-
ton, 1989; Utgoff & Stracuzzi, 2002). However, in practice, learning in deep
architectures has proven to be difficult. One needs only to ponder the diffi-
cult problem of inference in deep directed graphical models, due to “explaining
away”. Also looking back at the history of multi-layer neural networks, their
difficult optimization (Bengio et al., 2007; Bengio, 2007) has long prevented
reaping the expected benefits of going beyond one or two hidden layers. How-
ever this situation has recently changed with the successful approach of (Hinton
et al., 2006; Hinton & Salakhutdinov, 2006; Bengio et al., 2007; Ranzato et al.,
2007; Lee et al., 2008) for training Deep Belief Networks and stacked autoen-
coders.
1
One key ingredient to this success appears to be the use of an unsupervised
training criterion to perform a layer-by-layer initialization: each layer is at first
trained to produce a higher level (hidden) representation of the observed pat-
terns, based on the representation it receives as input from the layer below, by
optimizing a local unsupervised criterion. Each level produces a representation
of the input pattern that is more abstract than the previous level’s, because it
is obtained by composing more operations. This initialization yields a starting
point, from which a global fine-tuning of the model’s parameters is then per-
formed using another training criterion appropriate for the task at hand. This
technique has been shown empirically to avoid getting stuck in the kind of poor
solutions one typically reaches with random initializations. While unsupervised
learning of a mapping that produces “good” intermediate representations of
the input pattern seems to be key, little is understood regarding what consti-
tutes “good” representations for initializing deep architectures, or what explicit
criteria may guide learning such representations. We know of only a few algo-
rithms that seem to work well for this purpose: Restricted Boltzmann Machines
(RBMs) trained with contrastive divergence on one hand, and various types of
autoencoders on the other.
The present research begins with the question of what explicit criteria a good
intermediate representation should satisfy. Obviously, it should at a minimum
retain a certain amount of “information” about its input, while at the same time
being constrained to a given form (e.g. a real-valued vector of a given size in the
case of an autoencoder). A supplemental criterion that has been proposed for
such models is sparsity of the representation (Ranzato et al., 2008; Lee et al.,
2008). Here we hypothesize and investigate an additional specific criterion:
robustness to partial destruction of the input, i.e., partially destructed
inputs should yield almost the same representation. It is motivated by the
following informal reasoning: a good representation is expected to capture stable
structures in the form of dependencies and regularities characteristic of the
(unknown) distribution of its observed input. For high dimensional redundant
input (such as images) at least, such structures are likely to depend on evidence
gathered from a combination of many input dimensions. They should thus be
recoverable from partial observation only. A hallmark of this is our human
ability to recognize partially occluded or corrupted images. Further evidence is
our ability to form a high level concept associated to multiple modalities (such
as image and sound) and recall it even when some of the modalities are missing.
To validate our hypothesis and assess its usefulness as one of the guiding
principles in learning deep architectures, we propose a modification to the au-
toencoder framework to explicitly integrate robustness to partially destroyed
inputs. Section 2 describes the algorithm in details. Section 3 discusses links
with other approaches in the literature. Section 4 is devoted to a closer inspec-
tion of the model from different theoretical standpoints. In section 5 we verify
empirically if the algorithm leads to a difference in performance. Section 6
concludes the study.
2
2 Description of the Algorithm
2.1 Notation and Setup
Let Xand Ybe two random variables with joint probability density p(X, Y ),
with marginal distributions p(X) and p(Y). Throughout the text, we will
use the following notation: Expectation: EEp(X)[f(X)] = Rp(x)f(x)dx. En-
tropy: IH(X) = IH(p) = EEp(X)[log p(X)]. Conditional entropy: IH(X|Y) =
EEp(X,Y )[log p(X|Y)]. Kullback-Leibler divergence: IDKL (pkq) = EEp(X)[log p(X)
q(X)].
Cross-entropy: IH(pkq) = EEp(X)[log q(X)] = IH(p) + IDKL(pkq). Mutual infor-
mation: I(X;Y) = IH(X)IH(X|Y). Sigmoid: s(x) = 1
1+exand s(x) =
(s(x1), . . . , s(xd))T. Bernoulli distribution with mean µ:Bµ(x). and by exten-
sion Bµ(x) = (Bµ1(x1), . . . , Bµd(xd)).
The setup we consider is the typical supervised learning setup with a training
set of n(input, target) pairs Dn={(x(1), t(1) ). . . , (x(n), t(n))}, that we suppose
to be an i.i.d. sample from an unknown distribution q(X, T ) with corresponding
marginals q(X) and q(T).
2.2 The Basic Autoencoder
We begin by recalling the traditional autoencoder model such as the one used
in (Bengio et al., 2007) to build deep networks. An autoencoder takes an input
vector x[0,1]d, and first maps it to a hidden representation y[0,1]d0
through a deterministic mapping y=fθ(x) = s(Wx +b), parameterized by
θ={W,b}.Wis a d0×dweight matrix and bis a bias vector. The resulting
latent representation yis then mapped back to a “reconstructed” vector z
[0,1]din input space z=gθ0(y) = s(W0y+b0) with θ0={W0,b0}. The weight
matrix W0of the reverse mapping may optionally be constrained by W0=WT,
in which case the autoencoder is said to have tied weights. Each training x(i)is
thus mapped to a corresponding y(i)and a reconstruction z(i). The parameters
of this model are optimized to minimize the average reconstruction error:
θ?, θ0?= arg min
θ,θ0
1
n
n
X
i=1
Lx(i),z(i)
= arg min
θ,θ0
1
n
n
X
i=1
Lx(i), gθ0(fθ(x(i)))(1)
where Lis a loss function such as the traditional squared error L(x,z) = kxzk2.
An alternative loss, suggested by the interpretation of xand zas either bit
vectors or vectors of bit probabilities (Bernoullis) is the reconstruction cross-
entropy:
LIH (x,z)= IH(BxkBz)
=
d
X
k=1
[xklog zk+(1 xk) log(1 zk)] (2)
3
Figure 1: An example xis corrupted to ˜
x. The autoencoder then maps it to y
and attempts to reconstruct x.
qD
fθ
˜
x x
y
gθ0
Note that if xis a binary vector, LIH (x,z) is a negative log-likelihood for the
example x, given the Bernoulli parameters z. Equation 1 with L=LIH can be
written
θ?, θ0?= arg min
θ,θ0
EEq0(X)[LIH (X, gθ0(fθ(X)))] (3)
where q0(X) denotes the empirical distribution associated to our ntraining
inputs. This optimization will typically be carried out by stochastic gradient
descent.
2.3 The Denoising Autoencoder
To test our hypothesis and enforce robustness to partially destroyed inputs we
modify the basic autoencoder we just described. We will now train it to recon-
struct a clean “repaired” input from a corrupted, partially destroyed one. This
is done by first corrupting the initial input xto get a partially destroyed ver-
sion ˜
xby means of a stochastic mapping ˜
xqD(˜
x|x). In our experiments, we
considered the following corrupting process, parameterized by the desired pro-
portion νof “destruction”: for each input x, a fixed number νd of components
are chosen at random, and their value is forced to 0, while the others are left
untouched. The procedure can be viewed as replacing a component considered
missing by a default value, which is a common technique. A motivation for
zeroing the destroyed components is that it simulates the removal of these com-
ponents from the input. For images on a white (0) background, this corresponds
to “salt noise”. Note that alternative corrupting noises could be considered1.
The corrupted input ˜
xis then mapped, as with the basic autoencoder, to a
hidden representation y=fθ(˜
x) = s(W˜
x+b) from which we reconstruct a
z=gθ0(y) = s(W0y+b0) (see figure 1 for a schematic representation of the
process). As before the parameters are trained to minimize the average recon-
struction error LIH (x,z) = IH(BxkBz) over a training set, i.e. to have zas close
as possible to the uncorrupted input x. But the key difference is that zis now
a deterministic function of ˜
xrather than xand thus the result of a stochastic
mapping of x.
Let us define the joint distribution
q0(X, e
X, Y ) = q0(X)qD(e
X|X)δfθ(
e
X)(Y) (4)
1the approach we describe and our analysis is not specific to a particular kind of corrupting
noise.
4
where δu(v) puts mass 0 when u6=v. Thus Yis a deterministic function of
e
X.q0(X, e
X, Y ) is parameterized by θ. The objective function minimized by
stochastic gradient descent becomes:
arg min
θ,θ0
EEq0(X,
e
X)hLIH X, gθ0(fθ(e
X))i.(5)
So from the point of view of the stochastic gradient descent algorithm, in ad-
dition to picking an input sample from the training set, we will also produce a
random corrupted version of it, and take a gradient step towards reconstructing
the uncorrupted version from the corrupted version. Note that in this way, the
autoencoder cannot learn the identity, unlike the basic autoencoder, thus re-
moving the constraint that d0< d or the need to regularize specifically to avoid
such a trivial solution.
2.4 Layer-wise Initialization and Fine Tuning
The basic autoencoder has been used as a building block to train deep net-
works (Bengio et al., 2007), with the representation of the k-th layer used as
input for the (k+ 1)-th, and the (k+ 1)-th layer trained after the k-th has been
trained. After a few layers have been trained, the parameters are used as initial-
ization for a network optimized with respect to a supervised training criterion.
This greedy layer-wise procedure has been shown to yield significantly better
local minima than random initialization of deep networks (Bengio et al., 2007),
achieving better generalization on a number of tasks (Larochelle et al., 2007).
The procedure to train a deep network using the denoising autoencoder is
similar. The only difference is how each layer is trained, i.e., to minimize the
criterion in eq. 5 instead of eq. 3. Note that the corruption process qDis only
used during training, but not for propagating representations from the raw input
to higher-level representations. Note also that when layer kis trained, it receives
as input the uncorrupted output of the previous layers.
3 Relationship to Other Approaches
Our training procedure for the denoising autoencoder involves learning to re-
cover a clean input from a corrupted version, a task known as denoising. The
problem of image denoising, in particular, has been extensively studied in the
image processing community and many recent developments rely on machine
learning approaches (see e.g. Roth and Black (2005); Elad and Aharon (2006);
Hammond and Simoncelli (2007)). A particular form of gated autoencoders has
also been used for denoising in Memisevic (2007). Denoising using autoencoders
was actually introduced much earlier (LeCun, 1987; Gallinari et al., 1987), as
an alternative to Hopfield models (Hopfield, 1982). Our objective however is
fundamentally different from that of developing a competitive image denoising
algorithm. We investigate explicit robustness to corrupting noise only as a cri-
terion to guide the learning of suitable intermediate representations, with the
5
goal to build a better general purpose learning algorithm. Thus our corrup-
tion+denoising procedure is applied not only on the input, but also recursively
to intermediate representations. It is not specific to images and does not use
prior knowledge of image topology.
Whereas the proposed approach does not rely on prior knowledge, it bears
resemblance to the well known technique of augmenting the training data with
stochastically “transformed” patterns. But again we do not rely on prior knowl-
edge. Moreover we only use the corrupted patterns to optimize an unsupervised
criterion, as an initialization step.
There are also similarities with the work of (Doi et al., 2006) on robust
coding over noisy channels. In their framework, a linear encoder is to encode
a clean input for optimal transmission over a noisy channel to a decoder that
reconstructs the input. This work was later extended to robustness to noise in
the input, in a proposal for a model of retinal coding (Doi & Lewicki, 2007).
Though some of the inspiration behind our work comes from neural coding
and computation, our goal is not to account for experimental data of neuronal
activity as in (Doi & Lewicki, 2007). Also, the non-linearity of our denoising
autoencoder is crucial for its use in initializing a deep neural network.
It may be objected that, if our goal is to handle missing values correctly,
we could have more naturally defined a proper latent variable generative model,
and infer the posterior over the latent (hidden) representation in the presence
of missing inputs. But this usually requires a costly marginalization2which has
to be carried out for each new example. By contrast, our approach tries to learn
a fast and robust deterministic mapping fθfrom examples of already corrupted
inputs. The burden is on learning such a constrained mapping during training,
rather than on unconstrained inference at use time. We expect this may force
the model to capture implicit invariances in the data, and result in interesting
features. Also note that in section 4.4 we will see how our learning algorithm
for the denoising autoencoder can be viewed as a form of variational inference
in a particular generative model.
4 Analysis of the Denoising Autoencoder
The above intuitive motivation for the denoising autoencoder was given with the
perspective of discovering robust representations. In the following, which can
be skipped without hurting the remainder of the paper, we try to gain insight
by considering several alternative perspectives on the algorithm.
4.1 Manifold Learning Perspective
The process of mapping a corrupted example to an uncorrupted one can be
visualized in Figure 2, with a low-dimensional manifold near which the data
concentrate. We learn a stochastic operator p(X|e
X) that maps an e
Xto an X,
p(X|e
X) = Bgθ0(fθ(
e
X))(X).The corrupted examples will be much more likely to
2as in the case of RBMs, where it is exponential in the number of missing values
6
x
x
˜
x
˜
x
qD(˜
x|x)
gθ0(fθ(˜
x))
Figure 2: Illustration of what the denoising autoencoder is trying to learn. Sup-
pose training data (crosses) concentrate near a low-dimensional manifold. A cor-
rupted example (circle) is obtained by applying a corruption process qD(e
X|X)
(left side). Corrupted examples (circles) are typically outside and farther from
the manifold, hence the model learns with p(X|e
X) to map points to more likely
points (right side). Mapping from more corrupted examples requires bigger
jumps (longer dashed arrows).
be outside and farther from the manifold than the uncorrupted ones. Hence the
stochastic operator p(X|e
X) learns a map that tends to go from lower probability
points e
Xto high probability points X, generally on or near the manifold. Note
that when e
Xis farther from the manifold, p(X|e
X) should learn to make bigger
steps, to reach the manifold. At the limit we see that the operator should map
even far away points to a small volume near the manifold.
The denoising autoencoder can thus be seen as a way to define and learn a
manifold. The intermediate representation Y=f(X) can be interpreted as a
coordinate system for points on the manifold (this is most clear if we force the
dimension of Yto be smaller than the dimension of X). More generally, one can
think of Y=f(X) as a representation of Xwhich is well suited to capture the
main variations in the data, i.e., on the manifold. When additional criteria (such
as sparsity) are introduced in the learning model, one can no longer directly view
Y=f(X) as an explicit low-dimensional coordinate system for points on the
manifold, but it retains the property of capturing the main factors of variation
in the data.
4.2 The Stochastic Operator Perspective
The denoising autoencoder can be seen as corresponding to a semi-parametric
model from which we can sample. Let us augment the set of modeled random
variables to include the corrupted example e
Xin addition to the corresponding
uncorrupted example X, and let us perform maximum likelihood training on
a model of their joint. We consider here the simpler case where Xis discrete,
but the approach can be generalized. We define a joint distribution p(X, e
X) =
p(e
X)p(X|e
X) from the stochastic operator p(X|e
X), with marginal p(e
X) = q0(e
X)
set by construction.
We now have an empirical distribution q0and a model pon (X, e
X) pairs.
7
Performing maximum likelihood on them or minimizing IDKL(q0(X, e
X)kp(X, e
X))
is a reasonable training objective, again yielding the denoising criterion in eq. 5.
As an additional motivation for minimizing IDKL(q0(X, e
X)kp(X, e
X)), note
that as we minimize it (i.e., IDKL(q0(X, e
X)kp(X, e
X)) 0), the marginals of p
approach those of q0, hence in particular
p(X)q0(X),
i.e., training in this way corresponds to a semi-parametric model p(X) which
approaches the empirical distribution q0(X). By applying the marginalization
definition for p(X), we see what the corresponding model is
p(X) = 1
n
n
X
i=1 X
˜
x
p(X|e
X=˜
x)qD(˜
x|xi) (6)
where xiis one of the ntraining examples. Note that only the parameters of
p(X|e
X) are optimized in this model. Note also that sampling from the model is
easy. We have thus see that the denoising autoencoder learns a semi-parametric
model which can be sampled from, based on the stochastic operator p(X|e
X).
What would happen if we were to apply this operator repeatedly? That
would define a chain pk(X) where p0(X=x) = q0(e
X=x), p1(X) = p(X)
and pk(X) = Pxp(X|e
X=x)pk1(x). If the operator was ergodic (which
is plausible, following the above argumentation), its fixed point would define
yet another distribution π(X) = limk→∞ pk(X), of which the semi-parametric
p(X) would be a first-order approximation. The advantage of this formulation
is that π(X) is purely parametric: it does not explicitly depend on the empirical
distribution q0.
4.3 Bottom-up Filtering, Information Theoretic Perspec-
tive
In this section we adopt a bottom-up filtering viewpoint, and an information-
theoretic perspective. Let X, e
X, Y be random variables representing respectively
an input sample, a corrupted version of it, and the corresponding hidden rep-
resentation, i.e. Xq(X) (the true generating process for X), e
XqD(e
X|X),
Y=fθ(e
X), with the associated joint q(X, e
X, Y ). Notice that it is defined with
the same dependency structure as q0(X, e
X, Y ), and the same conditionals e
X|X
and Y|e
X.
The role of the greedy layer initialization is to learn a non-linear filter that
yields a good representation of its input for the next layer. A good repre-
sentation should retain a sufficient amount of “information” about its input,
while we might at the same time encourage its marginal distribution to dis-
play certain desirable properties. From this high level perspective, the filter
we learn, which has a fixed parameterized form with parameters θ, should be
optimized to realize a tradeoff between yielding a somewhat easier to model
marginal distribution and retaining as much information as possible about the
8
input. This can be formalized as optimizing an objective function of the form
arg maxθ{I(X;Y) + λJ(Y)}, where Jis a functional that induces some pref-
erence over the marginal q(Y), and hyper-parameter λcontrols the tradeoff. J
could for example encourage IDKL closeness to some reference prior distribution,
or be some measure of sparsity or independence. In the present study, how-
ever, we shall suppose that Yis only constrained by its dimensionality. This
corresponds to the case where J(Y) is constant for all distributions (i.e. no
preference).
Let us thus examine only the first term. We have I(X;Y) = IH(X)IH(X|Y).
If we suppose the unknown input distribution q(X) fixed, IH(X) is a constant.
Maximizing mutual information then amounts to maximizing IH(X|Y), i.e.
arg max
θ
I(X;Y) = arg max
θ
IH(X|Y)
max
θIH(X|Y) = max
θ
EEq(X,Y )[log q(X|Y)]
= max
θ,p?
EEq(X,Y )[log p?(X|Y)]
where we optimize over all possible distributions p?. It is easy to show that the
maximum is obtained for p?(X|Y) = q(X|Y). If instead we constrain p?(X|Y)
to a given parametric form p(X|Y) parameterized by θ0, we get a lower bound:
max
θIH(X|Y)max
θ,θ0
EEq(X,Y )[log p(X|Y)]
where we have an equality if θ0q(X|Y) = p(X|Y).
In the case of an autoencoder with binary vector variable X, we can view the
top-down reconstruction as representing a p(X|Y) = Bgθ0(Y)(X). Optimizing
for the lower bound leads to:
max
θ,θ0
EEq(X,Y )[log Bgθ0(Y)(X)]
= max
θ,θ0
EEq(X,
e
X)[log Bgθ0(fθ(
e
X))(X)]
= min
θ,θ0
EEq(X,
e
X)[LIH (X, gθ0(fθ(e
X)))]
where in the second line we use the fact that Y=fθ(e
X) deterministically. This
is the criterion we use for training the autoencoder (eq. 5), but replacing the
true generating process qby the empirical q0.
This shows that minimizing the expected reconstruction error amounts to
maximizing a lower bound on IH(X|Y) and at the same time on the mutual
information between input Xand the hidden representation Y. Note that this
reasoning holds equally for the basic autoencoder, but with the denoising au-
toencoder, we maximize the mutual information between Xand Yeven as Y
is a function of corrupted input.
4.4 Top-down, Generative Model Perspective
In this section we try to recover the training criterion for our denoising autoen-
coder (eq. 5) from a generative model perspective. Specifically we show that
9
training the denoising autoencoder as described in section 2.3 is equivalent to
maximizing a variational bound on a particular generative model.
Consider the generative model p(X, e
X, Y ) = p(Y)p(X|Y)p(e
X|X) where
p(X|Y) = Bs(W0Y+b0)and p(e
X|X) = qD(e
X|X). p(Y) is a uniform prior over
Y. This defines a generative model with parameter set θ0={W0,b0}. We will
use the previously defined q0(X, e
X, Y ) = q0(X)qD(e
X|X)δfθ(
e
X)(Y) (equation 4)
as an auxiliary model in the context of a variational approximation of the log-
likelihood of p(e
X). Note that we abuse notation to make it lighter, and use
the same letters X,e
Xand Yfor different sets of random variables representing
the same quantity under different distributions: por q0. Keep in mind that
whereas we had the dependency structure Xe
XYfor qor q0, we have
YXe
Xfor p.
Since pcontains a corruption operation at the last generative stage, we
propose to fit p(e
X) to corrupted training samples. Performing maximum like-
lihood fitting for samples drawn from q0(e
X) corresponds to minimizing the
cross-entropy, or maximizing
H= max
θ0{−IH(q0(e
X)kp(e
X))}
= max
θ0{EEq0(
e
X)[log p(e
X)]}.(7)
Let q?(X, Y |e
X) be a conditional density, the quantity L(q?,e
X) = EEq?(X,Y |
e
X)hlog p(X,
e
X,Y )
q?(X,Y |
e
X)i
is a lower bound on log p(e
X) since the following can be shown to be true for
any q?:log p(e
X) = L(q?,e
X) + IDKL (q?(X, Y |e
X)kp(X, Y |e
X))
Also it is easy to verify that the bound is tight when q?(X, Y |e
X) = p(X, Y |e
X),
where the IDKL becomes 0. We can thus write logp(e
X) = maxq?L(q?,e
X), and
consequently rewrite equation 7 as
H= max
θ0{EEq0(
e
X)[max
q?L(q?,e
X)]}
= max
θ0,q?{EEq0(
e
X)[L(q?,e
X)]}(8)
where we moved the maximization outside of the expectation because an uncon-
strained q?(X, Y |e
X) can in principle perfectly model the conditional distribution
needed to maximize L(q?,e
X) for any e
X. Now if we replace the maximization
over an unconstrained q?by the maximization over the parameters θof our q0
(appearing in fθthat maps an xto a y), we get a lower bound on H
H ≥ max
θ0{EEq0(
e
X)[L(q0,e
X)]}(9)
10
Maximizing this lower bound, we find
arg max
θ,θ0
{EEq0(
e
X)[L(q0,e
X)]}
= arg max
θ,θ0
EEq0(X,
e
X,Y )"log p(X, e
X, Y )
q0(X, Y |e
X)#
= arg max
θ,θ0
EEq0(X,
e
X,Y )
hlog p(X, e
X, Y)i
+EEq0(
e
X)hIH[q0(X, Y |e
X)]i
= arg max
θ,θ0
EEq0(X,
e
X,Y )hlog p(X, e
X, Y )i.
Note that θonly occurs in Y=fθ(e
X), and θ0only occurs in p(X|Y). The last
line is therefore obtained because q0(X|e
X)qD(e
X|X)q0(X) (none of which
depends on (θ, θ0)), and q0(Y|e
X) is deterministic, i.e., its entropy is constant,
irrespective of (θ, θ0). Hence the entropy of q0(X, Y |e
X) = q0(Y|e
X)q0(X|e
X),
does not vary with (θ, θ0). Finally, following from above, we obtain our training
criterion (eq. 5):
arg max
θ,θ0
EEq0(
e
X)[L(q0,e
X)]
= arg max
θ,θ0
EEq0(X,
e
X,Y )[log[p(Y)p(X|Y)p(e
X|X)]]
= arg max
θ,θ0
EEq0(X,
e
X,Y )[log p(X|Y)]
= arg max
θ,θ0
EEq0(X,
e
X)[log p(X|Y=fθ(e
X))]
= arg min
θ,θ0
EEq0(X,
e
X)hLIH X, gθ0(fθ(e
X))i
where the third line is obtained because (θ, θ0) have no influence on EEq0(X,
e
X,Y )[log p(Y)]
because we chose p(Y) uniform, i.e. constant, nor on EEq0(X,
e
X)[log p(e
X|X)], and
the last line is obtained by inspection of the definition of LIH in eq. 2, when
p(X|Y=fθ(e
X)) is a Bgθ0(fθ(
e
X)).
5 Experiments
We performed experiments with the proposed algorithm on the same benchmark
of classification problems used in (Larochelle et al., 2007)3. It contains differ-
ent variations of the MNIST digit classification problem, with added factors of
variation such as rotation (rot), addition of a background composed of random
pixels (bg-rand) or made from patches extracted from a set of images (bg-img), or
combinations of these factors (rot-bg-img). These variations render the problems
3All the datasets for these problems were taken from
http://www.iro.umontreal.ca/lisa/icml2007.
11
Table 1: Comparison of stacked denoising autoencoders (SdA-3) with
other models.
Test error rate on all considered classification problems is reported together with
a 95% confidence interval. Best performer is in bold, as well as those for which
confidence intervals overlap. SdA-3 appears to achieve performance superior or
equivalent to the best other model on all problems except bg-rand. For SdA-3,
we also indicate the fraction νof destroyed input components, as chosen by
proper model selection. Note that SAA-3 is equivalent to SdA-3 with ν= 0%.
Dataset SVMrbf SVMpoly DBN-1 SAA-3 DBN-3 SdA-3 (ν)
basic 3.03±0.15 3.69±0.17 3.94±0.17 3.46±0.16 3.11±0.15 2.80±0.14 (10%)
rot 11.11±0.28 15.42±0.32 14.69±0.31 10.30±0.27 10.30±0.27 10.29±0.27 (10%)
bg-rand 14.58±0.31 16.62±0.33 9.80±0.26 11.28±0.28 6.73±0.22 10.38±0.27 (40%)
bg-img 22.61±0.37 24.01±0.37 16.15±0.32 23.00±0.37 16.31±0.32 16.68±0.33 (25%)
rot-bg-img 55.18±0.44 56.41±0.43 52.21±0.44 51.93±0.44 47.39±0.44 44.49±0.44 (25%)
rect 2.15±0.13 2.15±0.13 4.71±0.19 2.41±0.13 2.60±0.14 1.99±0.12 (10%)
rect-img 24.04±0.37 24.05±0.37 23.69±0.37 24.05±0.37 22.50±0.37 21.59±0.36 (25%)
convex 19.13±0.34 19.82±0.35 19.92±0.35 18.41±0.34 18.63±0.34 19.06±0.34 (10%)
particularly challenging for current generic learning algorithms. Each problem
is divided into a training, validation, and test set (10000, 2000, 50000 examples
respectively). A subset of the original MNIST problem is also included with the
same example set sizes (problem basic). The benchmark also contains additional
binary classification problems: discriminating between convex and non-convex
shapes (convex), and between wide and long rectangles (rect,rect-img).
Neural networks with 3 hidden layers initialized by stacking denoising au-
toencoders (SdA-3), and fine tuned on the classification tasks, were evaluated
on all the problems in this benchmark. Model selection was conducted follow-
ing a similar procedure as Larochelle et al. (2007). Several values of hyper
parameters (destruction fraction ν, layer sizes, number of unsupervised training
epochs) were tried, combined with early stopping in the fine tuning phase. For
each task, the best model was selected based on its classification performance
on the validation set.
Table 1 reports the resulting classification error on the test set for the new
model (SdA-3), together with the performance reported in Larochelle et al.
(2007)4for SVMs with Gaussian and polynomial kernels, 1 and 3 hidden layers
deep belief network (DBN-1 and DBN-3) and a 3 hidden layer deep network
initialized by stacking basic autoencoders (SAA-3). Note that SAA-3 is equiv-
alent to a SdA-3 with ν= 0% destruction. As can be seen in the table, the
corruption+denoising training works remarkably well as an initialization step,
and in most cases yields significantly better classification performance than ba-
sic autoencoder stacking with no noise. On all but one task the SdA-3 algorithm
performs on par or better than the best other algorithms, including deep belief
4Except that rot and rot-bg-img, as reported on the website from which they are available,
have been regenerated since Larochelle et al. (2007), to fix a problem in the initial data
generation process. We used the updated data and corresponding benchmark results given on
this website.
12
Figure 3: Filters obtained after training the first denoising autoen-
coder.
(a-c) show some of the filters obtained after training a denoising autoencoder
on MNIST samples, with increasing destruction levels ν. The filters at the same
position in the three images are related only by the fact that the autoencoders
were started from the same random initialization point.
(d) and (e) zoom in on the filters obtained for two of the neurons, again for
increasing destruction levels.
As can be seen, with no noise, many filters remain similarly uninteresting (undis-
tinctive almost uniform grey patches). As we increase the noise level, denoising
training forces the filters to differentiate more, and capture more distinctive
features. Higher noise levels tend to induce less local filters, as expected. One
can distinguish different kinds of filters, from local blob detectors, to stroke
detectors, and some full character detectors at the higher noise levels.
(a) No destroyed inputs (b) 25% destruction (c) 50% destruction
(d) Neuron A (0%, 10%, 20%, 50% destruc-
tion)
(e) Neuron B (0%, 10%, 20%, 50% destruc-
tion)
nets. This suggests that our proposed procedure was indeed able to produce
more useful feature detectors.
Next, we wanted to understand qualitatively the effect of the corruption+denoising
training. To this end we display the filters obtained after initial training of the
first denoising autoencoder on MNIST digits. Figure 3 shows a few of these
filters as little image patches, for different noise levels. Each patch corresponds
to a row of the learnt weight matrix W, i.e. the incoming weights of one of the
hidden layer neurons. The beneficial effect of the denoising training can clearly
be seen. Without the denoising procedure, many filters appear to have learnt
no interesting feature. They look like the filters obtained after random initial-
ization. But when increasing the level of destructive corruption, an increasing
number of filters resemble sensible feature detectors. As we move to higher
noise levels, we observe a phenomenon that we expected: filters become less
local, they appear sensitive to larger structures spread out across more input
dimensions.
13
6 Conclusion and Future Work
We have introduced a very simple training principle for autoencoders, based on
the objective of undoing a corruption process. This is motivated by the goal of
learning representations of the input that are robust to small irrelevant changes
in input. Several perspectives also help to motivate it from a manifold learning
perspective and from the perspective of a generative model.
This principle can be used to train and stack autoencoders to initialize a
deep neural network. A series of image classification experiments were per-
formed to evaluate this new training principle. The empirical results support
the following conclusions: unsupervised initialization of layers with an explicit
denoising criterion helps to capture interesting structure in the input distribu-
tion. This in turn leads to intermediate representations much better suited for
subsequent learning tasks such as supervised classification. The experimental
results with Deep Belief Networks (whose layers are initialized as RBMs) sug-
gest that RBMs may also encapsulate a form of robustness in the representations
they learn, possibly because of their stochastic nature, which introduces noise
in the representation during training. Future work inspired by this observation
should investigate other types of corruption process, not only of the input but
of the representation itself as well.
References
Bengio, Y. (2007). Learning deep architectures for AI (Technical Report 1312).
Universit´e de Montr´eal, dept. IRO.
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-
wise training of deep networks. Advances in Neural Information Processing
Systems 19 (pp. 153–160). MIT Press.
Bengio, Y., & Le Cun, Y. (2007). Scaling learning algorithms towards AI. In
L. Bottou, O. Chapelle, D. DeCoste and J. Weston (Eds.), Large scale kernel
machines. MIT Press.
Doi, E., Balcan, D. C., & Lewicki, M. S. (2006). A theoretical analysis of
robust coding over noisy overcomplete channels. In Y. Weiss, B. Sch¨olkopf
and J. Platt (Eds.), Advances in neural information processing systems 18,
307–314. Cambridge, MA: MIT Press.
Doi, E., & Lewicki, M. S. (2007). A theory of retinal population coding. NIPS
(pp. 353–360). MIT Press.
Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant
representations over learned dictionaries. IEEE Transactions on Image Pro-
cessing,15, 3736–3745.
Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). Memoires
associatives distribuees. Proceedings of COGNITIVA 87. Paris, La Villette.
14
Hammond, D., & Simoncelli, E. (2007). A machine learning framework for adap-
tive combination of signal denoising methods. 2007 International Conference
on Image Processing (pp. VI: 29–32).
Hinton, G. (1989). Connectionist learning procedures. Artificial Intelligence,
40, 185–234.
Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data
with neural networks. Science,313, 504–507.
Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for
deep belief nets. Neural Computation,18, 1527–1554.
Hopfield, J. (1982). Neural networks and physical systems with emergent collec-
tive computational abilities. Proceedings of the National Academy of Sciences,
USA,79.
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007).
An empirical evaluation of deep architectures on problems with many factors
of variation. Twenty-fourth International Conference on Machine Learning
(ICML’2007).
LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Doctoral disser-
tation, Universit´e de Paris VI.
Lee, H., Ekanadham, C., & Ng, A. (2008). Sparse deep belief net model for visual
area V2. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in
neural information processing systems 20. Cambridge, MA: MIT Press.
McClelland, J., Rumelhart, D., & the PDP Research Group (1986). Parallel
distributed processing: Explorations in the microstructure of cognition, vol. 2.
Cambridge: MIT Press.
Memisevic, R. (2007). Non-linear latent factor models for revealing structure
in high-dimensional data. Doctoral dissertation, Departement of Computer
Science, University of Toronto, Toronto, Ontario, Canada.
Ranzato, M., Boureau, Y.-L., & LeCun, Y. (2008). Sparse feature learning for
deep belief networks. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.),
Advances in neural information processing systems 20. Cambridge, MA: MIT
Press.
Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Efficient learning
of sparse representations with an energy-based model. Advances in Neural
Information Processing Systems (NIPS 2006). MIT Press.
Roth, S., & Black, M. (2005). Fields of experts: a framework for learning image
priors. IEEE Conference on Computer Vision and Pattern Recognition (pp.
860–867).
15
Utgoff, P., & Stracuzzi, D. (2002). Many-layered learning. Neural Computation,
14, 2497–2539.
16
... Despite the presence of noise,x often retains valuable information about the original data. Denoising Autoencoders [41] are a specialized type of autoencoder designed to learn a robust representation of x from noisy input data. This is achieved by minimizing a loss function that enforces the reconstructed output to align with the desired properties of x. ...
Article
In this work, we propose a machine learning-based approach to address a specific aspect of the Quantum Marginal Problem: reconstructing a global density matrix compatible with a given set of quantum marginals. Our method integrates a quantum marginal imposition technique with convolutional denoising autoencoders. The loss function is carefully designed to enforce essential physical constraints, including Hermiticity, positivity, and normalization. Through extensive numerical simulations, we demonstrate the effectiveness of our approach, achieving high success rates and accuracy. Furthermore, we show that, in many cases, our model offers a faster alternative to state-of-the-art semidefinite programming solvers without compromising solution quality. These results highlight the potential of machine learning techniques for solving complex problems in quantum mechanics.
... Efficient image tokenization. Autoregressive generation requires the conversion of images into token sequences, which is often achieved using autoencoders [18,54]. Patchbased autoencoders [10,24,37,52,72] tokenize images spatially, where each token corresponds to a certain patch from the original image. ...
Preprint
Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: https://huang-yh.github.io/spectralar/.
... Thanks to a profound connection between the score function and the posterior expectation from the perspective of empirical Bayes (Robbins, 1956;Morris, 1983;Efron, 2010Efron, , 2011Raphan and Simoncelli, 2011;Saremi and Hyvärinen, 2019), we can also train a denoising autoencoder (DAE) (Vincent et al., 2008;Bengio et al., 2013b;Alain and Bengio, 2014) to estimate the conditional expectation E(z 0 |z t ), and then convert it to the score function. We summarize this connection as the following lemma. ...
Preprint
Full-text available
Diffusion-based generative models employ stochastic differential equations (SDEs) and their equivalent probability flow ordinary differential equations (ODEs) to establish a smooth transformation between complex high-dimensional data distributions and tractable prior distributions. In this paper, we reveal a striking geometric regularity in the deterministic sampling dynamics: each simulated sampling trajectory lies within an extremely low-dimensional subspace, and all trajectories exhibit an almost identical ''boomerang'' shape, regardless of the model architecture, applied conditions, or generated content. We characterize several intriguing properties of these trajectories, particularly under closed-form solutions based on kernel-estimated data modeling. We also demonstrate a practical application of the discovered trajectory regularity by proposing a dynamic programming-based scheme to better align the sampling time schedule with the underlying trajectory structure. This simple strategy requires minimal modification to existing ODE-based numerical solvers, incurs negligible computational overhead, and achieves superior image generation performance, especially in regions with only 5105 \sim 10 function evaluations.
... This stands conceptually similar to the way humans, animals, and even the nature can infer a distorted input by validating latent manifolds that have been stored in the memory -efficiently performing manifold recovery in cognitive space [27]. Auto-encoders [28], GANs [13], and diffusion models [15] have shown that manifold recovery is a possible approach to input purification. In spite of that these methods generally lacks explicit structure in the latent space and do not assure semantic alignment across modalities. ...
... They map input data to a lower-dimensional latent space through an encoder and use a decoder to reconstruct the original input. Denoising Autoencoders (DAEs) (Vincent, et al., 2008), an extension of AEs, introduce noise into the input data and train the network to reconstruct the original, undamaged signal, thus enhancing the model's robustness to data. Moreover, the concept of autoencoders has evolved into several variants, such as Variational Autoencoders (VAEs) (Kingma, et al., 2013), which model latent variables as probability distributions, providing a new perspective for learning generative models, although they may encounter mode collapse issues during training. ...
Preprint
In the architectural design process, floorplan design is often a dynamic and iterative process. Architects progressively draw various parts of the floorplan according to their ideas and requirements, continuously adjusting and refining throughout the design process. Therefore, the ability to predict a complete floorplan from a partial one holds significant value in the design process. Such prediction can help architects quickly generate preliminary designs, improve design efficiency, and reduce the workload associated with repeated modifications. To address this need, we propose FloorplanMAE, a self-supervised learning framework for restoring incomplete floor plans into complete ones. First, we developed a floor plan reconstruction dataset, FloorplanNet, specifically trained on architectural floor plans. Secondly, we propose a floor plan reconstruction method based on Masked Autoencoders (MAE), which reconstructs missing parts by masking sections of the floor plan and training a lightweight Vision Transformer (ViT). We evaluated the reconstruction accuracy of FloorplanMAE and compared it with state-of-the-art benchmarks. Additionally, we validated the model using real sketches from the early stages of architectural design. Experimental results show that the FloorplanMAE model can generate high-quality complete floor plans from incomplete partial plans. This framework provides a scalable solution for floor plan generation, with broad application prospects.
... We compare OART to other unsupervised feature extractors applicable to a one-class setting: principal component analysis (PCA) [17], independent component analysis (ICA) [18], and autoencoders (AEs) [19]. One-class classifiers in the literature are usually based on neural networks (restricted Boltzmann machines [20] and AE classifiers [21]), or oneclass SVMs (OSVMs) [15,22]. ...
Article
Full-text available
In this study, we address the inherent challenges in radiotherapy (RT) plan quality assessment (QA). RT, a prevalent cancer treatment, utilizes high-energy beams to target tumors while sparing adjacent healthy tissues. Typically, an RT plan is refined through several QA cycles by experts to ensure it meets clinical and operational objectives before being considered safe for patient treatment. This iterative process tends to eliminate unacceptable plans, creating a significant class imbalance problem for machine learning efforts aimed at automating the classification of RT plans as either acceptable or not. The complexity of RT treatment plans, coupled with the aforementioned class imbalance issue, introduces a generalization problem that significantly hinders the efficacy of traditional binary classification approaches. We introduce a novel one-class classification framework, using an adaptive neural network architecture, that outperforms both traditional binary and standard one-class classification methods in this imbalanced and complex context, despite the inherent disadvantage of not learning from unacceptable plans. Unlike its predecessors, our method enhances anomaly detection for RT plan QA without compromising on interpretability—a critical feature in healthcare applications, where understanding and trust in automated decisions are paramount. By offering clear insights into decision-making processes, our method allows healthcare professionals to quickly identify and address specific deficiencies in RT plans deemed unacceptable, thereby streamlining the QA process and enhancing patient care efficiency and safety.
... Each upsampling operation is followed by a convolutional layer (with decreasing filters of 128, 64, and 32), using 3 × 3 kernels and ReLU activations. The output layer uses a sigmoid activation to ensure that the output values are within a specified range, matching the normalized data used in the preprocessing stage (Vincent et al. 2008). ...
Article
Full-text available
Recent technological advancements and the availability of monitoring equipment have facilitated the collection of time-series data in the mining industry, specifically for monitoring the dynamic loads on roof bolts. These data are important for ensuring miner safety and predicting bolt load variations over time. However, due to the challenging environment in coal mines, data quality can be compromised, highlighting the need for robust real-time anomaly detection. In this study, we proposed two deep-learning models, including the convolutional autoencoder and long short-term memory (LSTM) autoencoder, to detect anomalies in dynamic roof bolt load monitoring data. The convolutional autoencoder model’s encoder consists of three convolutional layers and three pooling layers, with a decoder composed of three transposed convolutional layers. In contrast, the LSTM autoencoder model leverages two LSTM layers in its encoder to capture sequential dependencies in a compact form, while the decoder uses two additional LSTM layers to expand the representation back to the original sequence length. Results indicate that these models effectively detect anomalous patterns, learning and simulating normal operational conditions with high accuracy. The convolutional autoencoder achieved a reconstruction error (MSE) of 0.004, while the LSTM autoencoder yielded an MSE of 0.14. An anomaly detection threshold was set using the sum of the mean reconstruction error and the standard deviation of the reconstruction error on the validation set. Both models detected multiple anomalies across bolts, with two bolts out of the studied bolts exhibiting a notably higher count. The LSTM autoencoder, optimized for sequential data, performed better in identifying time-dependent anomalies compared to the convolutional autoencoder. These findings demonstrate the potential of these deep-learning models for real-time anomaly detection, contributing to enhanced safety monitoring and predictive maintenance in underground mining environments.
... In approximating the FP equation for complex systems, the score function plays an important role. Various methods have been proposed to estimate the score function, including score matching [25,35], kernel density estimations [17,56], and denoising autoencoders [48,47]. ...
Preprint
Full-text available
The Fokker-Planck (FP) equation governs the evolution of densities for stochastic dynamics of physical systems, such as the Langevin dynamics and the Lorenz system. This work simulates FP equations through a mean field control (MFC) problem. We first formulate the FP equation as a continuity equation, where the velocity field consists of the drift function and the score function, i.e., the gradient of the logarithm of the density function. Next, we design a MFC problem that matches the velocity fields in a continuity equation with the ones in the FP equation. The score functions along deterministic trajectories are computed efficiently through the score-based normalizing flow, which only rely on the derivatives of the parameterized velocity fields. A convergence analysis is conducted for our algorithm on the FP equation of Ornstein-Uhlenbeck processes. Numerical results, including Langevin dynamics, underdamped Langevin dynamics, and various chaotic systems, validate the effectiveness of our proposed algorithms.
... There are various alternative approaches to address the image reconstruction problem, including image transformers [3,10] and advanced denoising autoencoders [22,23], such as the masked autoencoder [9], which are commonly used for pre-training models in transfer learning to enhance image classification tasks [6,9]. However, these models demand extensive computational resources, often requiring days of training even with state-of-the-art GPU hardware. ...
Preprint
Autoregressive models are often employed to learn distributions of image data by decomposing the D-dimensional density function into a product of one-dimensional conditional distributions. Each conditional depends on preceding variables (pixels, in the case of image data), making the order in which variables are processed fundamental to the model performance. In this paper, we study the problem of observing a small subset of image pixels (referred to as a pixel patch) to predict the unobserved parts of the image. As our prediction mechanism, we propose a generalized and computationally efficient version of the convolutional neural autoregressive distribution estimator (ConvNADE) model adapted for real-valued and color images. Moreover, we investigate the quality of image reconstruction when observing both random pixel patches and low-discrepancy pixel patches inspired by quasi-Monte Carlo theory. Experiments on benchmark datasets demonstrate that choosing the pixels akin to a low-discrepancy sequence reduces test loss and produces more realistic reconstructed images.
Chapter
Artificial intelligence is dramatically reshaping scientific research and is coming to play an essential role in scientific and technological development by enhancing and accelerating discovery across multiple fields. This book dives into the interplay between artificial intelligence and the quantum sciences; the outcome of a collaborative effort from world-leading experts. After presenting the key concepts and foundations of machine learning, a subfield of artificial intelligence, its applications in quantum chemistry and physics are presented in an accessible way, enabling readers to engage with emerging literature on machine learning in science. By examining its state-of-the-art applications, readers will discover how machine learning is being applied within their own field and appreciate its broader impact on science and technology. This book is accessible to undergraduates and more advanced readers from physics, chemistry, engineering, and computer science. Online resources include Jupyter notebooks to expand and develop upon key topics introduced in the book.
Conference Paper
Full-text available
Biological sensory systems are faced with the problem of encoding a high-fidelity sensory signal with a population of noisy, low-fidelity neurons. This problem can be expressed in information theoretic terms as coding and transmitting a multi-dimensional, analog signal over a set of noisy channels. Previously, we have shown that robust, overcomplete codes can be learned by minimizing the reconstruction error with a constraint on the channel capacity. Here, we present a theoretical analysis that characterizes the optimal linear coder and decoder for one- and two-dimensional data. The analysis allows for an arbitrary number of coding units, thus including both under- and over-complete representations, and provides a number of important insights into optimal coding strategies. In particular, we show how the form of the code adapts to the number of coding units and to different data and noise conditions to achieve robustness. We also report numerical solutions for robust coding of high-dimensional image data and show that these codes are substantially more robust compared against other image codes such as ICA and wavelets.
Conference Paper
Full-text available
Efficient coding models predict that the optimal code for natural images is a population of oriented Gabor receptive fields. These results match response properties of neurons in primary visual cortex, but not those in the retina. Does the retina use an optimal code, and if so, what is it optimized for? Previous theories of retinal coding have assumed that the goal is to encode the maximal amount of information about the sensory signal. However, the image sampled by retinal photoreceptors is degraded both by the optics of the eye and by the photoreceptor noise. Therefore, de-blurring and de-noising of the retinal signal should be important aspects of retinal coding. Furthermore, the ideal retinal code should be robust to neural noise and make optimal use of all available neurons. Here we present a theoretical framework to derive codes that simultaneously satisfy all of these desiderata. When optimized for natural images, the model yields filters that show strong similarities to retinal ganglion cell (RGC) receptive fields. Importantly, the characteristics of receptive fields vary with retinal eccentricities where the optical blur and the number of RGCs are significantly different. The proposed model provides a unified account of retinal coding, and more generally, it may be viewed as an extension of the Wiener filter with an arbitrary number of noisy units.
Conference Paper
Full-text available
We develop a framework for learning generic, expressive image priors that capture the statistics of natural scenes and can be used for a variety of machine vision tasks. The approach extends traditional Markov random field (MRF) models by learning potential functions over extended pixel neighborhoods. Field potentials are modeled using a Products-of-Experts framework that exploits nonlinear functions of many linear filter responses. In contrast to previous MRF approaches all parameters, including the linear filters themselves, are learned from training data. We demonstrate the capabilities of this Field of Experts model with two example applications, image denoising and image inpainting, which are implemented using a simple, approximate inference scheme. While the model is trained on a generic image database and is not tuned toward a specific application, we obtain results that compete with and even outperform specialized techniques.
Article
Computational properties of use to biological organisms or to the construction of computers can emerge as collective properties of systems having a large number of simple equivalent components (or neurons). The physical meaning of content-addressable memory is described by an appropriate phase space flow of the state of a system. A model of such a system is given, based on aspects of neurobiology but readily adapted to integrated circuits. The collective properties of this model produce a content-addressable memory which correctly yields an entire memory from any subpart of sufficient size. The algorithm for the time evolution of the state of the system is based on asynchronous parallel processing. Additional emergent collective properties include some capacity for generalization, familiarity recognition, categorization, error correction, and time sequence retention. The collective properties are only weakly sensitive to details of the modeling or the failure of individual devices.
Article
Real world data is not random: The variability in the data-sets that arise in computer vision, signal processing and other areas is often highly constrained and governed by a number of degrees of freedom that is much smaller than the superficial dimensionality of the data. Unsupervised learning methods can be used to automatically discover the "true", underlying structure in such data-sets and are therefore a central component in many systems that deal with high-dimensional data. In this thesis we develop several new approaches to modeling the low-dimensional structure in data. We introduce a new non-parametric framework for latent variable modelling, that in contrast to previous methods generalizes learned embeddings beyond the training data and its latent representatives. We show that the computational complexity for learning and applying the model is much smaller than that of existing methods, and we illustrate its applicability on several problems. We also show how we can introduce supervision signals into latent variable models using conditioning. Supervision signals make it possible to attach "meaning" to the axes of a latent representation and to untangle the factors that contribute to the variability in the data. We develop a model that uses conditional latent variables to extract rich distributed represen- tations of image transformations, and we describe a new model for learning transformation features in structured supervised learning problems.
Article
A major goal of research on networks of neuron-like processing units is to discover efficient learning procedures that allow these networks to construct complex internal representations of their environment. The learning procedures must be capable of modifying the connection strengths in such a way that internal units which are not part of the input or output come to represent important features of the task domain. Several interesting gradient-descent procedures have recently been discovered. Each connection computes the derivative, with respect to the connection strength, of a global measure of the error in the performance of the network. The strength is then adjusted in the direction that decreases the error. These relatively simple, gradient-descent learning procedures work well for small tasks and the new challenge is to find ways of improving their convergence rate and their generalization abilities so that they can be applied to larger, more realistic tasks.