Page 1
An approximate inference with Gaussian process to latent functions from
uncertain data
Patrick Dallaire?, Camille Besse, Brahim Chaib-draa
DAMAS Laboratory, Computer Science and Software Engineering Department, Laval University, Canada
a r t i c l e i n f o
Available online 19 February 2011
Keywords:
Supervised learning
Gaussian processes
Data uncertainty
Dynamical systems
a b s t r a c t
Most formulations of supervised learning are often based on the assumption that only the outputs data
are uncertain. However, this assumption might be too strong for some learning tasks. This paper
investigates the use of Gaussian processes to infer latent functions from a set of uncertain input–output
examples. By assuming Gaussian distributions with known variances over the inputs–outputs and
using the expectation of the covariance function, it is possible to analytically compute the expected
covariance matrix of the data to obtain a posterior distribution over functions. The method is evaluated
on a synthetic problem and on a more realistic one, which consist in learning the dynamics of a cart–
pole balancing task. The results indicate an improvement of the mean squared error and the likelihood
of the posterior Gaussian process when the data uncertainty is significant.
& 2011 Elsevier B.V. All rights reserved.
1. Introduction
Supervised learning refers to the class of problems where a
learner has to infer a function f : X-Y given a set of labelled
examples. A typical underlying assumption concerning the train-
ing examples is that the inputs are known precisely and that only
the output variables are affected by noise. However, this assump-
tion can be unrealistic in many applications. There might be
uncontrollable effects that corrupt the inputs in the data gather-
ing process, which can affect the resulting quality of the esti-
mated model if this fact is ignored in the learning formulation.
The problem of learning in the context of noisy input and
output has been tackled in several ways. In computational
mathematics and engineering, one generally uses a method
known as total least-squares which consists in finding a minimal
correction to apply on all data points such that the modified data
satisfies a linear relation [1]. For their part, statisticians have
investigated the error-in-variables models to deal with errors on
the dependent variables as well as on the independent ones [2].
The general error-in-variables approach aims to create latent
variables that correspond to the true observations which follow
the underlying sought relation. There is a close link between the
error-in-variables models and the total least-squares methods.
The statistical model which corresponds to the basic total least-
squares approach is the error-in-variables model with the
restrictive condition that the measurement errors are zero mean,
independent and identically distributed [3].
Researchers in machine learning have also addressed the
problem of learning from uncertain data. Tresp et al. [4] proposed
incorporating deficient data into the training of a neural network
by integrating over the uncertain input using an estimated
probability distribution for these inputs. Moreover, they showed
that the expectation of the learned function will be biased if the
inputs are altered with Gaussian noise along with increased error
bars on predictions. Wright et al. [5] presented a framework for
Bayesian neural networks where they infer the regression over
the noiseless input by using Markov chain Monte Carlo sampling
over the hidden variables. More recently, Ting et al. [6] proposed a
Bayesian linear regression model including a precision parameter
which enforces interdependency between the input and output
noise models. Their algorithm then attempts to clean the uncer-
tain data by using the expectation-maximization principle.
Kernel methods have also been used to learn from uncertain
inputs. For instance, in classification tasks, Bi and Zhang [7]
proposed a statistical model that extends Support Vector Classi-
fication methods in order to deal with uncertain inputs. For this
purpose, they considered that each unobserved input is asso-
ciated to exactly one component of a Gaussian mixture model.
Then, by estimating the parameters of their Gaussian noise model,
they were able to take input uncertainty into consideration.
The virtues of Gaussian processes to learn from uncertain
inputs have also been investigated by some researchers. First,
Girard and Murray-Smith [8] showed that the covariance of the
data can be approximated using a new correlated process. This
process uses a second order Taylor expansion to obtain a
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/neucom
Neurocomputing
0925-2312/$-see front matter & 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.neucom.2010.09.024
?Corresponding author.
E-mail addresses: dallaire@damas.ift.ulaval.ca (P. Dallaire),
besse@damas.ift.ulaval.ca (C. Besse), chaib@damas.ift.ulaval.ca (B. Chaib-draa).
Neurocomputing 74 (2011) 1945–1955
Page 2
corrected covariance function accounting for input noise. Quin ˜o-
nero-Candela and Roweis [9] stated that computing the marginal
likelihood analytically is generally not possible. However, opti-
mizing a lower bound of this quantity can lead to denoised inputs
while learning the model. On the other hand, instead of approx-
imating the covariance matrix, both previous authors mentioned
that taking the expectation could lead to better estimation of the
marginal likelihood [9,10].
In the same vein, this paper investigates an approach in which
we not only make predictions for uncertain inputs with Gaussian
processes, but also we learn from uncertain training sets. To do so,
we marginalize out the inputs’ uncertainty to obtain the expected
covariance matrix and keep an analytical posterior distribution
over functions. The main contribution of this paper is a formula-
tion which allows learning from uncertain inputs and outputs by
using Gaussian processes exactly as in the noise-free case. Results
show that taking into account the uncertainty with this method
improves the mean squared error and the likelihood of the sought
function. As the noise decreases, the method tends towards its
noise-free counter-part and therefore, cannot do worse than
standard Gaussian process regression. To show that, the method
is applied to the well-known task of balancing a pole over a cart,
where the problem is to learn the four-dimensional (+ control)
nonlinear dynamics of the system.
This paper is structured as follows. First, we formalize the
problem of learning with Gaussian processes and the regression
model. Then, in Section 3, we present insights to account for
uncertain data. In Section 4, we present the experimental results
on a difficult synthetic problem and on a more realistic
problem. Section 5 discusses the results before Section 7 opens
research avenues and concludes the paper.
2. Gaussian processes
Gaussian Processes (GPs) are stochastic processes which are
used in machine learning to describe distributions directly into
the space of functions. In supervised learning, we have to make
the mandatory assumption that training examples are informa-
tive for further predictions. This information can be formalized is
many ways. Some methods use correlation between the examples
to express this information. When using Gaussian processes, it is
assumed that the joint distribution of the data is a multivariate
Gaussian. Consequently, the problem is to find a covariance
function which explains the data properly. One interesting prop-
erty of GPs is that they provide a probabilistic approach to the
learning task and give uncertainty estimates while making
predictions.
2.1. Gaussian processes for regression
As mentioned above, it is assumed that the joint distribution of
a finite set of output observations given their inputs is multi-
variate Gaussian. Thus, a GP is fully specified by its mean function
mðxÞ and covariance function Cðx,x0Þ. First, we assume that a set of
training data D ¼ fxi,yigN
observation such that
i ¼ 1is available where xiARD, yiis a scalar
yi¼ fðxiÞþei
and where ei is a white Gaussian noise with variance s2
convenience, we use the notation X ¼ [x1, y, xN] for inputs and
y¼[y1, y, yN] for outputs. According to the definition of a GP, the
joint distribution of the training set is Nðm,Kþs2
vector m and the matrix K are computed with the mean and
covariance functions. The components of m are computed such
that mi¼mðxiÞ. However, for the sake of notational simplicity, we
ð1Þ
e. For
eIÞ, where the
make the weak assumption that the mean function is mðxÞ ¼ 0.1
Then, the joint distribution of the training set becomes
yjX ? Nð0,Kþs2
where K is the covariance matrix whose entries Kijare given by
the covariance function C(xi,xj). This equation is fundamental,
since it corresponds to the Gaussian process prior assumption,
which is used hereafter to make predictions.
There is a variety of covariance functions, each of them making
different forms of functions more probable than others for a
Gaussian process. Selecting the covariance function is an impor-
tant aspect of the learning task. In particular, using Gaussian
processes with certain covariance functions corresponds to using
a neural network having an infinite number of sigmoidal or
Gaussian hidden units [11]. This choice has to be made a priori
and reflects the prior knowledge concerning the function to
estimate directly in the space of functions, the so called Gaussian
process prior. The multivariate Gaussian prior (2) defined over the
training observations can be used to compute the posterior
distribution over functions. Since this posterior distribution is
obtained by computing the conditional distribution of a Gaussian,
the resulting posterior remains a Gaussian process.
Accordingly, making predictions can be done by using the
posterior Gaussian process’ mean and its associated measure of
uncertainty, given by the posterior covariance. For a single test
input xn, the predictive distribution of the noise-free output fnis
obtained by first forming the joint distribution with (2)
"
0
k>
?
eIÞð2Þ
y
f?
#
? N
0
??
,
Kþs2
eIk?
k??
"# !
ð3Þ
where knis the N ? 1 vector of cross-covariance C(xn, xi) between
the test input xnand the training inputs X, and knis the prior
variance given by C(xn, xn). By computing the conditional dis-
tribution of (3) such that
f?jx?,X,y ? Nðmðx?Þ,s2ðx?ÞÞ
we obtain the Gaussian predictive distribution for any single
input xnwith mean and variance given by
mðx?Þ¼ k>
?ðKþs2
eIÞ?1y
ð4Þ
s2ðx?Þ ¼ k???k>
It should be pointed out that Eqs. (4) and (5) produce an
estimation of the latent function f and not its noisy version y.
The interested reader is invited to refer to [12] for more details on
different formulations of Gaussian processes.
?ðKþs2
eIÞ?1k?
ð5Þ
2.2. Gaussian process prior
Although many covariance functions can be used to define the
Gaussian process prior in (2), this paper we will mainly focus on
the squared exponential (se) covariance function:
Cseðxi,xjÞ ¼s2
which is one of the most widely used kernel functions. In
particular, using this covariance function is equivalent to a linear
combination of an infinite number Gaussian basis functions [13].
Since the outputs might be corrupted by noise, we add the
covariance function of an independent noise process:
fexpð?1
2ðxi?xjÞ>W?1ðxi?xjÞÞð6Þ
Cnðxi,xjÞ ¼s2
edij
ð7Þ
1Its common to find the bias of the training set and to subtract it from the
data, making the zero-mean prior the ideal choice.
P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955
1946
Page 3
where dijis the Kronecker delta. It results in a covariance function
parameterized by the vector of hyperparameters
y ¼ ½W,s2
where W is the diagonal matrix of characteristic length-scale,
which accounts for a different covariance measure for each input
dimension, s2
f
is the signal variance influencing the order of
magnitude of the functions and s2
noise process.
Varying these hyperparameters affects the interpretation of
the training data by changing the shape of functions of the GP
prior expects. It might be difficult a priori to fix the hyperpara-
meters of a covariance function and to expect these to fit the
observed data correctly. A common way to estimate the hyper-
parameters is to maximize the log marginal likelihood of the
observations [12]. In that case, it corresponds to take the loga-
rithm of (2) and maximize this quantity with respect to hyper-
parameters y:
f,s2
e?,
ð8Þ
erepresents the variance of the
logpðyjX,yÞ ¼ ?1
2y>ðKþs2
eIÞ?1y?1
2logjKþs2
eIj?N
2log2p
ð9Þ
since the joint distribution of the observations is a multivariate
Gaussian under a zero-mean prior. The optimization problem can
be tackled using any gradient method to find an acceptable local
maxima or the global maxima if possible.
3. Learning with uncertain data
In this paper, we consider the difficult problem of learning
from uncertain examples. The ideal Bayesian solution to this task
would be to make predictions by integrating over the uncertainty
of the training set such that
ZZ
but these integrals are analytically intractable in general. Thus,
the following sections present an approach to approximate this
predictive distribution, which deals with Gaussian input and
output distributions.
pðf?jx?Þ¼
pðf?jx?,X,yÞpðXÞpðyÞ dX dy
ð10Þ
3.1. Learning with uncertain inputs
As we stated earlier in Introduction, the assumption that
only the outputs are noisy is not enough for some learning
tasks. Consider the case where the inputs are uncertain and
where each input distribution is known. In this case, using
Gaussian process regression as previously formulated will not
handle the uncertainty explicitly. Nevertheless, one can apply the
standard method naively by using only one representative per
input, such as the mean, and ignoring other information about its
distribution.
For Gaussian processes, accounting for uncertain examples
implies computing uncertain covariance between these examples.
In fact, the covariance of any pair of examples has a distribution
that depends on the covariance function and the probability
distribution of each input. Consequently, the covariance matrix
has a very large distribution which makes the computation of the
posterior over functions impossible in general. To deal with this
issue, one generally relies on an approximation of the covariance
matrix distribution. In this paper, we investigate a case where the
expected covariance matrix is analytically computable.
Consider the case where inputs are a set of Gaussian distribu-
tions and where the true input value xiis not observable, but we
have access to the parameters of its distribution. Thus, the
expected covariance between uncertain data is obtained by
computing expectations with respect to xiand xj
ZZ
Kij¼
Cðxi,xjÞpðxiÞpðxjÞ dxidxj, if iaj
ð11Þ
where pðxiÞ ¼ Nðui,RiÞ and pðxjÞ ¼ Nðuj,RjÞ. Note that this equa-
tion is used to compute the elements of the covariance matrix K
which are off diagonal. Since the diagonal corresponds to var-
iances, Eq. (11) cannot be used, because the two variables are in
fact the same random variable. Therefore, the case of expected
variance requires the computation of a single expectation:
Kii¼
Z
Cðxi,xiÞpðxiÞ dxi
ð12Þ
which is only useful for non-stationary covariance function. For
stationary covariance functions, such as the squared exponential,
this integral is simplified to s2
Before discussing the expected squared exponential, we first
derive the simpler case of the expected covariance of the linear
covariance function Clin. Basically, this covariance function can be
expressed as
fby definition.2
Clinðxi,xjÞ ¼ x>
ixjþs2
b
ð13Þ
where s2
non-zero value makes it inhomogeneous. In practice, homoge-
neous functions will force the estimated functions to pass by the
origin and increasing s2
obtain the expected version CElinof this function, we have to
compute Eqs. (11) and (12), which results in (see Appendix A for
derivation)
b¼ 0 makes the covariance function homogeneous and
ballows functions to be more off origin. To
CElinððui,RiÞ,ðuj,RjÞÞ¼ u>
iujþs2
bþdijTrðRiÞð14Þ
where Tr denotes the trace of a matrix. The effect of the trace
term here is to increase the output variance of training examples
according to their input variance. Thus, the contribution of highly
uncertain examples is automatically decreased.
It is also possible to compute the expectation for polynomial
covariance function. However, as the degree increases, so the
complexity of the resulting expected covariance function. For
instance, in the case of the quadratic covariance function:
Cquaðxi,xjÞ ¼ ðx>
ixjþs2
bÞ2
ð15Þ
the computation of the expectations involves expectation over
quadratic forms of Gaussian variables, which yields in the
following expected quadratic covariance function (see Appendix
B for derivation)
CEquaððui,RiÞ,ðuj,RjÞÞ¼ Trð½Riþuiu>
þ2s2
where the last line is used only to compute expected variances.
Using linear or quadratic covariance function is an arguably
strong hypothesis on the space of functions. Therefore, one could
look for a more general function.
It has been shown by Girard [10] that, for normally distributed
inputs and using the squared exponential covariance function,
integrating over independent input distributions is analytically
feasible since it involves integrations over product of Gaussians.
However, to obtain the correct expected squared exponential
covariance function, we had to incorporate the variance case to
the function. The resulting expected covariance CEseis computed
i?½Rjþuju>
iÞ2þ2s2
j?Þð1þdijÞ
bu>
iujþs4
bþdij½TrðRiþuiu>
bTrðRiÞ?2ðu>
iuiÞ2?ð16Þ
2For stationary covariance function C(x,x) ¼ C(0,0) is constant.
P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955
1947
Page 4
exactly with (see Appendix C for derivation)
CEseððui,RiÞ,ðuj,RjÞÞ ¼
s2
fexp ?1
2ðui?ujÞ>ðWþRiþRjÞ?1ðui?ujÞ
jIþW?1ðRiþRjÞð1?dijÞj1=2
??
ð17Þ
which is again a squared exponential form, where the covariance
decreases as the uncertainty over the inputs increases, but farther
examples now become correlated, thus making the resulting
process smoother when examples are more uncertain. In
Eq. (17), we introduced the inverse Kronecker delta ð1?dijÞ in
order to eliminate the reweighing effect of the denominator when
computing a variance.
While computing the expected covariance matrix allows
learning with Gaussian processes exactly as in the noise-free
input case, this matrix can also be unlikely according to the full
distribution over covariance matrices. For instance, Fig. 1 shows
the case where the expected covariance between two training
examples is exactly the worst choice. In that particular case, a
better estimate would be to either say that the examples
are covarying or not covarying at all. This setup is actually
the worst possible, but it clearly indicates that the expected
Gaussian process might not explain the training data properly in
some cases.
A solution to this problem would be to clean the data by
finding the maximum a posteriori or to compute a complete
posterior Gaussian distribution on inputs (this avenue is dis-
cussed in Section 6), both modifying the covariance of the
examples. However,among all
p(C(xi,xj)), only a few training examples suffer from this problem
and the underlying function can still be inferred adequately for
sufficiently large data sets.
possible densityfunctions
3.2. Learning with uncertain outputs
So far, only the difficulty concerning uncertain inputs has been
covered. Dealing with uncertain outputs having a Gaussian
distribution is more manageable. Generally, the noise on the
output is assumed stationary and has to be estimated from
data. By observing each noisy output along with its variance,
one can define a non-stationary noise process only known for
the training examples. This noise process is introduced to reflect
the outputs’ uncertainty and decrease the influence of each
example according to its respective variance. Formally, it entails
a slight modification of the prior distribution yjX ? Nð0,Kþs2
which consists in adding a diagonal covariance matrix containing
the variances s2
eÞ,
yi.
3.3. Learning the covariance function hyperparameters
Theoretically, learning with uncertain data is as difficult as
in the noise-free case when using the covariance function
(17), although it might require more data. The posterior dis-
tribution over functions is found by using Eqs. (4) and (5) with
the expected covariance matrix. The hyperparameters of the
covariance function can be learned with the log marginal like-
lihood of the expected process, but this likelihood is often
riddled with many local maxima. Using standard conjugate
gradient methods will often lead to a local maxima that might
not explain the data properly. In the case of a squared exponential
covariance function, the matrix W would tend to have large
values on its diagonal, meaning that most dimensions are either
very smooth or even irrelevant, and the value of s2
overestimated to transpose the input error in the output
dimensions.
A solution to this difficulty is to find a maximum a posteriori
(MAP) estimation of the hyperparameters instead of a maximum
likelihood. Specifying a prior on the hyperparameters will thus act
as a regularization term preventing improper local maxima. As
stated before, it is easy for Gaussian processes to explain every-
thing as noise. To avoid such inappropriate explanation, larger
values of noise and characteristic length-scales can be penalized.
However, a bad local maxima can be an indication that there is no
apparent structure in the data or no enough data to find
correlations.
ewould be
4. Experiments
The following experiments compare the performance of Gaus-
sian process accounting for data uncertainty through the compu-
tation of the expected covariance matrix (uncertain-GP) and a
Gaussian process which only uses the mean of the inputs (classic-
GP). Both Gaussian processes are based on the squared exponen-
tial covariance function, where the classic-GP has only access
to (6) and the uncertain-GP has access to its expected version (17).
The purpose of these experiments is to determine whether using
the expected covariance matrix leads to any improvement over
the naive method that only uses the inputs’ mean.
Indeed, the standard Gaussian process does not usually deal
with uncertainty over the inputs. The expected behaviour of this
method would be to explain Gaussian input uncertainty as extra
Gaussian noise over the outputs. While this extra noise is in
general non-Gaussian, it can still be approximated as Gaussian
noise.
First, we evaluate the behaviour of each method on a one-
dimensional synthetic problem and then compare their perfor-
mances on a harder problem which consists in learning the
nonlinear dynamics of a cart–pole system.
4.1. Synthetic problem: sincsig
To visualize easily the behaviour of both learning methods, we
have chosen a one-dimensional task where the problem consists
in learning a function composed by the unnormalized sinc
0 0.20.40.6 0.81
0.5
1
1.5
2
2.5
3
3.5
covariance
likelihood
Fig. 1. Probability density function of the covariance between examples at
xi? Nð0,1Þ and xj? Nð1,1Þ under a squared exponential with all parameters set
to one. The expectation given by CEsefor these examples is 0.5.
P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955
1948
Page 5
function and a biased sigmoid function such that
fðxÞ¼
sinðxÞ=x
0:5=ð1þexpð?10x?5ÞÞþ0:5
if xZ0
otherwise
(
ð18Þ
which we refer to as the sincsig function hereafter.
All evaluations have been done on randomly drawn training
sets. Each data set is constructed by initially uniformly sampling
N points in [?10,10]. These points are the noise-free inputs
{xi}i¼1
sampled according to yi? NðsincsigðxiÞ,s2
inputs is constructed by first sampling the noise s2
on each input. Then, the sampled noise is used to corrupt the
noise-free input xiaccording to ui? Nðxi,s2
xijui,s2
D ¼ fðui,s2
Fig. 2 shows a typical example of a training data set (crosses),
with the true function to infer (solid line) and the resulting
regression (thin line) for the uncertain-GP (top) and the classic-
GP (bottom). Error bars around the thin lines indicate the
confidence of the Gaussian process about its prediction and
correspond to two standard deviations. Note here that whenever
the solid line goes far from the error bars, the true underlying
function becomes unlikely according to the prediction which can
be considered as an inconsistent learning.
N
. To obtain the associated noisy outputs, each of them is
yÞ. The set of uncertain
xito be applied
xiÞ. It is easy to see that
xi? Nðui,s2
xiÞ,yigN
xiÞ. Finally, the complete training set is defined as
i ¼ 1.
The first experiment was designed to evaluate the learning
performance on uncertain inputs only. Therefore, we conducted
this one with a small output noise of standard deviation sy¼ 0:01
with different size training sets. The input noise standard devia-
tions sxiwere sampled uniformly in [0, 2.5], which causes the
dataset to contain some very good examples on average. The
standard deviations have been chosen high enough so that the
noise over the inputs can be explained by adding an independent
output noise during the optimization procedure and to measure
the impact of highly uncertain examples on the posterior Gaus-
sian process.
All performance evaluation of the classic-GP and the uncer-
tain-GP have been done by training both with the same random
data sets. For comparison purpose, we also trained a standard
multilayer perceptron (MLP) having two hidden layers of five
neurons. The weights of the network were found with the
Levenberg–Marquardt algorithm [14]. Fig. 3(a) shows the aver-
aged mean squared error over 25 randomly chosen training sets
for various sizes. Results show that when very few data are
available, both processes tend to explain the data with higher
noise variance s2
eover the outputs. As expected, when the size of
the data set increases, the classic-GP gives greater importance to
the explanation that observed outputs are mostly noise. On the
other hand, the uncertain-GP discriminates the most uncertain
data and privileges the ones having less uncertainty in order to
infer the function from the right data.
–10–505 10
–1
0
1
2
–10–50510
–1
0
1
2
Fig. 2. Typical learning of the sincsig function with uncertain-GP (top) and classic-
GP (bottom). Error bars represent two standard deviations. The crosses are the
means of uncertain inputs and outputs.
50100
Number of training exemples
150 200250
0
0.02
0.04
0.06
0.08
0.1
0.12
MSE
MLP
GP with std cov. function
GP using uncertainty
50 100
Number of training exemples
150200250
0
0.02
0.04
0.06
0.08
0.1
0.12
MSE
MLP
GP with std cov. function
GP using uncertainty
Fig. 3. Mean squared error results on the sincsig problem. Squared red line, uncertain-GP; circled blue line, classic-GP; triangled green line, MLP. (a) Output uncertainty is
constant with sy¼ 0:1 and (b) output uncertainty is random with syi? Uð0:2,0:5Þ. (For interpretation of the references to colour in this figure legend, the reader is referred
to the web version of this article.)
Fig. 4. The cart–pole balancing problem.
P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955
1949
Page 6
In the second experiment, we assumed that the Gaussian
processes now know the noise’s variance on the observations to
evaluate the performance on (completely) uncertain datasets.
Therefore, the noise hyperparameter s2
processes know exactly the noise matrix to add when computing
the covariance matrix. For each output, the standard deviationsyiis
then uniformly sampled in [0,0.5] and the complete training set is
defined as D ¼ fðui,s2
of the MLP, the classic-GP and the uncertain-GP on such training
sets. Removing the independent noise process (7) from the prior
has two effects: First, it prevents the classic-GP from explaining
uncertain inputs by increasing the noise (hyperparameter) over the
outputs, and it also forces the uncertain-GP to explain the data with
the available information on the input variance.
The third experiment is set to validate the improvements
brought by the expected covariance method. We thus let the
optimization procedure choose itself the right kernel to use
according to the uncertainty over the inputs. Consequently, we
constructed two combinations of covariance functions. The first
one (classic-GP) is based on the standard squared exponential
covariance function (6) added to the known non-stationary noise
process representing the outputs’ uncertainty and an independent
noise process given by
eis set to zero since the
xiÞ,ðzi,s2
yiÞgN
i ¼ 1. Fig. 3(b) shows the performance
Cðxi,xjÞ ¼ Cseðxi,xjÞþdijs2
yiþdijs2
e
ð19Þ
where the first two terms are expected to explain the data and the
last one to compensate the error due to inputs’ noise.
The second one (uncertain-GP) is based on the expected
squared exponential covariance function (17) added to the known
non-stationary noise process representing the outputs’ uncer-
tainty and an independent noise process which is given by
Cððui,s2
xiÞ,ðuj,s2
xjÞÞ¼ CEseððui,s2
xiÞ,ðuj,s2
xjÞÞþCseðui,ujÞþdijs2
yiþdijs2
e
ð20Þ
where we expect the optimization process to balance the effect of
each covariance function.
The experiments were conducted on various levels of noise
and training set sizes. For these experiments, the uncertainty over
the data is upper bounded by a standard deviation v. Like the
previous experiments, noises are sampled independently from
uniform distributions and are then applied to the data. In this
case, sxi? Uð0,vÞ and syi? Uð0,vÞ. The performance of the classic-
GP and the uncertain-GP has been measured with the mean
squared error and the log-likelihood. The results are averaged
over 10 trials for a same value of v and data set size. The
likelihood is approximated with 100 equidistant noise-free points
from the true function. The averaged mean squared error and
averaged log-likelihood results specific to this third experiment
are shown in Figs. 5 and 6 respectively.3
As stated in Section 3.3, it is possible to learn the hyperpara-
meters given a training set. Since conjugate gradient methods
0 50 100150200250
0
0.01
0.02
0.03
0.04
0.05
0.06
Number of training exemples
MSE
GP with std cov. function
GP using uncertainty
0 50100150 200250
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Number of training exemples
MSE
GP with std cov. function
GP using uncertainty
050100 150200250
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Number of training exemples
MSE
GP with std cov. function
GP using uncertainty
0 50 100150200250
0
0.05
0.1
0.15
0.2
0.25
Number of training exemples
MSE
GP with std cov. function
GP using uncertainty
Fig. 5. Mean squared error results on sincsig. The input and output uncertainties are random with sxi? Uð0,vÞ and syi?Uð0,vÞ. The variable v is the upper bound on
uncertainty. Squared red line, uncertain-GP; circled blue line, classic-GP. (a) With upper bound v ¼ 0.1; (b) with upper bound v ¼ 0.4; (c) with upper bound v ¼ 1;
(d) with upper bound v ¼ 3. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
3Due to computational precision, likelihoods reached infinity making
Fig. 6(a) incomplete.
P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955
1950
Page 7
performed poorly for the optimization of the log-likelihood
in the uncertain-GP cases, we preferred stochastic optimization
methods for this task. For these experiments, we did not use any
prior on the hyperparameters to avoid helping the learning
procedure by introducing additional information on the function.
4.2. The cart–pole problem
We now consider the harder problem of learning the cart–pole
dynamics, which is also known as the inverted pendulum. Fig. 4
gives a picture of the system from which we try to learn the
dynamics. The state is defined by the position ðjÞ of the cart, its
velocity ð_ jÞ, the pole’s angle ðaÞ and its angular velocity ð_ aÞ. There
is also a control input which is used to apply lateral forces on the
cart. Following the equation in Florian [15] to govern the
dynamics, we used Euler’s method to update the system’s state:
€ a ¼gsinaþcosaðð?F?mpl_ a2sinaÞ=ðmcþmpÞÞ
lð4=3?ðmpcos2aÞ=ðmcþmpÞÞ
€ j ¼Fþmplð_ a2sina?€ asinaÞ
mcþmp
where g is the gravity force, F is the force associated to the action,
l is the half-length of the cart, mpis the mass of the pole and mcis
the mass of the cart.
For this problem, the training sets were sampled exactly as in
the sincsig case. State-action pairs were uniformly sampled on
their respective domains. The outputs were obtained with the
true dynamical system and then perturbed by noise with known
random variance. Since the output variances are also known, the
training set can be seen as Gaussian input distributions that map
to Gaussian output distributions. Therefore, one might use a
sequence of Gaussian belief states as its training set in order to
learn a partially observable dynamical system. Following this
idea, there is no reason for the output distributions to have a
significantly smaller variance than the input distribution.
In this experiment, the input and output noises standard
deviation were uniformly sampled in [0, 2.5] for each dimension.
Every output dimension was treated independently by using a
Gaussian Process prior on each of them. Fig. 7 shows the averaged
mean square error over 25 randomly chosen training sets for
different N values for each dimension.
5. Discussion
Learning from uncertain data is known to be hard. Under the
Gaussian process assumption, a data set having a prior distribu-
tion has uncertain covariances and consequently a distribution
over its covariance matrix. Considering the whole distribution on
covariance matrices to compute a posterior Gaussian process is
not feasible in general, even for the squared exponential covar-
iance function with Gaussian input uncertainty. However, the
former particular case allows computing the expectation of the
covariance matrix distribution and makes the inference procedure
no harder than with noise-free data. The experiments were
conducted to determine whether using the expected covariance
0 50 100 150200 250
0
50
100
150
200
250
Number of training exemples
Log–likelihhod
GP with std cov. function
GP using uncertainty
0 50100 150200250
–50
0
50
100
150
200
Number of training exemples
Log–likelihhod
GP with std cov. function
GP using uncertainty
0 50100 150200 250
–250
–200
–150
–100
–50
0
50
100
Number of training exemples
Log–likelihhod
GP with std cov. function
GP using uncertainty
0 50 100150200250
–2000
–1500
–1000
–500
0
500
Number of training exemples
Log–likelihhod
GP with std cov. function
GP using uncertainty
Fig. 6. Log-likelihood results on sincsig. The input and output uncertainties are random with sxi? Uð0,vÞ and syi?Uð0,vÞ. The variable v is the upper bound on uncertainty.
Squared red line, uncertain-GP; circled blue line, classic-GP. (a) With upper bound v ¼ 0.1; (b) with upper bound v ¼ 0.4; (c) with upper bound v ¼ 1; (d) with upper
bound v ¼ 3. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955
1951
Page 8
leads to performance improvement or not when dealing with
input uncertainties.
The first results on the synthetic problem are presented
in Fig. 3(a) and (b). It shows that using the expected covariance
(uncertain-GP) leads to lower mean squared error than using the
squared exponential with inputs mean (classic-GP). Fig. 2 shows
the typical behaviour of each method. We observed that the
classic-GP method often learns something closer to a convolution
of the sincsig function rather than the function itself, which is the
expected behaviour. On the other hand, the uncertain-GP method
generally leads to better estimation of the function. Although it
provides smaller mean squared errors, it also has the interesting
property that the sincsig function completely lies inside its error
bounds, meaning that most estimations are consistent with the
true function and therefore, increasing the likelihood.
These experiments were done on data sets containing large
input uncertainty up to 2.5 standard deviation. Since there are
very few almost sure training examples to guide the inference, it
makes the learning task very hard. In the case of such fuzzy
training sets, using the expected covariance is a clear advantage
over its standard counter-part.
A second set of experiments has been conducted where this
time we measure the likelihood of the estimated function. To
reinforce the relevance of using the expectation of the covariance
matrix, the uncertain-GP method is now allowed to balance
between the standard squared exponential covariance function
and its expectation. The balancing procedure is part of the log
marginallikelihoodmaximization.
in Figs. 5 and 6. Despite having a more expressive covariance
function, the uncertain-GP method shows little improvement.
We observed that the mean squared error is often caused by
learning a biased version of the function and log-likelihood
decreases as the bias increases. Fig. 8 shows the typical biased
posterior learned by uncertain-GP for the sinc function and the
convolution learned by the classic-GP. Although performance
measures show similar curves for both methods, one can arguably
prefer to obtain a biased function having the shape of the function
rather than some estimations which capture less structure in
the data.
In the context of learning the model of a partially observable
Markov decision process [16], one has to identify first the
dynamics, or state transition function, to be able to infer
the state of the system. A reinforcement learning algorithm uses
the inferred belief state to learn a value function by mapping the
uncertain state to its expected reward. Learning the optimal
policy can still be achieved if the learned dynamics are biased,
since the reinforcement learning algorithm learns over this bias.
Moreover, a far from truth model of the dynamics could lead the
algorithm to learn a good policy if the estimated model’s shape is
close enough to the dynamics’ true behaviour. Getting closer to
this goal, experiments have been conducted on the identification
of a dynamical system.
Theresults areshown
50100
Number of training exemples
150 200250
0
0.5
1
1.5
2
2.5
3
MSE
MLP
GP with std cov. function
GP using uncertainty
50 100
Number of training exemples
150200 250
0
1
2
3
4
5
MSE
MLP
GP with std cov. function
GP using uncertainty
50 100
Number of training exemples
150200 250
0
0.5
1
1.5
2
2.5
MSE
MLP
GP with std cov. function
GP using uncertainty
50100
Number of training exemples
150 200250
0
0.5
1
1.5
2
2.5
3
3.5
4
MSE
MLP
GP with std cov. function
GP using uncertainty
Fig. 7. Mean squared error results on the cart–pole problem. Squared red line, uncertain-GP; circled blue line, classic-GP; triangled green line, MLP. (a) Position of the cart;
(b) velocity of the cart; (c) pole angle; (d) angular velocity of the pole. (For interpretation of the references to colour in this figure legend, the reader is referred to the web
version of this article.)
P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955
1952
Page 9
The cart–pole problem is well known in the reinforcement
learning community. The last experiments aim to learn its
dynamics from uncertain state transition examples. The results
are exhibited in Fig. 7. The performances of the uncertain-GP and
classic-GP methods are compared under the mean squared error
measure. Again, using the expected covariance provided better
estimates of the true function. However, much more data would
have been required to correctly identify the dynamics, but the
complexity of computing a posterior Gaussian process is Oðn3Þ
and with hyperparameters optimization, it becomes the complex-
ity of each optimization step. Moreover, computing the covar-
iance matrix with the covariance function (17) takes more time,
since it now involves determinant computations.
The optimization procedure is not only computationally cum-
bersome. The log marginal likelihood is riddled with local maxima
in which vanilla gradient methods get driven quickly. The experi-
ments were, therefore, carried out using stochastic optimization
methods to move among the maxima and find the best one. An
interesting avenue would be to look at natural gradient
approaches which are well suited for this task [17].
6. Future work
Many previously proposed approaches for learning from
uncertain data relied on the estimation of the hidden true inputs
and outputs. Albeit not reported in this work, we conducted some
experiments on computing a posterior Gaussian distribution over
the training set. Since computing such posterior is not analytically
tractable and might not be Gaussian, one has to look for
approximations. By using Jensen’s inequality, we obtained a lower
bound approximation of the log marginal likelihood of the data
and estimated the posterior distribution of the data with Gaussian
distributions. The resulting quantity to maximize is
logpðyjX,yÞ?DKLðqðXÞJpðXÞÞ
where p(X) is the prior distribution over the training set and q(X)
the posterior.
Eq. (21) corresponds to the common log marginal likelihood
(9) with a penalty given by the Kullback–Leibler divergence of
the estimated posterior over inputs with respect to its prior.
The covariance matrix is computed with the expected squared
exponential covariance function (17) with respect to the learned
posterior over inputs. We obtained no significative results
with this method, mostly because the optimization task is
very hard, each input mean and variance becoming a variable
ð21Þ
to optimize. Further work is currently ongoing to address this
issue.
7. Conclusion
To conclude, we investigated the use of the expected covar-
iance matrix in Gaussian processes when learning from uncertain
data. Results indicate that in many cases, depending on the
amount of uncertainty, the proposed approach yields to do better
inference. An interesting property is that even when the improve-
ment on the mean squared error is negligible, the method will
often improve the likelihood of the posterior over functions. We
also stated that a bias may occur when using the expected
covariance without denoising, but such biased models can still
be useful for certain applications. Finally, the method finds its
best performance when used on significantly uncertain data sets.
In other cases, the approach provides performances similar to
standard Gaussian process regression.
Appendix A. Expected linear covariance function
The expected linear covariance with respect to Gaussian input
distributions pðxiÞ ¼ Nxiðui,RiÞ and pðxjÞ ¼ Nxjðuj,RjÞ is given by:
ZZ
¼Ei½x>
Kij¼
Clinðxi,xjÞNxiðui,RiÞNxjðuj,RjÞ dxidxj¼Ei½Ej½x>
i?Ej½xj?þs2
ixjþs2
b??
b¼ u>
iujþs2
b
Computing the expected variance is done with:
Z
¼Ei½x>
Kii¼
Clinðxi,xiÞNxiðui,RiÞ dxi¼Ei½x>
ixi?þs2
ixiþs2
b?
b¼ TrðRiÞþu>
iujþs2
b
By observing that the only difference between variance and
covariance is the presence of a trace in the variance case, we can
express the expected linear covariance function as
CElinððui,RiÞ,ðuj,RjÞÞ¼ u>
by the introduction of a Kronecker delta.
iujþs2
bþdijTrðRiÞ
Appendix B. Expected quadratic covariance function
The expected quadratic covariance with respect to Gaussian
input distributions pðxiÞ ¼ Nxiðui,RiÞ and pðxjÞ ¼ Nxjðuj,RjÞ is
–6–4–20246
–0.5
0
0.5
1
1.5
–6–4–20246
–0.5
0
0.5
1
1.5
Fig. 8. Typical posterior distribution over functions when learning the sinc function with uncertain inputs and outputs. (a) Using the expected covariance matrix –
uncertain-GP and (b) input noise is seen as extra output noise – classic-GP.
P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955
1953
Page 10
given by:
Kij¼
ZZ
Cquaðxi,xjÞNxiðui,RiÞNxjðuj,RjÞ dxidxj
¼Ei½Ej½ðx>
¼Ei½Ej½x>
¼Ei½x>
¼Ei½x>
¼ Tr ð½Rjþuju>
ixjþs2
ixjx>
iEj½xjx>
iðRjþuju>
bÞ2?? ¼Ei½Ej½x>
ixj??þ2s2
j?xi?þ2s2
jÞxi?þ2s2
j?RiÞþu>
ixjx>
ixj??þs4
iujþs4
bu>
i½Rjþuju>
ixjþ2s2
b
bx>
ixjþs4
b??
bEi½Ej½x>
bu>
b
iujþs4
b
j?uiþ2s2
bu>
iujþs4
b
where we use some properties of the trace to obtain a more
intuitive form for the first two terms:
Tr ð½Rjþuju>
¼ Tr ð½Rjþuju>
¼ Tr ð½Rjþuju>
¼ Tr ð½Rjþuju>
¼ Tr ð½Rjþuju>
j?RiÞþu>
i½Rjþuju>
j?RiÞþTr ðu>
j?RiÞþTr ð½Rjþuju>
j?Riþ½Rjþuju>
j?½Riþuiu>
j?ui
i½Rjþuju>
j?uiÞ
j?uiu>
iÞ
iÞ
j?uiu>
i?Þ
By substituting back in the expected covariance equation, we
can have
Kij¼ Tr ð½Rjþuju>
Computing the expected variance is done with:
Z
¼Ei½x>
¼Ei½x>
¼ 2Tr ðR2
þ2s2
j?½Riþuiu>
i?Þþ2s2
bu>
iujþs4
b
Kii¼
Cquaðxi,xiÞNxiðui,RiÞ dxi¼Ei½x>
ixi?þ2s2
ixix>
iÞþ4u>
bTr ðRiþuiu>
ixix>
ixiþ2s2
bx>
ixiþs4
b?
ixix>
bEi½x>
bTr ðRiþuiu>
iRiuiþTr ð½Riþuiu>
iÞþs4
ixi?þs4
b
iÞþs4
ixi?þ2s2
b
i?Þ2
b
By observing the differences between variance and covariance,
we can express the expected quadratic covariance function as:
CEquaððui,RiÞ,ðuj,RjÞÞ ¼ Tr ð½Riþuiu>
þ2s2
i?½Rjþuju>
iÞ2þ2s2
j?Þð1þdijÞ
bTrðRiÞ?2ðu>
bu>
iujþs4
bþdij½Tr ðRiþuiu>
iuiÞ2?
by the introduction of a Kronecker delta.
Appendix C. Expected squared exponential covariance
The expected squared exponential covariance with respect to
Gaussianinputdistributions
pðxjÞ ¼ Nxjðuj,RjÞ is given by:
ZZ
Using the Gaussian form of the squared exponential covar-
iance function, Cse(xi, xj) can be rewritten as the normalized
Gaussian distribution cNxiðxj,WÞ, where c ¼s2
its normalization constant:
ZZ
Using the product of Gaussians’ results in another Gaussian,
which is no longer normalized:
ZZ
where the resulting Gaussian with mean vector a and covariance
matrix B will be eliminated by integration. It turns out that the
pðxiÞ ¼ Nxiðui,RiÞ
and
Kij¼
Cseðxi,xjÞNxiðui,RiÞNxjðuj,RjÞ dxidxj
fð2pÞD=2jWj1=2is
Kij¼ c
Nxiðxj,WÞNxiðui,RiÞNxjðuj,RjÞ dxidxj
Kij¼ cZ?1Nxiða,BÞNxjðuj,RjÞ dxidxj
factor has a Gaussian form such that Z?1¼ Nxjðui,WþRiÞ. There-
fore, by integrating over xi, we obtain:
Z
For the second variable, we use the same reasoning. Thus, by
applying the product of Gaussians again and this time integrating
over xj, we obtain:
Kij¼ c
Nxjðui,WþRiÞNxjðuj,RjÞ dxj
Kij¼ cNuiðuj,WþRiþRjÞ ¼s2
¼s2
jIþW?1ðRiþRjÞj1=2
fð2pÞD=2jWj1=2Nuiðuj,WþRiþRjÞ
fexpððui?ujÞ>ðWþRiþRjÞ?1ðui?ujÞÞ
The expected squared exponential variance is easier to com-
pute. First, we have:
Z
By the definition of a stationary covariance function, C(x,x) is
constant, which leads directly to
Kii¼
Cseðxi,xiÞNxiðui,RiÞ dxi
Kii¼ Cseðxi,xiÞ
By the definition of the squared exponential covariance function,
Cseðxi,xiÞ ¼s2
Finally, combining the variance and covariance equations in a
single function can be done by introducing an inverted Kronecker
delta:
?
jIþW?1ðRiþRjÞð1?dijÞj1=2
f.
CEseððui,RiÞ,ðuj,RjÞÞ ¼
s2
fexp ?1
2ðui?ujÞ>ðWþRiþRjÞ?1ðui?ujÞ
?
References
[1] H.G. Golub, F.C.V. Loan, An analysis of the total least squares problem, SIAM
Journal on Numerical Analysis (17) (1980) 883–893.
[2] R.J. Carroll, D. Ruppert, L.A. Stefanski, Measurement Error in Nonlinear
Models, Chapman and Hall/CRC, 1995.
[3] I. Markovsky, S.V. Huffel, Overview of total least-squares methods, Signal
Processing 87 (10) (2007) 2283–2302.
[4] V. Tresp, S. Ahmad, R. Neuneier, Training neural networks with deficient data,
in: Advances in Neural Information Processing Systems (NIPS’94), Morgan
Kaufmann, 1994, pp. 128–135.
[5] W.A. Wright, G. Ramage, D. Cornford, I.T. Nabney, Neural network modelling
with input uncertainty: theory and application, Journal of VLSI Signal
Processing Systems 26 (1/2) (2000) 169–188.
[6] J.-A. Ting, A. D’Souza, S. Schaal, Bayesian regression with input noise for high
dimensional data, in: Proceedings of the International Conference on
Machine Learning (ICML’06), ACM, 2006, pp. 937–944.
[7] J. Bi, T. Zhang, Support vector classification with input data uncertainty,
Advances in Neural Information Processing Systems (NIPS’04), vol. 17, MIT
Press, 2004, pp. 161–168.
[8] A. Girard, R. Murray-Smith, Learning a Gaussian process model with uncer-
tain inputs, Technical Report TR-2003-144, University of Glasgow, Glasgow,
UK, 2003.
[9] J. Quin ˜onero-Candela, S.T. Roweis, Data imputation and robust training with
Gaussian processes, 2003.
[10] A. Girard, Approximate methods for propagation of uncertainty with Gaus-
sian process models, Ph.D. Thesis, 2004.
[11] C.K.I. Williams, Computation with Infinite Neural Networks, Neural Compu-
tation 10 (5) (1998) 1203–1216.
[12] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning,
The MIT Press, 2006.
[13] D.J.C. Mackay, Information Theory, Inference, and Learning Algorithms,
Cambridge University Press, 2003.
[14] J. More ´, The Levenberg–Marquardt algorithm: implementation and theory,
in: G.A. Watson (Ed.), Numerical Analysis, Lecture Notes in Mathematics, vol.
630, Springer, Berlin, Heidelberg, 1978, pp. 105–116 (Chapter 10).
[15] R. Florian, Correct equations for the dynamics of the cart–pole system, Center
for Cognitive and Neural Studies (Coneural), Romania, 2007.
[16] P. Dallaire, C. Besse, S. Ross, B. Chaib-draa, Bayesian reinforcement learning in
continuous POMDPs with Gaussian processes, in: Proceedings of IEEE/RSJ
P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955
1954
Page 11
International Conference on Intelligent Robots and Systems (IROS’09), IEEE
Press, 2009, pp. 2604–2609.
[17] N.L. Roux, P.-A. Manzagol, Y. Bengio, Topmoumoute online natural gradient
algorithms, in: Advances in Neural Information Processing Systems (NIPS’07),
MIT Press, 2008, pp. 849–856.
Brahim Chaib-draa received a Diploma in Computer
Engineering from the E´cole Supe ´rieure d’E´lectricite ´
(SUPELEC) de Paris, Paris, France, in 1978 and a Ph.D.
degree in Computer Science from the Universite ´ du
Hainaut-Cambre ´sis, Valenciennes, France, in 1990. In
1990, he joined the Department of Computer Science
and Software Engineering at Laval University, Quebec,
QC, Canada, where he is a Professor and Group Leader
of the Decision for Agents and Multi-Agent Systems
(DAMAS) Group. His research interests include agent
and multiagent technologies, natural language for
interaction, formal systems for agents and multiagent
systems, distributed practical reasoning, and real-time
and distributed systems. He is the author of several technical publications in these
areas. He is on the Editorial Boards of IEEE Transactions on SMC, Computational
Intelligence and The International Journal of Grids and Multiagent Systems.
Dr. Chaib-draa is a member of ACM and AAAI and senior member of the IEEE
Computer Society.
Camille Besse was born is 1981. He received the B.Sc.
degree in Intelligent Systems in 2002 and his M.Sc.
degree in Artificial Intelligence in 2005 from Paul
Sabatier University, Toulouse, France. He is currently
a Ph.D. student under the supervision of Prof. Brahim
Chaib-draa and member of the DAMAS laboratory
research group. His current research interests include
partially observable Markov decision processes for
single and multiple agents, for planning and reinforce-
ment learning.
Patrick Dallaire was born is 1982. He received the
B.Sc. degree in Computer Science in 2008 from Laval
University, Canada. He is currently a M.Sc. student
under the supervision of Prof. Brahim Chaib-draa and
member of the DAMAS laboratory research group. His
current research interests include Gaussian processes,
partially observable Markov decision processes, Baye-
sian learning and reinforcement learning.
P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955
1955
Download full-text