# An approximate inference with Gaussian process to latent functions from uncertain data.

**ABSTRACT** Most formulations of supervised learning are often based on the assumption that only the outputs data are uncertain. However, this assumption might be too strong for some learning tasks. This paper investigates the use of Gaussian processes to infer latent functions from a set of uncertain input–output examples. By assuming Gaussian distributions with known variances over the inputs–outputs and using the expectation of the covariance function, it is possible to analytically compute the expected covariance matrix of the data to obtain a posterior distribution over functions. The method is evaluated on a synthetic problem and on a more realistic one, which consist in learning the dynamics of a cart–pole balancing task. The results indicate an improvement of the mean squared error and the likelihood of the posterior Gaussian process when the data uncertainty is significant.

**0**Bookmarks

**·**

**109**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**In this work we propose a heteroscedastic generalization to RVM, a fast Bayesian framework for regression, based on some recent similar works. We use variational approximation and expectation propagation to tackle the problem. The work is still under progress and we are examining the results and comparing with the previous works.01/2013;

Page 1

An approximate inference with Gaussian process to latent functions from

uncertain data

Patrick Dallaire?, Camille Besse, Brahim Chaib-draa

DAMAS Laboratory, Computer Science and Software Engineering Department, Laval University, Canada

a r t i c l e i n f o

Available online 19 February 2011

Keywords:

Supervised learning

Gaussian processes

Data uncertainty

Dynamical systems

a b s t r a c t

Most formulations of supervised learning are often based on the assumption that only the outputs data

are uncertain. However, this assumption might be too strong for some learning tasks. This paper

investigates the use of Gaussian processes to infer latent functions from a set of uncertain input–output

examples. By assuming Gaussian distributions with known variances over the inputs–outputs and

using the expectation of the covariance function, it is possible to analytically compute the expected

covariance matrix of the data to obtain a posterior distribution over functions. The method is evaluated

on a synthetic problem and on a more realistic one, which consist in learning the dynamics of a cart–

pole balancing task. The results indicate an improvement of the mean squared error and the likelihood

of the posterior Gaussian process when the data uncertainty is significant.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

Supervised learning refers to the class of problems where a

learner has to infer a function f : X-Y given a set of labelled

examples. A typical underlying assumption concerning the train-

ing examples is that the inputs are known precisely and that only

the output variables are affected by noise. However, this assump-

tion can be unrealistic in many applications. There might be

uncontrollable effects that corrupt the inputs in the data gather-

ing process, which can affect the resulting quality of the esti-

mated model if this fact is ignored in the learning formulation.

The problem of learning in the context of noisy input and

output has been tackled in several ways. In computational

mathematics and engineering, one generally uses a method

known as total least-squares which consists in finding a minimal

correction to apply on all data points such that the modified data

satisfies a linear relation [1]. For their part, statisticians have

investigated the error-in-variables models to deal with errors on

the dependent variables as well as on the independent ones [2].

The general error-in-variables approach aims to create latent

variables that correspond to the true observations which follow

the underlying sought relation. There is a close link between the

error-in-variables models and the total least-squares methods.

The statistical model which corresponds to the basic total least-

squares approach is the error-in-variables model with the

restrictive condition that the measurement errors are zero mean,

independent and identically distributed [3].

Researchers in machine learning have also addressed the

problem of learning from uncertain data. Tresp et al. [4] proposed

incorporating deficient data into the training of a neural network

by integrating over the uncertain input using an estimated

probability distribution for these inputs. Moreover, they showed

that the expectation of the learned function will be biased if the

inputs are altered with Gaussian noise along with increased error

bars on predictions. Wright et al. [5] presented a framework for

Bayesian neural networks where they infer the regression over

the noiseless input by using Markov chain Monte Carlo sampling

over the hidden variables. More recently, Ting et al. [6] proposed a

Bayesian linear regression model including a precision parameter

which enforces interdependency between the input and output

noise models. Their algorithm then attempts to clean the uncer-

tain data by using the expectation-maximization principle.

Kernel methods have also been used to learn from uncertain

inputs. For instance, in classification tasks, Bi and Zhang [7]

proposed a statistical model that extends Support Vector Classi-

fication methods in order to deal with uncertain inputs. For this

purpose, they considered that each unobserved input is asso-

ciated to exactly one component of a Gaussian mixture model.

Then, by estimating the parameters of their Gaussian noise model,

they were able to take input uncertainty into consideration.

The virtues of Gaussian processes to learn from uncertain

inputs have also been investigated by some researchers. First,

Girard and Murray-Smith [8] showed that the covariance of the

data can be approximated using a new correlated process. This

process uses a second order Taylor expansion to obtain a

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Neurocomputing

0925-2312/$-see front matter & 2011 Elsevier B.V. All rights reserved.

doi:10.1016/j.neucom.2010.09.024

?Corresponding author.

E-mail addresses: dallaire@damas.ift.ulaval.ca (P. Dallaire),

besse@damas.ift.ulaval.ca (C. Besse), chaib@damas.ift.ulaval.ca (B. Chaib-draa).

Neurocomputing 74 (2011) 1945–1955

Page 2

corrected covariance function accounting for input noise. Quin ˜o-

nero-Candela and Roweis [9] stated that computing the marginal

likelihood analytically is generally not possible. However, opti-

mizing a lower bound of this quantity can lead to denoised inputs

while learning the model. On the other hand, instead of approx-

imating the covariance matrix, both previous authors mentioned

that taking the expectation could lead to better estimation of the

marginal likelihood [9,10].

In the same vein, this paper investigates an approach in which

we not only make predictions for uncertain inputs with Gaussian

processes, but also we learn from uncertain training sets. To do so,

we marginalize out the inputs’ uncertainty to obtain the expected

covariance matrix and keep an analytical posterior distribution

over functions. The main contribution of this paper is a formula-

tion which allows learning from uncertain inputs and outputs by

using Gaussian processes exactly as in the noise-free case. Results

show that taking into account the uncertainty with this method

improves the mean squared error and the likelihood of the sought

function. As the noise decreases, the method tends towards its

noise-free counter-part and therefore, cannot do worse than

standard Gaussian process regression. To show that, the method

is applied to the well-known task of balancing a pole over a cart,

where the problem is to learn the four-dimensional (+ control)

nonlinear dynamics of the system.

This paper is structured as follows. First, we formalize the

problem of learning with Gaussian processes and the regression

model. Then, in Section 3, we present insights to account for

uncertain data. In Section 4, we present the experimental results

on a difficult synthetic problem and on a more realistic

problem. Section 5 discusses the results before Section 7 opens

research avenues and concludes the paper.

2. Gaussian processes

Gaussian Processes (GPs) are stochastic processes which are

used in machine learning to describe distributions directly into

the space of functions. In supervised learning, we have to make

the mandatory assumption that training examples are informa-

tive for further predictions. This information can be formalized is

many ways. Some methods use correlation between the examples

to express this information. When using Gaussian processes, it is

assumed that the joint distribution of the data is a multivariate

Gaussian. Consequently, the problem is to find a covariance

function which explains the data properly. One interesting prop-

erty of GPs is that they provide a probabilistic approach to the

learning task and give uncertainty estimates while making

predictions.

2.1. Gaussian processes for regression

As mentioned above, it is assumed that the joint distribution of

a finite set of output observations given their inputs is multi-

variate Gaussian. Thus, a GP is fully specified by its mean function

mðxÞ and covariance function Cðx,x0Þ. First, we assume that a set of

training data D ¼ fxi,yigN

observation such that

i ¼ 1is available where xiARD, yiis a scalar

yi¼ fðxiÞþei

and where ei is a white Gaussian noise with variance s2

convenience, we use the notation X ¼ [x1, y, xN] for inputs and

y¼[y1, y, yN] for outputs. According to the definition of a GP, the

joint distribution of the training set is Nðm,Kþs2

vector m and the matrix K are computed with the mean and

covariance functions. The components of m are computed such

that mi¼mðxiÞ. However, for the sake of notational simplicity, we

ð1Þ

e. For

eIÞ, where the

make the weak assumption that the mean function is mðxÞ ¼ 0.1

Then, the joint distribution of the training set becomes

yjX ? Nð0,Kþs2

where K is the covariance matrix whose entries Kijare given by

the covariance function C(xi,xj). This equation is fundamental,

since it corresponds to the Gaussian process prior assumption,

which is used hereafter to make predictions.

There is a variety of covariance functions, each of them making

different forms of functions more probable than others for a

Gaussian process. Selecting the covariance function is an impor-

tant aspect of the learning task. In particular, using Gaussian

processes with certain covariance functions corresponds to using

a neural network having an infinite number of sigmoidal or

Gaussian hidden units [11]. This choice has to be made a priori

and reflects the prior knowledge concerning the function to

estimate directly in the space of functions, the so called Gaussian

process prior. The multivariate Gaussian prior (2) defined over the

training observations can be used to compute the posterior

distribution over functions. Since this posterior distribution is

obtained by computing the conditional distribution of a Gaussian,

the resulting posterior remains a Gaussian process.

Accordingly, making predictions can be done by using the

posterior Gaussian process’ mean and its associated measure of

uncertainty, given by the posterior covariance. For a single test

input xn, the predictive distribution of the noise-free output fnis

obtained by first forming the joint distribution with (2)

"

0

k>

?

eIÞð2Þ

y

f?

#

? N

0

??

,

Kþs2

eIk?

k??

"# !

ð3Þ

where knis the N ? 1 vector of cross-covariance C(xn, xi) between

the test input xnand the training inputs X, and knis the prior

variance given by C(xn, xn). By computing the conditional dis-

tribution of (3) such that

f?jx?,X,y ? Nðmðx?Þ,s2ðx?ÞÞ

we obtain the Gaussian predictive distribution for any single

input xnwith mean and variance given by

mðx?Þ¼ k>

?ðKþs2

eIÞ?1y

ð4Þ

s2ðx?Þ ¼ k???k>

It should be pointed out that Eqs. (4) and (5) produce an

estimation of the latent function f and not its noisy version y.

The interested reader is invited to refer to [12] for more details on

different formulations of Gaussian processes.

?ðKþs2

eIÞ?1k?

ð5Þ

2.2. Gaussian process prior

Although many covariance functions can be used to define the

Gaussian process prior in (2), this paper we will mainly focus on

the squared exponential (se) covariance function:

Cseðxi,xjÞ ¼s2

which is one of the most widely used kernel functions. In

particular, using this covariance function is equivalent to a linear

combination of an infinite number Gaussian basis functions [13].

Since the outputs might be corrupted by noise, we add the

covariance function of an independent noise process:

fexpð?1

2ðxi?xjÞ>W?1ðxi?xjÞÞð6Þ

Cnðxi,xjÞ ¼s2

edij

ð7Þ

1Its common to find the bias of the training set and to subtract it from the

data, making the zero-mean prior the ideal choice.

P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955

1946

Page 3

where dijis the Kronecker delta. It results in a covariance function

parameterized by the vector of hyperparameters

y ¼ ½W,s2

where W is the diagonal matrix of characteristic length-scale,

which accounts for a different covariance measure for each input

dimension, s2

f

is the signal variance influencing the order of

magnitude of the functions and s2

noise process.

Varying these hyperparameters affects the interpretation of

the training data by changing the shape of functions of the GP

prior expects. It might be difficult a priori to fix the hyperpara-

meters of a covariance function and to expect these to fit the

observed data correctly. A common way to estimate the hyper-

parameters is to maximize the log marginal likelihood of the

observations [12]. In that case, it corresponds to take the loga-

rithm of (2) and maximize this quantity with respect to hyper-

parameters y:

f,s2

e?,

ð8Þ

erepresents the variance of the

logpðyjX,yÞ ¼ ?1

2y>ðKþs2

eIÞ?1y?1

2logjKþs2

eIj?N

2log2p

ð9Þ

since the joint distribution of the observations is a multivariate

Gaussian under a zero-mean prior. The optimization problem can

be tackled using any gradient method to find an acceptable local

maxima or the global maxima if possible.

3. Learning with uncertain data

In this paper, we consider the difficult problem of learning

from uncertain examples. The ideal Bayesian solution to this task

would be to make predictions by integrating over the uncertainty

of the training set such that

ZZ

but these integrals are analytically intractable in general. Thus,

the following sections present an approach to approximate this

predictive distribution, which deals with Gaussian input and

output distributions.

pðf?jx?Þ¼

pðf?jx?,X,yÞpðXÞpðyÞ dX dy

ð10Þ

3.1. Learning with uncertain inputs

As we stated earlier in Introduction, the assumption that

only the outputs are noisy is not enough for some learning

tasks. Consider the case where the inputs are uncertain and

where each input distribution is known. In this case, using

Gaussian process regression as previously formulated will not

handle the uncertainty explicitly. Nevertheless, one can apply the

standard method naively by using only one representative per

input, such as the mean, and ignoring other information about its

distribution.

For Gaussian processes, accounting for uncertain examples

implies computing uncertain covariance between these examples.

In fact, the covariance of any pair of examples has a distribution

that depends on the covariance function and the probability

distribution of each input. Consequently, the covariance matrix

has a very large distribution which makes the computation of the

posterior over functions impossible in general. To deal with this

issue, one generally relies on an approximation of the covariance

matrix distribution. In this paper, we investigate a case where the

expected covariance matrix is analytically computable.

Consider the case where inputs are a set of Gaussian distribu-

tions and where the true input value xiis not observable, but we

have access to the parameters of its distribution. Thus, the

expected covariance between uncertain data is obtained by

computing expectations with respect to xiand xj

ZZ

Kij¼

Cðxi,xjÞpðxiÞpðxjÞ dxidxj, if iaj

ð11Þ

where pðxiÞ ¼ Nðui,RiÞ and pðxjÞ ¼ Nðuj,RjÞ. Note that this equa-

tion is used to compute the elements of the covariance matrix K

which are off diagonal. Since the diagonal corresponds to var-

iances, Eq. (11) cannot be used, because the two variables are in

fact the same random variable. Therefore, the case of expected

variance requires the computation of a single expectation:

Kii¼

Z

Cðxi,xiÞpðxiÞ dxi

ð12Þ

which is only useful for non-stationary covariance function. For

stationary covariance functions, such as the squared exponential,

this integral is simplified to s2

Before discussing the expected squared exponential, we first

derive the simpler case of the expected covariance of the linear

covariance function Clin. Basically, this covariance function can be

expressed as

fby definition.2

Clinðxi,xjÞ ¼ x>

ixjþs2

b

ð13Þ

where s2

non-zero value makes it inhomogeneous. In practice, homoge-

neous functions will force the estimated functions to pass by the

origin and increasing s2

obtain the expected version CElinof this function, we have to

compute Eqs. (11) and (12), which results in (see Appendix A for

derivation)

b¼ 0 makes the covariance function homogeneous and

ballows functions to be more off origin. To

CElinððui,RiÞ,ðuj,RjÞÞ¼ u>

iujþs2

bþdijTrðRiÞð14Þ

where Tr denotes the trace of a matrix. The effect of the trace

term here is to increase the output variance of training examples

according to their input variance. Thus, the contribution of highly

uncertain examples is automatically decreased.

It is also possible to compute the expectation for polynomial

covariance function. However, as the degree increases, so the

complexity of the resulting expected covariance function. For

instance, in the case of the quadratic covariance function:

Cquaðxi,xjÞ ¼ ðx>

ixjþs2

bÞ2

ð15Þ

the computation of the expectations involves expectation over

quadratic forms of Gaussian variables, which yields in the

following expected quadratic covariance function (see Appendix

B for derivation)

CEquaððui,RiÞ,ðuj,RjÞÞ¼ Trð½Riþuiu>

þ2s2

where the last line is used only to compute expected variances.

Using linear or quadratic covariance function is an arguably

strong hypothesis on the space of functions. Therefore, one could

look for a more general function.

It has been shown by Girard [10] that, for normally distributed

inputs and using the squared exponential covariance function,

integrating over independent input distributions is analytically

feasible since it involves integrations over product of Gaussians.

However, to obtain the correct expected squared exponential

covariance function, we had to incorporate the variance case to

the function. The resulting expected covariance CEseis computed

i?½Rjþuju>

iÞ2þ2s2

j?Þð1þdijÞ

bu>

iujþs4

bþdij½TrðRiþuiu>

bTrðRiÞ?2ðu>

iuiÞ2?ð16Þ

2For stationary covariance function C(x,x) ¼ C(0,0) is constant.

P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955

1947

Page 4

exactly with (see Appendix C for derivation)

CEseððui,RiÞ,ðuj,RjÞÞ ¼

s2

fexp ?1

2ðui?ujÞ>ðWþRiþRjÞ?1ðui?ujÞ

jIþW?1ðRiþRjÞð1?dijÞj1=2

??

ð17Þ

which is again a squared exponential form, where the covariance

decreases as the uncertainty over the inputs increases, but farther

examples now become correlated, thus making the resulting

process smoother when examples are more uncertain. In

Eq. (17), we introduced the inverse Kronecker delta ð1?dijÞ in

order to eliminate the reweighing effect of the denominator when

computing a variance.

While computing the expected covariance matrix allows

learning with Gaussian processes exactly as in the noise-free

input case, this matrix can also be unlikely according to the full

distribution over covariance matrices. For instance, Fig. 1 shows

the case where the expected covariance between two training

examples is exactly the worst choice. In that particular case, a

better estimate would be to either say that the examples

are covarying or not covarying at all. This setup is actually

the worst possible, but it clearly indicates that the expected

Gaussian process might not explain the training data properly in

some cases.

A solution to this problem would be to clean the data by

finding the maximum a posteriori or to compute a complete

posterior Gaussian distribution on inputs (this avenue is dis-

cussed in Section 6), both modifying the covariance of the

examples.However, among all

p(C(xi,xj)), only a few training examples suffer from this problem

and the underlying function can still be inferred adequately for

sufficiently large data sets.

possible densityfunctions

3.2. Learning with uncertain outputs

So far, only the difficulty concerning uncertain inputs has been

covered. Dealing with uncertain outputs having a Gaussian

distribution is more manageable. Generally, the noise on the

output is assumed stationary and has to be estimated from

data. By observing each noisy output along with its variance,

one can define a non-stationary noise process only known for

the training examples. This noise process is introduced to reflect

the outputs’ uncertainty and decrease the influence of each

example according to its respective variance. Formally, it entails

a slight modification of the prior distribution yjX ? Nð0,Kþs2

which consists in adding a diagonal covariance matrix containing

the variances s2

eÞ,

yi.

3.3. Learning the covariance function hyperparameters

Theoretically, learning with uncertain data is as difficult as

in the noise-free case when using the covariance function

(17), although it might require more data. The posterior dis-

tribution over functions is found by using Eqs. (4) and (5) with

the expected covariance matrix. The hyperparameters of the

covariance function can be learned with the log marginal like-

lihood of the expected process, but this likelihood is often

riddled with many local maxima. Using standard conjugate

gradient methods will often lead to a local maxima that might

not explain the data properly. In the case of a squared exponential

covariance function, the matrix W would tend to have large

values on its diagonal, meaning that most dimensions are either

very smooth or even irrelevant, and the value of s2

overestimated to transpose the input error in the output

dimensions.

A solution to this difficulty is to find a maximum a posteriori

(MAP) estimation of the hyperparameters instead of a maximum

likelihood. Specifying a prior on the hyperparameters will thus act

as a regularization term preventing improper local maxima. As

stated before, it is easy for Gaussian processes to explain every-

thing as noise. To avoid such inappropriate explanation, larger

values of noise and characteristic length-scales can be penalized.

However, a bad local maxima can be an indication that there is no

apparent structure in the data or no enough data to find

correlations.

ewould be

4. Experiments

The following experiments compare the performance of Gaus-

sian process accounting for data uncertainty through the compu-

tation of the expected covariance matrix (uncertain-GP) and a

Gaussian process which only uses the mean of the inputs (classic-

GP). Both Gaussian processes are based on the squared exponen-

tial covariance function, where the classic-GP has only access

to (6) and the uncertain-GP has access to its expected version (17).

The purpose of these experiments is to determine whether using

the expected covariance matrix leads to any improvement over

the naive method that only uses the inputs’ mean.

Indeed, the standard Gaussian process does not usually deal

with uncertainty over the inputs. The expected behaviour of this

method would be to explain Gaussian input uncertainty as extra

Gaussian noise over the outputs. While this extra noise is in

general non-Gaussian, it can still be approximated as Gaussian

noise.

First, we evaluate the behaviour of each method on a one-

dimensional synthetic problem and then compare their perfor-

mances on a harder problem which consists in learning the

nonlinear dynamics of a cart–pole system.

4.1. Synthetic problem: sincsig

To visualize easily the behaviour of both learning methods, we

have chosen a one-dimensional task where the problem consists

in learning a function composed by the unnormalized sinc

00.2 0.40.6 0.81

0.5

1

1.5

2

2.5

3

3.5

covariance

likelihood

Fig. 1. Probability density function of the covariance between examples at

xi? Nð0,1Þ and xj? Nð1,1Þ under a squared exponential with all parameters set

to one. The expectation given by CEsefor these examples is 0.5.

P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955

1948

Page 5

function and a biased sigmoid function such that

fðxÞ¼

sinðxÞ=x

0:5=ð1þexpð?10x?5ÞÞþ0:5

if xZ0

otherwise

(

ð18Þ

which we refer to as the sincsig function hereafter.

All evaluations have been done on randomly drawn training

sets. Each data set is constructed by initially uniformly sampling

N points in [?10,10]. These points are the noise-free inputs

{xi}i¼1

sampled according to yi? NðsincsigðxiÞ,s2

inputs is constructed by first sampling the noise s2

on each input. Then, the sampled noise is used to corrupt the

noise-free input xiaccording to ui? Nðxi,s2

xijui,s2

D ¼ fðui,s2

Fig. 2 shows a typical example of a training data set (crosses),

with the true function to infer (solid line) and the resulting

regression (thin line) for the uncertain-GP (top) and the classic-

GP (bottom). Error bars around the thin lines indicate the

confidence of the Gaussian process about its prediction and

correspond to two standard deviations. Note here that whenever

the solid line goes far from the error bars, the true underlying

function becomes unlikely according to the prediction which can

be considered as an inconsistent learning.

N

. To obtain the associated noisy outputs, each of them is

yÞ. The set of uncertain

xito be applied

xiÞ. It is easy to see that

xi? Nðui,s2

xiÞ,yigN

xiÞ. Finally, the complete training set is defined as

i ¼ 1.

The first experiment was designed to evaluate the learning

performance on uncertain inputs only. Therefore, we conducted

this one with a small output noise of standard deviation sy¼ 0:01

with different size training sets. The input noise standard devia-

tions sxiwere sampled uniformly in [0, 2.5], which causes the

dataset to contain some very good examples on average. The

standard deviations have been chosen high enough so that the

noise over the inputs can be explained by adding an independent

output noise during the optimization procedure and to measure

the impact of highly uncertain examples on the posterior Gaus-

sian process.

All performance evaluation of the classic-GP and the uncer-

tain-GP have been done by training both with the same random

data sets. For comparison purpose, we also trained a standard

multilayer perceptron (MLP) having two hidden layers of five

neurons. The weights of the network were found with the

Levenberg–Marquardt algorithm [14]. Fig. 3(a) shows the aver-

aged mean squared error over 25 randomly chosen training sets

for various sizes. Results show that when very few data are

available, both processes tend to explain the data with higher

noise variance s2

eover the outputs. As expected, when the size of

the data set increases, the classic-GP gives greater importance to

the explanation that observed outputs are mostly noise. On the

other hand, the uncertain-GP discriminates the most uncertain

data and privileges the ones having less uncertainty in order to

infer the function from the right data.

–10 –505 10

–1

0

1

2

–10–505 10

–1

0

1

2

Fig. 2. Typical learning of the sincsig function with uncertain-GP (top) and classic-

GP (bottom). Error bars represent two standard deviations. The crosses are the

means of uncertain inputs and outputs.

50 100

Number of training exemples

150200250

0

0.02

0.04

0.06

0.08

0.1

0.12

MSE

MLP

GP with std cov. function

GP using uncertainty

50 100

Number of training exemples

150 200250

0

0.02

0.04

0.06

0.08

0.1

0.12

MSE

MLP

GP with std cov. function

GP using uncertainty

Fig. 3. Mean squared error results on the sincsig problem. Squared red line, uncertain-GP; circled blue line, classic-GP; triangled green line, MLP. (a) Output uncertainty is

constant with sy¼ 0:1 and (b) output uncertainty is random with syi? Uð0:2,0:5Þ. (For interpretation of the references to colour in this figure legend, the reader is referred

to the web version of this article.)

Fig. 4. The cart–pole balancing problem.

P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955

1949

Page 6

In the second experiment, we assumed that the Gaussian

processes now know the noise’s variance on the observations to

evaluate the performance on (completely) uncertain datasets.

Therefore, the noise hyperparameter s2

processes know exactly the noise matrix to add when computing

the covariance matrix. For each output, the standard deviationsyiis

then uniformly sampled in [0,0.5] and the complete training set is

defined as D ¼ fðui,s2

of the MLP, the classic-GP and the uncertain-GP on such training

sets. Removing the independent noise process (7) from the prior

has two effects: First, it prevents the classic-GP from explaining

uncertain inputs by increasing the noise (hyperparameter) over the

outputs, and it also forces the uncertain-GP to explain the data with

the available information on the input variance.

The third experiment is set to validate the improvements

brought by the expected covariance method. We thus let the

optimization procedure choose itself the right kernel to use

according to the uncertainty over the inputs. Consequently, we

constructed two combinations of covariance functions. The first

one (classic-GP) is based on the standard squared exponential

covariance function (6) added to the known non-stationary noise

process representing the outputs’ uncertainty and an independent

noise process given by

eis set to zero since the

xiÞ,ðzi,s2

yiÞgN

i ¼ 1. Fig. 3(b) shows the performance

Cðxi,xjÞ ¼ Cseðxi,xjÞþdijs2

yiþdijs2

e

ð19Þ

where the first two terms are expected to explain the data and the

last one to compensate the error due to inputs’ noise.

The second one (uncertain-GP) is based on the expected

squared exponential covariance function (17) added to the known

non-stationary noise process representing the outputs’ uncer-

tainty and an independent noise process which is given by

Cððui,s2

xiÞ,ðuj,s2

xjÞÞ¼ CEseððui,s2

xiÞ,ðuj,s2

xjÞÞþCseðui,ujÞþdijs2

yiþdijs2

e

ð20Þ

where we expect the optimization process to balance the effect of

each covariance function.

The experiments were conducted on various levels of noise

and training set sizes. For these experiments, the uncertainty over

the data is upper bounded by a standard deviation v. Like the

previous experiments, noises are sampled independently from

uniform distributions and are then applied to the data. In this

case, sxi? Uð0,vÞ and syi? Uð0,vÞ. The performance of the classic-

GP and the uncertain-GP has been measured with the mean

squared error and the log-likelihood. The results are averaged

over 10 trials for a same value of v and data set size. The

likelihood is approximated with 100 equidistant noise-free points

from the true function. The averaged mean squared error and

averaged log-likelihood results specific to this third experiment

are shown in Figs. 5 and 6 respectively.3

As stated in Section 3.3, it is possible to learn the hyperpara-

meters given a training set. Since conjugate gradient methods

0 50 100 150200 250

0

0.01

0.02

0.03

0.04

0.05

0.06

Number of training exemples

MSE

GP with std cov. function

GP using uncertainty

0 50 100150200250

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Number of training exemples

MSE

GP with std cov. function

GP using uncertainty

0 50 100150 200250

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Number of training exemples

MSE

GP with std cov. function

GP using uncertainty

050100 150 200250

0

0.05

0.1

0.15

0.2

0.25

Number of training exemples

MSE

GP with std cov. function

GP using uncertainty

Fig. 5. Mean squared error results on sincsig. The input and output uncertainties are random with sxi? Uð0,vÞ and syi?Uð0,vÞ. The variable v is the upper bound on

uncertainty. Squared red line, uncertain-GP; circled blue line, classic-GP. (a) With upper bound v ¼ 0.1; (b) with upper bound v ¼ 0.4; (c) with upper bound v ¼ 1;

(d) with upper bound v ¼ 3. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

3Due to computational precision, likelihoods reached infinity making

Fig. 6(a) incomplete.

P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955

1950

Page 7

performed poorly for the optimization of the log-likelihood

in the uncertain-GP cases, we preferred stochastic optimization

methods for this task. For these experiments, we did not use any

prior on the hyperparameters to avoid helping the learning

procedure by introducing additional information on the function.

4.2. The cart–pole problem

We now consider the harder problem of learning the cart–pole

dynamics, which is also known as the inverted pendulum. Fig. 4

gives a picture of the system from which we try to learn the

dynamics. The state is defined by the position ðjÞ of the cart, its

velocity ð_ jÞ, the pole’s angle ðaÞ and its angular velocity ð_ aÞ. There

is also a control input which is used to apply lateral forces on the

cart. Following the equation in Florian [15] to govern the

dynamics, we used Euler’s method to update the system’s state:

€ a ¼gsinaþcosaðð?F?mpl_ a2sinaÞ=ðmcþmpÞÞ

lð4=3?ðmpcos2aÞ=ðmcþmpÞÞ

€ j ¼Fþmplð_ a2sina?€ asinaÞ

mcþmp

where g is the gravity force, F is the force associated to the action,

l is the half-length of the cart, mpis the mass of the pole and mcis

the mass of the cart.

For this problem, the training sets were sampled exactly as in

the sincsig case. State-action pairs were uniformly sampled on

their respective domains. The outputs were obtained with the

true dynamical system and then perturbed by noise with known

random variance. Since the output variances are also known, the

training set can be seen as Gaussian input distributions that map

to Gaussian output distributions. Therefore, one might use a

sequence of Gaussian belief states as its training set in order to

learn a partially observable dynamical system. Following this

idea, there is no reason for the output distributions to have a

significantly smaller variance than the input distribution.

In this experiment, the input and output noises standard

deviation were uniformly sampled in [0, 2.5] for each dimension.

Every output dimension was treated independently by using a

Gaussian Process prior on each of them. Fig. 7 shows the averaged

mean square error over 25 randomly chosen training sets for

different N values for each dimension.

5. Discussion

Learning from uncertain data is known to be hard. Under the

Gaussian process assumption, a data set having a prior distribu-

tion has uncertain covariances and consequently a distribution

over its covariance matrix. Considering the whole distribution on

covariance matrices to compute a posterior Gaussian process is

not feasible in general, even for the squared exponential covar-

iance function with Gaussian input uncertainty. However, the

former particular case allows computing the expectation of the

covariance matrix distribution and makes the inference procedure

no harder than with noise-free data. The experiments were

conducted to determine whether using the expected covariance

0 50 100150200 250

0

50

100

150

200

250

Number of training exemples

Log–likelihhod

GP with std cov. function

GP using uncertainty

0 50 100 150200250

–50

0

50

100

150

200

Number of training exemples

Log–likelihhod

GP with std cov. function

GP using uncertainty

0 50 100 150200250

–250

–200

–150

–100

–50

0

50

100

Number of training exemples

Log–likelihhod

GP with std cov. function

GP using uncertainty

0 50 100150200250

–2000

–1500

–1000

–500

0

500

Number of training exemples

Log–likelihhod

GP with std cov. function

GP using uncertainty

Fig. 6. Log-likelihood results on sincsig. The input and output uncertainties are random with sxi? Uð0,vÞ and syi?Uð0,vÞ. The variable v is the upper bound on uncertainty.

Squared red line, uncertain-GP; circled blue line, classic-GP. (a) With upper bound v ¼ 0.1; (b) with upper bound v ¼ 0.4; (c) with upper bound v ¼ 1; (d) with upper

bound v ¼ 3. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955

1951

Page 8

leads to performance improvement or not when dealing with

input uncertainties.

The first results on the synthetic problem are presented

in Fig. 3(a) and (b). It shows that using the expected covariance

(uncertain-GP) leads to lower mean squared error than using the

squared exponential with inputs mean (classic-GP). Fig. 2 shows

the typical behaviour of each method. We observed that the

classic-GP method often learns something closer to a convolution

of the sincsig function rather than the function itself, which is the

expected behaviour. On the other hand, the uncertain-GP method

generally leads to better estimation of the function. Although it

provides smaller mean squared errors, it also has the interesting

property that the sincsig function completely lies inside its error

bounds, meaning that most estimations are consistent with the

true function and therefore, increasing the likelihood.

These experiments were done on data sets containing large

input uncertainty up to 2.5 standard deviation. Since there are

very few almost sure training examples to guide the inference, it

makes the learning task very hard. In the case of such fuzzy

training sets, using the expected covariance is a clear advantage

over its standard counter-part.

A second set of experiments has been conducted where this

time we measure the likelihood of the estimated function. To

reinforce the relevance of using the expectation of the covariance

matrix, the uncertain-GP method is now allowed to balance

between the standard squared exponential covariance function

and its expectation. The balancing procedure is part of the log

marginallikelihoodmaximization.

in Figs. 5 and 6. Despite having a more expressive covariance

function, the uncertain-GP method shows little improvement.

We observed that the mean squared error is often caused by

learning a biased version of the function and log-likelihood

decreases as the bias increases. Fig. 8 shows the typical biased

posterior learned by uncertain-GP for the sinc function and the

convolution learned by the classic-GP. Although performance

measures show similar curves for both methods, one can arguably

prefer to obtain a biased function having the shape of the function

rather than some estimations which capture less structure in

the data.

In the context of learning the model of a partially observable

Markov decision process [16], one has to identify first the

dynamics, or state transition function, to be able to infer

the state of the system. A reinforcement learning algorithm uses

the inferred belief state to learn a value function by mapping the

uncertain state to its expected reward. Learning the optimal

policy can still be achieved if the learned dynamics are biased,

since the reinforcement learning algorithm learns over this bias.

Moreover, a far from truth model of the dynamics could lead the

algorithm to learn a good policy if the estimated model’s shape is

close enough to the dynamics’ true behaviour. Getting closer to

this goal, experiments have been conducted on the identification

of a dynamical system.

Theresults are shown

50 100

Number of training exemples

150200 250

0

0.5

1

1.5

2

2.5

3

MSE

MLP

GP with std cov. function

GP using uncertainty

50 100

Number of training exemples

150200250

0

1

2

3

4

5

MSE

MLP

GP with std cov. function

GP using uncertainty

50100

Number of training exemples

150 200250

0

0.5

1

1.5

2

2.5

MSE

MLP

GP with std cov. function

GP using uncertainty

50 100

Number of training exemples

150200 250

0

0.5

1

1.5

2

2.5

3

3.5

4

MSE

MLP

GP with std cov. function

GP using uncertainty

Fig. 7. Mean squared error results on the cart–pole problem. Squared red line, uncertain-GP; circled blue line, classic-GP; triangled green line, MLP. (a) Position of the cart;

(b) velocity of the cart; (c) pole angle; (d) angular velocity of the pole. (For interpretation of the references to colour in this figure legend, the reader is referred to the web

version of this article.)

P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955

1952

Page 9

The cart–pole problem is well known in the reinforcement

learning community. The last experiments aim to learn its

dynamics from uncertain state transition examples. The results

are exhibited in Fig. 7. The performances of the uncertain-GP and

classic-GP methods are compared under the mean squared error

measure. Again, using the expected covariance provided better

estimates of the true function. However, much more data would

have been required to correctly identify the dynamics, but the

complexity of computing a posterior Gaussian process is Oðn3Þ

and with hyperparameters optimization, it becomes the complex-

ity of each optimization step. Moreover, computing the covar-

iance matrix with the covariance function (17) takes more time,

since it now involves determinant computations.

The optimization procedure is not only computationally cum-

bersome. The log marginal likelihood is riddled with local maxima

in which vanilla gradient methods get driven quickly. The experi-

ments were, therefore, carried out using stochastic optimization

methods to move among the maxima and find the best one. An

interesting avenue would be to look at natural gradient

approaches which are well suited for this task [17].

6. Future work

Many previously proposed approaches for learning from

uncertain data relied on the estimation of the hidden true inputs

and outputs. Albeit not reported in this work, we conducted some

experiments on computing a posterior Gaussian distribution over

the training set. Since computing such posterior is not analytically

tractable and might not be Gaussian, one has to look for

approximations. By using Jensen’s inequality, we obtained a lower

bound approximation of the log marginal likelihood of the data

and estimated the posterior distribution of the data with Gaussian

distributions. The resulting quantity to maximize is

logpðyjX,yÞ?DKLðqðXÞJpðXÞÞ

where p(X) is the prior distribution over the training set and q(X)

the posterior.

Eq. (21) corresponds to the common log marginal likelihood

(9) with a penalty given by the Kullback–Leibler divergence of

the estimated posterior over inputs with respect to its prior.

The covariance matrix is computed with the expected squared

exponential covariance function (17) with respect to the learned

posterior over inputs. We obtained no significative results

with this method, mostly because the optimization task is

very hard, each input mean and variance becoming a variable

ð21Þ

to optimize. Further work is currently ongoing to address this

issue.

7. Conclusion

To conclude, we investigated the use of the expected covar-

iance matrix in Gaussian processes when learning from uncertain

data. Results indicate that in many cases, depending on the

amount of uncertainty, the proposed approach yields to do better

inference. An interesting property is that even when the improve-

ment on the mean squared error is negligible, the method will

often improve the likelihood of the posterior over functions. We

also stated that a bias may occur when using the expected

covariance without denoising, but such biased models can still

be useful for certain applications. Finally, the method finds its

best performance when used on significantly uncertain data sets.

In other cases, the approach provides performances similar to

standard Gaussian process regression.

Appendix A. Expected linear covariance function

The expected linear covariance with respect to Gaussian input

distributions pðxiÞ ¼ Nxiðui,RiÞ and pðxjÞ ¼ Nxjðuj,RjÞ is given by:

ZZ

¼Ei½x>

Kij¼

Clinðxi,xjÞNxiðui,RiÞNxjðuj,RjÞ dxidxj¼Ei½Ej½x>

i?Ej½xj?þs2

ixjþs2

b??

b¼ u>

iujþs2

b

Computing the expected variance is done with:

Z

¼Ei½x>

Kii¼

Clinðxi,xiÞNxiðui,RiÞ dxi¼Ei½x>

ixi?þs2

ixiþs2

b?

b¼ TrðRiÞþu>

iujþs2

b

By observing that the only difference between variance and

covariance is the presence of a trace in the variance case, we can

express the expected linear covariance function as

CElinððui,RiÞ,ðuj,RjÞÞ¼ u>

by the introduction of a Kronecker delta.

iujþs2

bþdijTrðRiÞ

Appendix B. Expected quadratic covariance function

The expected quadratic covariance with respect to Gaussian

input distributions pðxiÞ ¼ Nxiðui,RiÞ and pðxjÞ ¼ Nxjðuj,RjÞ is

–6 –4–20246

–0.5

0

0.5

1

1.5

–6 –4–20246

–0.5

0

0.5

1

1.5

Fig. 8. Typical posterior distribution over functions when learning the sinc function with uncertain inputs and outputs. (a) Using the expected covariance matrix –

uncertain-GP and (b) input noise is seen as extra output noise – classic-GP.

P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955

1953

Page 10

given by:

Kij¼

ZZ

Cquaðxi,xjÞNxiðui,RiÞNxjðuj,RjÞ dxidxj

¼Ei½Ej½ðx>

¼Ei½Ej½x>

¼Ei½x>

¼Ei½x>

¼ Tr ð½Rjþuju>

ixjþs2

ixjx>

iEj½xjx>

iðRjþuju>

bÞ2?? ¼Ei½Ej½x>

ixj??þ2s2

j?xi?þ2s2

jÞxi?þ2s2

j?RiÞþu>

ixjx>

ixj??þs4

iujþs4

bu>

i½Rjþuju>

ixjþ2s2

b

bx>

ixjþs4

b??

bEi½Ej½x>

bu>

b

iujþs4

b

j?uiþ2s2

bu>

iujþs4

b

where we use some properties of the trace to obtain a more

intuitive form for the first two terms:

Tr ð½Rjþuju>

¼ Tr ð½Rjþuju>

¼ Tr ð½Rjþuju>

¼ Tr ð½Rjþuju>

¼ Tr ð½Rjþuju>

j?RiÞþu>

i½Rjþuju>

j?RiÞþTr ðu>

j?RiÞþTr ð½Rjþuju>

j?Riþ½Rjþuju>

j?½Riþuiu>

j?ui

i½Rjþuju>

j?uiÞ

j?uiu>

iÞ

iÞ

j?uiu>

i?Þ

By substituting back in the expected covariance equation, we

can have

Kij¼ Tr ð½Rjþuju>

Computing the expected variance is done with:

Z

¼Ei½x>

¼Ei½x>

¼ 2Tr ðR2

þ2s2

j?½Riþuiu>

i?Þþ2s2

bu>

iujþs4

b

Kii¼

Cquaðxi,xiÞNxiðui,RiÞ dxi¼Ei½x>

ixi?þ2s2

ixix>

iÞþ4u>

bTr ðRiþuiu>

ixix>

ixiþ2s2

bx>

ixiþs4

b?

ixix>

bEi½x>

bTr ðRiþuiu>

iRiuiþTr ð½Riþuiu>

iÞþs4

ixi?þs4

b

iÞþs4

ixi?þ2s2

b

i?Þ2

b

By observing the differences between variance and covariance,

we can express the expected quadratic covariance function as:

CEquaððui,RiÞ,ðuj,RjÞÞ ¼ Tr ð½Riþuiu>

þ2s2

i?½Rjþuju>

iÞ2þ2s2

j?Þð1þdijÞ

bTrðRiÞ?2ðu>

bu>

iujþs4

bþdij½Tr ðRiþuiu>

iuiÞ2?

by the introduction of a Kronecker delta.

Appendix C. Expected squared exponential covariance

The expected squared exponential covariance with respect to

Gaussianinputdistributions

pðxjÞ ¼ Nxjðuj,RjÞ is given by:

ZZ

Using the Gaussian form of the squared exponential covar-

iance function, Cse(xi, xj) can be rewritten as the normalized

Gaussian distribution cNxiðxj,WÞ, where c ¼s2

its normalization constant:

ZZ

Using the product of Gaussians’ results in another Gaussian,

which is no longer normalized:

ZZ

where the resulting Gaussian with mean vector a and covariance

matrix B will be eliminated by integration. It turns out that the

pðxiÞ ¼ Nxiðui,RiÞ

and

Kij¼

Cseðxi,xjÞNxiðui,RiÞNxjðuj,RjÞ dxidxj

fð2pÞD=2jWj1=2is

Kij¼ c

Nxiðxj,WÞNxiðui,RiÞNxjðuj,RjÞ dxidxj

Kij¼ cZ?1Nxiða,BÞNxjðuj,RjÞ dxidxj

factor has a Gaussian form such that Z?1¼ Nxjðui,WþRiÞ. There-

fore, by integrating over xi, we obtain:

Z

For the second variable, we use the same reasoning. Thus, by

applying the product of Gaussians again and this time integrating

over xj, we obtain:

Kij¼ c

Nxjðui,WþRiÞNxjðuj,RjÞ dxj

Kij¼ cNuiðuj,WþRiþRjÞ ¼s2

¼s2

jIþW?1ðRiþRjÞj1=2

fð2pÞD=2jWj1=2Nuiðuj,WþRiþRjÞ

fexpððui?ujÞ>ðWþRiþRjÞ?1ðui?ujÞÞ

The expected squared exponential variance is easier to com-

pute. First, we have:

Z

By the definition of a stationary covariance function, C(x,x) is

constant, which leads directly to

Kii¼

Cseðxi,xiÞNxiðui,RiÞ dxi

Kii¼ Cseðxi,xiÞ

By the definition of the squared exponential covariance function,

Cseðxi,xiÞ ¼s2

Finally, combining the variance and covariance equations in a

single function can be done by introducing an inverted Kronecker

delta:

?

jIþW?1ðRiþRjÞð1?dijÞj1=2

f.

CEseððui,RiÞ,ðuj,RjÞÞ ¼

s2

fexp ?1

2ðui?ujÞ>ðWþRiþRjÞ?1ðui?ujÞ

?

References

[1] H.G. Golub, F.C.V. Loan, An analysis of the total least squares problem, SIAM

Journal on Numerical Analysis (17) (1980) 883–893.

[2] R.J. Carroll, D. Ruppert, L.A. Stefanski, Measurement Error in Nonlinear

Models, Chapman and Hall/CRC, 1995.

[3] I. Markovsky, S.V. Huffel, Overview of total least-squares methods, Signal

Processing 87 (10) (2007) 2283–2302.

[4] V. Tresp, S. Ahmad, R. Neuneier, Training neural networks with deficient data,

in: Advances in Neural Information Processing Systems (NIPS’94), Morgan

Kaufmann, 1994, pp. 128–135.

[5] W.A. Wright, G. Ramage, D. Cornford, I.T. Nabney, Neural network modelling

with input uncertainty: theory and application, Journal of VLSI Signal

Processing Systems 26 (1/2) (2000) 169–188.

[6] J.-A. Ting, A. D’Souza, S. Schaal, Bayesian regression with input noise for high

dimensional data, in: Proceedings of the International Conference on

Machine Learning (ICML’06), ACM, 2006, pp. 937–944.

[7] J. Bi, T. Zhang, Support vector classification with input data uncertainty,

Advances in Neural Information Processing Systems (NIPS’04), vol. 17, MIT

Press, 2004, pp. 161–168.

[8] A. Girard, R. Murray-Smith, Learning a Gaussian process model with uncer-

tain inputs, Technical Report TR-2003-144, University of Glasgow, Glasgow,

UK, 2003.

[9] J. Quin ˜onero-Candela, S.T. Roweis, Data imputation and robust training with

Gaussian processes, 2003.

[10] A. Girard, Approximate methods for propagation of uncertainty with Gaus-

sian process models, Ph.D. Thesis, 2004.

[11] C.K.I. Williams, Computation with Infinite Neural Networks, Neural Compu-

tation 10 (5) (1998) 1203–1216.

[12] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning,

The MIT Press, 2006.

[13] D.J.C. Mackay, Information Theory, Inference, and Learning Algorithms,

Cambridge University Press, 2003.

[14] J. More ´, The Levenberg–Marquardt algorithm: implementation and theory,

in: G.A. Watson (Ed.), Numerical Analysis, Lecture Notes in Mathematics, vol.

630, Springer, Berlin, Heidelberg, 1978, pp. 105–116 (Chapter 10).

[15] R. Florian, Correct equations for the dynamics of the cart–pole system, Center

for Cognitive and Neural Studies (Coneural), Romania, 2007.

[16] P. Dallaire, C. Besse, S. Ross, B. Chaib-draa, Bayesian reinforcement learning in

continuous POMDPs with Gaussian processes, in: Proceedings of IEEE/RSJ

P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955

1954

Page 11

International Conference on Intelligent Robots and Systems (IROS’09), IEEE

Press, 2009, pp. 2604–2609.

[17] N.L. Roux, P.-A. Manzagol, Y. Bengio, Topmoumoute online natural gradient

algorithms, in: Advances in Neural Information Processing Systems (NIPS’07),

MIT Press, 2008, pp. 849–856.

Brahim Chaib-draa received a Diploma in Computer

Engineering from the E´cole Supe ´rieure d’E´lectricite ´

(SUPELEC) de Paris, Paris, France, in 1978 and a Ph.D.

degree in Computer Science from the Universite ´ du

Hainaut-Cambre ´sis, Valenciennes, France, in 1990. In

1990, he joined the Department of Computer Science

and Software Engineering at Laval University, Quebec,

QC, Canada, where he is a Professor and Group Leader

of the Decision for Agents and Multi-Agent Systems

(DAMAS) Group. His research interests include agent

and multiagent technologies, natural language for

interaction, formal systems for agents and multiagent

systems, distributed practical reasoning, and real-time

and distributed systems. He is the author of several technical publications in these

areas. He is on the Editorial Boards of IEEE Transactions on SMC, Computational

Intelligence and The International Journal of Grids and Multiagent Systems.

Dr. Chaib-draa is a member of ACM and AAAI and senior member of the IEEE

Computer Society.

Camille Besse was born is 1981. He received the B.Sc.

degree in Intelligent Systems in 2002 and his M.Sc.

degree in Artificial Intelligence in 2005 from Paul

Sabatier University, Toulouse, France. He is currently

a Ph.D. student under the supervision of Prof. Brahim

Chaib-draa and member of the DAMAS laboratory

research group. His current research interests include

partially observable Markov decision processes for

single and multiple agents, for planning and reinforce-

ment learning.

Patrick Dallaire was born is 1982. He received the

B.Sc. degree in Computer Science in 2008 from Laval

University, Canada. He is currently a M.Sc. student

under the supervision of Prof. Brahim Chaib-draa and

member of the DAMAS laboratory research group. His

current research interests include Gaussian processes,

partially observable Markov decision processes, Baye-

sian learning and reinforcement learning.

P. Dallaire et al. / Neurocomputing 74 (2011) 1945–1955

1955

#### View other sources

#### Hide other sources

- Available from Patrick Dallaire · May 31, 2014
- Available from ulaval.ca