Efficient Multioutput Gaussian Processes through Variational Inducing Kernels.
ABSTRACT Interest in multioutput kernel methods is increas ing, whether under the guise of multitask learn ing, multisensor networks or structured output data. From the Gaussian process perspective a multioutput Mercer kernel is a covariance func tion over correlated output functions. One way of constructing such kernels is based on convolution processes (CP). A key problem for this approach is efficient inference. ´ Alvarez and Lawrence re cently presented a sparse approximation for CPs that enabled efficient inference. In this paper, we extend this work in two directions: we in troduce the concept of variational inducing func tions to handle potential nonsmooth functions involved in the kernel CP construction and we consider an alternative approach to approximate inference based on variational methods, extend ing the work by Titsias (2009) to the multiple output case. We demonstrate our approaches on prediction of school marks, compiler perfor mance and financial time series.

 [show abstract] [hide abstract]
ABSTRACT: A central task of Bayesian machine learning is to infer the posterior distribution of hidden random variables given observations and calculate expectations with respect to this distribution. However, this is often computationally intractable so that people have to seek approximation schemes. Deterministic approximate inference techniques are an alternative of the stochastic approximate inference methods based on numerical sampling, namely Monte Carlo techniques, and during the last 15 years, many advancements in this field have been made. This paper reviews typical deterministic approximate inference techniques, some of which are very recent and need further explorations. With an aim to promote research in deterministic approximate inference, we also attempt to identify open problems that may be helpful for future investigations in this field.Neural Computing and Applications · 1.17 Impact Factor  SourceAvailable from: Simo Särkkä
Conference Proceeding: Linear Operators and Stochastic Partial Differential Equations in Gaussian Process Regression.
[show abstract] [hide abstract]
ABSTRACT: In this paper we shall discuss an extension to Gaussian process (GP) regression models, where the measurements are modeled as linear functionals of the underlying GP and the estimation objective is a general linear operator of the process. We shall show how this framework can be used for modeling physical processes involved in measurement of the GP and for encoding physical prior information into regression models in form of stochastic partial differential equations (SPDE). We shall also illustrate the practical applicability of the theory in a simulated application.Artificial Neural Networks and Machine Learning  ICANN 2011  21st International Conference on Artificial Neural Networks, Espoo, Finland, June 1417, 2011, Proceedings, Part II; 01/2011
Page 1
Efficient Multioutput Gaussian Processes through Variational Inducing Kernels
Mauricio A.´Alvarez
School of Computer Science
University of Manchester
Manchester, UK, M13 9PL
alvarezm@cs.man.ac.uk
David LuengoMichalis K. Titsias, Neil D. Lawrence
School of Computer Science
University of Manchester
Manchester, UK, M13 9PL
{mtitsias,neill}@cs.man.ac.uk
Depto. Teor´ ıa de Se˜ nal y Comunicaciones
Universidad Carlos III de Madrid
28911 Legan´ es, Spain
luengod@ieee.org
Abstract
Interest in multioutput kernel methods is increas
ing, whether under the guise of multitask learn
ing, multisensor networks or structured output
data. From the Gaussian process perspective a
multioutput Mercer kernel is a covariance func
tion over correlated output functions. One way of
constructingsuchkernelsisbasedonconvolution
processes (CP). A key problem for this approach
is efficient inference.´Alvarez and Lawrence re
cently presented a sparse approximation for CPs
that enabled efficient inference. In this paper,
we extend this work in two directions: we in
troduce the concept of variational inducing func
tions to handle potential nonsmooth functions
involved in the kernel CP construction and we
consider an alternative approach to approximate
inference based on variational methods, extend
ing the work by Titsias (2009) to the multiple
output case.We demonstrate our approaches
on prediction of school marks, compiler perfor
mance and financial time series.
1Introduction
In this paper we are interested in developing priors over
multiple functions in a Gaussian processes (GP) frame
work. While such priors can be trivially specified by con
sidering the functions to be independent, our focus is on
priors which specify correlations between the functions.
Mostattemptstoapplysuchpriors(Tehetal.,2005;Rogers
et al., 2008; Bonilla et al., 2008) have focused on what
is known in the geostatistics community as “linear model
of coregionalization” (LMC) (Goovaerts, 1997). In these
Appearing in Proceedings of the 13thInternational Conference
on Artificial Intelligence and Statistics (AISTATS) 2010, Chia La
guna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copy
right 2010 by the authors.
models the different outputs are assumed to be linear com
binations of a set of one or more “latent functions”. GP pri
ors are placed, independently, over each of the latent func
tions inducing a correlated covariance function over the D
outputs {fd(x)}D
We wish to go beyond the LMC framework, in particu
lar, our focus is on convolution processes (CPs). Using
CPs for multioutput GPs was proposed by Higdon (2002)
and introduced to the machine learning audience by Boyle
and Frean (2005). Convolution processes allow the inte
gration of prior information from physical models, such as
ordinary differential equations, into the covariance func
tion. ´Alvarez et al. (2009a), inspired by Lawrence et al.
(2007), have demonstrated how first and second order dif
ferential equations, as well as partial differential equations,
can be accommodated in a covariance function. They in
terpret the set of latent functions as a set of latent forces,
and they term the resulting models “latent force models”
(LFM). The covariance functions for these models are de
rived through convolution processes. In the CP framework,
output functions are generated by convolving R indepen
dent latent processes {ur}R
tions Gd,r(x), for each output d and latent force r,
?
The LMC can be seen as a particular case of the CP, in
which the kernel functions Gd,r(x) correspond to a scaled
Dirac δfunction Gd,r(x − z) = ad,rδ(x − z).
A practical problem associated with the CP framework is
that in these models inference has computational complex
ity O(N3D3) and storage requirements O(N2D2). Re
cently´AlvarezandLawrence(2009)introduced anefficient
approximation for inference in this multioutput GP model.
Their idea was to exploit a conditional independence as
sumption over the output functions {fd(x)}D
tent functions are fully observed then the output functions
are conditionally independent of one another (as can be
seen in (1)). Furthermore, if the latent processes are suf
ficiently smooth, the conditional independence assumption
d=1.
r=1with smoothing kernel func
fd(x) =
R
?
r=1
Z
Gd,r(x − z)ur(z)dz.
(1)
d=1: if the la
Page 2
Efficient Multioutput Gaussian Processes through Variational Inducing Kernels
will hold approximately even for a finite number of obser
vations of the latent functions
the variables {zk}K
ing inputs. These assumptions led to approximations that
were very similar in spirit to the PITC and FITC approx
imations of Snelson and Ghahramani (2006); Qui˜ nonero
Candela and Rasmussen (2005).
Inthispaperwebuildontheworkof´AlvarezandLawrence
and extend it in two ways. First, we notice that if the
locations of the inducing points are close relative to the
length scale of the latent function, the PITC approxima
tion will be accurate enough. However, if the length scale
becomes small the approximation requires very many in
ducing points. In the worst case, the latent process could
be white noise (as suggested by Higdon (2002) and imple
mented by Boyle and Frean (2005)). In this case the ap
proximation will fail completely. To deal with such type
of latent functions, we develop the concept of an inducing
function, a generalization of the traditional concept of in
ducing variable commonly employed in several sparse GP
methods. As we shall see, an inducing function is an arti
ficial construction generated from a convolution operation
between a smoothing kernel or inducing kernel and the la
tent functions ur. The artificial nature of the inducing func
tion is based on the fact that its construction is immersed in
a variationallike inference procedure that does not modify
the marginal likelihood of the true model. This leads us
to the second extension of the paper: a problem with the
FITC and PITC approximations can be their tendency to
overfit when inducing inputs are optimized. A solution to
this problem was given in a recent work by Titsias (2009)
who provided a sparse GP approximation that has an as
sociated variational bound. In this paper we show how the
ideas of Titsias can be extended to the multiple output case.
Our variational approximation is developed through the in
ducing functions and the quality of the approximation can
be controlled through the inducing kernels and the num
ber and location of the inducing inputs. Our approxima
tion allows us to consider latent force models with a large
number of states, D, and data points N. The use of induc
ing kernels also allows us to extend the inducing variable
approximation of the latent force model framework to sys
tems of stochastic differential equations (SDEs). We apply
the approximation to different real world datasets, includ
ing a multivariate financial time series example.
?
{ur(zk)}K
k=1
?R
r=1, where
k=1are usually referred to as the induc
A similar idea to the inducing function one introduced
in this paper, was simultaneously proposed by L´ azaro
Gredilla and FigueirasVidal (2010). L´ azaroGredilla and
FigueirasVidal (2010) introduced the concept of inducing
feature to improve performance over the pseudoinputs ap
proach of Snelson and Ghahramani (2006) in sparse GP
models. Our use of inducing functions and inducing ker
nels is motivated by the necessity to deal with nonsmooth
latent functions in the CP model of multiple outputs.
2Multioutput GPs (MOGPs)
Let yd ∈ RN, where d = 1,...,D, be the observed
data associated with the output function yd(x). For sim
plicity, we assume that all the observations associated with
different outputs are evaluated at the same inputs X (al
though this assumption is easily relaxed).
ten use the stacked vector y = (y1,...,yD) to collec
tively denote the data of all the outputs. Each observed
vector ydis assumed to be obtained by adding indepen
dent Gaussian noise to a vector of function values fdso
that the likelihood is p(ydfd) = N(ydfd,σ2
fdis defined via (1). More precisely, the assumption in
(1) is that a function value fd(x) (the noisefree version
of yd(x)) is generated from a common pool of R inde
pendent latent functions {ur(x)}R
variance function (Mercer kernel) given by kr(x,x?). No
tice that the outputs share the same latent functions, but
they also have their own set of parameters ({αdr}R
where αdr are the parameters of the smoothing kernel
Gd,r(·). Because convolution is a linear operation, the co
variance between any pair of function values fd(x) and
fd?(x?) is given by kfd,fd?(x,x?) = Cov[fd(x),fd?(x?)] =
?R
GP prior p(f1,...,fD) over all the function values associ
ated with the different outputs. The joint probability dis
tribution of the multioutput GP model can be written as
p({yd,fd}D
prior p(f1,...,fD) has a zero mean vector and a (ND) ×
(ND) covariance matrix Kf,f, where f = (f1,...,fD),
which consists of N × N blocks of the form Kfd,fd?. Ele
ments of each block are given by kfd,fd?(x,x?) for all pos
sible values of x. Each such block is a crosscovariance (or
covariance) matrix of pairs of outputs.
We will of
dI), where
r=1, each having a co
r=1,σ2
d)
r=1
?
ZGd,r(x − z)?
ZGd?,r(x?− z?)kr(z,z?)dzdz?.
This covariance function is used to define a fullycoupled
d=1) =
?D
d=1p(ydfd)p(f1,...,fD). The GP
Prediction using the above GP model, as well as the maxi
mization of the marginal likelihood p(y) = N(y0,Kf,f+
Σ), where Σ = diag(σ2
time and O(N2D2) storage which rapidly becomes infea
sible even when only a few hundred outputs and data points
are considered. Efficient approximations are needed in or
der to make the above multioutput GP model more practi
cal.
1I,...,σ2
DI), requires O(N3D3)
3PITClike approximation for MOGPs
Before we propose our variational sparse inference method
for multioutput GP regression in Section 4, we review
the sparse method proposed by ´Alvarez and Lawrence
(2009). This method is based on a likelihood approxima
tion. More precisely, each output function yd(x) is in
dependent from the other output functions given the full
length of each latent function ur(x). This means, that the
likelihood of the data factorizes according to p(yu) =
Page 3
Mauricio A.´Alvarez, David Luengo, Michalis K. Titsias, Neil D. Lawrence
?D
Lawrence (2009) makes use of this factorization by as
suming that it remains valid even when we are only al
lowed to exploit the information provided by a finite set
of function values, ur, instead of the fulllength function
ur(x) (which involves uncountably many points). Let ur,
for r = 1,...,R, be a Kdimensional vector of values
from the function ur(x) which are evaluated at the in
puts Z = {zk}K
notes all these variables. The sparse method approximates
the exact likelihood function p(yu) with the likelihood
p(yu) =?D
Kfd,uK−1
the conditional GP priors p(fdu). The matrix Ku,u is
a block diagonal covariance matrix where the rth block
Kur,uris obtained by evaluating kr(z,z?) at the inducing
inputs Z. Further, the matrix Kfd,uis defined by the cross
covariance function Cov[fd(x),ur(z)] =
z?)kr(z?,z)dz?.The variables u follow the GP prior
p(u) = N(u0,Ku,u) and can be integrated out to give the
following approximation to the exact marginal likelihood:
d=1p(ydu) =?D
d=1p(ydfd), with u = {ur}R
r=1the
set of latent functions. The sparse method in´Alvarez and
k=1. The vector u = (u1,...,uR) de
d=1p(ydu) =?D
u,uKu,fdare the mean and covariance matrices of
d=1N(ydµfdu,Σfdu+
u,uu and Σfdu= Kfd,fd−
σ2
dI), where µfdu= Kfd,uK−1
?
ZGd,r(x −
p(yθ) = N(y0,D + Kf,uK−1
Here, D is a blockdiagonal matrix, where each block is
given by Kfd,fd− Kfd,uK−1
proximate marginal likelihood represents exactly each di
agonal (outputspecific) block Kfd,fdwhile each off diag
onal (crossoutput) block Kfd,fd?is approximated by the
Nystr¨ om matrix Kfd,uK−1
The above sparse method has a similar structure to the
PITC approximation introduced for singleoutput regres
sion (Qui˜ nonero Candela and Rasmussen, 2005). Because
of this similarity,´Alvarez and Lawrence (2009) call their
multioutput sparse approximation PITC as well. Two of the
properties of this PITC approximation (which may some
times be seen as limitations) are:
u,uKu,f+ Σ).
(2)
u,uKu,fdfor all d. This ap
u,uKu,fd?.
1. It assumes that all latent functions u are smooth.
2. It is based on a modification of the initial full GP
model. This implies that the inducing inputs Z are
extra kernel hyparameters in the modified GP model.
Because of point 1, the method is not applicable when
the latent functions are white noise processes. An impor
tant class of problems where we have to deal with white
noise processes arise in linear SDEs where the above sparse
method is currently not applicable there. Because of 2, the
maximization of the marginal likelihood in eq. (2) with re
spect to (Z,θ), where θ are model hyperparameters, may
be prone to overfitting especially when the number of vari
ables in Z is large. Moreover, fitting a modified sparse GP
model implies that the full GP model is not approximated
in a systematic and rigorous way since there is no distance
or divergence between the two models that is minimized.
In the next section, we address point 1 above by introduc
ing the concept of variational inducing kernels that allow us
to efficiently sparsify multioutput GP models having white
noise latent functions. Further, these inducing kernels are
incorporated into the variational inference method of Tit
sias (2009) (thus addressing point 2) that treats the induc
ing inputs Z as well as other quantities associated with the
inducing kernels as variational parameters. The whole vari
ational approach provides us with a very flexible, robust to
overfitting, approximation framework that overcomes the
limitations of the PITC approximation.
4Sparse variational approximation
In this section, we introduce the concept of variational in
ducing kernels (VIKs). VIKs give us a way to define more
general inducing variables that have larger approximation
capacity than the u inducing variables used earlier and im
portantly allow us to deal with white noise latent functions.
To motivate the idea, we first explain why the u variables
can work when the latent functions are smooth and fail
when these functions become white noises.
In PITC, we assume each latent function ur(x) is smooth
and we sparsify the GP model through introducing, ur, in
ducing variables which are direct observations of the latent
function, ur(x), at particular input points. Because of the
latent function’s smoothness, the ur variables also carry
information about other points in the function through the
imposed prior over the latent function. So, having observed
urwe can reduce the uncertainty of the whole function.
With the vector of inducing variables u, if chosen to be
sufficiently large relative to the length scales of the la
tent functions, we can efficiently represent the functions
{ur(x)}R
convolved versions of the latent functions.1When the re
construction of f from u is perfect, the conditional prior
p(fu) becomes a delta function and the sparse PITC ap
proximation becomes exact. Figure 1(a) shows a cartoon
description of a summarization of ur(x) by ur.
r=1and subsequently variables f which are just
In contrast, when some of the latent functions are white
noise processes the sparse approximation will fail. If ur(z)
is white noise2it has a covariance function δ(z−z?). Such
processesnaturallyariseintheapplicationofstochasticdif
ferential equations (see section 6) and are the ultimate non
1Thisideaislikea“softversion”oftheNyquistShannonsam
pling theorem. If the latent functions were bandlimited, we could
compute exact results given a high enough number of inducing
points. In general they won’t be bandlimited, but for smooth func
tions low frequency components will dominate over high frequen
cies, which will quickly fade away.
2Such a process can be thought as the “time derivative” of the
Wiener process.
Page 4
Efficient Multioutput Gaussian Processes through Variational Inducing Kernels
(a) Latent function is smooth(b) Latent function is noise
Figure 1: With a smooth latent function as in (a), we can use some inducing variables ur (red dots) from the complete latent process
ur(x) (in black) to generate smoothed versions (for example the one in blue), with uncertainty described by p(urur). However, with a
white noise latent function as in (b), choosing inducing variables ur(red dots) from the latent process (in black) does not give us a clue
about other points (for example the blue dots).
smooth processes where two values ur(z) and ur(z?) are
uncorrelated when z ?= z?. When we apply the sparse ap
proximation a vector of “whitenoise” inducing variables
ur does not carry information about ur(z) at any input
z that differs from all inducing inputs Z. In other words
there is no additional information in the conditional prior
p(ur(z)ur) over the unconditional prior p(ur(z)). Figure
1(b) shows a pictorial representation. The lack of structure
makes it impossible to exploit the correlations in the stan
dard sparse methods like PITC.3
Our solution to this problem is the following. We will de
fine a more powerful form of inducing variable, one based
not around the latent function at a point, but one given by
the convolution of the latent function with a smoothing ker
nel. More precisely, let us replace each inducing vector ur
with variables λrwhich are evaluated at the inputs Z and
are defined according to
?
where Tr(x) is a smoothing kernel (e.g. Gaussian) which
we call the inducing kernel (IK). This kernel is not nec
essarily related to the model’s smoothing kernels. These
newly defined inducing variables can carry information
about ur(z) not only at a single input location but from
the entire input space. We can even allow a separate IK
for each inducing point, this is, if the set of inducing points
is Z = {zk}K
with the advantage of associating to each inducing point zk
its own set of adaptive parameters in Tr,k. For the PITC
approximation, this adds more hyperparameters to the like
lihood, perhaps leading to overfitting. However, in the vari
ational approximation we define all these new parameters
as variational parameters and therefore they do not cause
the model to overfit.
λr(z) =
Tr(z − v)ur(v)dv,
(3)
k=1, then λr(zk) =?Tr,k(zk− v)ur(v)dv,
Ifur(z)hasawhitenoise4GPpriorthecovariancefunction
3Returning to our sampling theorem analogy, the white noise
process has infinite bandwidth. It is therefore impossible to rep
resent it by observations at a few fixed inducing points.
4It is straightforward to generalize the method for rough latent
for λr(x) is
Cov[λr(x),λr(x?)] =
?
Tr(x − z)Tr(x?− z)dz
(4)
and the crosscovariance between fd(x) and λr(x?) is
?
Notice that this crosscovariance function, unlike the case
of u inducing variables, maintains a weighted integration
over the whole input space. This implies that a single in
ducing variable λr(x) can properly propagate information
from the fulllength process ur(x) into f.
Cov[fd(x),λr(x?)] =
Gd,r(x − z)Tr(x?− z)dz. (5)
It is possible to combine the IKs defined above with the
PITC approximation of´Alvarez and Lawrence (2009), but
in this paper our focus will be on applying them within the
variational framework of Titsias (2009). We therefore refer
to the kernels as variational inducing kernels (VIKs).
Variational inference
We now extend the variational inference method of Titsias
(2009) to deal with multiple outputs and incorporate them
into the VIK framework.
We
p({yd,fd}D
step of the variational method is to augment this model
with inducing variables. For our purpose, suitable inducing
variables are defined through VIKs.
let λ = (λ1,...,λR) be the whole vector of inducing
variables where each λr is a Kdimensional vector of
values obtained according to eq. (3). λr’s role is to carry
information about the latent function ur(z). Each λr is
evaluated at the inputs Z and has its own VIK, Tr(x), that
depends on parameters θTr. The λ variables augment the
GP model according to p(y,f,λ) = p(yf)p(fλ)p(λ).
Here, p(λ) = N(λ0,Kλ,λ) and Kλ,λis a block diagonal
functions that are not white noise or to combine smooth latent
functions with white noise.
compactly
d=1) as p(y,f)
write thejointprobability
p(yf)p(f).
model
The first
=
More precisely,
Page 5
Mauricio A.´Alvarez, David Luengo, Michalis K. Titsias, Neil D. Lawrence
matrix where each block Kλr,λris obtained by evaluating
the covariance function in eq. (4) at the inputs Z. Addition
ally, p(fλ) = N(fKf,λK−1
where the crosscovariance Kf,λ is computed through
eq.(5).Becauseof
?p(fλ)p(λ)dλ
exact inference in the initial GP model. Crucially, this
holds for any values of the augmentation parameters
(Z,{θTr}R
to turn these augmentation parameters into variational
parameters by applying approximate sparse inference.
λ,λλ,Kf,f− Kf,λK−1
λ,λKλ,f)
theconsistencycondition
=
p(f), performing exact inference
in the above augmented model is equivalent to performing
r=1). This is the key property that allows us
Our method now proceeds along the lines of Titsias
(2009). We introduce the variational distribution q(f,λ) =
p(fλ)φ(λ), where p(fλ) is the conditional GP prior de
fined earlier and φ(λ) is an arbitrary variational distribu
tion. By minimizing the KL divergence between q(f,λ)
and the true posterior p(f,λy), we can compute the fol
lowing Jensen’s lower bound on the true log marginal like
lihood (a detailed derivation of the bound is available in
´Alvarez et al. (2009b)):
FV = logN
?
y0,Kf,λK−1
λ,λKλ,f+ Σ
?
−1
2tr
?
Σ−1?K
?
,
where Σ is the covariance function associated with the ad
ditivenoiseprocessand?K = Kf,f−Kf,λK−1
log of a GP prior with the only difference that now the co
variance matrix has a particular low rank form. This form
allows the inversion of the covariance matrix to take place
in O(NDK2) time rather than O(N3D3). The second part
can be seen as a penalization term that regularizes the es
timation of the parameters. Notice also that only the diag
onal of the exact covariance matrix Kf,fneeds to be com
puted. Overall, the computation of the bound can be done
efficiently in O(NDK2) time.
λ,λKλ,f. Note
that this bound consists of two parts. The first part is the
The bound can be maximized with respect to all parameters
ofthecovariancefunction; bothmodelparametersandvari
ational parameters. The variational parameters are the in
ducing inputs Z and the parameters θTrof each VIK which
are rigorously selected so that the KL divergence is mini
mized. In fact each VIK is also a variational quantity and
one could try different forms of VIKs in order to choose
the one that gives the best lower bound.
The form of the bound is very similar to the projected pro
cess approximation, also known as DTC (Csat´ o and Op
per, 2001; Seeger et al., 2003; Rasmussen and Williams,
2006). However, the bound has an additional trace term
that penalizes the movement of inducing inputs away from
the data. This term converts the DTC approximation to a
lower bound and prevents overfitting. In what follows, we
refer to this approximation as DTCVAR, where the VAR
suffix refers to the variational framework.
5Experiments
We present results of applying the method proposed for
two realworld datasets that will be described in short.
We compare the results obtained using PITC, the intrin
sic coregionalization model (ICM)5employed in Bonilla et
al. (2008) and the method using variational inducing ker
nels. For PITC we estimate the parameters through the
maximization of the approximated marginal likelihood of
equation (2) using the scaledconjugate gradient method.
We use one latent function and both the covariance func
tion of the latent process, kr(x,x?), and the kernel smooth
ing function, Gd,r(x), follow a Gaussian form, this is
k(x,x?) = N(x − x?0,C), where C is a diagonal ma
trix. For the DTCVAR approximations, we maximize the
variational bound FV. Optimization is also performed us
ing scaled conjugate gradient. We use one white noise la
tent function and a corresponding inducing function. The
inducing kernels and the model kernels follow the same
Gaussian form. Using this form for the covariance or ker
nel, all convolution integrals are solved analytically.
5.1 Exam score prediction
In this experiment the goal is to predict the exam score
obtained by a particular student belonging to a particular
school. The data comes from the Inner London Education
Authority (ILEA).6It consists of examination records from
139 secondary schools in years 1985, 1986 and 1987. It is a
random 50% sample with 15362 students. The input space
consists of features related to each student and features re
lated to each school. From the multiple output point of
view, each school represents one output and the exam score
of each student a particular instantiation of that output.
We follow the same preprocessing steps employed in
Bonilla et al. (2008). The only features used are the
studentdependent ones (year in which each student took
the exam, gender, VR band and ethnic group), which are
categorical variables. Each of them is transformed to a bi
nary representation. For example, the possible values that
the variable year of the exam can take are 1985, 1986 or
1987 and are represented as 100, 010 or 001. The trans
formation is also applied to the variables gender (two bi
nary variables), VR band (four binary variables) and ethnic
group (eleven binary variables), ending up with an input
space with dimension 20. The categorical nature of data
restricts the input space to 202 unique input feature vec
tors. However, two students represented by the same in
put vector x and belonging both to the same school d, can
obtain different exam scores. To reduce this noise in the
5The ICM is a particular case of the LMC with one latent func
tion (Goovaerts, 1997).
6Data is available at http://www.cmm.bristol.ac.
uk/learningtraining/multilevelmsupport/
datasets.shtml
Page 6
Efficient Multioutput Gaussian Processes through Variational Inducing Kernels
data, we follow Bonilla et al. (2008) in taking the mean of
the observations that, within a school, share the same in
put vector and use a simple heteroskedastic noise model in
which the variance for each of these means is divided by
the number of observations used to compute it. The perfor
mance measure employed is the percentage of unexplained
variance defined as the sumsquared error on the test set as
a percentage of the total data variance.7The performance
measure is computed for ten repetitions with 75% of the
data in the training set and 25% of the data in the test set.
Figure 5.1 shows results using PITC, DTCVAR with one
smoothing kernel and DTCVAR with as many inducing
kernels as inducing points (DTCVARS in the figure). For
50 inducing points all three alternatives lead to approx
imately the same results. PITC keeps a relatively con
stant performance for all values of inducing points, while
the DTCVAR approximations increase their performance
as the number of inducing points increases. This is consis
tent with the expected behaviour of the DTCVAR methods,
since the trace term penalizes the model for a reduced num
ber of inducing points. Notice that all the approximations
outperform independent GPs and the best result of the ICM
presented in Bonilla et al. (2008).
SM 5SM 20SM 50 ICM IND
40
45
50
55
60
65
70
75
Percentage of unexplained variance
Method
PITC
DTCVAR
DTCVARS
Figure2: Examscorepredictionresultsfortheschooldataset. Re
sults include the mean of the percentage of unexplained variance
of ten repetitions of the experiment, together with one standard
deviation. In the bottom, SM X stands for sparse method with X
inducing points, DTCVAR refers to the DTC variational approx
imation with one smoothing kernel and DTCVARS to the same
approximation using as many inducing kernels as inducing points.
Results using the ICM model and independent GPs (appearing as
IND in the figure) have also been included.
5.2Compiler prediction performance
In this dataset the outputs correspond to the speedup of 11
C programs after some transformation sequence has been
applied to them. The speedup is defined as the execution
time of the original program divided by the execution time
of the transformed program. The input space consists of
13dimensional binary feature vectors, where the presence
7In Bonilla et al. (2008), results are reported in terms of ex
plained variance.
of a one in these vectors indicates that the program has re
ceived that particular transformation. The dataset contains
88214observations foreach outputand thesamenumber of
input vectors. All the outputs share the same input space.
Due to technical requirements, it is important that the pre
diction of the speedup for the particular program is made
using few observations in the training set. We compare our
results to the ones presented in Bonilla et al. (2008) and use
N = 16, 32, 64 and 128 for the training set. The remaining
88214−N observations are used for testing, employing as
performance measure the mean absolute error. The experi
ment is repeated ten times and standard deviations are also
reported. We only include results for the average perfor
mance over the 11 outputs.
Figure 3 shows the results of applying independent GPs
(IND in the figure), the intrinsic coregionalization model
(ICM in the figure), PITC, DTCVAR with one inducing
kernel(DTCVARinthefigure)andDTCVARwithasmany
inducing kernels as inducing points (DTCVARS in the fig
ure). Since the training sets are small enough, we also in
cluderesultsofapplyingtheGPgeneratedusingthefullco
variance matrix of the convolution construction (see FULL
GP in the figure). We repeated the experiment for different
values of K, but show results only for K = N/2. Re
sults for ICM and IND were obtained from Bonilla et al.
(2008). In general, the DTCVAR variants outperform the
16 3264128
0.02
0.04
0.06
0.08
0.1
0.12
Mean Absolute Error
Number of training points
IND
ICM
PITC
DTCVAR
DTCVARS
FULL GP
Figure 3: Mean absolute error and standard deviation over ten
repetitions of the compiler experiment as a function of the train
ing points. IND stands for independent GPs, ICM stands for in
trinsic coregionalization model, DTCVAR refers to the DTCVAR
approximation using one inducing kernel, DTCVARS refers to
the DTCVAR approximation using as many inducing kernels as
inducing points and FULL GP stands for the GP for the multiple
outputs without any approximation.
ICMmethod, andtheindependentGPsforN = 16, 32and
64. Inthiscase, usingasmanyinducingkernelsasinducing
points improves on average the performance. All methods,
including the independent GPs are better than PITC. The
size of the test set encourages the application of the sparse
methods: for N = 128, making the prediction of the whole
dataset using the full GP takes in average 22 minutes while
the prediction with DTCVAR takes 0.65 minutes. Using
more inducing kernels improves the performance, but also
Page 7
Mauricio A.´Alvarez, David Luengo, Michalis K. Titsias, Neil D. Lawrence
makes the evaluation of the test set more expensive. For
DTCVARS, the evaluation takes in average 6.8 minutes.
Time results are average results over the ten repetitions.
6 Stochastic Latent Force Models
The starting point of stochastic differential equations is
a stochastic version of the equation of motion, which is
called Langevin’s equation:
df(t)
dt
where f(t) is the velocity of the particle, −Cf(t) is a sys
tematic friction term, u(t) is a random fluctuation external
force, i.e. white noise, and S indicates the sensitivity of the
ouputtotherandomfluctuations. Inthemathematicalprob
ability literature, the above is written more rigorously as
df(t) = −Cf(t)dt + SdW(t) where W(t) is the Wiener
process (standard Brownian motion). Since u(t) is a GP
and the equation is linear, f(t) must be also a GP which
turns out to be the OrnsteinUhlenbeck (OU) process.
= −Cf(t) + Su(t),
(6)
Here, we are interested in extending the Langevin equation
to model multivariate time series. The way that the model
in (6) is extended is by adding more output signals and
more external forces. The forces can be either smooth (sys
tematic or drifttype) forces or white noise forces. Thus,
dfd(t)
dt
= −Ddfd(t) +
R
?
r=1
Sd,rur(t),
(7)
where fd(t) is the dth output signal. Each ur(t) can be ei
ther a smooth latent force that is assigned a GP prior with
covariance function kr(t,t?) or a white noise force that has
a GP prior with covariance function δ(t − t?). That is, we
have a composition of R latent forces, where Rsof them
correspond to smooth latent forces and Rocorrespond to
white noise processes. The intuition behind this combi
nation of input forces is that the smooth part can be used
to represent medium/long term trends that cause a depar
ture from the mean of the output processes, whereas the
stochastic part is related to short term fluctuations around
the mean. A model with Rs= 1 and Ro= 0 was proposed
by Lawrence et al. (2007) to describe protein transcription
regulation in a single input motif (SIM) gene network.
Solving the differential equation (7), we obtain
fd(t) = e−Ddtfd0+
R
?
r=1
Sd,r
?t
0
e−Dd(t−z)ur(z)dz,
where fd0arises from the initial condition. This model now
is a special case of the multioutput regression model dis
cussed in sections 1 and 2 where each output signal yd(t) =
fd(t) + ? has a mean function e−Ddtfd0and each model
kernel Gd,r(x) is equal to Sd,re−Dd(t−z).
model can be viewed as a stochastic latent force model
(SLFM) following the work of´Alvarez et al. (2009a).
The above
Latent market forces
The application considered is the inference of missing data
in a multivariate financial data set: the foreign exchange
rate w.r.t. the dollar of 10 of the top international curren
cies (Canadian Dollar [CAD], Euro [EUR], Japanese Yen
[JPY], Great British Pound [GBP], Swiss Franc [CHF],
Australian Dollar [AUD], Hong Kong Dollar [HKD], New
Zealand Dollar [NZD], South Korean Won [KRW] and
Mexican Peso [MXN]) and 3 precious metals (gold [XAU],
silver [XAG] and platinum [XPT]).8We considered all the
data available for the calendar year of 2007 (251 working
days). In this data there are several missing values: XAU,
XAG and XPT have 9, 8 and 42 days of missing values re
spectively. On top of this, we also introduced artificially
long sequences of missing data. Our objective is to model
the data and test the effectiveness of the model by imputing
these missing points. We removed a test set from the data
by extracting contiguous sections from 3 currencies asso
ciated with very different geographic locations: we took
days 50–100 from CAD, days 100–150 from JPY and days
150–200 from AUD. The remainder of the data comprised
the training set, which consisted of 3051 points, with the
test data containing 153 points. For preprocessing we re
moved the mean from each output and scaled them so that
they all had unit variance.
It seems reasonable to suggest that the fluctuations of the
13 correlated financial timeseries are driven by a smaller
number of latent market forces. We therefore modelled
the data with up to six latent forces which could be noise
or smooth GPs.The GP priors for the smooth latent
forces are assumed to have a Gaussian covariance function,
kurur(t,t?) = (1/?2π?2
We present an example with R = 4. For this value of R, we
consider all the possible combinations of Roand Rs. The
training was performed in all cases by maximizing the vari
ational bound using the scale conjugate gradient algorithm
until convergence was achieved. The best performance in
terms of achieving the highest value for FV was obtained
for Rs= 1 and Ro= 3. We compared against the LMC
model for different values of the latent functions in that
framework. While our best model resulted in an standard
izedmeansquareerrorof0.2795, thebestLMC(withR=2)
resulted in 0.3927. We plotted predictions from the latent
market force model to characterize the performance when
filling in missing data. In figure 4 we show the output
signals obtained using the model with the highest bound
(Rs = 1 and Ro = 3) for CAD, JPY and AUD. Note
that the model performs better at capturing the deep drop
in AUD than it does for the fluctuations in CAD and JPY.
r)exp(−((t − t?)2)/2?2
r), where
the hyperparameter ?ris known as the lengthscale.
8Data is available at http://fx.sauder.ubc.ca/
data.html.
Page 8
Efficient Multioutput Gaussian Processes through Variational Inducing Kernels
50100 150200250
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
(a) CAD: Real data and prediction
50 100150 200250
7.8
8
8.2
8.4
8.6
8.8
9
9.2
9.4x 10
−3
(b) JPY: Real data and prediction
50100 150200 250
0.7
0.75
0.8
0.85
0.9
0.95
(c) AUD: Real data and prediction
Figure 4: Predictions from the model with Rs = 1 and Ro = 3 are shown as solid lines for the mean and grey bars for error bars at 2
standard deviations. For CAD, JPY and AUD the data was artificially held out. The true values are shown as a dotted line. Crosses on
the xaxes of all plots show the locations of the inducing inputs.
7 Conclusions
We have presented a variational approach to sparse approx
imations in convolution processes. Our main focus was to
provide efficient mechanisms for learning in multiple out
put Gaussian processes when the latent function is fluctuat
ing rapidly. In order to do so, we have introduced the con
cept of inducing function, which generalizes the idea of in
ducing point, traditionally employed in sparse GP methods.
The approach extends the variational approximation of Tit
sias (2009) to the multiple output case. Using our approach
we can perform efficient inference on latent force models
which are based around SDEs, but also contain a smooth
driving force. Our approximation is flexible enough and
has been shown to be applicable to a wide range of data
sets, including highdimensional ones.
Acknowledgements
We would like to thank Edwin Bonilla for his valuable feedback
with respect to the datasets in section 5. Also to the authors of
Bonilla et al. (2008) who kindly made the compiler dataset avail
able. DL has been partly financed by the Spanish government
through CICYT projects TEC200613514C0201 and TEC2009
14504C0201, and the CONSOLIDERINGENIO 2010 Program
(Project CSD200800010). MA and NL have been financed by a
Google Research Award “Mechanistically Inspired Convolution
Processes for Learning” and MA, NL and MT have been financed
by EPSRC Grant No EP/F005687/1 “Gaussian Processes for Sys
tems Identification with Applications in Systems Biology”.
References
Mauricio A.´Alvarez and Neil D. Lawrence. Sparse convolved
Gaussian processes for multioutput regression. In NIPS 21,
pages 57–64. MIT Press, 2009.
Mauricio A.´Alvarez, David Luengo, and Neil D. Lawrence. La
tent Force Models. In JMLR: W&CP 5, pages 9–16, 2009.
Mauricio A.´Alvarez, David Luengo, Michalis K. Titsias, and
Neil D. Lawrence. Variational inducing kernels for sparse con
volved multiple output Gaussian processes. Technical report,
School of Computer Science, University of Manchester, 2009.
Available at http://arxiv.org/abs/0912.3268.
Edwin V. Bonilla, Kian Ming Chai, and Christopher K. I.
Williams. Multitask Gaussian process prediction. In NIPS
20, pages 153–160. MIT Press, 2008.
Phillip Boyle and Marcus Frean. Dependent Gaussian processes.
In NIPS 17, pages 217–224. MIT Press, 2005.
Lehel Csat´ o and Manfred Opper. Sparse representation for Gaus
sian process models. In NIPS 13, pages 444–450. MIT Press,
2001.
Pierre Goovaerts. Geostatistics For Natural Resources Evalua
tion. Oxford University Press, 1997.
David M. Higdon. Space and spacetime modelling using process
convolutions. In Quantitative methods for current environmen
tal issues, pages 37–56. SpringerVerlag, 2002.
Neil D. Lawrence, Guido Sanguinetti, and Magnus Rattray. Mod
elling transcriptional regulation using Gaussian processes. In
NIPS 19, pages 785–792. MIT Press, 2007.
Miguel L´ azaroGredilla and An´ ıbal FigueirasVidal.
domain Gaussian processes for sparse inference using inducing
features. In NIPS 22, pages 1087–1095. MIT press, 2010.
Joaquin Qui˜ nonero Candela and Carl Edward Rasmussen. A uni
fying view of sparse approximate Gaussian process regression.
JMLR, 6:1939–1959, 2005.
Carl Edward Rasmussen and Christopher K. I. Williams. Gaus
sian Processes for Machine Learning. MIT Press, Cambridge,
MA, 2006.
Alex Rogers, M. A. Osborne, S. D. Ramchurn, S. J. Roberts,
and N. R. Jennings. Towards realtime information processing
of sensor network data using computationally efficient multi
output Gaussian processes. In Proc. Int. Conf. on Information
Proc. in Sensor Networks (IPSN 2008), 2008.
Matthias Seeger, Christopher K. I. Williams, and Neil D.
Lawrence. Fast forward selection to speed up sparse Gaussian
process regression. In Proc. 9th Int. Workshop on Artificial
Intelligence and Statistics, Key West, FL, 3–6 January 2003.
Edward Snelson and Zoubin Ghahramani. Sparse Gaussian pro
cesses using pseudoinputs. In NIPS 18. MIT Press, 2006.
Yee Whye Teh, Matthias Seeger, and Michael I. Jordan. Semi
parametric latent factor models. In AISTATS 10, pages 333–
340, 2005.
Michalis K. Titsias. Variational learning of inducing variables in
sparse Gaussian processes. In JMLR: W&CP 5, pages 567–
574, 2009.
Inter