# Efficient Multioutput Gaussian Processes through Variational Inducing Kernels.

**ABSTRACT** Interest in multioutput kernel methods is increas- ing, whether under the guise of multitask learn- ing, multisensor networks or structured output data. From the Gaussian process perspective a multioutput Mercer kernel is a covariance func- tion over correlated output functions. One way of constructing such kernels is based on convolution processes (CP). A key problem for this approach is efficient inference. ´ Alvarez and Lawrence re- cently presented a sparse approximation for CPs that enabled efficient inference. In this paper, we extend this work in two directions: we in- troduce the concept of variational inducing func- tions to handle potential non-smooth functions involved in the kernel CP construction and we consider an alternative approach to approximate inference based on variational methods, extend- ing the work by Titsias (2009) to the multiple output case. We demonstrate our approaches on prediction of school marks, compiler perfor- mance and financial time series.

**0**Bookmarks

**·**

**67**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**This paper proposes an online tree-based Bayesian approach for reinforcement learning. For inference, we employ a generalised context tree model. This defines a distribution on multivariate Gaussian piecewise-linear models, which can be updated in closed form. The tree structure itself is constructed using the cover tree method, which remains efficient in high dimensional spaces. We combine the model with Thompson sampling and approximate dynamic programming to obtain effective exploration policies in unknown environments. The flexibility and computational simplicity of the model render it suitable for many reinforcement learning problems in continuous state spaces. We demonstrate this in an experimental comparison with least squares policy iteration.05/2013; - SourceAvailable from: export.arxiv.org
##### Conference Paper: Sparse gaussian processes for multi-task learning

[Show abstract] [Hide abstract]

**ABSTRACT:**Multi-task learning models using Gaussian processes (GP) have been recently developed and successfully applied in various applications. The main difficulty with this approach is the computational cost of inference using the union of examples from all tasks. The paper investigates this problem for the grouped mixed-effect GP model where each individual response is given by a fixed-effect, taken from one of a set of unknown groups, plus a random individual effect function that captures variations among individuals. Such models have been widely used in previous work but no sparse solutions have been developed. The paper presents the first sparse solution for such problems, showing how the sparse approximation can be obtained by maximizing a variational lower bound on the marginal likelihood, generalizing ideas from single-task Gaussian processes to handle the mixed-effect model as well as grouping. Experiments using artificial and real data validate the approach showing that it can recover the performance of inference with the full sample, that it outperforms baseline methods, and that it outperforms state of the art sparse solutions for other multi-task GP formulations.Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I; 09/2012 - SourceAvailable from: David Luengo[Show abstract] [Hide abstract]

**ABSTRACT:**Purely data driven approaches for machine learning present difficulties when data is scarce relative to the complexity of the model or when the model is forced to extrapolate. On the other hand, purely mechanistic approaches need to identify and specify all the interactions in the problem at hand (which may not be feasible) and still leave the issue of how to parameterize the system. In this paper, we present a hybrid approach using Gaussian processes and differential equations to combine data driven modelling with a physical model of the system. We show how different, physically-inspired, kernel functions can be developed through sensible, simple, mechanistic assumptions about the underlying system. The versatility of our approach is illustrated with three case studies from motion capture, computational biology and geostatistics.IEEE Transactions on Pattern Analysis and Machine Intelligence 05/2013; · 4.80 Impact Factor

Page 1

Efficient Multioutput Gaussian Processes through Variational Inducing Kernels

Mauricio A.´Alvarez

School of Computer Science

University of Manchester

Manchester, UK, M13 9PL

alvarezm@cs.man.ac.uk

David LuengoMichalis K. Titsias, Neil D. Lawrence

School of Computer Science

University of Manchester

Manchester, UK, M13 9PL

{mtitsias,neill}@cs.man.ac.uk

Depto. Teor´ ıa de Se˜ nal y Comunicaciones

Universidad Carlos III de Madrid

28911 Legan´ es, Spain

luengod@ieee.org

Abstract

Interest in multioutput kernel methods is increas-

ing, whether under the guise of multitask learn-

ing, multisensor networks or structured output

data. From the Gaussian process perspective a

multioutput Mercer kernel is a covariance func-

tion over correlated output functions. One way of

constructingsuchkernelsisbasedonconvolution

processes (CP). A key problem for this approach

is efficient inference.´Alvarez and Lawrence re-

cently presented a sparse approximation for CPs

that enabled efficient inference. In this paper,

we extend this work in two directions: we in-

troduce the concept of variational inducing func-

tions to handle potential non-smooth functions

involved in the kernel CP construction and we

consider an alternative approach to approximate

inference based on variational methods, extend-

ing the work by Titsias (2009) to the multiple

output case.We demonstrate our approaches

on prediction of school marks, compiler perfor-

mance and financial time series.

1Introduction

In this paper we are interested in developing priors over

multiple functions in a Gaussian processes (GP) frame-

work. While such priors can be trivially specified by con-

sidering the functions to be independent, our focus is on

priors which specify correlations between the functions.

Mostattemptstoapplysuchpriors(Tehetal.,2005;Rogers

et al., 2008; Bonilla et al., 2008) have focused on what

is known in the geostatistics community as “linear model

of coregionalization” (LMC) (Goovaerts, 1997). In these

Appearing in Proceedings of the 13thInternational Conference

on Artificial Intelligence and Statistics (AISTATS) 2010, Chia La-

guna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copy-

right 2010 by the authors.

models the different outputs are assumed to be linear com-

binations of a set of one or more “latent functions”. GP pri-

ors are placed, independently, over each of the latent func-

tions inducing a correlated covariance function over the D

outputs {fd(x)}D

We wish to go beyond the LMC framework, in particu-

lar, our focus is on convolution processes (CPs). Using

CPs for multi-output GPs was proposed by Higdon (2002)

and introduced to the machine learning audience by Boyle

and Frean (2005). Convolution processes allow the inte-

gration of prior information from physical models, such as

ordinary differential equations, into the covariance func-

tion. ´Alvarez et al. (2009a), inspired by Lawrence et al.

(2007), have demonstrated how first and second order dif-

ferential equations, as well as partial differential equations,

can be accommodated in a covariance function. They in-

terpret the set of latent functions as a set of latent forces,

and they term the resulting models “latent force models”

(LFM). The covariance functions for these models are de-

rived through convolution processes. In the CP framework,

output functions are generated by convolving R indepen-

dent latent processes {ur}R

tions Gd,r(x), for each output d and latent force r,

?

The LMC can be seen as a particular case of the CP, in

which the kernel functions Gd,r(x) correspond to a scaled

Dirac δ-function Gd,r(x − z) = ad,rδ(x − z).

A practical problem associated with the CP framework is

that in these models inference has computational complex-

ity O(N3D3) and storage requirements O(N2D2). Re-

cently´AlvarezandLawrence(2009)introduced anefficient

approximation for inference in this multi-output GP model.

Their idea was to exploit a conditional independence as-

sumption over the output functions {fd(x)}D

tent functions are fully observed then the output functions

are conditionally independent of one another (as can be

seen in (1)). Furthermore, if the latent processes are suf-

ficiently smooth, the conditional independence assumption

d=1.

r=1with smoothing kernel func-

fd(x) =

R

?

r=1

Z

Gd,r(x − z)ur(z)dz.

(1)

d=1: if the la-

Page 2

Efficient Multioutput Gaussian Processes through Variational Inducing Kernels

will hold approximately even for a finite number of obser-

vations of the latent functions

the variables {zk}K

ing inputs. These assumptions led to approximations that

were very similar in spirit to the PITC and FITC approx-

imations of Snelson and Ghahramani (2006); Qui˜ nonero

Candela and Rasmussen (2005).

Inthispaperwebuildontheworkof´AlvarezandLawrence

and extend it in two ways. First, we notice that if the

locations of the inducing points are close relative to the

length scale of the latent function, the PITC approxima-

tion will be accurate enough. However, if the length scale

becomes small the approximation requires very many in-

ducing points. In the worst case, the latent process could

be white noise (as suggested by Higdon (2002) and imple-

mented by Boyle and Frean (2005)). In this case the ap-

proximation will fail completely. To deal with such type

of latent functions, we develop the concept of an inducing

function, a generalization of the traditional concept of in-

ducing variable commonly employed in several sparse GP

methods. As we shall see, an inducing function is an arti-

ficial construction generated from a convolution operation

between a smoothing kernel or inducing kernel and the la-

tent functions ur. The artificial nature of the inducing func-

tion is based on the fact that its construction is immersed in

a variational-like inference procedure that does not modify

the marginal likelihood of the true model. This leads us

to the second extension of the paper: a problem with the

FITC and PITC approximations can be their tendency to

overfit when inducing inputs are optimized. A solution to

this problem was given in a recent work by Titsias (2009)

who provided a sparse GP approximation that has an as-

sociated variational bound. In this paper we show how the

ideas of Titsias can be extended to the multiple output case.

Our variational approximation is developed through the in-

ducing functions and the quality of the approximation can

be controlled through the inducing kernels and the num-

ber and location of the inducing inputs. Our approxima-

tion allows us to consider latent force models with a large

number of states, D, and data points N. The use of induc-

ing kernels also allows us to extend the inducing variable

approximation of the latent force model framework to sys-

tems of stochastic differential equations (SDEs). We apply

the approximation to different real world datasets, includ-

ing a multivariate financial time series example.

?

{ur(zk)}K

k=1

?R

r=1, where

k=1are usually referred to as the induc-

A similar idea to the inducing function one introduced

in this paper, was simultaneously proposed by L´ azaro-

Gredilla and Figueiras-Vidal (2010). L´ azaro-Gredilla and

Figueiras-Vidal (2010) introduced the concept of inducing

feature to improve performance over the pseudo-inputs ap-

proach of Snelson and Ghahramani (2006) in sparse GP

models. Our use of inducing functions and inducing ker-

nels is motivated by the necessity to deal with non-smooth

latent functions in the CP model of multiple outputs.

2Multioutput GPs (MOGPs)

Let yd ∈ RN, where d = 1,...,D, be the observed

data associated with the output function yd(x). For sim-

plicity, we assume that all the observations associated with

different outputs are evaluated at the same inputs X (al-

though this assumption is easily relaxed).

ten use the stacked vector y = (y1,...,yD) to collec-

tively denote the data of all the outputs. Each observed

vector ydis assumed to be obtained by adding indepen-

dent Gaussian noise to a vector of function values fdso

that the likelihood is p(yd|fd) = N(yd|fd,σ2

fdis defined via (1). More precisely, the assumption in

(1) is that a function value fd(x) (the noise-free version

of yd(x)) is generated from a common pool of R inde-

pendent latent functions {ur(x)}R

variance function (Mercer kernel) given by kr(x,x?). No-

tice that the outputs share the same latent functions, but

they also have their own set of parameters ({αdr}R

where αdr are the parameters of the smoothing kernel

Gd,r(·). Because convolution is a linear operation, the co-

variance between any pair of function values fd(x) and

fd?(x?) is given by kfd,fd?(x,x?) = Cov[fd(x),fd?(x?)] =

?R

GP prior p(f1,...,fD) over all the function values associ-

ated with the different outputs. The joint probability dis-

tribution of the multioutput GP model can be written as

p({yd,fd}D

prior p(f1,...,fD) has a zero mean vector and a (ND) ×

(ND) covariance matrix Kf,f, where f = (f1,...,fD),

which consists of N × N blocks of the form Kfd,fd?. Ele-

ments of each block are given by kfd,fd?(x,x?) for all pos-

sible values of x. Each such block is a cross-covariance (or

covariance) matrix of pairs of outputs.

We will of-

dI), where

r=1, each having a co-

r=1,σ2

d)

r=1

?

ZGd,r(x − z)?

ZGd?,r(x?− z?)kr(z,z?)dzdz?.

This covariance function is used to define a fully-coupled

d=1) =

?D

d=1p(yd|fd)p(f1,...,fD). The GP

Prediction using the above GP model, as well as the maxi-

mization of the marginal likelihood p(y) = N(y|0,Kf,f+

Σ), where Σ = diag(σ2

time and O(N2D2) storage which rapidly becomes infea-

sible even when only a few hundred outputs and data points

are considered. Efficient approximations are needed in or-

der to make the above multioutput GP model more practi-

cal.

1I,...,σ2

DI), requires O(N3D3)

3PITC-like approximation for MOGPs

Before we propose our variational sparse inference method

for multioutput GP regression in Section 4, we review

the sparse method proposed by ´Alvarez and Lawrence

(2009). This method is based on a likelihood approxima-

tion. More precisely, each output function yd(x) is in-

dependent from the other output functions given the full-

length of each latent function ur(x). This means, that the

likelihood of the data factorizes according to p(y|u) =

Page 3

Mauricio A.´Alvarez, David Luengo, Michalis K. Titsias, Neil D. Lawrence

?D

Lawrence (2009) makes use of this factorization by as-

suming that it remains valid even when we are only al-

lowed to exploit the information provided by a finite set

of function values, ur, instead of the full-length function

ur(x) (which involves uncountably many points). Let ur,

for r = 1,...,R, be a K-dimensional vector of values

from the function ur(x) which are evaluated at the in-

puts Z = {zk}K

notes all these variables. The sparse method approximates

the exact likelihood function p(y|u) with the likelihood

p(y|u) =?D

Kfd,uK−1

the conditional GP priors p(fd|u). The matrix Ku,u is

a block diagonal covariance matrix where the rth block

Kur,uris obtained by evaluating kr(z,z?) at the inducing

inputs Z. Further, the matrix Kfd,uis defined by the cross-

covariance function Cov[fd(x),ur(z)] =

z?)kr(z?,z)dz?.The variables u follow the GP prior

p(u) = N(u|0,Ku,u) and can be integrated out to give the

following approximation to the exact marginal likelihood:

d=1p(yd|u) =?D

d=1p(yd|fd), with u = {ur}R

r=1the

set of latent functions. The sparse method in´Alvarez and

k=1. The vector u = (u1,...,uR) de-

d=1p(yd|u) =?D

u,uKu,fdare the mean and covariance matrices of

d=1N(yd|µfd|u,Σfd|u+

u,uu and Σfd|u= Kfd,fd−

σ2

dI), where µfd|u= Kfd,uK−1

?

ZGd,r(x −

p(y|θ) = N(y|0,D + Kf,uK−1

Here, D is a block-diagonal matrix, where each block is

given by Kfd,fd− Kfd,uK−1

proximate marginal likelihood represents exactly each di-

agonal (output-specific) block Kfd,fdwhile each off diag-

onal (cross-output) block Kfd,fd?is approximated by the

Nystr¨ om matrix Kfd,uK−1

The above sparse method has a similar structure to the

PITC approximation introduced for single-output regres-

sion (Qui˜ nonero Candela and Rasmussen, 2005). Because

of this similarity,´Alvarez and Lawrence (2009) call their

multioutput sparse approximation PITC as well. Two of the

properties of this PITC approximation (which may some-

times be seen as limitations) are:

u,uKu,f+ Σ).

(2)

u,uKu,fdfor all d. This ap-

u,uKu,fd?.

1. It assumes that all latent functions u are smooth.

2. It is based on a modification of the initial full GP

model. This implies that the inducing inputs Z are

extra kernel hyparameters in the modified GP model.

Because of point 1, the method is not applicable when

the latent functions are white noise processes. An impor-

tant class of problems where we have to deal with white

noise processes arise in linear SDEs where the above sparse

method is currently not applicable there. Because of 2, the

maximization of the marginal likelihood in eq. (2) with re-

spect to (Z,θ), where θ are model hyperparameters, may

be prone to overfitting especially when the number of vari-

ables in Z is large. Moreover, fitting a modified sparse GP

model implies that the full GP model is not approximated

in a systematic and rigorous way since there is no distance

or divergence between the two models that is minimized.

In the next section, we address point 1 above by introduc-

ing the concept of variational inducing kernels that allow us

to efficiently sparsify multioutput GP models having white

noise latent functions. Further, these inducing kernels are

incorporated into the variational inference method of Tit-

sias (2009) (thus addressing point 2) that treats the induc-

ing inputs Z as well as other quantities associated with the

inducing kernels as variational parameters. The whole vari-

ational approach provides us with a very flexible, robust to

overfitting, approximation framework that overcomes the

limitations of the PITC approximation.

4Sparse variational approximation

In this section, we introduce the concept of variational in-

ducing kernels (VIKs). VIKs give us a way to define more

general inducing variables that have larger approximation

capacity than the u inducing variables used earlier and im-

portantly allow us to deal with white noise latent functions.

To motivate the idea, we first explain why the u variables

can work when the latent functions are smooth and fail

when these functions become white noises.

In PITC, we assume each latent function ur(x) is smooth

and we sparsify the GP model through introducing, ur, in-

ducing variables which are direct observations of the latent

function, ur(x), at particular input points. Because of the

latent function’s smoothness, the ur variables also carry

information about other points in the function through the

imposed prior over the latent function. So, having observed

urwe can reduce the uncertainty of the whole function.

With the vector of inducing variables u, if chosen to be

sufficiently large relative to the length scales of the la-

tent functions, we can efficiently represent the functions

{ur(x)}R

convolved versions of the latent functions.1When the re-

construction of f from u is perfect, the conditional prior

p(f|u) becomes a delta function and the sparse PITC ap-

proximation becomes exact. Figure 1(a) shows a cartoon

description of a summarization of ur(x) by ur.

r=1and subsequently variables f which are just

In contrast, when some of the latent functions are white

noise processes the sparse approximation will fail. If ur(z)

is white noise2it has a covariance function δ(z−z?). Such

processesnaturallyariseintheapplicationofstochasticdif-

ferential equations (see section 6) and are the ultimate non-

1Thisideaislikea“softversion”oftheNyquist-Shannonsam-

pling theorem. If the latent functions were bandlimited, we could

compute exact results given a high enough number of inducing

points. In general they won’t be bandlimited, but for smooth func-

tions low frequency components will dominate over high frequen-

cies, which will quickly fade away.

2Such a process can be thought as the “time derivative” of the

Wiener process.

Page 4

Efficient Multioutput Gaussian Processes through Variational Inducing Kernels

(a) Latent function is smooth(b) Latent function is noise

Figure 1: With a smooth latent function as in (a), we can use some inducing variables ur (red dots) from the complete latent process

ur(x) (in black) to generate smoothed versions (for example the one in blue), with uncertainty described by p(ur|ur). However, with a

white noise latent function as in (b), choosing inducing variables ur(red dots) from the latent process (in black) does not give us a clue

about other points (for example the blue dots).

smooth processes where two values ur(z) and ur(z?) are

uncorrelated when z ?= z?. When we apply the sparse ap-

proximation a vector of “white-noise” inducing variables

ur does not carry information about ur(z) at any input

z that differs from all inducing inputs Z. In other words

there is no additional information in the conditional prior

p(ur(z)|ur) over the unconditional prior p(ur(z)). Figure

1(b) shows a pictorial representation. The lack of structure

makes it impossible to exploit the correlations in the stan-

dard sparse methods like PITC.3

Our solution to this problem is the following. We will de-

fine a more powerful form of inducing variable, one based

not around the latent function at a point, but one given by

the convolution of the latent function with a smoothing ker-

nel. More precisely, let us replace each inducing vector ur

with variables λrwhich are evaluated at the inputs Z and

are defined according to

?

where Tr(x) is a smoothing kernel (e.g. Gaussian) which

we call the inducing kernel (IK). This kernel is not nec-

essarily related to the model’s smoothing kernels. These

newly defined inducing variables can carry information

about ur(z) not only at a single input location but from

the entire input space. We can even allow a separate IK

for each inducing point, this is, if the set of inducing points

is Z = {zk}K

with the advantage of associating to each inducing point zk

its own set of adaptive parameters in Tr,k. For the PITC

approximation, this adds more hyperparameters to the like-

lihood, perhaps leading to overfitting. However, in the vari-

ational approximation we define all these new parameters

as variational parameters and therefore they do not cause

the model to overfit.

λr(z) =

Tr(z − v)ur(v)dv,

(3)

k=1, then λr(zk) =?Tr,k(zk− v)ur(v)dv,

Ifur(z)hasawhitenoise4GPpriorthecovariancefunction

3Returning to our sampling theorem analogy, the white noise

process has infinite bandwidth. It is therefore impossible to rep-

resent it by observations at a few fixed inducing points.

4It is straightforward to generalize the method for rough latent

for λr(x) is

Cov[λr(x),λr(x?)] =

?

Tr(x − z)Tr(x?− z)dz

(4)

and the cross-covariance between fd(x) and λr(x?) is

?

Notice that this cross-covariance function, unlike the case

of u inducing variables, maintains a weighted integration

over the whole input space. This implies that a single in-

ducing variable λr(x) can properly propagate information

from the full-length process ur(x) into f.

Cov[fd(x),λr(x?)] =

Gd,r(x − z)Tr(x?− z)dz. (5)

It is possible to combine the IKs defined above with the

PITC approximation of´Alvarez and Lawrence (2009), but

in this paper our focus will be on applying them within the

variational framework of Titsias (2009). We therefore refer

to the kernels as variational inducing kernels (VIKs).

Variational inference

We now extend the variational inference method of Titsias

(2009) to deal with multiple outputs and incorporate them

into the VIK framework.

We

p({yd,fd}D

step of the variational method is to augment this model

with inducing variables. For our purpose, suitable inducing

variables are defined through VIKs.

let λ = (λ1,...,λR) be the whole vector of inducing

variables where each λr is a K-dimensional vector of

values obtained according to eq. (3). λr’s role is to carry

information about the latent function ur(z). Each λr is

evaluated at the inputs Z and has its own VIK, Tr(x), that

depends on parameters θTr. The λ variables augment the

GP model according to p(y,f,λ) = p(y|f)p(f|λ)p(λ).

Here, p(λ) = N(λ|0,Kλ,λ) and Kλ,λis a block diagonal

functions that are not white noise or to combine smooth latent

functions with white noise.

compactly

d=1) as p(y,f)

write thejointprobability

p(y|f)p(f).

model

The first

=

More precisely,

Page 5

Mauricio A.´Alvarez, David Luengo, Michalis K. Titsias, Neil D. Lawrence

matrix where each block Kλr,λris obtained by evaluating

the covariance function in eq. (4) at the inputs Z. Addition-

ally, p(f|λ) = N(f|Kf,λK−1

where the cross-covariance Kf,λ is computed through

eq.(5).Becauseof

?p(f|λ)p(λ)dλ

exact inference in the initial GP model. Crucially, this

holds for any values of the augmentation parameters

(Z,{θTr}R

to turn these augmentation parameters into variational

parameters by applying approximate sparse inference.

λ,λλ,Kf,f− Kf,λK−1

λ,λKλ,f)

theconsistencycondition

=

p(f), performing exact inference

in the above augmented model is equivalent to performing

r=1). This is the key property that allows us

Our method now proceeds along the lines of Titsias

(2009). We introduce the variational distribution q(f,λ) =

p(f|λ)φ(λ), where p(f|λ) is the conditional GP prior de-

fined earlier and φ(λ) is an arbitrary variational distribu-

tion. By minimizing the KL divergence between q(f,λ)

and the true posterior p(f,λ|y), we can compute the fol-

lowing Jensen’s lower bound on the true log marginal like-

lihood (a detailed derivation of the bound is available in

´Alvarez et al. (2009b)):

FV = logN

?

y|0,Kf,λK−1

λ,λKλ,f+ Σ

?

−1

2tr

?

Σ−1?K

?

,

where Σ is the covariance function associated with the ad-

ditivenoiseprocessand?K = Kf,f−Kf,λK−1

log of a GP prior with the only difference that now the co-

variance matrix has a particular low rank form. This form

allows the inversion of the covariance matrix to take place

in O(NDK2) time rather than O(N3D3). The second part

can be seen as a penalization term that regularizes the es-

timation of the parameters. Notice also that only the diag-

onal of the exact covariance matrix Kf,fneeds to be com-

puted. Overall, the computation of the bound can be done

efficiently in O(NDK2) time.

λ,λKλ,f. Note

that this bound consists of two parts. The first part is the

The bound can be maximized with respect to all parameters

ofthecovariancefunction; bothmodelparametersandvari-

ational parameters. The variational parameters are the in-

ducing inputs Z and the parameters θTrof each VIK which

are rigorously selected so that the KL divergence is mini-

mized. In fact each VIK is also a variational quantity and

one could try different forms of VIKs in order to choose

the one that gives the best lower bound.

The form of the bound is very similar to the projected pro-

cess approximation, also known as DTC (Csat´ o and Op-

per, 2001; Seeger et al., 2003; Rasmussen and Williams,

2006). However, the bound has an additional trace term

that penalizes the movement of inducing inputs away from

the data. This term converts the DTC approximation to a

lower bound and prevents overfitting. In what follows, we

refer to this approximation as DTCVAR, where the VAR

suffix refers to the variational framework.

5Experiments

We present results of applying the method proposed for

two real-world datasets that will be described in short.

We compare the results obtained using PITC, the intrin-

sic coregionalization model (ICM)5employed in Bonilla et

al. (2008) and the method using variational inducing ker-

nels. For PITC we estimate the parameters through the

maximization of the approximated marginal likelihood of

equation (2) using the scaled-conjugate gradient method.

We use one latent function and both the covariance func-

tion of the latent process, kr(x,x?), and the kernel smooth-

ing function, Gd,r(x), follow a Gaussian form, this is

k(x,x?) = N(x − x?|0,C), where C is a diagonal ma-

trix. For the DTCVAR approximations, we maximize the

variational bound FV. Optimization is also performed us-

ing scaled conjugate gradient. We use one white noise la-

tent function and a corresponding inducing function. The

inducing kernels and the model kernels follow the same

Gaussian form. Using this form for the covariance or ker-

nel, all convolution integrals are solved analytically.

5.1 Exam score prediction

In this experiment the goal is to predict the exam score

obtained by a particular student belonging to a particular

school. The data comes from the Inner London Education

Authority (ILEA).6It consists of examination records from

139 secondary schools in years 1985, 1986 and 1987. It is a

random 50% sample with 15362 students. The input space

consists of features related to each student and features re-

lated to each school. From the multiple output point of

view, each school represents one output and the exam score

of each student a particular instantiation of that output.

We follow the same preprocessing steps employed in

Bonilla et al. (2008). The only features used are the

student-dependent ones (year in which each student took

the exam, gender, VR band and ethnic group), which are

categorical variables. Each of them is transformed to a bi-

nary representation. For example, the possible values that

the variable year of the exam can take are 1985, 1986 or

1987 and are represented as 100, 010 or 001. The trans-

formation is also applied to the variables gender (two bi-

nary variables), VR band (four binary variables) and ethnic

group (eleven binary variables), ending up with an input

space with dimension 20. The categorical nature of data

restricts the input space to 202 unique input feature vec-

tors. However, two students represented by the same in-

put vector x and belonging both to the same school d, can

obtain different exam scores. To reduce this noise in the

5The ICM is a particular case of the LMC with one latent func-

tion (Goovaerts, 1997).

6Data is available at http://www.cmm.bristol.ac.

uk/learning-training/multilevel-m-support/

datasets.shtml

Page 6

Efficient Multioutput Gaussian Processes through Variational Inducing Kernels

data, we follow Bonilla et al. (2008) in taking the mean of

the observations that, within a school, share the same in-

put vector and use a simple heteroskedastic noise model in

which the variance for each of these means is divided by

the number of observations used to compute it. The perfor-

mance measure employed is the percentage of unexplained

variance defined as the sum-squared error on the test set as

a percentage of the total data variance.7The performance

measure is computed for ten repetitions with 75% of the

data in the training set and 25% of the data in the test set.

Figure 5.1 shows results using PITC, DTCVAR with one

smoothing kernel and DTCVAR with as many inducing

kernels as inducing points (DTCVARS in the figure). For

50 inducing points all three alternatives lead to approx-

imately the same results. PITC keeps a relatively con-

stant performance for all values of inducing points, while

the DTCVAR approximations increase their performance

as the number of inducing points increases. This is consis-

tent with the expected behaviour of the DTCVAR methods,

since the trace term penalizes the model for a reduced num-

ber of inducing points. Notice that all the approximations

outperform independent GPs and the best result of the ICM

presented in Bonilla et al. (2008).

SM 5SM 20SM 50 ICM IND

40

45

50

55

60

65

70

75

Percentage of unexplained variance

Method

PITC

DTCVAR

DTCVARS

Figure2: Examscorepredictionresultsfortheschooldataset. Re-

sults include the mean of the percentage of unexplained variance

of ten repetitions of the experiment, together with one standard

deviation. In the bottom, SM X stands for sparse method with X

inducing points, DTCVAR refers to the DTC variational approx-

imation with one smoothing kernel and DTCVARS to the same

approximation using as many inducing kernels as inducing points.

Results using the ICM model and independent GPs (appearing as

IND in the figure) have also been included.

5.2Compiler prediction performance

In this dataset the outputs correspond to the speed-up of 11

C programs after some transformation sequence has been

applied to them. The speed-up is defined as the execution

time of the original program divided by the execution time

of the transformed program. The input space consists of

13-dimensional binary feature vectors, where the presence

7In Bonilla et al. (2008), results are reported in terms of ex-

plained variance.

of a one in these vectors indicates that the program has re-

ceived that particular transformation. The dataset contains

88214observations foreach outputand thesamenumber of

input vectors. All the outputs share the same input space.

Due to technical requirements, it is important that the pre-

diction of the speed-up for the particular program is made

using few observations in the training set. We compare our

results to the ones presented in Bonilla et al. (2008) and use

N = 16, 32, 64 and 128 for the training set. The remaining

88214−N observations are used for testing, employing as

performance measure the mean absolute error. The experi-

ment is repeated ten times and standard deviations are also

reported. We only include results for the average perfor-

mance over the 11 outputs.

Figure 3 shows the results of applying independent GPs

(IND in the figure), the intrinsic coregionalization model

(ICM in the figure), PITC, DTCVAR with one inducing

kernel(DTCVARinthefigure)andDTCVARwithasmany

inducing kernels as inducing points (DTCVARS in the fig-

ure). Since the training sets are small enough, we also in-

cluderesultsofapplyingtheGPgeneratedusingthefullco-

variance matrix of the convolution construction (see FULL

GP in the figure). We repeated the experiment for different

values of K, but show results only for K = N/2. Re-

sults for ICM and IND were obtained from Bonilla et al.

(2008). In general, the DTCVAR variants outperform the

16 3264128

0.02

0.04

0.06

0.08

0.1

0.12

Mean Absolute Error

Number of training points

IND

ICM

PITC

DTCVAR

DTCVARS

FULL GP

Figure 3: Mean absolute error and standard deviation over ten

repetitions of the compiler experiment as a function of the train-

ing points. IND stands for independent GPs, ICM stands for in-

trinsic coregionalization model, DTCVAR refers to the DTCVAR

approximation using one inducing kernel, DTCVARS refers to

the DTCVAR approximation using as many inducing kernels as

inducing points and FULL GP stands for the GP for the multiple

outputs without any approximation.

ICMmethod, andtheindependentGPsforN = 16, 32and

64. Inthiscase, usingasmanyinducingkernelsasinducing

points improves on average the performance. All methods,

including the independent GPs are better than PITC. The

size of the test set encourages the application of the sparse

methods: for N = 128, making the prediction of the whole

dataset using the full GP takes in average 22 minutes while

the prediction with DTCVAR takes 0.65 minutes. Using

more inducing kernels improves the performance, but also

Page 7

Mauricio A.´Alvarez, David Luengo, Michalis K. Titsias, Neil D. Lawrence

makes the evaluation of the test set more expensive. For

DTCVARS, the evaluation takes in average 6.8 minutes.

Time results are average results over the ten repetitions.

6 Stochastic Latent Force Models

The starting point of stochastic differential equations is

a stochastic version of the equation of motion, which is

called Langevin’s equation:

df(t)

dt

where f(t) is the velocity of the particle, −Cf(t) is a sys-

tematic friction term, u(t) is a random fluctuation external

force, i.e. white noise, and S indicates the sensitivity of the

ouputtotherandomfluctuations. Inthemathematicalprob-

ability literature, the above is written more rigorously as

df(t) = −Cf(t)dt + SdW(t) where W(t) is the Wiener

process (standard Brownian motion). Since u(t) is a GP

and the equation is linear, f(t) must be also a GP which

turns out to be the Ornstein-Uhlenbeck (OU) process.

= −Cf(t) + Su(t),

(6)

Here, we are interested in extending the Langevin equation

to model multivariate time series. The way that the model

in (6) is extended is by adding more output signals and

more external forces. The forces can be either smooth (sys-

tematic or drift-type) forces or white noise forces. Thus,

dfd(t)

dt

= −Ddfd(t) +

R

?

r=1

Sd,rur(t),

(7)

where fd(t) is the dth output signal. Each ur(t) can be ei-

ther a smooth latent force that is assigned a GP prior with

covariance function kr(t,t?) or a white noise force that has

a GP prior with covariance function δ(t − t?). That is, we

have a composition of R latent forces, where Rsof them

correspond to smooth latent forces and Rocorrespond to

white noise processes. The intuition behind this combi-

nation of input forces is that the smooth part can be used

to represent medium/long term trends that cause a depar-

ture from the mean of the output processes, whereas the

stochastic part is related to short term fluctuations around

the mean. A model with Rs= 1 and Ro= 0 was proposed

by Lawrence et al. (2007) to describe protein transcription

regulation in a single input motif (SIM) gene network.

Solving the differential equation (7), we obtain

fd(t) = e−Ddtfd0+

R

?

r=1

Sd,r

?t

0

e−Dd(t−z)ur(z)dz,

where fd0arises from the initial condition. This model now

is a special case of the multioutput regression model dis-

cussed in sections 1 and 2 where each output signal yd(t) =

fd(t) + ? has a mean function e−Ddtfd0and each model

kernel Gd,r(x) is equal to Sd,re−Dd(t−z).

model can be viewed as a stochastic latent force model

(SLFM) following the work of´Alvarez et al. (2009a).

The above

Latent market forces

The application considered is the inference of missing data

in a multivariate financial data set: the foreign exchange

rate w.r.t. the dollar of 10 of the top international curren-

cies (Canadian Dollar [CAD], Euro [EUR], Japanese Yen

[JPY], Great British Pound [GBP], Swiss Franc [CHF],

Australian Dollar [AUD], Hong Kong Dollar [HKD], New

Zealand Dollar [NZD], South Korean Won [KRW] and

Mexican Peso [MXN]) and 3 precious metals (gold [XAU],

silver [XAG] and platinum [XPT]).8We considered all the

data available for the calendar year of 2007 (251 working

days). In this data there are several missing values: XAU,

XAG and XPT have 9, 8 and 42 days of missing values re-

spectively. On top of this, we also introduced artificially

long sequences of missing data. Our objective is to model

the data and test the effectiveness of the model by imputing

these missing points. We removed a test set from the data

by extracting contiguous sections from 3 currencies asso-

ciated with very different geographic locations: we took

days 50–100 from CAD, days 100–150 from JPY and days

150–200 from AUD. The remainder of the data comprised

the training set, which consisted of 3051 points, with the

test data containing 153 points. For preprocessing we re-

moved the mean from each output and scaled them so that

they all had unit variance.

It seems reasonable to suggest that the fluctuations of the

13 correlated financial time-series are driven by a smaller

number of latent market forces. We therefore modelled

the data with up to six latent forces which could be noise

or smooth GPs.The GP priors for the smooth latent

forces are assumed to have a Gaussian covariance function,

kurur(t,t?) = (1/?2π?2

We present an example with R = 4. For this value of R, we

consider all the possible combinations of Roand Rs. The

training was performed in all cases by maximizing the vari-

ational bound using the scale conjugate gradient algorithm

until convergence was achieved. The best performance in

terms of achieving the highest value for FV was obtained

for Rs= 1 and Ro= 3. We compared against the LMC

model for different values of the latent functions in that

framework. While our best model resulted in an standard-

izedmeansquareerrorof0.2795, thebestLMC(withR=2)

resulted in 0.3927. We plotted predictions from the latent

market force model to characterize the performance when

filling in missing data. In figure 4 we show the output

signals obtained using the model with the highest bound

(Rs = 1 and Ro = 3) for CAD, JPY and AUD. Note

that the model performs better at capturing the deep drop

in AUD than it does for the fluctuations in CAD and JPY.

r)exp(−((t − t?)2)/2?2

r), where

the hyperparameter ?ris known as the lengthscale.

8Data is available at http://fx.sauder.ubc.ca/

data.html.

Page 8

Efficient Multioutput Gaussian Processes through Variational Inducing Kernels

50100 150200250

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

(a) CAD: Real data and prediction

50 100150 200250

7.8

8

8.2

8.4

8.6

8.8

9

9.2

9.4x 10

−3

(b) JPY: Real data and prediction

50100 150200 250

0.7

0.75

0.8

0.85

0.9

0.95

(c) AUD: Real data and prediction

Figure 4: Predictions from the model with Rs = 1 and Ro = 3 are shown as solid lines for the mean and grey bars for error bars at 2

standard deviations. For CAD, JPY and AUD the data was artificially held out. The true values are shown as a dotted line. Crosses on

the x-axes of all plots show the locations of the inducing inputs.

7 Conclusions

We have presented a variational approach to sparse approx-

imations in convolution processes. Our main focus was to

provide efficient mechanisms for learning in multiple out-

put Gaussian processes when the latent function is fluctuat-

ing rapidly. In order to do so, we have introduced the con-

cept of inducing function, which generalizes the idea of in-

ducing point, traditionally employed in sparse GP methods.

The approach extends the variational approximation of Tit-

sias (2009) to the multiple output case. Using our approach

we can perform efficient inference on latent force models

which are based around SDEs, but also contain a smooth

driving force. Our approximation is flexible enough and

has been shown to be applicable to a wide range of data

sets, including high-dimensional ones.

Acknowledgements

We would like to thank Edwin Bonilla for his valuable feedback

with respect to the datasets in section 5. Also to the authors of

Bonilla et al. (2008) who kindly made the compiler dataset avail-

able. DL has been partly financed by the Spanish government

through CICYT projects TEC2006-13514-C02-01 and TEC2009-

14504-C02-01, and the CONSOLIDER-INGENIO 2010 Program

(Project CSD2008-00010). MA and NL have been financed by a

Google Research Award “Mechanistically Inspired Convolution

Processes for Learning” and MA, NL and MT have been financed

by EPSRC Grant No EP/F005687/1 “Gaussian Processes for Sys-

tems Identification with Applications in Systems Biology”.

References

Mauricio A.´Alvarez and Neil D. Lawrence. Sparse convolved

Gaussian processes for multi-output regression. In NIPS 21,

pages 57–64. MIT Press, 2009.

Mauricio A.´Alvarez, David Luengo, and Neil D. Lawrence. La-

tent Force Models. In JMLR: W&CP 5, pages 9–16, 2009.

Mauricio A.´Alvarez, David Luengo, Michalis K. Titsias, and

Neil D. Lawrence. Variational inducing kernels for sparse con-

volved multiple output Gaussian processes. Technical report,

School of Computer Science, University of Manchester, 2009.

Available at http://arxiv.org/abs/0912.3268.

Edwin V. Bonilla, Kian Ming Chai, and Christopher K. I.

Williams. Multi-task Gaussian process prediction. In NIPS

20, pages 153–160. MIT Press, 2008.

Phillip Boyle and Marcus Frean. Dependent Gaussian processes.

In NIPS 17, pages 217–224. MIT Press, 2005.

Lehel Csat´ o and Manfred Opper. Sparse representation for Gaus-

sian process models. In NIPS 13, pages 444–450. MIT Press,

2001.

Pierre Goovaerts. Geostatistics For Natural Resources Evalua-

tion. Oxford University Press, 1997.

David M. Higdon. Space and space-time modelling using process

convolutions. In Quantitative methods for current environmen-

tal issues, pages 37–56. Springer-Verlag, 2002.

Neil D. Lawrence, Guido Sanguinetti, and Magnus Rattray. Mod-

elling transcriptional regulation using Gaussian processes. In

NIPS 19, pages 785–792. MIT Press, 2007.

Miguel L´ azaro-Gredilla and An´ ıbal Figueiras-Vidal.

domain Gaussian processes for sparse inference using inducing

features. In NIPS 22, pages 1087–1095. MIT press, 2010.

Joaquin Qui˜ nonero Candela and Carl Edward Rasmussen. A uni-

fying view of sparse approximate Gaussian process regression.

JMLR, 6:1939–1959, 2005.

Carl Edward Rasmussen and Christopher K. I. Williams. Gaus-

sian Processes for Machine Learning. MIT Press, Cambridge,

MA, 2006.

Alex Rogers, M. A. Osborne, S. D. Ramchurn, S. J. Roberts,

and N. R. Jennings. Towards real-time information processing

of sensor network data using computationally efficient multi-

output Gaussian processes. In Proc. Int. Conf. on Information

Proc. in Sensor Networks (IPSN 2008), 2008.

Matthias Seeger, Christopher K. I. Williams, and Neil D.

Lawrence. Fast forward selection to speed up sparse Gaussian

process regression. In Proc. 9th Int. Workshop on Artificial

Intelligence and Statistics, Key West, FL, 3–6 January 2003.

Edward Snelson and Zoubin Ghahramani. Sparse Gaussian pro-

cesses using pseudo-inputs. In NIPS 18. MIT Press, 2006.

Yee Whye Teh, Matthias Seeger, and Michael I. Jordan. Semi-

parametric latent factor models. In AISTATS 10, pages 333–

340, 2005.

Michalis K. Titsias. Variational learning of inducing variables in

sparse Gaussian processes. In JMLR: W&CP 5, pages 567–

574, 2009.

Inter-