Content uploaded by Evgeny Burnaev
Author content
All content in this area was uploaded by Evgeny Burnaev on Oct 07, 2018
Content may be subject to copyright.
ISSN 10642269, Journal of Communications Technology and Electronics, 2016, Vol. 61, No. 6, pp. 661–671. © Pleiades Publishing, Inc., 2016.
Original Russian Text © E.V. Burnaev, M.E. Panov, A.A. Zaytsev, 2015, published in Informatsionnye Protsessy, 2015, Vol. 15, No. 3, pp. 298–313.
661
1. INTRODUCTION
One of the main problems solved when construct
ing surrogate models (metamodels, databased mod
els) is the problem of approximation of an unknown
dependence from data (constructing the regression
function) [1], which can be described as follows.
Let
y
=
f
(
x
) be an unknown function with the input
x
∈
⺨⊂
⺢
n
and the output
y
⊂
⺢
. The values of this
function are known at the points of a training sample
D
train
= = It is
required to construct an approximation
for an objective function
f
(
x
) using the
training sample
D
train
. The quality of approximation is
estimated using an independent test sample
D
test
=
= e.g., by
the mean error
(1)
and 95 and 99percentiles of the absolute error of
approximation.
A widely used approach to describing the class of
models of approximation in this formulation is based
on introducing a prior distribution on a space of func
tions. It is a standard practice to use a prior Gaussian
distribution, as a result of which the problem is
Xy,()
xiyi = fxi
(),()i = 1…N,,,{}.
y
ˆf
ˆx() fx()≈=
X*y*
,()
xj
*yj
* = fxj
*
(),()j = 1…N*
,,,{},
εf
ˆDtest
()
1
N*
yj
*f
ˆxj
*
()–,
j1=
N*
∑
=
reduced to a nonparametric regression on the basis of
Gaussian processes [2–8]. This method is widely used
for solving applied problems, including conceptual
design [9], structural optimization [10], multicriterion
design optimization [11], design in aerospace [12] and
automobile [13] industries, etc.
The general approach to construction of Gaussian
processes regression is described in the literature [14],
but few attention has been paid to its practical imple
mentation. The widely used software packages, e.g.,
DACE [15], in some cases, produce degenerate
approximations, which can hardly be interpreted from
the viewpoint of the original problem. Available theo
retical results [16] show that the choice of the gamma
distribution as a prior distribution on the parameters of
the covariance function leads to a minimaxoptimal
rate of concentration of the posterior distribution; in
this case, such a prior distribution plays a role of a reg
ularization that potentially prevents the degeneration
of the model of approximation. However, the practical
implementation of the corresponding algorithm of
regularization requires an accurate choice of the next
level hyperprior distribution on the parameters of the
gamma distribution and a reliable optimization proce
dure making it possible to obtain a robust local maxi
mum of a likelihood. In this work, it is suggested to
introduce a hyperprior lognormal distribution with
fixed parameters for the parameters of the prior
gamma distribution and optimize the joint likelihood
Regression on the Basis of Nonstationary Gaussian Processes
with Bayesian Regularization
E. V. Burnaev, M. E. Panov, and A. A. Zaytsev
Institute for Information Transmission Problems, Russian Academy of Sciences,
Bol’shoi Karetnyi per. 19, str. 1, Moscow, 127994 Russia
email: burnaev@iitp.ru
Received June 26, 2015
Abstract
—We consider the regression problem, i.e. prediction of a real valued function. A Gaussian process
prior is imposed on the function, and is combined with the training data to obtain predictions for new points.
We introduce a Bayesian regularization on parameters of a covariance function of the process, which increases
quality of approximation and robustness of the estimation. Also an approach to modeling nonstationary cova
riance function of a Gaussian process on basis of linear expansion in parametric functional dictionary is pro
posed. Introducing such a covariance function allows to model functions, which have nonhomogeneous
behaviour. Combining above features with careful optimization of covariance function parameters results in
unified approach, which can be easily implemented and applied. The resulting algorithm is an out of the box
solution to regression problems, with no need to tune parameters manually. The effectiveness of the method
is demonstrated on various datasets.
Keywords
: Gaussian processes, regression, Bayesian regularization, a priori distribution, Bayesian regression
DOI:
10.1134/S1064226916060061
MATHEMATICAL MODELS
AND COMPUTATIONAL METHODS
DRAFT
662
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
BURNAEV et al.
of data and the model with respect both to the param
eters of the covariance function and the parameters of
the prior gamma distribution. The results of experi
ments show that this approach makes it possible to
substantially increase the accuracy of approximation
and avoid its degeneration.
Another important feature of the general approach
to regression on the basis of Gaussian processes is that
models with a stationary covariance function are
mainly used [14]. The existing approaches to a nonsta
tionary covariance function modeling, as a rule, are
excessively parameterized [17], which leads to high
computational costs when estimating the model
parameters and results in instability of the correspond
ing optimization procedure. In the present work, we
suggest a method of a nonstationary covariance func
tion modeling on the basis of a linear decomposition in
a dictionary of parametric functions. With a given dic
tionary, the proposed procedure makes possible to
automatically estimate the contribution of the station
ary and of the nonstationary parts of the covariance
function to the total variability of the approximation
model.
The work consists of the following parts.
In Section 2, the regression on the basis of Gauss
ian processes and our method for constructing a non
stationary covariance function are described.
Section 3 describes our algorithm proposed for
estimating the regression parameters on the basis of
Gaussian processes by means of the Bayesian
approach.
Section 4 presents the results of numerical experi
ments.
2. APPROXIMATOR ON THE BASIS
OF A GAUSSIAN PROCESS
The regression on the basis of Gaussian processes
makes it possible to construct nonlinear approxima
tion models and yields an analytical expression for
estimating the uncertainty of approximation at a given
point
x
.
A Gaussian process
f
(
⋅
) is a stochastic process for
which each spatial point
x
is associated with a certain
normal variable. Moreover, each finite set of values of
a Gaussian process at different space points has a mul
tidimensional normal distribution. Thus, a random
process
f
(
⋅
) is defined by the mean
m
(x) = and
the covariance function
k
(
x
,
x
') =
For the zero mean
m
(
x
) = = 0 and a fixed covariance function,
the a posteriori mean of the Gaussian process at the
points of a test sample
X
* has the form [14]
(2)
where
Usually, in the simulations, one has an access only
to noisy values of the function:
(3)
where the noise
ε
(
x
) is modelled by a Gaussian white
noise with the zero mean and a variance In this
case, observations
y
(
x
) comprise a Gaussian process
with a zero mean and the covariance function
cov(
y
(
x
),
y
(
x
')) = Then, the posterior
mean of the Gaussian process
f
(
x
) at the points of a test
sample
X
* has the form
(4)
where
I
N
is a unit
N
×
N
matrix. The posterior covari
ance function of the Gaussian process at the points of
the test sample has the form
(5)
where
K
(
X
*,
X
*) =
The variance of the noise in formula (4), , regu
larizes the covariance matrix
K
and makes it possible
to control the generalization ability of the approxima
tor. The diagonal elements of the matrix
from formula (5) can be used as estimates of the
expected approximation error at the corresponding
points
X
*.
2.1. Estimating the Parameters
of the Gaussian Process
When working with real data, the covariance func
tion of the Gaussian process, generated this data, is
not known. Assume that the covariance function of the
Gaussian process belongs to a parametric family
k
(
x
,
x
') =
k
(
x
,
x
'|
a
), where
a
∈
⺢
K
is the vector of
parameters of the covariance function. The role of the
family
k
(
x
,
x
'|
a
) is often played by the class of station
ary covariance functions, i.e., functions depending
only on the difference of the arguments:
k
(
x
,
x
'|
a
) =
⺕fx()[]
⺕fx() mx()–()fx'()mx'()–()[].
⺕fx()[]
f
ˆX*
() K*K1–y,=
K*KX
*X,()kxj
*xi
,()i = 1…N; j = 1…N*
,,,,,[]==
KKXX,() kxixj
,()ij = 1…N,,,,[].==
yx() fx() εx()+,=
σ
˜2.
kxx',()σ
˜2.+
f
ˆX*
() K*Kσ
˜2IN
+()
1–y,=
⺦f
ˆX*
()[]KX
*X*
,()σ
˜2IN*
+=
–
K*Kσ
˜2IN
+()
1–K*
T,
kxi
*xj
*
,()ij = 1…N*
,,,,[].
σ
˜2
⺦f
ˆX*
()[]
DRAFT
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
REGRESSION ON THE BASIS OF NONSTATIONARY GAUSSIAN PROCESSES 663
k
(
x
–
x
'|
a
). The parameter
a
is estimated from a train
ing sample
D
train
by the maximumlikelihood method.
The logarithm of the likelihood of a Gaussian process
at the points of a training sample [14] has the form
(6)
where |
K
+
I
| is the determinant of the matrix
K
+
I
. The logarithm of the likelihood depends on the
parameters
a
of the covariance function and the vari
ance of the noise, Thus, the estimates of
a
and
are obtained by maximizing the logarithm of the like
lihood:
(7)
The choice of a particular family of covariance
functions
k
(
x
,
x
'|
a
) is usually determined by the prior
knowledge of the properties of the function to be
approximated. In this work, we use the covariance
functions from the family
In this case, it is traditionally recommended to use the
following parameterization of
θ
:
θ
i
=
i
= 1, ...,
n
,
and optimize the likelihood with respect to the param
eters
τ
= (see [18, 19]).
It should be noted that, besides the above–
described family of covariance functions, one may
consider the family of covariance functions on the
basis of the Mahalanobis distance
where
Λ
2
is some positivedefinite matrix. This family
of covariance functions enables one to approximate a
wide class of dependences, but the quadratic depen
dence on the dimension of the input space makes the
optimization of the likelihood a complex computa
tional problem; construction of an efficient algorithm
for solving this problem deserves special consider
ation.
The use of a stationary covariance function limits
the accuracy of approximation if the modelled depen
dence is spatially nonuniform. Below we suggest a
method for constructing a nonstationary covariance
function for a Gaussian process.
2.2. The Model of a Nonstationary Gaussian Process
Consider a nonstationary Gaussian process of the
following form:
(8)
where
f
(
x
) = + is a Gauss
ian process with a zero mean and a stationary covari
ance function
k
(
x
,
x
') = {
α
i
,
i
= 1, ...,
Q
)
are independent identically distributed normal ran
dom variables with a zero mean and a variance
and {
ψ
i
(
x
),
i
= 1, ...,
Q
} is some set of functions. For
such a process, the covariance function takes the form
(9)
where
ψ
(
x
) =
Introduce the following notation:
Ψ
(
X
) =
Ψ
(
X
*) =
= + and
= Then, the posterior mathe
matical expectation of process (8) at the test sample
points has the form
(10)
Correspondingly, the posterior covariance of predic
tions at the test sample points has the form
(11)
The logarithm of the likelihood of Gaussian process
(8) coincides with expression (6) in which the matrix
K
is replaced with the matrix :
(12)
If the set of basis functions is specified, then, for the
identification of the approximation model, it is neces sary to maximize the logarithm of the likelihood (12)
with respect to the parameters
a
of the covariance
pyXaσ
˜
,,()log
=
1
2
yTKσ
˜2IN
+()
1–yKσ
˜2IN
+log N2πlog++(),–
σ
˜2
σ
˜2
σ
˜2.
σ
˜2
pyXaσ
˜
,,().
aσ
˜
,
max
→log
kpq,xx
˜a–()σ
2θi
2xix
˜i
–p
i1=
n
∑
⎩⎭
⎨⎬
⎧⎫
q
–
⎝⎠
⎜⎟
⎛⎞
,exp=
aθ = θi
()
i1=
nσ,{},p(02], q(01].,∈,∈=
eτi,
τi
()
i1=
n
kxx
˜Λσ,–()σ
2xx
˜
–()
TΛ2xx
˜
–()(),exp=
yfx() εx(),+=
αiψix()
i1=
Q
∑
f
˜x(),
f
˜x()
kxx'a–(),
σ
˜2
Q
,
cov yx()yx'(),()kxx'a–()=
+
σ
˜2ψx()
Tψx'() σ
˜2,+
ψix()i = 1…Q,,,{}.
ψxi
()i = 1…N,,,{},
ψxj
*
(),{
j1…N*
,,=},
K*
K*
σ2ΨX*
()
TΨX(),
K
Kσ2ΨX()
TΨX().+
f
ˆX*
() K*Kσ
˜2IN
+()
1–y.=
⺦f
ˆX*
()[]KX
*X*
,()σ
˜2IN*
+=
+
σ2ΨX*
()
TΨX*
()K*Kσ
˜2IN
+()
1–K*
T.–
K
pyXaσ
˜σ,, ,()log 1
2
yTKσ
˜2IN
+()
1–y–1
2
Kσ
˜2IN
+log N
2
2π.log+–=
DRAFT
664
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
BURNAEV et al.
function
k
(
x
–
x
'|
a
) and with respect to the parameters
and , i.e.,
(13)
It should be noted that, in the optimization with
respect to the parameters and the contribution
of the stationary and nonstationary parts of the covari
ance function to the total variability of the approxima
tion model will be automatically determined.
Constructing the set of functions {
ψ
i
(
x
),
i
= 1, ...,
Q
}
from a training sample is an independent problem. In
the series of papers [20–23], it was suggested to
approximate a multidimensional dependence by a lin
ear decomposition in a dictionary of parametric func
tions of different types:
1. Sigmoid basis functions
ψ
j
(
x
) =
where
and
γ
j
and
β
j
,
i
are adjustable parameters;
2. Radial basis functions
ψ
j
(
x
) =
where
d
j
and
t
j
are adjustable
parameters;
3. Linear basis functions
ψ
j
(
x
) =
x
j
,
j
= 1, ...,
n
.
The parameters of the functions from the dictio
nary are adjusted by the methods of gradient optimiza
tion, boosting, and begging [20–23]. It is the dictio
nary obtained in this manner that is suggested to be
used for modeling a nonstationary Gaussian process.
2.3. Modeling of a Multidimensional Output
In the previous sections, we considered construc
tion of an approximator based on a nonstationary
Gaussian process regression model for the case of a
onedimensional output
y
∈
⺢
1
. In this section, we
extend these results to the case of a multidimensional
output
y
=
m
> 1. The simplest
approach consists in constructing an independent
approximator for each output component
y
l
,
l
= 1, ...,
m
. However, this approach does not take into account
the crosscorrelation between the output components.
Modeling not only correlational properties of the each
component but also the crosscorrelations between
components can substantially improve accuracy of
approximation and can be implemented, e.g., by the
cokriging algorithm [24].
σ
˜
σ
pyXaσ
˜σ,, ,().
aσ
˜σ,,
max
→log
σ
˜
σ,
hβji,xi
i1=
n
∑
(),
xx1…xn
,,(),hz() gz
gγj
() 1+sign z()(),==
gv() ev1–
ev1+
,=
xd
j
–2/tj
2
–(),exp
y1...,ym
,()⺢m,∈
In this work, we suggest an intermediate approach,
which presumes that the Gaussian processes
simulating different output components are indepen
dent but have a common covariance function. Assume
that, for the
l
th output component, the covariance
function of the corresponding Gaussian process
has the form
where different output components have different fac
tors and the parameters
θ
= are the same
for all output components. Assume that the coeffi
cients are independent identi
cally distributed normal random variables with a zero
mean and a variance and the processes
ε
l
(
x
),
l
= 1, ...,
m
are mutually independent, independent
from and are mod
elled by a white Gaussian noise with a variance
The likelihood of such a model equals the sum of
the likelihoods (12) over all output channels. The opti
mal values of parameters are obtained by optimizing
the loglikelihood:
(14)
This approach offers a computationally efficient
method for constructing a model of a dependence with
taking into account crosscorrelations between differ
ent output components. A drawback of this approach
is that, although the posterior mean values of the cor
responding Gaussian processes are different for differ
ent output components
all output components are modelled by independent
Gaussian processes whose covariance matrices differ
only in the scaling factors As a result, poste
rior covariances (11) of different components, e.g., in
the interpolation regime, will differ only in the values
of the scaling factors.
ylx() α
il,ψix() f
˜lx() ε
lx()++
i1=
Q
∑
=
f
˜lx()
klxx
˜al
–()σ
l
2θi
2xix
˜i
–p
i1=
n
∑
⎩⎭
⎨⎬
⎧⎫
q
–
⎝⎠
⎜⎟
⎛⎞
,exp=
alθ = θi
()
i1=
nσl
,{},p(02], q(01],,∈,∈=
σl
2
θi
()
i1=
n
αil,i = 1…Q,,,{}
l1=
m
⺕αil,
2σ2
Q
,=
αil,i = 1…Q,,,{}
l1=
m,
f
˜lx(){}
l1=
m,
σ
˜2.
pYXσ
˜alσl
,{}
l1=
m
,,()log
=
pylXσ
˜alσl
,,,()log
l1=
m
∑.
σlal
,{}
l1=
mσ,
max
→
fl
ˆX*
() Kl*
,Klσ
˜2IN
+()
1–yl,=
l1…m,,,=
σl
{}
i1=
m.
DRAFT
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
REGRESSION ON THE BASIS OF NONSTATIONARY GAUSSIAN PROCESSES 665
2.4. Interpolation Properties in the Case
of a Nonstationary Covariance Function
The model of the dependence on the basis of Gaus
sian processes has an important property: in models
(3) and (8) with a noise, for = 0, we obtain an inter
polating behavior of the posterior mean, i.e., at the
points of sample
X
, the approximated value coincides
with the observed value of the dependence [14]. Such
a behavior is important for some engineering prob
lems. For > 0, the interpolation property, generally
speaking, is not satisfied but, for model (8), it can be
shown that, for sufficiently small
≈
0, there always
exist such values of the parameters
that the behavior of the approximator will be interpo
lating with any specified accuracy. This assertion is
formulated in the following theorem [25].
Theorem 1.
For a given
⑀
> 0,
if
⎯
⎯
⎯
⎯
where
λ
i
,
i
= 1, ...,
N
,
are the eigenvalues of the matrix
K and the norm is the standard Euclidean norm, then
Thus, for our model of a nonstationary Gaussian
process, one can also obtain interpolating realizations,
which sometimes is necessary in applied problems.
3. THE USE OF THE BAYESIAN APPROACH
FOR ESTIMATING THE PARAMETERS
OF THE REGRESSION MODEL ON THE BASIS
OF A GAUSSIAN PROCESS
3.1. Drawbacks of the Standard Approach
to Model Identification
In Section 2, we described a basic algorithm for
constructing a model on the basis of Gaussian pro
cesses. On a wide class of problems, this algorithm
provides a good quality of approximation of a simu
lated dependence, but in some cases, it can lead to
degeneration of the model. For example, consider the
σ
˜2
σ
˜
σ
˜2
σ
˜θi
()
i1=
nσσ,,,{},
ΨX()
TΨX() CC ⺢+
∈,,<
σ
˜21
8
min λi
i = 1N,
min
ε,(),<
σ21
2
,>
σ21
8
λi
i = 1N,
min
C1+
<,
yf
ˆX()–
⑀
y.<
use of a covariance function with a weighted Euclidean
distance:
When optimizing the likelihood with respect to the
parameters of the covariance function
a
=
and the variance of the noise, the
following effects can be observed:
⎯
The optimum can be located in a region where
||
θ
|| ~ 0; in this case, the matrix
K
in this region is ill
conditioned;
⎯
For ||
θ
||
Ⰷ
1, we have
K
→
I
N
, which leads to
degenerate approximation;
⎯
The loglikelihood can weakly
depend on the parameters
a
and
If the maximum of the likelihood is located in the
region of small ||
θ
|| ~ 0, the covariance matrix
K
is ill
conditioned. Formula (4) implies that the approxima
tion can be represented in the form =
where = If
the value of is insufficiently large, then, due to the
illconditioning of the matrix
K
, the norm of the vec
tor takes large values and, in the resulting
approximation the numerical noise becomes
essentially noticeable. If the maximum of likelihood is
located in the region of large values ||
θ
||
Ⰷ
1, the
approximation model also degenerates: for
≈
0 and
(
x
i
,
y
i
)
∈
D
train
, we have
≈
y
i
; in this case,
≈
for
x
∉
D
train
.
3.2. Bayesian Inference
To overcome the above difficulties, it is suggested to
use the Bayesian regularization on the parameters of
the covariance function. Below we consider two vari
ants of such regularization: one on the basis of the
prior normal distribution and one on the basis of the
prior gamma distribution.
Assume that the parameters
θ
are distributed with
some prior density where
γ
are the hyperpa
rameters of the prior distribution of the covariance
function parameters. Then, the logarithm of the joint
distribution of data and model parameters has the form
(15)
kxx
˜
,()σ
2θi
2xix
˜i
–()
2
i1=
n
∑
–
⎝⎠
⎜⎟
⎛⎞
.exp=
θ = θi
()
i1=
nσ,{}
σ
˜,
pyXaσ
˜
,,()log
σ
˜.
f
ˆx()
f
ˆx()
αikxx
i
,(),
i1=
N
∑
αi
()
i1=
N
Kσ
˜2IN
+()
1–y.
σ
˜
αi
()
i1=
N
f
ˆx(),
σ
˜
f
ˆxi
()
f
ˆx()
1
N
yi
i1=
N
∑
pθγ(),
PyθXσσ
˜γ,,,,()log
=
pyXθσσ
˜
,,,()log pθγ().log+
DRAFT
666
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
BURNAEV et al.
The term plays the role of an additional
regularization on the parameters of the covariance
function
θ
, and the choice of a particular distribution
will determine the character of this regularization.
Expression (15) can be maximized with respect to the
parameters
σ
,
γ
, and
θ
in order to find their opti
mal values. This approach corresponds to finding a
MAPestimate (Maximum Aposteriori Probability) of
the model parameters
σ
, and
θ
and the parameters
of the prior distribution
γ
. In this work, the role of the
prior distribution is played by the normal and
gamma distributions.
In the case of the normal distribution, consider the
following transformation of the vector
θ
:
where
θ
||
=
e
= and the vector
θ⊥
is orthogonal to the vector
e
. Suppose that the
quantity |
θ
||
| is normally distributed with the mean
μ
||
>
0 and variance
d
||
and the vector
θ⊥
is independent of
|
θ
||
| and is normally distributed with a zero mean and a
diagonal covariance matrix
d
⊥
I
⊥
, where
I
⊥
is a unit
matrix of dimension
n
– 1 and
d
⊥
is the variance of the
components of the vector
θ⊥
. This assumption is based
on our intuition that all components of the vector
θ
should no be not too large and should not be too small
and should differ from each other not too significantly.
Thus, in the given case,
γ
= and
= + where
is the density of the normal distribution with
a mean
μ
and a covariance matrix
Σ
.
Now let us consider the case of the gamma distri
bution. Assume that all components of the vector
θ
are
independent and distributed according to the same
gamma distribution with the parameters
α
and
β
. In
this case,
(16)
where
α
> 1,
β
> 0, and
Γ
(
α
) is the gamma function.
Such a choice of the prior distribution of parameters
also penalizes too large and too small values of the
components
θ
i
. Section 4.1 presents a number of
experimental results showing that both methods of
Bayesian regularization substantially improve the
robustness of the algorithm and the quality of the final
model. It should be noted that, in a similar manner,
one can regularize the parameter
σ
.
pθγ()log
σ
˜,
σ
˜,
pθγ()
θθ
|| θ⊥
+,=
θ|| e,⋅
1
n
…1
n
,,
⎝⎠
⎛⎞
,
μ|| d|| d⊥
,,{}
pθγ()log
pNμ|| d||
,()
θ()log
pN0d⊥I⊥
,()
θ(),log
pNμΣ,()
⋅()
pθγ()log pθαβ,()log=
=
αβα1–()θ
i
log βθiΓα()log––+log[],
i1=
n
∑
3.3. Multilevel Bayesian Inference
In order that the procedures based on the Bayesian
approach could be used for solving applied problems,
it is necessary to provide reasonable values of parame
ters of the prior distribution [14]. A researcher typi
cally has no possibility to choose the parameters of the
prior distribution for each problem, e.g., when the
number of problems is large or if the algorithm is sup
posed to be used as a black box, i.e., a user will not have
appropriate information for choosing the parameters
of the prior distribution in the Bayesian approach.
In this section, we suggest a multilevel Bayesian
inference [26] for the model of regression on the basis
of Gaussian processes, which helps to avoid the diffi
culty described above. The idea of this approach is
based on introducing an additional hyperprior distri
bution on parameters for the original Gaussian model.
In this case, the model identification consists in tuin
ing the parameters of the model, the parameters of the
prior distribution, and the parameters of the hyper
prior distribution using the principle of maximum of
the posterior probability. In the considered case, the
parameters of the hyperprior distribution influence
the likelihood to the less extent, and the regulariza
tion, introduced by this hyperprior distribution on
the parameters of the first level prior distribution pre
vents the latter from degeneracy when optimizing
(15), and, as a result, keeps the parameters of the cova
riance function from the degeneracy as well.
Thus, it is suggested to introduce
⎯
a prior gamma distribution
Γ
(
α
,
β
) on the
parameters of the covariance function
θ
;
⎯
a prior gamma distribution
Γ
(
ασ
,
βσ
) on the
parameter of the covariance function
σ
;
⎯
a hyperprior lognormal distribution with prelim
inary fixed parameters
μ
and on each of parame
ters
α
, and
βσ
of the Bayesian regularization.
In this case, the joint distribution density will be
proportional:
where
Γ
(
⋅
) is the density of the gamma distribution and
log
ᏺ
(
⋅
) is the density of the lognormal distribution.
Substituting the explicit expressions for the distribu
tions, we obtain an explicit expression of the maximi
zation:
σprior
2
βα
σ
,
pyXθσ2αβα
σβσμσ
prior
2
,, ,,, , ,,()
∝pyXθσ2
,,()Γθαβ,()Γσ
2ασβσ
,()⋅⋅
×ᏺαμσ
prior
2
,()ᏺβμσ
prior
2
,()log⋅log
×ᏺασμσ
prior
2
,()ᏺβσμσ
prior
2
,(),log⋅log
DRAFT
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
REGRESSION ON THE BASIS OF NONSTATIONARY GAUSSIAN PROCESSES 667
Here,
K
θ
,
σ
is the covariance matrix of the sample and
Γ
(
α
) is the value of the gamma function at the value of
the argument
α
.
4. EXPERIMENTAL RESULTS
4.1. Bayesian Regularization
In this section, we present two examples of applica
tion of the Bayesian regularization for estimating the
parameters of the covariance function of the regres
sion on the basis of Gaussian processes: the approxi
mation of the Rastrigin function and comparison of
the Bayesian estimate of parameters with estimation of
parameters on the basis of the maximum likelihood
method on a large set of test functions.
In some cases, the maximum likelihood corre
sponds to the values of parameters that have no good
physical interpretation. For example, consider the
approximation of the Rastrigin function [27] by means
of the regression on the basis of Gaussian processes
using 20 random points; in this case, the parameters of
the covariance function are estimated by the maxi
mumlikelihood method (the function is shown in
Fig. 1, left). The approximation obtained is presented
in Fig. 1 (center): the regression model consists of nar
row peaks around points of the training sample and
cannot be used in practice. If the approximation is
constructed using the Bayesian regularization either
with the prior Gamma distribution or with the prior
normal distribution, then one obtains the same
approximations of the form shown in Fig. 1 (left). We
see that the approximation using the Bayesian regular
ization reproduces the form of the Rastrigin function
better, although the difference is rather large due to a
small size of the training sample.
In the experiments, we compared the performance
of three algorithms for constructing the approxima
tion based on the Gaussian processes, namely:
⎯
MLE GP, the regression on the basis of Gaussian
processes with the maximumlikelihood estimation of
parameters;
⎯
Gamma GP, the Bayesian method of estimation
of the parameters with a prior gamma distribution;
⎯
Normal GP, the Bayesian method of estimation
of the parameters with a prior normal distribution.
The algorithms were compared with the help of the
Dolan–More curves [28], which make possible an
intuitive representation of results of testing several
algorithms on a set of problems. Suppose that there is
a set of algorithms
a
i
,
i
= 1, ...,
A
and a set of problems
t
j
,
j
= 1, ...,
T
. Then, for each algorithm, we can mea
sure its performance (e.g., the mean absolute error on
pyXθσ2αβα
σβσμσ
prior
2
,, ,,, , ,,()log
∝1
2
N2πlog Kθσ,
log yTKθσ,
1–y++[]–
–
Γα()log αβlog α1–() θ
i
log
i1=
n
∑βθ
i
i1=
n
∑
–++
–
Γα
σ
()log ασβσ
log ασ1–()σ
2
log βσσ2
–++
–
1
2
2 α1
σprior
2
αμ–log()
22β1
σprior
2
βμ–log()
2
+log++log
–
1
2
2 ασ
1
σprior
2
ασμ–log()
22βσ
1
σprior
2
βσμ–log()
2
+log++log .
1.0
0.6
0.2
0.5
0
1.0
0
20
40
–10
30
50
10
0.8
0.4
1.0
0.6
0.2
0.5
0
1.0
0
20
40
30
10
0.8
0.4
1.0
0.6
0.2
0.5
0
1.0
0
20
40
30
10
0.8
0.4
Regularized approximationNot regularized approximationRastrigin function
Fig. 1.
The Rastrigin function and its approximation: (left) objective function, (center) its approximation with the maximum
likelihood estimate of parameters, and (right) its approximation with the Bayesian estimates of parameters.
DRAFT
668
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
BURNAEV et al.
the test sample) on each problem and obtain a set of
performance values
q
ij
. Suppose that the algorithm,
providing the best results for a considered problem,
has the performance
q
j
,
j
= 1, ...,
T
, such that
∀
i
=
1, ...,
A
:
q
ij
≥
q
j
. Define
r
ij
= The closer
r
ij
to
unity, the better the algorithm
a
i
has performed on the
problem
t
j
. The parameter
τ
will be varied from 1 to
∞
.
For each
τ
, define a point
c
i
(
τ
) on the Dolan–More
curve for the algorithm
a
i
as a fraction of problems on
which
r
ij
is smaller than
τ
. With an increase in
τ
,
c
i
(
τ
)
tends to unity, and the higher the curve
c
i
(
τ
) lies, the
better the performance of the algorithm as compared
to other algorithms.
For demonstration of the experimental results, we
used a large set of test functions for testing the optimi
zation methods [29–31]. The testing was performed
using 30 different functions for each of which 2 ran
dom testing samples of 10, 20, 40, 80, and 160 points
qij
qj
1.≥
were generated. To measure the performance of the
approximation algorithms we used 95percentile of
the absolute error on control samples, containing
10000 test points. The results are presented in the form
of the Dolan–More curves in Fig. 2. Performances of
algorithms were estimated by the area under the
Dolan–More curve, the values of which are provided
in the caption for Fig. 2. It should be noted that both
algorithms with the Bayesian regularization have
shown better results than the algorithm without regu
larization.
4.2. Nonstationary Covariance Function
In this section, we present two examples of using a
nonstationary covariance function: the approximation
of a spatially nonuniform onedimensional function
and the comparison of three different realizations of
the regression on the basis of Gaussian processes on
large set of different functions.
Let us consider the onedimensional example, in
which we see that the use of a nonstationary covari
ance function makes it possible to significantly
improve the quality of approximation. We approxi
mate the Heaviside function on the interval [0, 1] from
20 random points of this interval.
The graphs of the approximation with the use of a
stationary and a nonstationary covariance functions
are presented in Fig. 3. The use of a nonstationary
covariance function substantially increases the quality
of approximation.
In the experiment, we compared the work of three
algorithms based on Gaussian processes, namely:
⎯
DACE, the implementation of the regression
on the basis of Gaussian processes in the DACE
package [15];
Fig. 2.
Dolan–More curves for test functions. The areas
under the curves are presented in the legend.
1.2
1.0
0.8
0.6
0.4
0.2
0
–0.2 1.00.80.70.60.40.200.30.1 0.5 0.9
MacrosGP prediction 1.2
1.0
0.8
0.6
0.4
0.2
0
–0.2 1.00.80.70.60.40.200.30.1 0.5 0.9
MacrosGP_HDA prediction
Fig. 3.
Approximation of the Heaviside function: (left) the approximation obtained with the use of a stationary covariance func
tion and (right) the approximation with the use of a nonstationary covariance function.
2.01.61.2 1.41.0 1.80.6 0.80.2
MLE GP, 1.926
Gamma GP, 1.930
Normal GP, 1.951
00.4
0.1
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.2
DRAFT
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
REGRESSION ON THE BASIS OF NONSTATIONARY GAUSSIAN PROCESSES 669
⎯
GP, the algorithm on the basis of Gaussian pro
cesses without using the nonstationary covariance
function, developed by the authors of this paper;
⎯
nonstationary GP, the algorithm on the basis of
Gaussian processes with the use of the nonstationary
covariance function, developed by the authors of this
paper.
To demonstrate the experimental results, we used a
large set of test functions applied for testing optimiza
tion methods [29–31]. The testing was performed on
23 different functions, for each of which 2 random
training samples of 160 points were generated. We used
the meansquare error (1) on test samples with
10000 points to measure approximation accuracy. The
results are presented in the form of Dolan–More
curves [28] in Fig. 4. The higher the curve in the graph,
the higher the performance of the corresponding algo
rithm. The proposed implementation of the approxi
mator on the basis of Gaussian processes is signifi
cantly better than the DACE, and the addition of the
nonstationary covariance function substantially
improves the quality of results.
4.3. Multilevel Bayesian Regularization
As follows from the results of testing (Section 4.1),
the algorithm of Bayesian inference proposed in Sec
tion 3.2 increase on average approximation accuracy
and its robustness. However, when optimizing (15)
with respect to the parameters
γ
= (
α
,
β
) of the prior
distribution, situations can occur in which, e.g.,
β
≈
0.
As a result, the regularization stops to penalize large
values of the parameters
θ
= and the resulting
approximation can degenerate into a constant with
peaks at the points of the training sample (see the
example in Fig. 1). The multilevel Bayesian inference
described in Section 3.3, for which the results are pre
sented in this section, makes it possible to avoid such
situations.
The testing was performed on a large set of func
tions of different input dimensions. The parameter
μ
of the lognormal distribution was taken equal to 1.5.
Two measures of the quality of approximation were
used: the meansquare error and the measure of vari
ability of an approximation making it possible to
detect its degeneration. The variability was estimated
from the values of and
y
* by the formula
variability =
where max(
y
) is the maximum value of the vector
y
.
For each sample size, the number of problems for
θi
()
i1=
n
y
ˆf
ˆX*
()=
max y
ˆ
() min y
ˆ
()–
max y*
()min y*
()–
,
0.8
0.6
0.4
0.2
0907050 60921 10 80403020876543
Err(RSM)/Err(Best)
1.0
Part of functions (46 in all)
DACE
GP
Nonstationary GP
Fig. 4.
Dolan–More curves for test functions.
DRAFT
670
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
BURNAEV et al.
which the variability was smaller than 10
–4
(in this
case, the approximation is considered degenerated
into a constant).
The parameter in the experiments was taken
equal to 1.5, 1, 0.5, 0.3, and 0.2. The Dolan–More
curve constructed from the value of the meansquare
error is presented in Fig. 5. We see that, with a decrease
in , the accuracy of approximation decreases.
Table 1 presents the number of cases when the con
structed model had the approximation error close to
the error of approximation by a constant. Already for
= 0.5, there are no cases of degeneration of the
regression models, while the meansquare error of the
model remains approximately the same as in the ordi
nary Bayesian approach.
Thus, the suggested approach makes it possible to
avoid degenerate models of regression on the basis of
σprior
2
σprior
2
σprior
2
Gaussian processes without deteriorating the quality
of the model on the average.
5. CONCLUSIONS
In this work, we have proposed two methods for
improvement of the accuracy of regression model on
the basis of Gaussian processes: the use of the Bayesian
regularization and the use of a new nonstationary
covariance function constructed from the data.
The prior distributions of parameters for the Baye
sian regularization were the normal and gamma distri
butions. Both prior distributions make it possible to
avoid degenerate approximations and increase the
generalization ability and robustness of the algorithm.
The nonstationary covariance function of the
Gaussian process is constructed on the basis of a dic
tionary of parametric basis functions adjusted using
the data. In contrast to other implementations of
approximators on the basis of Gaussian processes such
as DACE [15], the suggested approach is fully auto
matic and does not require any knowledge of machine
learning from an engineer, who applies the algorithm
in practice. The experimental results demonstrate
high performance of the proposed algorithm as com
pared to traditional methods, especially for the spa
tially nonuniform functions.
ACKNOWLEDGMENTS
This work was performed at the Institute for Infor
mation Transmission Problems, Russian Academy of
Sciences, and was supported exclusively by the Rus
sian Science Foundation, project no. 145000150.
REFERENCES
1. A. I. J. Forrester, A. Sóbester, and A. J. Keane, “Engi
neering design via surrogate modelling: a practical
guide,” in
Progress in Astronautics and Aeronautics
(Wiley, New York, 2008).
2. A. A. Zaytsev, E. V. Burnaev, and V. G. Spokoiny,
“Properties of the Bayesian parameter estimation of a
regression based on Gaussian processes,” J. Math. Sci.
203
, 789–798 (2014).
Choice of the parameter the number of problems on which the variability of approximation is smaller than 10
–4
, i.e.,
the degeneration of the resulting approximation occurred. A total of 190 test problem was considered for each size
N
of the
training sample, which varied from 10 to 320 points
Sample size Bayesian
regularization
Multilevel Bayesian regularization for different
1.5 1 0.5 0.3 0.2
101600000
20200000
403100000
8037140000
16024151000
320310000
σprior
2:
σprior
2
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.70.60.40.200.30.1 0.5
0.9
1.0
log
10
(
error
)
Fraction of Problems
Bayesian regularization
σprior
2 = 1.5
σprior
2 = 1.0
σprior
2 = 0.5
σprior
2 = 0.3
σprior
2 = 0.2
Fig. 5.
Dolan–More curve for the Bayesian regularization
and for the multilevel Bayesian regularization for different
values of
σprior
2.
DRAFT
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
REGRESSION ON THE BASIS OF NONSTATIONARY GAUSSIAN PROCESSES 671
3. E. V. Burnaev, A. A. Zaytsev, and V. G. Spokoiny, “The
BernsteinVon Mises theorem for regression based on
Gaussian processes,” Rus. Math. Surv.
68
, 954–956
(2013).
4. E. Burnaev and M. Panov, “Adaptive design of experi
ments based on Gaussian processes,” in
Lecture Notes
in Artificial Intelligence (Proc. SLDS, April 20–23, Lon
don, 2015)
, Ed. by A. Gammerman et al. (UK,
Springer, 2015), Vol. 9047, p. 116–126.
5. M. Belyaev, E. Burnaev, and Y. Kapushev, “Gaussian
process regression for structured data sets,” in
Lecture
Notes in Artificial Intelligence (Proc. SLDS, April 20–23,
London, 2015
, Ed. by A. Gammerman et al. (UK,
Springer, 2015), Vol. 9047, p. 106–115.
6. A. Giunta and L. T. A. Watson, “Comparison of
approximation modeling technique: Polynomial versus
interpolating models,” in
Proc. 7th AIAA/USAF/
NASA/ISSMO Symp. on Multidisciplinary Analysis and
Optimization, 1998
(AIAA, 1998), Vol. 1, p. 392–404
(1998).
7. T. W. Simpson, A. J. Booker, S. Ghosh, A. Giunta,
P. N. Koch, and R. J. Yang, “Approximation methods
in multidisciplinary analysis and optimization: a panel
discussion,” Structur. Multidisciplin. Optimiz.
27
,
302–313 (2004).
8. S. M. Batill, J. E. Renaud, and X. Gu, “Modeling and
simulation uncertainty in multidisciplinary design opti
mization,” in
Proc. 8th AIAA/USAF/NASA/ISSMO
Symp. on Multidisciplinary Analysis and Optimization,
2000
(AIAA, 2000), p. 1–11 (2000).
9. J. E. Pacheco, C. H. Amon, and S. Finger, “Bayesian
surrogates applied to conceptual stages of the engineer
ing design process,” J. Mechan. Design
125
, 664–672
(2003).
10. A. J. Booker, J. E. Dennis, P. D. Frank, D. B. Serafini,
V. Torczon, and M. Trosset, “A rigorous framework for
optimization of expensive functions by surrogates,”
Struct. Optimiz.
17
, 1–13 (1999).
11. P. N. Koch, B. A. Wujek, O. Golovidov, and T. W. Sim
pson, “Facilitating probabilistic multidisciplinary
design optimization using kriging approximation mod
els,” in
Proc. 9th AIAA/ISSMO Symp. on Multidisci
plinary Analysis and Optimization, 2002
(AIAA, 2002).
12. T. W. Simpson, T. M. Maurey, J. J. Korte, and F. Mis
tree, “Kriging Metamodels for Global Approximation
in SimulationBased Multidisciplinary Design Optimi
zation,” AIAA J.
39
, 2223–2241 (2001).
13. T. W. Simpson, T. M. Maurey, J. J. Korte, and F. Mis
tree, “Metamodeling development for vehicle frontal
impact simulation,” in
Proc. American Society
of Mechanical Engineers (ASME) Design Engineering
Technical Conf.Design Automation Conf.,
DETC2001/DAC21012, Sept., 2001
(ASME, 2001).
14. C. E. Rasmussen and C. K. I. Williams,
Gaussian Pro
cesses for Machine Learning
(MIT Press, Cambridge,
MA, 2006).
15. S. N. Lophaven, H. B. Nielsen, and J. Sondergaard,
Aspects of the Matlab Toolbox Dace. Tech. Rep. IMM
REP200213, 2002
(Techn. Univ. Denmark, Lyngby,
2002).
16. A. Bhattacharya, D. Pati, and D. Dunson, “Anisotropic
function estimation using multibandwidth Gaussian
processes,” Ann. Statist.
42
, 352–381 (2014).
17. Ying Xiong, Wei Chen, Daniel Apley, and Xuru Ding,
“A nonstationary covariancebased kriging method for
metamodelling in engineering design,” Int. J. Numer.
Methods Eng.
71
, 733–756 (2007).
18. B. Nagy, J. L. Loeppky, and W. J. Welch,
Correlation
parameterization in random function models to improve
normal approximation of the likelihood or posterior. Tech.
Rep., 229
(Department of Statistics, Univ. British
Columbia, 2007).
19. B. MacDonald, P. Ranjan, and H. Chipman, “Gpfit:
an r package for fitting a Gaussian process model to
deterministic simulator outputs,” J. Statist. Software
64
(12), 1–23 (2015).
20. E. V. Burnaev, M. G. Belyaev, and P. V. Prihodko,
“About hybrid algorithm for tuning of parameters in
approximation based on linear expansions in paramet
ric functions,” in
Proc. 8th Int. Conf. on Intelligent
Information Processing (IIP2010), Paphos, Cyprus,
Oct. 17–24, 2010
(MAKS, Moscow, 2011), pp. 200–
203.
21. M. Belyaev and E. Burnaev, “Approximation of a mul
tidimensional dependency based on a linear expansion
in a dictionary of parametric functions,” Inform. Appl.
7
, 114–125 (2013).
22. E. Burnaev and P. Prikhodko, “On a method for con
structing ensembles of regression models,” Avtom.
Telemekh.
74
, 1630–1644 (2013).
23. S. Grihon, E. Burnaev, M. Belyaev, and P. Prikhodko,
Surrogate Modeling of Stability Constraints for Optimiza
tion of Composite Structures
in
Springer Series in Compu
tational Science & Engineering
(Springer, New York,
2013), pp. 359–391.
24. N. A. C. Cressie,
Statistics for Spatial Data
(Wiley, New
York, 1993).
25. E. Burnaev, E. Zaytsev, M. Panov, P. Prihodko, and
Yu. Yanovich, “Modeling of nonstationary covariance
function of Gaussian process using decomposition in
dictionary of nonlinear functions,” in
Information
Technologies and Systems2011, (Proc. Conf., Moscow,
Russia, Oct. 2–7, 2011)
(Instit. Problem Pered. Inf.,
Moscow, 2011), pp. 355–362.
26. C. M. Bishop,
Pattern Recognition and Machine Learn
ing
(Springer, New York, 2006).
27. A. Torn and A. Zilinskas,
Global Optimization
in
Lecture
Notes in Computer Science
(Berlin, SpringerVerlag,
1989), Vol. 350.
28. E. D. Dolan and J. J. More, “Benchmarking optimiza
tion software with performance profiles,” Math. Pro
gram.
91
, 201–213 (2002).
29. A. Saltelli and I. M. Sobol, “About the use of rank trans
formation in sensitivity analysis of model output,”
Reliab. Eng. Syst. Safety
50
, 225–239 (1995).
30. J. Ronkkonen and J. Lampinen, “An extended muta
tion concept for the local selection based differential
evolution algorithm,” in
Genetic and Evolutionary Com
putation Conf. (GECCO’2007), London, England,
July 7–11, 2007
(ACM Press, New York, 2007).
31. T. Ishigami and T. Homma, “An importance qualifica
tion technique in uncertainty analysis for computer
models,” in
Proc. ISUMA’90, 1st Int. Symp. on Uncer
tainty Modelling and Analysis, (ISUMA’90), Univ.
Maryland, USA, December 3–5, 1990
(IEEE, New
York, 1990), pp. 398–403.
Translated by E. Chernokozhin
DRAFT