ArticlePDF Available

Abstract

We consider the regression problem, i.e. prediction of a real valued function. A Gaussian process prior is imposed on the function, and is combined with the training data to obtain predictions for new points. We introduce a Bayesian regularization on parameters of a covariance function of the process, which increases quality of approximation and robustness of the estimation. Also an approach to modeling nonstationary covariance function of a Gaussian process on basis of linear expansion in parametric functional dictionary is proposed. Introducing such a covariance function allows to model functions, which have non-homogeneous behaviour. Combining above features with careful optimization of covariance function parameters results in unified approach, which can be easily implemented and applied. The resulting algorithm is an out of the box solution to regression problems, with no need to tune parameters manually. The effectiveness of the method is demonstrated on various datasets.
ISSN 10642269, Journal of Communications Technology and Electronics, 2016, Vol. 61, No. 6, pp. 661–671. © Pleiades Publishing, Inc., 2016.
Original Russian Text © E.V. Burnaev, M.E. Panov, A.A. Zaytsev, 2015, published in Informatsionnye Protsessy, 2015, Vol. 15, No. 3, pp. 298–313.
661
1. INTRODUCTION
One of the main problems solved when construct
ing surrogate models (metamodels, databased mod
els) is the problem of approximation of an unknown
dependence from data (constructing the regression
function) [1], which can be described as follows.
Let
y
=
f
(
x
) be an unknown function with the input
x
n
and the output
y
. The values of this
function are known at the points of a training sample
D
train
= = It is
required to construct an approximation
for an objective function
f
(
x
) using the
training sample
D
train
. The quality of approximation is
estimated using an independent test sample
D
test
=
= e.g., by
the mean error
(1)
and 95 and 99percentiles of the absolute error of
approximation.
A widely used approach to describing the class of
models of approximation in this formulation is based
on introducing a prior distribution on a space of func
tions. It is a standard practice to use a prior Gaussian
distribution, as a result of which the problem is
Xy,()
xiyi = fxi
(),()i = 1N,,,{}.
y
ˆf
ˆx() fx()=
X*y*
,()
xj
*yj
* = fxj
*
(),()j = 1N*
,,,{},
εf
ˆDtest
()
1
N*
yj
*f
ˆxj
*
(),
j1=
N*
=
reduced to a nonparametric regression on the basis of
Gaussian processes [2–8]. This method is widely used
for solving applied problems, including conceptual
design [9], structural optimization [10], multicriterion
design optimization [11], design in aerospace [12] and
automobile [13] industries, etc.
The general approach to construction of Gaussian
processes regression is described in the literature [14],
but few attention has been paid to its practical imple
mentation. The widely used software packages, e.g.,
DACE [15], in some cases, produce degenerate
approximations, which can hardly be interpreted from
the viewpoint of the original problem. Available theo
retical results [16] show that the choice of the gamma
distribution as a prior distribution on the parameters of
the covariance function leads to a minimaxoptimal
rate of concentration of the posterior distribution; in
this case, such a prior distribution plays a role of a reg
ularization that potentially prevents the degeneration
of the model of approximation. However, the practical
implementation of the corresponding algorithm of
regularization requires an accurate choice of the next
level hyperprior distribution on the parameters of the
gamma distribution and a reliable optimization proce
dure making it possible to obtain a robust local maxi
mum of a likelihood. In this work, it is suggested to
introduce a hyperprior lognormal distribution with
fixed parameters for the parameters of the prior
gamma distribution and optimize the joint likelihood
Regression on the Basis of Nonstationary Gaussian Processes
with Bayesian Regularization
E. V. Burnaev, M. E. Panov, and A. A. Zaytsev
Institute for Information Transmission Problems, Russian Academy of Sciences,
Bol’shoi Karetnyi per. 19, str. 1, Moscow, 127994 Russia
email: burnaev@iitp.ru
Received June 26, 2015
Abstract
—We consider the regression problem, i.e. prediction of a real valued function. A Gaussian process
prior is imposed on the function, and is combined with the training data to obtain predictions for new points.
We introduce a Bayesian regularization on parameters of a covariance function of the process, which increases
quality of approximation and robustness of the estimation. Also an approach to modeling nonstationary cova
riance function of a Gaussian process on basis of linear expansion in parametric functional dictionary is pro
posed. Introducing such a covariance function allows to model functions, which have nonhomogeneous
behaviour. Combining above features with careful optimization of covariance function parameters results in
unified approach, which can be easily implemented and applied. The resulting algorithm is an out of the box
solution to regression problems, with no need to tune parameters manually. The effectiveness of the method
is demonstrated on various datasets.
Keywords
: Gaussian processes, regression, Bayesian regularization, a priori distribution, Bayesian regression
DOI:
10.1134/S1064226916060061
MATHEMATICAL MODELS
AND COMPUTATIONAL METHODS
DRAFT
662
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
BURNAEV et al.
of data and the model with respect both to the param
eters of the covariance function and the parameters of
the prior gamma distribution. The results of experi
ments show that this approach makes it possible to
substantially increase the accuracy of approximation
and avoid its degeneration.
Another important feature of the general approach
to regression on the basis of Gaussian processes is that
models with a stationary covariance function are
mainly used [14]. The existing approaches to a nonsta
tionary covariance function modeling, as a rule, are
excessively parameterized [17], which leads to high
computational costs when estimating the model
parameters and results in instability of the correspond
ing optimization procedure. In the present work, we
suggest a method of a nonstationary covariance func
tion modeling on the basis of a linear decomposition in
a dictionary of parametric functions. With a given dic
tionary, the proposed procedure makes possible to
automatically estimate the contribution of the station
ary and of the nonstationary parts of the covariance
function to the total variability of the approximation
model.
The work consists of the following parts.
In Section 2, the regression on the basis of Gauss
ian processes and our method for constructing a non
stationary covariance function are described.
Section 3 describes our algorithm proposed for
estimating the regression parameters on the basis of
Gaussian processes by means of the Bayesian
approach.
Section 4 presents the results of numerical experi
ments.
2. APPROXIMATOR ON THE BASIS
OF A GAUSSIAN PROCESS
The regression on the basis of Gaussian processes
makes it possible to construct nonlinear approxima
tion models and yields an analytical expression for
estimating the uncertainty of approximation at a given
point
x
.
A Gaussian process
f
(
) is a stochastic process for
which each spatial point
x
is associated with a certain
normal variable. Moreover, each finite set of values of
a Gaussian process at different space points has a mul
tidimensional normal distribution. Thus, a random
process
f
(
) is defined by the mean
m
(x) = and
the covariance function
k
(
x
,
x
') =
For the zero mean
m
(
x
) = = 0 and a fixed covariance function,
the a posteriori mean of the Gaussian process at the
points of a test sample
X
* has the form [14]
(2)
where
Usually, in the simulations, one has an access only
to noisy values of the function:
(3)
where the noise
ε
(
x
) is modelled by a Gaussian white
noise with the zero mean and a variance In this
case, observations
y
(
x
) comprise a Gaussian process
with a zero mean and the covariance function
cov(
y
(
x
),
y
(
x
')) = Then, the posterior
mean of the Gaussian process
f
(
x
) at the points of a test
sample
X
* has the form
(4)
where
I
N
is a unit
N
×
N
matrix. The posterior covari
ance function of the Gaussian process at the points of
the test sample has the form
(5)
where
K
(
X
*,
X
*) =
The variance of the noise in formula (4), , regu
larizes the covariance matrix
K
and makes it possible
to control the generalization ability of the approxima
tor. The diagonal elements of the matrix
from formula (5) can be used as estimates of the
expected approximation error at the corresponding
points
X
*.
2.1. Estimating the Parameters
of the Gaussian Process
When working with real data, the covariance func
tion of the Gaussian process, generated this data, is
not known. Assume that the covariance function of the
Gaussian process belongs to a parametric family
k
(
x
,
x
') =
k
(
x
,
x
'|
a
), where
a
K
is the vector of
parameters of the covariance function. The role of the
family
k
(
x
,
x
'|
a
) is often played by the class of station
ary covariance functions, i.e., functions depending
only on the difference of the arguments:
k
(
x
,
x
'|
a
) =
fx()[]
fx()[]
f
ˆX*
() K*K1y,=
K*KX
*X,()kxj
*xi
,()i = 1N; j = 1N*
,,,,,[]==
KKXX,() kxixj
,()ij = 1N,,,,[].==
yx() fx() εx()+,=
σ
˜2.
kxx',()σ
˜2.+
f
ˆX*
() K*Kσ
˜2IN
+()
1y,=
f
ˆX*
()[]KX
*X*
,()σ
˜2IN*
+=
K*Kσ
˜2IN
+()
1K*
T,
kxi
*xj
*
,()ij = 1N*
,,,,[].
σ
˜2
f
ˆX*
()[]
DRAFT
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
REGRESSION ON THE BASIS OF NONSTATIONARY GAUSSIAN PROCESSES 663
k
(
x
x
'|
a
). The parameter
a
is estimated from a train
ing sample
D
train
by the maximumlikelihood method.
The logarithm of the likelihood of a Gaussian process
at the points of a training sample [14] has the form
(6)
where |
K
+
I
| is the determinant of the matrix
K
+
I
. The logarithm of the likelihood depends on the
parameters
a
of the covariance function and the vari
ance of the noise, Thus, the estimates of
a
and
are obtained by maximizing the logarithm of the like
lihood:
(7)
The choice of a particular family of covariance
functions
k
(
x
,
x
'|
a
) is usually determined by the prior
knowledge of the properties of the function to be
approximated. In this work, we use the covariance
functions from the family
In this case, it is traditionally recommended to use the
following parameterization of
θ
:
θ
i
=
i
= 1, ...,
n
,
and optimize the likelihood with respect to the param
eters
τ
= (see [18, 19]).
It should be noted that, besides the above–
described family of covariance functions, one may
consider the family of covariance functions on the
basis of the Mahalanobis distance
where
Λ
2
is some positivedefinite matrix. This family
of covariance functions enables one to approximate a
wide class of dependences, but the quadratic depen
dence on the dimension of the input space makes the
optimization of the likelihood a complex computa
tional problem; construction of an efficient algorithm
for solving this problem deserves special consider
ation.
The use of a stationary covariance function limits
the accuracy of approximation if the modelled depen
dence is spatially nonuniform. Below we suggest a
method for constructing a nonstationary covariance
function for a Gaussian process.
2.2. The Model of a Nonstationary Gaussian Process
Consider a nonstationary Gaussian process of the
following form:
(8)
where
f
(
x
) = + is a Gauss
ian process with a zero mean and a stationary covari
ance function
k
(
x
,
x
') = {
α
i
,
i
= 1, ...,
Q
)
are independent identically distributed normal ran
dom variables with a zero mean and a variance
and {
ψ
i
(
x
),
i
= 1, ...,
Q
} is some set of functions. For
such a process, the covariance function takes the form
(9)
where
ψ
(
x
) =
Introduce the following notation:
Ψ
(
X
) =
Ψ
(
X
*) =
= + and
= Then, the posterior mathe
matical expectation of process (8) at the test sample
points has the form
(10)
Correspondingly, the posterior covariance of predic
tions at the test sample points has the form
(11)
The logarithm of the likelihood of Gaussian process
(8) coincides with expression (6) in which the matrix
K
is replaced with the matrix :
(12)
If the set of basis functions is specified, then, for the
identification of the approximation model, it is neces sary to maximize the logarithm of the likelihood (12)
with respect to the parameters
a
of the covariance
pyXaσ
˜
,,()log
=
1
2
yTKσ
˜2IN
+()
1yKσ
˜2IN
+log N2πlog++(),
σ
˜2
σ
˜2
σ
˜2.
σ
˜2
pyXaσ
˜
,,().
aσ
˜
,
max
log
kpq,xx
˜a()σ
2θi
2xix
˜i
p
i1=
n
⎩⎭
⎨⎬
⎧⎫
q
⎝⎠
⎜⎟
⎛⎞
,exp=
aθ = θi
()
i1=
nσ,{},p(02], q(01].,,=
eτi,
τi
()
i1=
n
kxx
˜Λσ,()σ
2xx
˜
()
TΛ2xx
˜
()(),exp=
yfx() εx(),+=
αiψix()
i1=
Q
f
˜x(),
f
˜x()
kxx'a(),
σ
˜2
Q
,
cov yx()yx'(),()kxx'a()=
+
σ
˜2ψx()
Tψx'() σ
˜2,+
ψix()i = 1Q,,,{}.
ψxi
()i = 1N,,,{},
ψxj
*
(),{
j1N*
,,=},
K*
K*
σ2ΨX*
()
TΨX(),
K
Kσ2ΨX()
TΨX().+
f
ˆX*
() K*Kσ
˜2IN
+()
1y.=
f
ˆX*
()[]KX
*X*
,()σ
˜2IN*
+=
+
σ2ΨX*
()
TΨX*
()K*Kσ
˜2IN
+()
1K*
T.
K
pyXaσ
˜σ,, ,()log 1
2
yTKσ
˜2IN
+()
1y1
2
Kσ
˜2IN
+log N
2
2π.log+=
DRAFT
664
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
BURNAEV et al.
function
k
(
x
x
'|
a
) and with respect to the parameters
and , i.e.,
(13)
It should be noted that, in the optimization with
respect to the parameters and the contribution
of the stationary and nonstationary parts of the covari
ance function to the total variability of the approxima
tion model will be automatically determined.
Constructing the set of functions {
ψ
i
(
x
),
i
= 1, ...,
Q
}
from a training sample is an independent problem. In
the series of papers [20–23], it was suggested to
approximate a multidimensional dependence by a lin
ear decomposition in a dictionary of parametric func
tions of different types:
1. Sigmoid basis functions
ψ
j
(
x
) =
where
and
γ
j
and
β
j
,
i
are adjustable parameters;
2. Radial basis functions
ψ
j
(
x
) =
where
d
j
and
t
j
are adjustable
parameters;
3. Linear basis functions
ψ
j
(
x
) =
x
j
,
j
= 1, ...,
n
.
The parameters of the functions from the dictio
nary are adjusted by the methods of gradient optimiza
tion, boosting, and begging [20–23]. It is the dictio
nary obtained in this manner that is suggested to be
used for modeling a nonstationary Gaussian process.
2.3. Modeling of a Multidimensional Output
In the previous sections, we considered construc
tion of an approximator based on a nonstationary
Gaussian process regression model for the case of a
onedimensional output
y
1
. In this section, we
extend these results to the case of a multidimensional
output
y
=
m
> 1. The simplest
approach consists in constructing an independent
approximator for each output component
y
l
,
l
= 1, ...,
m
. However, this approach does not take into account
the crosscorrelation between the output components.
Modeling not only correlational properties of the each
component but also the crosscorrelations between
components can substantially improve accuracy of
approximation and can be implemented, e.g., by the
cokriging algorithm [24].
σ
˜
σ
pyXaσ
˜σ,, ,().
aσ
˜σ,,
max
log
σ
˜
σ,
hβji,xi
i1=
n
(),
xx1xn
,,(),hz() gz
gγj
() 1+sign z()(),==
gv() ev1
ev1+
 ,=
xd
j
2/tj
2
(),exp
y1...,ym
,()m,
In this work, we suggest an intermediate approach,
which presumes that the Gaussian processes
simulating different output components are indepen
dent but have a common covariance function. Assume
that, for the
l
th output component, the covariance
function of the corresponding Gaussian process
has the form
where different output components have different fac
tors and the parameters
θ
= are the same
for all output components. Assume that the coeffi
cients are independent identi
cally distributed normal random variables with a zero
mean and a variance and the processes
ε
l
(
x
),
l
= 1, ...,
m
are mutually independent, independent
from and are mod
elled by a white Gaussian noise with a variance
The likelihood of such a model equals the sum of
the likelihoods (12) over all output channels. The opti
mal values of parameters are obtained by optimizing
the loglikelihood:
(14)
This approach offers a computationally efficient
method for constructing a model of a dependence with
taking into account crosscorrelations between differ
ent output components. A drawback of this approach
is that, although the posterior mean values of the cor
responding Gaussian processes are different for differ
ent output components
all output components are modelled by independent
Gaussian processes whose covariance matrices differ
only in the scaling factors As a result, poste
rior covariances (11) of different components, e.g., in
the interpolation regime, will differ only in the values
of the scaling factors.
ylx() α
il,ψix() f
˜lx() ε
lx()++
i1=
Q
=
f
˜lx()
klxx
˜al
()σ
l
2θi
2xix
˜i
p
i1=
n
⎩⎭
⎨⎬
⎧⎫
q
⎝⎠
⎜⎟
⎛⎞
,exp=
alθ = θi
()
i1=
nσl
,{},p(02], q(01],,,=
σl
2
θi
()
i1=
n
αil,i = 1Q,,,{}
l1=
m
αil,
2σ2
Q
,=
αil,i = 1Q,,,{}
l1=
m,
f
˜lx(){}
l1=
m,
σ
˜2.
pYXσ
˜alσl
,{}
l1=
m
,,()log
=
pylXσ
˜alσl
,,,()log
l1=
m
.
σlal
,{}
l1=
mσ,
max
fl
ˆX*
() Kl*
,Klσ
˜2IN
+()
1yl,=
l1m,,,=
σl
{}
i1=
m.
DRAFT
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
REGRESSION ON THE BASIS OF NONSTATIONARY GAUSSIAN PROCESSES 665
2.4. Interpolation Properties in the Case
of a Nonstationary Covariance Function
The model of the dependence on the basis of Gaus
sian processes has an important property: in models
(3) and (8) with a noise, for = 0, we obtain an inter
polating behavior of the posterior mean, i.e., at the
points of sample
X
, the approximated value coincides
with the observed value of the dependence [14]. Such
a behavior is important for some engineering prob
lems. For > 0, the interpolation property, generally
speaking, is not satisfied but, for model (8), it can be
shown that, for sufficiently small
0, there always
exist such values of the parameters
that the behavior of the approximator will be interpo
lating with any specified accuracy. This assertion is
formulated in the following theorem [25].
Theorem 1.
For a given
> 0,
if
where
λ
i
,
i
= 1, ...,
N
,
are the eigenvalues of the matrix
K and the norm is the standard Euclidean norm, then
Thus, for our model of a nonstationary Gaussian
process, one can also obtain interpolating realizations,
which sometimes is necessary in applied problems.
3. THE USE OF THE BAYESIAN APPROACH
FOR ESTIMATING THE PARAMETERS
OF THE REGRESSION MODEL ON THE BASIS
OF A GAUSSIAN PROCESS
3.1. Drawbacks of the Standard Approach
to Model Identification
In Section 2, we described a basic algorithm for
constructing a model on the basis of Gaussian pro
cesses. On a wide class of problems, this algorithm
provides a good quality of approximation of a simu
lated dependence, but in some cases, it can lead to
degeneration of the model. For example, consider the
σ
˜2
σ
˜
σ
˜2
σ
˜θi
()
i1=
nσσ,,,{},
ΨX()
TΨX() CC +
,,<
σ
˜21
8
min λi
i = 1N,
min
ε,(),<
σ21
2
,>
σ21
8
λi
i = 1N,
min
C1+
<,
yf
ˆX()
y.<
use of a covariance function with a weighted Euclidean
distance:
When optimizing the likelihood with respect to the
parameters of the covariance function
a
=
and the variance of the noise, the
following effects can be observed:
The optimum can be located in a region where
||
θ
|| ~ 0; in this case, the matrix
K
in this region is ill
conditioned;
For ||
θ
||
1, we have
K
I
N
, which leads to
degenerate approximation;
The loglikelihood can weakly
depend on the parameters
a
and
If the maximum of the likelihood is located in the
region of small ||
θ
|| ~ 0, the covariance matrix
K
is ill
conditioned. Formula (4) implies that the approxima
tion can be represented in the form =
where = If
the value of is insufficiently large, then, due to the
illconditioning of the matrix
K
, the norm of the vec
tor takes large values and, in the resulting
approximation the numerical noise becomes
essentially noticeable. If the maximum of likelihood is
located in the region of large values ||
θ
||
1, the
approximation model also degenerates: for
0 and
(
x
i
,
y
i
)
D
train
, we have
y
i
; in this case,
for
x
D
train
.
3.2. Bayesian Inference
To overcome the above difficulties, it is suggested to
use the Bayesian regularization on the parameters of
the covariance function. Below we consider two vari
ants of such regularization: one on the basis of the
prior normal distribution and one on the basis of the
prior gamma distribution.
Assume that the parameters
θ
are distributed with
some prior density where
γ
are the hyperpa
rameters of the prior distribution of the covariance
function parameters. Then, the logarithm of the joint
distribution of data and model parameters has the form
(15)
kxx
˜
,()σ
2θi
2xix
˜i
()
2
i1=
n
⎝⎠
⎜⎟
⎛⎞
.exp=
θ = θi
()
i1=
nσ,{}
σ
˜,
pyXaσ
˜
,,()log
σ
˜.
f
ˆx()
f
ˆx()
αikxx
i
,(),
i1=
N
αi
()
i1=
N
Kσ
˜2IN
+()
1y.
σ
˜
αi
()
i1=
N
f
ˆx(),
σ
˜
f
ˆxi
()
f
ˆx()
1
N
yi
i1=
N
pθγ(),
PyθXσσ
˜γ,,,,()log
=
pyXθσσ
˜
,,,()log pθγ().log+
DRAFT
666
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
BURNAEV et al.
The term plays the role of an additional
regularization on the parameters of the covariance
function
θ
, and the choice of a particular distribution
will determine the character of this regularization.
Expression (15) can be maximized with respect to the
parameters
σ
,
γ
, and
θ
in order to find their opti
mal values. This approach corresponds to finding a
MAPestimate (Maximum Aposteriori Probability) of
the model parameters
σ
, and
θ
and the parameters
of the prior distribution
γ
. In this work, the role of the
prior distribution is played by the normal and
gamma distributions.
In the case of the normal distribution, consider the
following transformation of the vector
θ
:
where
θ
||
=
e
= and the vector
θ
is orthogonal to the vector
e
. Suppose that the
quantity |
θ
||
| is normally distributed with the mean
μ
||
>
0 and variance
d
||
and the vector
θ
is independent of
|
θ
||
| and is normally distributed with a zero mean and a
diagonal covariance matrix
d
I
, where
I
is a unit
matrix of dimension
n
– 1 and
d
is the variance of the
components of the vector
θ
. This assumption is based
on our intuition that all components of the vector
θ
should no be not too large and should not be too small
and should differ from each other not too significantly.
Thus, in the given case,
γ
= and
= + where
is the density of the normal distribution with
a mean
μ
and a covariance matrix
Σ
.
Now let us consider the case of the gamma distri
bution. Assume that all components of the vector
θ
are
independent and distributed according to the same
gamma distribution with the parameters
α
and
β
. In
this case,
(16)
where
α
> 1,
β
> 0, and
Γ
(
α
) is the gamma function.
Such a choice of the prior distribution of parameters
also penalizes too large and too small values of the
components
θ
i
. Section 4.1 presents a number of
experimental results showing that both methods of
Bayesian regularization substantially improve the
robustness of the algorithm and the quality of the final
model. It should be noted that, in a similar manner,
one can regularize the parameter
σ
.
pθγ()log
σ
˜,
σ
˜,
pθγ()
θθ
|| θ
+,=
θ|| e,
1
n
 1
n
,,
⎝⎠
⎛⎞
,
μ|| d|| d
,,{}
pθγ()log
pNμ|| d||
,()
θ()log
pN0dI
,()
θ(),log
pNμΣ,()
()
pθγ()log pθαβ,()log=
=
αβα1()θ
i
log βθiΓα()log+log[],
i1=
n
3.3. Multilevel Bayesian Inference
In order that the procedures based on the Bayesian
approach could be used for solving applied problems,
it is necessary to provide reasonable values of parame
ters of the prior distribution [14]. A researcher typi
cally has no possibility to choose the parameters of the
prior distribution for each problem, e.g., when the
number of problems is large or if the algorithm is sup
posed to be used as a black box, i.e., a user will not have
appropriate information for choosing the parameters
of the prior distribution in the Bayesian approach.
In this section, we suggest a multilevel Bayesian
inference [26] for the model of regression on the basis
of Gaussian processes, which helps to avoid the diffi
culty described above. The idea of this approach is
based on introducing an additional hyperprior distri
bution on parameters for the original Gaussian model.
In this case, the model identification consists in tuin
ing the parameters of the model, the parameters of the
prior distribution, and the parameters of the hyper
prior distribution using the principle of maximum of
the posterior probability. In the considered case, the
parameters of the hyperprior distribution influence
the likelihood to the less extent, and the regulariza
tion, introduced by this hyperprior distribution on
the parameters of the first level prior distribution pre
vents the latter from degeneracy when optimizing
(15), and, as a result, keeps the parameters of the cova
riance function from the degeneracy as well.
Thus, it is suggested to introduce
a prior gamma distribution
Γ
(
α
,
β
) on the
parameters of the covariance function
θ
;
a prior gamma distribution
Γ
(
ασ
,
βσ
) on the
parameter of the covariance function
σ
;
a hyperprior lognormal distribution with prelim
inary fixed parameters
μ
and on each of parame
ters
α
, and
βσ
of the Bayesian regularization.
In this case, the joint distribution density will be
proportional:
where
Γ
(
) is the density of the gamma distribution and
log
(
) is the density of the lognormal distribution.
Substituting the explicit expressions for the distribu
tions, we obtain an explicit expression of the maximi
zation:
σprior
2
βα
σ
,
pyXθσ2αβα
σβσμσ
prior
2
,, ,,, , ,,()
pyXθσ2
,,()Γθαβ,()Γσ
2ασβσ
,()
×αμσ
prior
2
,()βμσ
prior
2
,()loglog
×ασμσ
prior
2
,()βσμσ
prior
2
,(),loglog
DRAFT
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
REGRESSION ON THE BASIS OF NONSTATIONARY GAUSSIAN PROCESSES 667
Here,
K
θ
,
σ
is the covariance matrix of the sample and
Γ
(
α
) is the value of the gamma function at the value of
the argument
α
.
4. EXPERIMENTAL RESULTS
4.1. Bayesian Regularization
In this section, we present two examples of applica
tion of the Bayesian regularization for estimating the
parameters of the covariance function of the regres
sion on the basis of Gaussian processes: the approxi
mation of the Rastrigin function and comparison of
the Bayesian estimate of parameters with estimation of
parameters on the basis of the maximum likelihood
method on a large set of test functions.
In some cases, the maximum likelihood corre
sponds to the values of parameters that have no good
physical interpretation. For example, consider the
approximation of the Rastrigin function [27] by means
of the regression on the basis of Gaussian processes
using 20 random points; in this case, the parameters of
the covariance function are estimated by the maxi
mumlikelihood method (the function is shown in
Fig. 1, left). The approximation obtained is presented
in Fig. 1 (center): the regression model consists of nar
row peaks around points of the training sample and
cannot be used in practice. If the approximation is
constructed using the Bayesian regularization either
with the prior Gamma distribution or with the prior
normal distribution, then one obtains the same
approximations of the form shown in Fig. 1 (left). We
see that the approximation using the Bayesian regular
ization reproduces the form of the Rastrigin function
better, although the difference is rather large due to a
small size of the training sample.
In the experiments, we compared the performance
of three algorithms for constructing the approxima
tion based on the Gaussian processes, namely:
MLE GP, the regression on the basis of Gaussian
processes with the maximumlikelihood estimation of
parameters;
Gamma GP, the Bayesian method of estimation
of the parameters with a prior gamma distribution;
Normal GP, the Bayesian method of estimation
of the parameters with a prior normal distribution.
The algorithms were compared with the help of the
Dolan–More curves [28], which make possible an
intuitive representation of results of testing several
algorithms on a set of problems. Suppose that there is
a set of algorithms
a
i
,
i
= 1, ...,
A
and a set of problems
t
j
,
j
= 1, ...,
T
. Then, for each algorithm, we can mea
sure its performance (e.g., the mean absolute error on
pyXθσ2αβα
σβσμσ
prior
2
,, ,,, , ,,()log
1
2
N2πlog Kθσ,
log yTKθσ,
1y++[]
Γα()log αβlog α1() θ
i
log
i1=
n
βθ
i
i1=
n
++
Γα
σ
()log ασβσ
log ασ1()σ
2
log βσσ2
++
1
2
2 α1
σprior
2
αμlog()
22β1
σprior
2
βμlog()
2
+log++log
1
2
2 ασ
1
σprior
2
ασμlog()
22βσ
1
σprior
2
βσμlog()
2
+log++log .
1.0
0.6
0.2
0.5
0
1.0
0
20
40
–10
30
50
10
0.8
0.4
1.0
0.6
0.2
0.5
0
1.0
0
20
40
30
10
0.8
0.4
1.0
0.6
0.2
0.5
0
1.0
0
20
40
30
10
0.8
0.4
Regularized approximationNot regularized approximationRastrigin function
Fig. 1.
The Rastrigin function and its approximation: (left) objective function, (center) its approximation with the maximum
likelihood estimate of parameters, and (right) its approximation with the Bayesian estimates of parameters.
DRAFT
668
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
BURNAEV et al.
the test sample) on each problem and obtain a set of
performance values
q
ij
. Suppose that the algorithm,
providing the best results for a considered problem,
has the performance
q
j
,
j
= 1, ...,
T
, such that
i
=
1, ...,
A
:
q
ij
q
j
. Define
r
ij
= The closer
r
ij
to
unity, the better the algorithm
a
i
has performed on the
problem
t
j
. The parameter
τ
will be varied from 1 to
.
For each
τ
, define a point
c
i
(
τ
) on the Dolan–More
curve for the algorithm
a
i
as a fraction of problems on
which
r
ij
is smaller than
τ
. With an increase in
τ
,
c
i
(
τ
)
tends to unity, and the higher the curve
c
i
(
τ
) lies, the
better the performance of the algorithm as compared
to other algorithms.
For demonstration of the experimental results, we
used a large set of test functions for testing the optimi
zation methods [29–31]. The testing was performed
using 30 different functions for each of which 2 ran
dom testing samples of 10, 20, 40, 80, and 160 points
qij
qj
1.
were generated. To measure the performance of the
approximation algorithms we used 95percentile of
the absolute error on control samples, containing
10000 test points. The results are presented in the form
of the Dolan–More curves in Fig. 2. Performances of
algorithms were estimated by the area under the
Dolan–More curve, the values of which are provided
in the caption for Fig. 2. It should be noted that both
algorithms with the Bayesian regularization have
shown better results than the algorithm without regu
larization.
4.2. Nonstationary Covariance Function
In this section, we present two examples of using a
nonstationary covariance function: the approximation
of a spatially nonuniform onedimensional function
and the comparison of three different realizations of
the regression on the basis of Gaussian processes on
large set of different functions.
Let us consider the onedimensional example, in
which we see that the use of a nonstationary covari
ance function makes it possible to significantly
improve the quality of approximation. We approxi
mate the Heaviside function on the interval [0, 1] from
20 random points of this interval.
The graphs of the approximation with the use of a
stationary and a nonstationary covariance functions
are presented in Fig. 3. The use of a nonstationary
covariance function substantially increases the quality
of approximation.
In the experiment, we compared the work of three
algorithms based on Gaussian processes, namely:
DACE, the implementation of the regression
on the basis of Gaussian processes in the DACE
package [15];
Fig. 2.
Dolan–More curves for test functions. The areas
under the curves are presented in the legend.
1.2
1.0
0.8
0.6
0.4
0.2
0
–0.2 1.00.80.70.60.40.200.30.1 0.5 0.9
MacrosGP prediction 1.2
1.0
0.8
0.6
0.4
0.2
0
–0.2 1.00.80.70.60.40.200.30.1 0.5 0.9
MacrosGP_HDA prediction
Fig. 3.
Approximation of the Heaviside function: (left) the approximation obtained with the use of a stationary covariance func
tion and (right) the approximation with the use of a nonstationary covariance function.
2.01.61.2 1.41.0 1.80.6 0.80.2
MLE GP, 1.926
Gamma GP, 1.930
Normal GP, 1.951
00.4
0.1
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.2
DRAFT
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
REGRESSION ON THE BASIS OF NONSTATIONARY GAUSSIAN PROCESSES 669
GP, the algorithm on the basis of Gaussian pro
cesses without using the nonstationary covariance
function, developed by the authors of this paper;
nonstationary GP, the algorithm on the basis of
Gaussian processes with the use of the nonstationary
covariance function, developed by the authors of this
paper.
To demonstrate the experimental results, we used a
large set of test functions applied for testing optimiza
tion methods [29–31]. The testing was performed on
23 different functions, for each of which 2 random
training samples of 160 points were generated. We used
the meansquare error (1) on test samples with
10000 points to measure approximation accuracy. The
results are presented in the form of Dolan–More
curves [28] in Fig. 4. The higher the curve in the graph,
the higher the performance of the corresponding algo
rithm. The proposed implementation of the approxi
mator on the basis of Gaussian processes is signifi
cantly better than the DACE, and the addition of the
nonstationary covariance function substantially
improves the quality of results.
4.3. Multilevel Bayesian Regularization
As follows from the results of testing (Section 4.1),
the algorithm of Bayesian inference proposed in Sec
tion 3.2 increase on average approximation accuracy
and its robustness. However, when optimizing (15)
with respect to the parameters
γ
= (
α
,
β
) of the prior
distribution, situations can occur in which, e.g.,
β
0.
As a result, the regularization stops to penalize large
values of the parameters
θ
= and the resulting
approximation can degenerate into a constant with
peaks at the points of the training sample (see the
example in Fig. 1). The multilevel Bayesian inference
described in Section 3.3, for which the results are pre
sented in this section, makes it possible to avoid such
situations.
The testing was performed on a large set of func
tions of different input dimensions. The parameter
μ
of the lognormal distribution was taken equal to 1.5.
Two measures of the quality of approximation were
used: the meansquare error and the measure of vari
ability of an approximation making it possible to
detect its degeneration. The variability was estimated
from the values of and
y
* by the formula
variability =
where max(
y
) is the maximum value of the vector
y
.
For each sample size, the number of problems for
θi
()
i1=
n
y
ˆf
ˆX*
()=
max y
ˆ
() min y
ˆ
()
max y*
()min y*
()
 ,
0.8
0.6
0.4
0.2
0907050 60921 10 80403020876543
Err(RSM)/Err(Best)
1.0
Part of functions (46 in all)
DACE
GP
Nonstationary GP
Fig. 4.
Dolan–More curves for test functions.
DRAFT
670
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
BURNAEV et al.
which the variability was smaller than 10
–4
(in this
case, the approximation is considered degenerated
into a constant).
The parameter in the experiments was taken
equal to 1.5, 1, 0.5, 0.3, and 0.2. The Dolan–More
curve constructed from the value of the meansquare
error is presented in Fig. 5. We see that, with a decrease
in , the accuracy of approximation decreases.
Table 1 presents the number of cases when the con
structed model had the approximation error close to
the error of approximation by a constant. Already for
= 0.5, there are no cases of degeneration of the
regression models, while the meansquare error of the
model remains approximately the same as in the ordi
nary Bayesian approach.
Thus, the suggested approach makes it possible to
avoid degenerate models of regression on the basis of
σprior
2
σprior
2
σprior
2
Gaussian processes without deteriorating the quality
of the model on the average.
5. CONCLUSIONS
In this work, we have proposed two methods for
improvement of the accuracy of regression model on
the basis of Gaussian processes: the use of the Bayesian
regularization and the use of a new nonstationary
covariance function constructed from the data.
The prior distributions of parameters for the Baye
sian regularization were the normal and gamma distri
butions. Both prior distributions make it possible to
avoid degenerate approximations and increase the
generalization ability and robustness of the algorithm.
The nonstationary covariance function of the
Gaussian process is constructed on the basis of a dic
tionary of parametric basis functions adjusted using
the data. In contrast to other implementations of
approximators on the basis of Gaussian processes such
as DACE [15], the suggested approach is fully auto
matic and does not require any knowledge of machine
learning from an engineer, who applies the algorithm
in practice. The experimental results demonstrate
high performance of the proposed algorithm as com
pared to traditional methods, especially for the spa
tially nonuniform functions.
ACKNOWLEDGMENTS
This work was performed at the Institute for Infor
mation Transmission Problems, Russian Academy of
Sciences, and was supported exclusively by the Rus
sian Science Foundation, project no. 145000150.
REFERENCES
1. A. I. J. Forrester, A. Sóbester, and A. J. Keane, “Engi
neering design via surrogate modelling: a practical
guide,” in
Progress in Astronautics and Aeronautics
(Wiley, New York, 2008).
2. A. A. Zaytsev, E. V. Burnaev, and V. G. Spokoiny,
“Properties of the Bayesian parameter estimation of a
regression based on Gaussian processes,” J. Math. Sci.
203
, 789–798 (2014).
Choice of the parameter the number of problems on which the variability of approximation is smaller than 10
–4
, i.e.,
the degeneration of the resulting approximation occurred. A total of 190 test problem was considered for each size
N
of the
training sample, which varied from 10 to 320 points
Sample size Bayesian
regularization
Multilevel Bayesian regularization for different
1.5 1 0.5 0.3 0.2
101600000
20200000
403100000
8037140000
16024151000
320310000
σprior
2:
σprior
2
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.70.60.40.200.30.1 0.5
0.9
1.0
log
10
(
error
)
Fraction of Problems
Bayesian regularization
σprior
2 = 1.5
σprior
2 = 1.0
σprior
2 = 0.5
σprior
2 = 0.3
σprior
2 = 0.2
Fig. 5.
Dolan–More curve for the Bayesian regularization
and for the multilevel Bayesian regularization for different
values of
σprior
2.
DRAFT
JOURNAL OF COMMUNICATIONS TECHNOLOGY AND ELECTRONICS Vol. 61
No. 6
2016
REGRESSION ON THE BASIS OF NONSTATIONARY GAUSSIAN PROCESSES 671
3. E. V. Burnaev, A. A. Zaytsev, and V. G. Spokoiny, “The
BernsteinVon Mises theorem for regression based on
Gaussian processes,” Rus. Math. Surv.
68
, 954–956
(2013).
4. E. Burnaev and M. Panov, “Adaptive design of experi
ments based on Gaussian processes,” in
Lecture Notes
in Artificial Intelligence (Proc. SLDS, April 20–23, Lon
don, 2015)
, Ed. by A. Gammerman et al. (UK,
Springer, 2015), Vol. 9047, p. 116–126.
5. M. Belyaev, E. Burnaev, and Y. Kapushev, “Gaussian
process regression for structured data sets,” in
Lecture
Notes in Artificial Intelligence (Proc. SLDS, April 20–23,
London, 2015
, Ed. by A. Gammerman et al. (UK,
Springer, 2015), Vol. 9047, p. 106–115.
6. A. Giunta and L. T. A. Watson, “Comparison of
approximation modeling technique: Polynomial versus
interpolating models,” in
Proc. 7th AIAA/USAF/
NASA/ISSMO Symp. on Multidisciplinary Analysis and
Optimization, 1998
(AIAA, 1998), Vol. 1, p. 392404
(1998).
7. T. W. Simpson, A. J. Booker, S. Ghosh, A. Giunta,
P. N. Koch, and R. J. Yang, “Approximation methods
in multidisciplinary analysis and optimization: a panel
discussion,” Structur. Multidisciplin. Optimiz.
27
,
302–313 (2004).
8. S. M. Batill, J. E. Renaud, and X. Gu, “Modeling and
simulation uncertainty in multidisciplinary design opti
mization,” in
Proc. 8th AIAA/USAF/NASA/ISSMO
Symp. on Multidisciplinary Analysis and Optimization,
2000
(AIAA, 2000), p. 1–11 (2000).
9. J. E. Pacheco, C. H. Amon, and S. Finger, “Bayesian
surrogates applied to conceptual stages of the engineer
ing design process,” J. Mechan. Design
125
, 664–672
(2003).
10. A. J. Booker, J. E. Dennis, P. D. Frank, D. B. Serafini,
V. Torczon, and M. Trosset, “A rigorous framework for
optimization of expensive functions by surrogates,”
Struct. Optimiz.
17
, 1–13 (1999).
11. P. N. Koch, B. A. Wujek, O. Golovidov, and T. W. Sim
pson, “Facilitating probabilistic multidisciplinary
design optimization using kriging approximation mod
els,” in
Proc. 9th AIAA/ISSMO Symp. on Multidisci
plinary Analysis and Optimization, 2002
(AIAA, 2002).
12. T. W. Simpson, T. M. Maurey, J. J. Korte, and F. Mis
tree, “Kriging Metamodels for Global Approximation
in SimulationBased Multidisciplinary Design Optimi
zation,” AIAA J.
39
, 2223–2241 (2001).
13. T. W. Simpson, T. M. Maurey, J. J. Korte, and F. Mis
tree, “Metamodeling development for vehicle frontal
impact simulation,” in
Proc. American Society
of Mechanical Engineers (ASME) Design Engineering
Technical Conf.Design Automation Conf.,
DETC2001/DAC21012, Sept., 2001
(ASME, 2001).
14. C. E. Rasmussen and C. K. I. Williams,
Gaussian Pro
cesses for Machine Learning
(MIT Press, Cambridge,
MA, 2006).
15. S. N. Lophaven, H. B. Nielsen, and J. Sondergaard,
Aspects of the Matlab Toolbox Dace. Tech. Rep. IMM
REP200213, 2002
(Techn. Univ. Denmark, Lyngby,
2002).
16. A. Bhattacharya, D. Pati, and D. Dunson, “Anisotropic
function estimation using multibandwidth Gaussian
processes,” Ann. Statist.
42
, 352–381 (2014).
17. Ying Xiong, Wei Chen, Daniel Apley, and Xuru Ding,
A nonstationary covariancebased kriging method for
metamodelling in engineering design,” Int. J. Numer.
Methods Eng.
71
, 733–756 (2007).
18. B. Nagy, J. L. Loeppky, and W. J. Welch,
Correlation
parameterization in random function models to improve
normal approximation of the likelihood or posterior. Tech.
Rep., 229
(Department of Statistics, Univ. British
Columbia, 2007).
19. B. MacDonald, P. Ranjan, and H. Chipman, “Gpfit:
an r package for fitting a Gaussian process model to
deterministic simulator outputs,” J. Statist. Software
64
(12), 1–23 (2015).
20. E. V. Burnaev, M. G. Belyaev, and P. V. Prihodko,
About hybrid algorithm for tuning of parameters in
approximation based on linear expansions in paramet
ric functions,” in
Proc. 8th Int. Conf. on Intelligent
Information Processing (IIP2010), Paphos, Cyprus,
Oct. 17–24, 2010
(MAKS, Moscow, 2011), pp. 200–
203.
21. M. Belyaev and E. Burnaev, “Approximation of a mul
tidimensional dependency based on a linear expansion
in a dictionary of parametric functions,” Inform. Appl.
7
, 114–125 (2013).
22. E. Burnaev and P. Prikhodko, “On a method for con
structing ensembles of regression models,” Avtom.
Telemekh.
74
, 1630–1644 (2013).
23. S. Grihon, E. Burnaev, M. Belyaev, and P. Prikhodko,
Surrogate Modeling of Stability Constraints for Optimiza
tion of Composite Structures
in
Springer Series in Compu
tational Science & Engineering
(Springer, New York,
2013), pp. 359–391.
24. N. A. C. Cressie,
Statistics for Spatial Data
(Wiley, New
York, 1993).
25. E. Burnaev, E. Zaytsev, M. Panov, P. Prihodko, and
Yu. Yanovich, “Modeling of nonstationary covariance
function of Gaussian process using decomposition in
dictionary of nonlinear functions,” in
Information
Technologies and Systems2011, (Proc. Conf., Moscow,
Russia, Oct. 2–7, 2011)
(Instit. Problem Pered. Inf.,
Moscow, 2011), pp. 355–362.
26. C. M. Bishop,
Pattern Recognition and Machine Learn
ing
(Springer, New York, 2006).
27. A. Torn and A. Zilinskas,
Global Optimization
in
Lecture
Notes in Computer Science
(Berlin, SpringerVerlag,
1989), Vol. 350.
28. E. D. Dolan and J. J. More, “Benchmarking optimiza
tion software with performance profiles,” Math. Pro
gram.
91
, 201–213 (2002).
29. A. Saltelli and I. M. Sobol, “About the use of rank trans
formation in sensitivity analysis of model output,”
Reliab. Eng. Syst. Safety
50
, 225–239 (1995).
30. J. Ronkkonen and J. Lampinen, “An extended muta
tion concept for the local selection based differential
evolution algorithm,” in
Genetic and Evolutionary Com
putation Conf. (GECCO’2007), London, England,
July 7–11, 2007
(ACM Press, New York, 2007).
31. T. Ishigami and T. Homma, “An importance qualifica
tion technique in uncertainty analysis for computer
models,” in
Proc. ISUMA’90, 1st Int. Symp. on Uncer
tainty Modelling and Analysis, (ISUMA’90), Univ.
Maryland, USA, December 3–5, 1990
(IEEE, New
York, 1990), pp. 398–403.
Translated by E. Chernokozhin
DRAFT
... Some algorithmic novelties of the tool have already been described earlier [30][31][32][33][34][35][36][37]; in the present paper we describe the tool as a whole, in particular focusing on the overall decision process and performance comparison that have not been published before. The tool is a part of the MACROS library [38]. ...
... Thanks to this important property we can use GP in surrogate-based optimization [27] and adaptive design of experiments [34]. In GTApprox, parameters of GP are optimized by a novel optimization algorithm with adaptive regularization [36], improving the generalization ability of the approximation (see also results on theoretical properties of parameters estimates [47][48][49]). GP memory requirements scale quadratically with the size of the training set, so this technique is not applicable to very large training sets. ...
... HDA contains many tweaks and novel structural elements intended to automate training and reduce overfitting while increasing the scope of the approach: Gaussian base functions in addition to standard sigmoids, adaptive selection of the type and number of base functions [52], a new algorithm of initialization of base functions' parameters [37], adaptive regularization [52], boosting used to construct ensembles for additional improvement of accuracy and stability [33], post-processing of the results to remove redundant features. HDAGP [36] extends GP by adding to it HDA-based non-stationary covariance functions with the goal of improving GP's ability to deal with spatially inhomogeneous dependencies. ...
Preprint
Full-text available
We describe GTApprox - a new tool for medium-scale surrogate modeling in industrial design. Compared to existing software, GTApprox brings several innovations: a few novel approximation algorithms, several advanced methods of automated model selection, novel options in the form of hints. We demonstrate the efficiency of GTApprox on a large collection of test problems. In addition, we describe several applications of GTApprox to real engineering problems.
... Maximum likelihood estimation of a Gaussian process regression model sometimes provides degenerate results -a phenomenon closely connected to overfitting [65,68,48,51]. To regularize the problem and avoid inversion of large ill-conditioned matrices, one can impose a prior distribution on a Gaussian process regression model and then use Bayesian MAP (Maximum A Posteriory) estimates [20,23,11]. In particular in this paper we adopted the approach described in [20]: we impose prior distributions on all parameters of the covariance function and additional hyperprior distributions on parameters of the prior distributions. ...
... To regularize the problem and avoid inversion of large ill-conditioned matrices, one can impose a prior distribution on a Gaussian process regression model and then use Bayesian MAP (Maximum A Posteriory) estimates [20,23,11]. In particular in this paper we adopted the approach described in [20]: we impose prior distributions on all parameters of the covariance function and additional hyperprior distributions on parameters of the prior distributions. Experiments confirm that such approach allows to avoid ill-conditioned and degenerate cases, that can occur even more often when processing variable fidelity data. ...
Preprint
Engineers widely use Gaussian process regression framework to construct surrogate models aimed to replace computationally expensive physical models while exploring design space. Thanks to Gaussian process properties we can use both samples generated by a high fidelity function (an expensive and accurate representation of a physical phenomenon) and a low fidelity function (a cheap and coarse approximation of the same physical phenomenon) while constructing a surrogate model. However, if samples sizes are more than few thousands of points, computational costs of the Gaussian process regression become prohibitive both in case of learning and in case of prediction calculation. We propose two approaches to circumvent this computational burden: one approach is based on the Nystr\"om approximation of sample covariance matrices and another is based on an intelligent usage of a blackbox that can evaluate a~low fidelity function on the fly at any point of a design space. We examine performance of the proposed approaches using a number of artificial and real problems, including engineering optimization of a rotating disk shape.
... In many appearance-based methods input images from the AS are considered holistically, in relation to other images, and natural visual features are computed by projecting images onto low-dimensional subspaces [9,13,14] usually via the Principal Component Analysis (PCA). Various types of regression between images or their low-dimensional features and robot coordinates are constructed including Gaussian process regression, random forest, etc. [15], [16], [17], [18], [19], [20], [21], [22]. This paper proposes new geometrically motivated machine learning approach to the appearance-based robot localization problem by combining a few advanced techniques. ...
... Based on known values (26) of the functions r GSE (w) (20) and R GSE (θ) (19) at sample points, as well as on the known values of their Jacobian matrices (21), (22) at these points, the estimators r * (w) and R * (θ) of these functions at arbitrary points w ∈ W and θ ∈ Θ, respectively, are constructed using the Jacobian Regression method, proposed in [32]. ...
Preprint
An appearance-based robot self-localization problem is considered in the machine learning framework. The appearance space is composed of all possible images, which can be captured by a robot's visual system under all robot localizations. Using recent manifold learning and deep learning techniques, we propose a new geometrically motivated solution based on training data consisting of a finite set of images captured in known locations of the robot. The solution includes estimation of the robot localization mapping from the appearance space to the robot localization space, as well as estimation of the inverse mapping for modeling visual image features. The latter allows solving the robot localization problem as the Kalman filtering problem.
... Sobol' indices), derived from metamodels, has a complicated dependency on the metamodels structure and quality (see e.g. confidence intervals for Sobol' indices estimates [16] in case of Gaussian Process metamodel [17,18,19,20,21] and bootstrap-based confidence intervals in case of polynomial chaos expansions [22]). ...
Preprint
Global sensitivity analysis aims at quantifying respective effects of input random variables (or combinations thereof) onto variance of a physical or mathematical model response. Among the abundant literature on sensitivity measures, Sobol' indices have received much attention since they provide accurate information for most of models. We consider a problem of experimental design points selection for Sobol' indices estimation. Based on the concept of D-optimality, we propose a method for constructing an adaptive design of experiments, effective for calculation of Sobol' indices based on Polynomial Chaos Expansions. We provide a set of applications that demonstrate the efficiency of the proposed approach.
... As can be seen, smaller values of cause decreases in quality. On the other hand, larger values caused greater numerical instability, as they lead to degenerate covariance matrices [8]. We conclude that the default value of = 1 provides a reasonable compromise, which worked reasonably on all datasets. ...
Preprint
Full-text available
Observation of the underlying actors that generate event sequences reveals that they often evolve continuously. Most modern methods, however, tend to model such processes through at most piecewise-continuous trajectories. To address this, we adopt a way of viewing events not as standalone phenomena but instead as observations of a Gaussian Process, which in turn governs the actor's dynamics. We propose integrating these obtained dynamics, resulting in a continuous-trajectory modification of the widely successful Neural ODE model. Through Gaussian Process theory, we were able to evaluate the uncertainty in an actor's representation, which arises from not observing them between events. This estimate led us to develop a novel, theoretically backed negative feedback mechanism. Empirical studies indicate that our model with Gaussian process interpolation and negative feedback achieves state-of-the-art performance, with improvements up to 20% AUROC against similar architectures.
... Moreover, trajectories are time series with typical non-stationarity characteristics, and the GPR model has an implicit zero-mean hypothesis. If the nonstationary time series is predicted directly, it is prone to rapid convergence to zero, resulting in large prediction errors [43,51]. Hence, smoothing the ship's trajectory time series through the Difference Method (DM) is necessary. ...
Article
This paper focuses on collision avoidance (CA) for ships under multiple uncertainties. A ship CA framework is designed combining robust motion control of the Own Ship and probabilistic prediction of the Target Ships’ behavior. A motion control method based on the Tube-based Model Predictive Control (MPC) is designed to achieve robust trajectory tracking, considering uncertainties about ship motion and external disturbances. A high-precision probabilistic trajectory prediction method based on GPR with the incremental theory is proposed to describe the uncertain behavior of the TSs. The artificial potential field (APF) method is introduced to deal with the CA constraints in Tube-based MPC, effectively reducing computational complexity. Simulation experiments with different degrees of uncertainty demonstrate the effectiveness of the proposed framework for ship CA.
... To enhance the efficiency of managing uncertainty occurring in forecasting tasks, a combination of statistical and machine learning techniques has been introduced. Bayesian regularization, based on the Gaussian process [35], is applied to incorporate uncertainty into the forecasting process. This approach is particularly valuable for scenarios involving complex and limited available data. ...
Article
Full-text available
Accurate electricity demand forecasting is essential for global energy security, reducing costs, ensuring grid stability, and informing decision making in the energy sector. Disruptions often lead to unpredictable demand shifts, posing greater challenges for short-term load forecasting. Understanding electricity demand patterns during a pandemic offers insights into handling future disruptions. This study aims to develop an effective forecasting model for daily electricity peak demand, which is crucial for managing potential disruptions. This paper proposed a hybrid approach to address scenarios involving both government intervention and non-intervention, utilizing integration methods such as stepwise regression, similar day selection-based day type criterion, variational mode decomposition, empirical mode decomposition, fast Fourier transform, and neural networks with grid search optimization for the problem. The electricity peak load data in Thailand during the year of the COVID-19 situation is used as a case study to demonstrate the effectiveness of the approach. To enhance the flexibility and adaptability of the approach, the new criterion of separating datasets and the new criterion of similar day selection are proposed to perform one-day-ahead forecasting with rolling datasets. Computational analysis confirms the method’s effectiveness, adaptability, reduced input, and computational efficiency, rendering it a practical choice for daily electricity peak demand forecasting, especially in disrupted situations.
... In some cases consideration of shape of distribution is critical for determination of outliers as points which are distant can also be normal points depending on the size of the data distribution. For this reason distance-based methods perform poorly when the data contains clusters of different densities [19] since they only consider the distance of a point from other point but not the position of the point with respect to data distribution. In such cases a distance measure should be preferred which takes the shape of distribution into account such as [13]. ...
... In practice, the constant K can be estimated with classical tools (Wood & Zhang, 1996) but it requires access to versions of the system with different latent parameters. Alternatively to the soft constraint we propose, a prior on the hyperparameter as in Burnaev et al. (2016) acting as a regularization of the evidence maximisation procedure could improve robustness of the policy search. We leave investigation of such priors to future work. ...
Preprint
Full-text available
Model-based Reinforcement Learning estimates the true environment through a world model in order to approximate the optimal policy. This family of algorithms usually benefits from better sample efficiency than their model-free counterparts. We investigate whether controllers learned in such a way are robust and able to generalize under small perturbations of the environment. Our work is inspired by the PILCO algorithm, a method for probabilistic policy search. We show that enforcing a lower bound to the likelihood noise in the Gaussian Process dynamics model regularizes the policy updates and yields more robust controllers. We demonstrate the empirical benefits of our method in a simulation benchmark.
Article
Full-text available
The problem of a multidimensional function approximation using a finite set of pairs “point” – “function value at this point” is considered. As a model for the function, an expansion in a dictionary containing nonlinear parametric functions has been used. Several subproblems should be solved when constructing an approximation based on such model: extraction of a validation sample, initialization of parameters of the functions from the dictionary, and tuning of these parameters. Efficient methods for solving these subproblems have been suggested. Efficiency of the proposed approach is demonstrated on some problems of engineering design.
Article
Full-text available
The problem of a multidimensional function approximation using a finite set of pairs “point” – “function value at this point” is considered. As a model for the function, an expansion in a dictionary containing nonlinear parametric functions has been used. Several subproblems should be solved when constructing an approximation based on such model: extraction of a validation sample, initialization of parameters of the functions from the dictionary, and tuning of these parameters. Efficient methods for solving these subproblems have been suggested. Efficiency of the proposed approach is demonstrated on some problems of engineering design.
Article
Full-text available
Approximationalgorithmsarewidelyusedinmanyengineer- ing problems. To obtain a data set for approximation a factorial design of experiments is often used. In such case the size of the data set can be very large. Therefore, one of the most popular algorithms for approxima- tion — Gaussian Process regression — can hardly be applied due to its computational complexity. In this paper a new approach for a Gaussian Process regression in case of a factorial design of experiments is pro- posed. It allows to efficiently compute exact inference and handle large multidimensional and anisotropic data sets.
Conference Paper
Full-text available
We consider a problem of adaptive design of experiments for Gaussian process regression. We introduce a Bayesian framework, which provides theoretical justification for some well-know heuristic criteria from the literature and also gives an opportunity to derive some new criteria. We also perform testing of methods in question on a big set of multidimensional functions.
Article
Full-text available
We consider the regression approach based on Gaussian processes and outline our theoretical results about the properties of the posterior distribution of the corresponding covariance function’s parameter vector. We perform statistical experiments confirming that the obtained theoretical propositions are valid for a wide class of covariance functions commonly used in applied problems.
Article
Full-text available
Gaussian process (GP) models are commonly used statistical metamodels for emulating expensive computer simulators. Fitting a GP model can be numerically unstable if any pair of design points in the input space are close together. Ranjan, Haynes, and Karsten (2011) proposed a computationally stable approach for _tting GP models to deterministic computer simulators. They used a genetic algorithm based approach that is robust but computationally intensive for maximizing the likelihood. This paper implements a slightly modified version of the model proposed by Ranjan et al. (2011) in the R package GPfi t. A novel parameterization of the spatial correlation function and a clustering based multi- start gradient based optimization algorithm yield robust optimization that is typically faster than the genetic algorithm based approach. We present two examples with R codes to illustrate the usage of the main functions in GPfit. Several test functions are used for pesrformance comparison with the popular R package mlegp. We also use GPfit for a real application, i.e., for emulating the tidal kinetic energy model for the Bay of Fundy, Nova Scotia, Canada. GPfit is free software and distributed under the General Public License and available from the Comprehensive R Archive Network.
Chapter
Full-text available
Problem of aircraft structural components (wing, fuselage, tail) optimization is considered. Solution of this problem is very computationally intensive, since it requires at each iteration a two-level process. First from previous iteration an update step at full component level must be performed in order to take into account internal loads and their sensitivities in the whole structure involved by changes in local geometry. Second numerous local analyzes are run on isolated elements (for example, super stiffeners) of structural components in order to calculate mechanical strength criteria and their sensitivities depending on current internal loads. An optimization step is then performed from combined global-local sensitivities. This bi-level global-local optimization process is then repeated until convergence of load distribution in the whole structure. Numerous calculations of mechanical strength criteria are necessary for local analyzes and results in great increase of the time between two iterations. In this work an effective method for speeding up the optimization process was elaborated. The method uses surrogate models of optimization constraints (mechanical strength criteria) and provides reduction of the structure optimization computational time from several days to a few hours.
Article
Full-text available
Transformations can help small sample likelihood/Bayesian inference by improving the ap-proximate normality of the likelihood/posterior. In this article we investigate when one can expect an improvement for a one-dimensional random function (Gaussian process) model. The log transformation of the range parameter is compared with an alternative (the logexp) for the family of Power Exponential correlations. Formulas are developed for measuring non-normality based on Sprott (1973). The effect of transformations on non-normality is evaluated analytically and by simulations. Results show that, on average, the log transformation im-proves approximate normality for the Squared Exponential (Gaussian) correlation function, but this is not always the case for the other members of the Power Exponential family.
Article
In nonparametric regression problems involving multiple predictors, there is typically interest in estimating an anisotropic multivariate regression surface in the important predictors while discarding the unimportant ones. Our focus is on defining a Bayesian procedure that leads to the minimax optimal rate of posterior contraction (up to a log factor) adapting to the unknown dimension and anisotropic smoothness of the true surface. We propose such an approach based on a Gaussian process prior with dimension-specific scalings, which are assigned carefully-chosen hyperpriors. We additionally show that using a homogenous Gaussian process with a single bandwidth leads to a sub-optimal rate in anisotropic cases.