An approach for constructing parsimonious generalized Gaussian kernel regression models
ABSTRACT The paper proposes a novel construction algorithm for generalized Gaussian kernel regression models. Each kernel regressor in the generalized Gaussian kernel regression model has an individual diagonal covariance matrix, which is determined by maximizing the correlation between the training data and the regressor using a repeated guided random search based on boosting optimization. The standard orthogonal least squares algorithm is then used to select a sparse generalized kernel regression model from the resulting full regression matrix. Experimental results involving two real data sets demonstrate the effectiveness of the proposed regression modeling approach.

Article: Support vector regression
[Show abstract] [Hide abstract]
ABSTRACT: Instead of minimizing the observed training error, Support Vector Regression (SVR) attempts to minimize the generalization error bound so as to achieve generalized performance. The idea of SVR is based on the computation of a linear regression function in a high dimensional feature space where the input data are mapped via a nonlinear function. SVR has been applied in various fields – time series and financial (noisy and risky) prediction, approximation of complex engineering analyses, convex quadratic programming and choices of loss functions, etc. In this paper, an attempt has been made to review the existing theory, methods, recent developments and scopes of SVR.Neural Information Processing – Letters and Reviews. 11/2007; 11.  [Show abstract] [Hide abstract]
ABSTRACT: This paper presents a growing algorithm to design the architecture of RBF neural network called growing RBF neural network algorithm (GRBF). The GRBF starts from a single prototype randomly initialized in the feature space; the whole algorithm consists of two major parts: the structure learning phase and parameter adjusting phase. In the structure algorithm, the growing strategy is used to judge when and where the RBF neural network should be grown in the hidden layer based on the sensitivity analysis of the network output. In the parameter adjusting strategy, the whole weights of the RBF should be adjusted for improving the whole capabilities of the GRBF. In the end, the proposed GRBF network is employed to track nonlinear functions. The computational complexity analysis and the results of the simulations confirm the efficiency of the proposed algorithm.08/2009: pages 7382;  SourceAvailable from: K.L. Du[Show abstract] [Hide abstract]
ABSTRACT: The radial basis function (RBF) network has its foundation in the conventional approximation theory. It has the capability of universal approximation. The RBF network is a popular alternative to the wellknown multilayer perceptron (MLP), since it has a simpler structure and a much faster training process. In this paper, we give a comprehensive survey on the RBF network and its learning. Many aspects associated with the RBF network, such as network structure, universal approimation capability, radial basis functions, RBF network learning, structure optimization, normalized RBF networks, application to dynamic system modeling, and nonlinear complexvalued signal processing, are described. We also compare the features and capability of the two models.ISRN Applied Mathematics. 03/2012; 2012.
Page 1
An Approach for Constructing Parsimonious Generalized
Gaussian Kernel Regression Models
X.X. Wang
?, S. Chen
?
and D.J. Brown
?
?
Computer Intelligence and Applications Research Group
Department of Creative Technologies
University of Portsmouth
Buckingham Building, Lion Terrace, Portsmouth PO1 3HE, U.K.
Emails: xunxian.wang@port.ac.uk, david.j.brown@port.ac.uk
?
School of Electronics and Computer Science
University of Southampton
Highfield, Southampton SO17 1BJ, U.K.
Email: sqc@ecs.soton.ac.uk
Abstract
The paper proposes a novel construction algorithm for generalized Gaussian kernel re
gression models. Each kernel regressor in the generalized Gaussian kernel regression model
has an individual diagonal covariance matrix, which is determined by maximizing the cor
relation between the training data and the regressor using a repeated guided random search
based on boosting optimization. The standard orthogonal least squares algorithm is then
used to select a sparse generalized kernel regression model from the resulting full regres
sion matrix. Experimental results involving two real data sets demonstrate the effectiveness
of the proposed regression modeling approach.
Keywords — Neural networks, regression, orthogonal least squares, correlation, boosting
1Introduction
A fundamental principle in practical nonlinear data modeling is the parsimonious principle of
ensuring the smallest possiblemodel that explains the training data. Forward selection using the
orthogonal least squares (OLS) algorithm [1]–[5] is a simple and efficient construction method
that iscapable ofproducingparsimoniouslinearintheweightsnonlinearmodelswithexcellent
generalization performance. Alternatively, the stateofart sparse kernel modeling techniques,
such as the support vector machine and relevant vector machine [6]–[9], have been gaining
1
Page 2
popularity in data modeling applications. These existing sparse regression modeling techniques
typically place the kernel centers or mean vectors at the training input data and use a fixed
common kernel variance for all the regressor kernels. The valueof this common kernel variance
obviously has a critical influence on the sparsity and generalization capability of the resulting
model, and it has to be determined via some sort of cross validation. For example, in [3]
a genetic algorithm is applied to determine the appropriate common kernel variance through
optimizing the model generalization performance.
In this paper, we considera generalized Gaussian kernel model, in which each kernel regres
sor has an individually tuned diagonal covariance matrix. Such a generalized kernel regression
model has the potential of improving modeling capability and producing sparser final models,
compared with the standard approach of single fixed common variance. The difficult issue is
then how to determine these kernel covariance matrices. Since the correlation function between
a kernel regressor and the training data defines the “similarity” between the regressor and the
training data, it can be used to “shape” the regressor by adjusting the associated kernel covari
ance matrix in order to maximize the absolute value of this correlation function. A weighted
optimization algorithm, which has its root from boosting [10][12], is considered to perform the
associated optimization task. This weighted optimization algorithm is a guided random search
method and the solution obtained may depend on the initial choice of population. To provide a
robust optimization and guarantee stable solutions regardless of the initial choice of population,
the algorithm is augmented into a repeated weighted optimization method.
The determination of kernel covariancematrices essentially provides the full bank of regres
sors or the full regression matrix, and this allows the application of the standard OLS algorithm
[1],[2] to select a parsimonious subset model. The outline of the paper is as follows. Section 2
gives the generalized Gaussian kernel regression model to be considered. Section 3 derives the
correlation criterion to be used for determining the kernel covariance matrices and presents a
repeated boosting search optimizationalgorithm for performing the corresponding optimization
tasks. Section 4 briefly summarizes the standard OLS algorithm used to select a sparse kernel
regression model, whileSection 5 describes our modelingexperiments. Finally, Section 6 offers
our conclusions.
2
Page 3
2Generalized Gaussian kernel regression model
Consider a general discrete stochastic nonlinear system represented by
???????????????????????????????????????????????????????????????????????? "!$#&%??'???(?)?+*$?(?? ,!$#&%?
(1)
where
???and
?.?are the system input and output variables, respectively,
integers representing the known lags in
uncorrelated with zero mean,
/$0 and
denotes the system input
/?1 are positive
?2?
and
?.?, respectively, the observation noise
%??
is
*3?4?657?.????????????.?????
?
????????????????????
?)8:9
vector with a known dimension
/
?
/21
#
/?0 ,
?(?)?<;=!is a priori unknown system mapping, and
isan unknownparametervectorassociatedwiththeappropriate, butyet tobedetermined,model
structure. The system model (1) is to be identified from an
> sample system observational
data set
?A@
?CB?*$?.?????(D
@
??E"?, using some suitable functions which can approximate
?=???F;=!with
arbitrary accuracy.
We willmodeltheunknowndynamicalprocess(1)by usingthefollowinggeneralized Gaus
sian kernel regression model
?.???HG?.?I#J%????
@
KL
E"?NM
LPO(L
?Q*,??!$#&%?
are the model weight parameters, and
(2)
where
G?.?denotes the model output given the input
Gaussian function
*R?,
M
L
O(L
?F;=!are the kernel regressors. We allow the regressor function to be chosen as the general
O.L
?Q*3!S?UTV?Q*W??*
L
??X
L
!with
TV?Q*Y??*
L
??X
L
!Y?[Z?\^]`_?acb
d
?Q*cae*
L
!
9
X
???
L
?+*fag*
diag
L
!ih
(3)
where the kernel covariance matrix takes the form of
X
L
?Bkj,l
Lnm
?
?)????????j2l
Lom
?
D. As is in a
standard kernel regression model, the kernel mean vectors are placed at the training input data
points. If all the diagonal covariance matrices are set to the identical form of diag
BkjRl????????)??j,l)D,
we arrive at the standard Gaussian kernel model.
The kernel model (2) over the training set
?V@ can be written in the matrix form as
pq?srt u#wv
(4)
by defining
px?y57?N?z?
l
???????
@
8
9
(5)
3
Page 4
c?5
M
?
M
l
?????
M
@
8
9
(6)
v?y5%??3%
l
??????%
@
8
9
(7)
r?y5
?
?
?
l
?????
?
@
8
(8)
?
L
?5
O?L
?Q*??!
O?L
?Q*
l
!2?????
O(L
?Q*
@
!
8
9
?
b
?????
>
(9)
The objective of sparse modeling is to construct a subset model consisting of
>
?S???
>
!signif
icant regressors only from the full set of regressors defined in (9).
3Determination of the full regression matrix
To specify the pool of regressors or the full regression matrix
r
, one needs to determine all the
associated diagonal covariance matrices
X
L
,
b
?????
> . Let us start the discussion by defining
the least squares cost or mean square error (MSE) associated with an
? term model as
???
?
b
>
@
K
??E"?
?????aG?.?)!
l
(10)
where for the notationalsimplicitythe same notation
G???is also used for representing the
? term
model output. Obviously
???
?[p$9?p????p??
l.
3.1Correlation criterion
The correlation between a regressor
?
L
and the training data is defined by
?
?X
L
!Y?
p
9
?
L
?
p
9
p
?
?
9
L
?
L
(11)
This correlation is a function of the regressor’s kernel covariance matrix. We propose to use this
L
correlation function as the optimization criterion to determine the regressor’s kernel covariance
matrix. Specifically, we should choose
X
so that
?
?
?X
L
!
? is maximized. Why this is a good
strategy to specify the pool of regressors can easily be explained. Assuming that
?
L
is selected
to form a oneterm model, the associated reduction in the MSE value can be shown to be
?
?
?
???
a
?
?Y???
4
p
9??
L??
l
?
9
L
?
L
(12)
Page 5
which can be rewritten as
?
?
?
?
p
9
p
?
?
p
9
?
L
?
l
?Qp
9
p!,?
?
9
L
?
L
!
?
??p??
l
?
?
?X
L
!
?
l
(13)
Since
??p???lis a constant, maximizing
Having chosen the optimization criterion, we now turn our attention to optimization algo
?
?
?X
L
!
? leads to a maximum reduction in the MSE value.
rithm. We propose a repeated guided random search method to perform the associated opti
mization tasks. This algorithm adopts ideas from boosting [10][12].
3.2 Weighted optimization algorithm
The task of maximizing
?
?
?X
L
!
? with respect to
X
L
can be carried out by various optimization
methods. For example, the global optimizationmethods, such as thegenetic algorithm [13],[14]
and adaptive simulated annealing [15],[16], can be used. A global optimization method how
ever is generally computationally very costly and may be overkill, since in this application
we only seek to tune a kernel’s diagonal covariance matrix. Let us consider the following
simple search method to perform this optimization. Given
?
points of
X,
X??
???
?)????????X???
?
, let
X????????u???????????(\?B
weighted combination is a convex combination, the point
?
?
?X??
L
?
!
?
?
b
???
?,Dand
X??????????
?????????! #"?B
?
?
?X??
L
?
!
?
?
b
???
?,D. A
?$?`#
hull defined by the
b
!th point is generated by a weighted combination of
X??
L
?
,
b
???
?. Because this
X
??&%
???
is always within the convex
?values. A
?$?
#
d
!th point is then generated as the mirror image of
X'? ?(%
???
,
with respect to
X
???)???, along the direction defined by
then replaces
X
???????
aX
? ?&%
???
. The best of
X
??&%
???
and
X
? ?&%
l
?
X
?????????. The process is repeated until it converges.
A simple illustration is depicted in Fig. 1 for a onedimensional case, where there are
?
?+*
is apoints,
X!?
???
,
X??
l
?
and
X!?,
?
, and
X????????
and
?X??
l
?
and
X.?????#???V?
is the mirror image of
X??/,
?
. The 4th value
X0?21
?
weighted combination of
X
?
???
,
X
?
l
?
X
?,
?
, and
X
?/3
?
X
?1
?
with respect
to
X??
l
?
. As
X??21
?
is better than
X??/3
?
in this case, it replaces
X!?/,
?
. Clearly, how the weighted
combination is performed is critical. The weightings for
X??
L
?
,
b
???
?, should reflect
the “goodness” of
X0?
L
?
, and the process should be capable of selflearning or adapting these
weightings. This is exactly the basic idea of boosting [10][12]. Specifically, by combining the
AdaBoost algorithm of [11] with the abovementioned simple search strategy, we arrive at the
weighted optimization algorithm. Given the training data
?@
and for fitting the
4 th regressor’s
5
Page 6
covariance matrix, the algorithm is summarized as follows.
Initialization: Set iteration index
?
???, give the
?
randomly chosen initial values for
X??,
X
?
???
?
?
?
!???X
?
l
?
?
?
?
!,
?????)??X
??
?
?
?
?
!, with the associated weightings
?
???
?
L
?
?
?for
b
???
?, and
specify a small positive value
? for terminating the search.
Step 1: Boosting
1. Calculate the loss of each point in the population, namely
cost
L
?
b
a
?
?
?X
?
L
?
?
?
?
!?!
?
?
b
?????
?
2. Find
X
???)???
?
?
?
!I?
???????! #""Bcost
L
?
b
???
?2D
and
X?????#???
?
?
?
!Y?
???????0?(\?Bcost
cost
L
?
b
? ???
?,D
3. Normalize the loss
loss
L
?
L
?
?
?
E"?cost
?
?
b
?????
?
4. Compute a weighting factor
?
? according to
?
?
?
?
K
L
E"?
?
???
?
L
loss
L
?
?
?
?
?
?
b
a??
?
5. Update the weighting vector
?
????%
???
L
?
?
??
?
???
?
L
?loss
?
?
for
for
?
?
?
b
?
?
???
?
L
?
??loss
?
?
?
???
b
?
b
?????
?
6. Normalize the weighting vector
?
????%
???
L
?
?
????%
???
L
?
?
?
E"?
?
????%
???
?
?
b
?????
?
Step 2: Parameter updating
1. Construct the
?$?
#
b
!th point using the formula
X
??&%
???
?
?
?
!I?
6
?
K
L
E"?
?
????%
???
L
X
?
L
?
?
?
?
!
Page 7
2. Construct the
?$?
#
d
!th point using the formula
X
??&%
l
?
?
?
?
!S?X????????
?
?
?
!$#
?
X????)???
?
?
?
!awX
??&%
???
?
?
?
!
?
3. Choose a better point (smaller loss value) from
X
? ?&%
???
?
?
?
!and
X
? ?&%
l
?
?
?
?
!to replace
X.?????????
?
?
?
!, which will inherit the weighting
? value from
X
?????????
?
?
?
! 1.
Set
?
?
?
#
band repeat from Step 1 until
?
?
?
X
??(%
???
?
?
?
!awX
? ?&%
???
?
?
?
a
b
!
?
?
???
?
Then choose the
4 th regressor covariance matrix as
X
????X
???)???
?
?
?
!.
The algorithmic parameter that needs to be set appropriately is the population size
?. The
population size depends on the dimension of
X
and the objective function to be optimized.
Generally, an appropriate value of
?
has to be found empirically. This is very similar to for
example the choice of population size in the genetic algorithm.
3.3Repeated weighted optimization algorithm
The above weighted optimization algorithm performs a guided random search and the solution
obtained may depend on the initial choice of population. To derive a robust algorithm that
guarantees a global optimal solution, one may incorporate the full idea of the scatter search
[17][19] with this weighted optimization algorithm. However, to avoid an overly complicated
algorithm, we simply augment the algorithm into the following repeated weighted optimization
algorithm. The aim is not to guarantee a global optimal solution. Rather it is to make sure that
the algorithm will arrive at similar solutions regardless of the initial choices of population.
Initialization: Givea positiveinteger number
?
for controlling the maximumrepeating times,
and choose a small positive number
?
?for terminating the search.
First generation: Randomly choose the
?number of the initial population
X
?
???
?
??????????X
??
?
?
, and
call the weighted optimization algorithm to obtain a solution
X????)???
or
?
.
1Each
?
?????
?
,
????????? , has an associated weighting value
keep the weighting value of
?
?. When
?
?????????
?
?
?????????
?
replaces
???! #"%$'&
?
, it will
?
?! #"%$'&
?
.
7
Page 8
Repeat loop: For
?
?
b??
?
Set
X
?
???
?
??X
???)???
?
, and randomly generate the other
?Aa
bpoints
X
?
L
?
?
for
d
? ???
?.
Call the weighted optimization algorithm to obtain a solution
X
???)???
?
.
If
?
?
?
X
?
???
?
awX????)???
Exit loop;
?
?
?
??
?
?
End if
End for
Choose the
4 th regressor’s covariance matrix as
X
????X
?&?)???
?
.
The important algorithmic parameter that need to be chosen appropriately is the termination
criterion
?
?. Basically,
?
?determines whether the solutions obtained in different runs of the
weighted optimization are closely enough to be regarded as the same solution. If a too small
?
?is chosen, the loop may keep going for long time. To safeguard against this, we also specify
the maximum repeating times
?
. Again, appropriate values for
?
and
?
?depends on the
dimension of
X
and how hard the objective function to be optimized. Also the choice of
?
has
some influence on the choice of
?
and
?
?. Generally, these algorithmic parameters have to be
found empirically.
4 OLS algorithm for subset model selection
Once the full regression matrix
r
has been designed, the standard OLS algorithm [1],[2] can
be used to select a subset model. Let an orthogonal decomposition of the regression matrix be
r?????
(14)
where
?
?
??
?
?
?
?
?
b??
...
?
m
l
?????
...
?
?
m
@
?
b
...
...
...
?
@
???
m
@
?
?????
?
b
???
?
?
?
?
?
(15)
and
?
?5??
?
?
l
???????
@
8
(16)
8
Page 9
with orthogonal columns that satisfy
?
9
L
?
?
?
?, if
???
?
?. The regression model (4) can
alternatively be expressed as
px?
???
#wv
(17)
where the orthogonal weight vector
?
?y5??
???
l
???????
@
8
9
satisfy the triangular system
?t t???
(18)
Knowing
?
and
?
,
can readily be solved from (18).
For the orthogonal regression model (17), the MSE
?
@
?b
>
v
9
v
(19)
can be expressed as
?
@
?
b
>
p
9
pfa
b
>
@
K
L
E"?
?
9
L
?
L
?
l
L
(20)
Thus the MSE for the
4 term subset model can be expressed recursively as
?
???
?
?
???Ra
b
>
?
9
?
?
?
?
l
?
(21)
At the
4 th stage of regression, the
4 th term is selected to maximize the error reduction criterion
[1],[2]
?
?
???
b
>
?
9
?
?
?
?
l
?
(22)
The forward selection procedure is terminated at the
>
?th stage if
?
@??
???
(23)
is satisfied, where the small positive scalar
? is a chosen tolerance. This produces a parsimo
nious model containing
>
?significant regressors.
In this study, we should assume that an appropriate tolerance value
? can be chosen. It
is worth emphasizing that the termination of the model construction process can alternatively
be decided using cross validation [20],[21]. A simple method is to have a separate validation
data set. The model construction is based on the training data set, while the performance of
the selected model, the MSE (20), is monitored over the validation data set. The construction
process is terminated when the MSE over the validation data set stops improving. Instead
of using the pure least squares cost (20), it is also worth pointing out that other criteria can
alternatively be adopted for the orthogonal forward selection, and these include regularization,
optimal experimental design, and leaveoneout cross validation criterion [4],[5].
9
Page 10
5Modeling examples
Two realdata sets were used to demonstrate the effectiveness of the proposed sparse model
construction algorithm. The population size
?, the maximum repeating times
?
and the termi
nation criterion
?
?were chosen empirically to ensure that the OLS subset selection procedure
could produce consistent final models with the same levels of modeling accuracy and model
sparsity for repeating runs.
Example 1. This example constructed a model representing the relationship between the fuel
rack position (input
???) and the engine speed (output
?=?) for a Leyland TL11 turbocharged,
direct injection diesel engine operated at low engine speed. Detailed system description and
experimentalsetupcan befound in[22]. Thedataset, depictedinFig. 2, contained410 samples.
The first 210 data points were used in training and the last 200 points in model validation. The
previous study [22] has shown that this data set can be modeled adequately as
?.??? ?(?)?+*$??!$#J%??
(24)
with
?????F;=!describing the unknown underlying system and the system input vector defining by
*$?'? 57?.?????z???????$?????
l
8
9
(25)
The previous results [4],[5] have shown that when fitting a Gaussian kernel model with a
single common variance,
j
l
?
b??
???
is the optimal value for this kernel variance. Since every
training input data points were considered as a candidate regressor’s center, there were 210
regressors for the full Gaussian kernel model. With the tolerance level set to
?
???
?
???
b
?
?
1, the
OLSalgorithmselecteda19termsubsetmodelfrom thefullregressionmodel, and theresulting
subsetmodelislistedinTable1. TheMSEvaluesoftheresultingmodelwere
?
?
d??
?
b
?
?
1forthe
training set and
?
?
?
d
?
b
?
?
1for the validation set, respectively. Fig. 3 shows the corresponding
model prediction
G?.?and the model prediction error
The proposed sparse model construction algorithm was then applied to construct a general
%????U???aG?.?.
ized Gaussian kernel model. The algorithmic parameters of the repeated weighted optimization
for kernel covariance fitting were chosen to be
?e?
*
?,
?
?
?
?and
?
??
?
?
???
d. Using the
same tolerance level of
?
???
?
???
b
?
?
1, the OLS algorithm selected a 11term subset model
10
Page 11
from the full generalized Gaussian kernel model, and the obtained model is listed in Table 2.
The MSE values of this model were
?
?
?
?
?
b
?
?
1over the training set and
?
?ob
?
?
b
?
?
1over the
validation set, respectively. The model prediction and prediction error generated by this model
are illustrated in Fig. 4.
Example 2. This example constructed a model for the gas furnace data set (Series J in [23]).
The data set, illustrated in Fig. 5, contained 296 pairs of inputoutput points, where the input
?$?
was the coded input gas feed rate and the output
?=?represented CO
l concentration from the gas
furnace. All the 296 data points were used in training, with the model input vector defined by
*,??5 ???????z?????
l
?????
,
???????z?????
l
?????
,
8
9
(26)
For this data set, the previous experiments have found out that it was difficult for the existing
stateofart kernel regression techniques to fit a Gaussian kernel regression model using a com
mon kernel variance [5]. Various existing stateofart kernel regression techniques were then
used in [5] to fit a thinplatesplineregression model for this data set and the best result obtained
required at least 30 model terms to achieve a modeling accuracy of
?
?
?
?
?????.
The proposed sparse model algorithm was employed to construct a generalized Gaussian
kernel model for this data set. The kernel covariance matrices were first determined using the
repeated weighted optimization with the following algorithmic parameters:
?
?
b
??,
?
?
?
?
and
?
??
?
?
???
b. With the modeling accuracy of
?
?
?
?
?
???, the OLS algorithm constructed a
20term subset model from the full generalized Gaussian kernel model, as is listed in Table 3.
The model prediction and prediction error generated by this model are shown in Fig. 6.
6Conclusions
A novel construction algorithm has been developed for the generalized Gaussian kernel model.
Each kernel regressor in the pool of candidate regressors has an individual diagonal covariance
matrix, which is determined by maximizing the absolute value of the correlation between the
regressor and the training data using a repeated weighted search optimization. The standard
orthogonal least squares algorithm is then applied to select a parsimonious model from the full
regression matrix. Compared with the existing kernel regression modeling approaches which
11
Page 12
adopt a single common kernel variance for all the regressors, the proposed method has the
advantages of improving modeling capability and producing sparser models. These advantages
have been demonstrated by the experimental results involving two real data sets.
References
[1] S. Chen, S.A. Billings and W. Luo, “Orthogonal least squares methods and their appli
cation to nonlinear system identification,” Int. J. Control, Vol.50, No.5, pp.1873–1896,
1989.
[2] S. Chen, C.F.N. Cowan and P.M. Grant, “Orthogonal least squares learning algorithm for
radial basis function networks,” IEEE Trans. Neural Networks, Vol.2, No.2, pp.302–309,
1991.
[3] S. Chen, Y. Wu and B.L. Luk, “Combined genetic algorithm optimisation and regularised
orthogonal least squares learning for radial basis function networks,” IEEE Trans. Neural
Networks, Vol.10, No.5, pp.1239–1243, 1999.
[4] S. Chen, X. Hong and C.J. Harris, “Sparse kernel regression modelling using combined
locally regularized orthogonal least squares and Doptimality experimental design,” IEEE
Trans. Automatic Control, Vol.48, No.6, pp.1029–1036, 2003.
[5] S. Chen, X. Hong, C.J. Harris and P.M. Sharkey, “Sparse modelling using orthogonal
forward regression with PRESS statistic and regularization,” IEEE Trans. Systems, Man
and Cybernetics, Part B, Vol.34, No.2, pp.898–911, 2004.
[6] V. Vapnik, The Nature of Statistical Learning Theory. New York: SpringerVerlag, 1995.
[7] V. Vapnik, S. Golowich and A. Smola, “Support vector method for function approxima
tion, regression estimation, and signal processing,” in: M.C. Mozer, M.I. Jordan and T.
Petsche, Eds., Advances in Neural Information Processing Systems 9. Cambridge, MA:
MIT Press, 1997, pp.281–287.
[8] M.E. Tipping, “Sparse Bayesian learning and the relevance vector machine,” J. Machine
Learning Research, Vol.1, pp.211–244, 2001.
12
Page 13
[9] B. Schlkopf and A.J. Smola, Learning with Kernels: Support Vector Machines, Regular
ization, Optimization, and Beyond. Cambridge, MA: MIT Press, 2002.
[10] R.E. Schapire, “The strength of weak learnability,” Machine Learning, Vol.5, No.2,
pp.197–227, 1990.
[11] Y. Freund and R.E. Schapire, “A decisiontheoretic generalization of online learning and
an application to boosting,” J. Computer and System Sciences, Vol.55, No.1, pp.119–139,
1997.
[12] R. MeirandG.R¨ atsch, “Anintroductiontoboostingandleveraging,”in: S. Mendelsonand
A. Smola, eds., Advanced Lectures in Machine Learning. Springer Verlag, 2003, pp.119–
184.
[13] D.E. Goldberg, Genetic Algorithmsin Search, Optimizationand MachineLearning. Read
ing, MA: Addison Wesley, 1989.
[14] K.F. Man, K.S. Tang and S. Kwong, Genetic Algorithms: Concepts and Design. London:
SpringerVerlag, 1998.
[15] L. Ingber, “Simulated annealing: practice versus theory,” Mathematical and Computer
Modeling, Vol.18, No.11, pp.29–57, 1993.
[16] S.Chen andB.L. Luk, “Adaptivesimulatedannealingforoptimizationinsignalprocessing
applications,” Signal Processing, Vol.79, No.1,
[17] F. Glover, “A template for scatter search and path relinking,” in: J.K. Hao, E. Lutton,
E. Ronald , M. Schoenauer and D. Snyers, Eds., Artificial Evolution, Lecture Notes in
Computer Science, 1363. Springer, 1998, pp.13–54.
[18] F. Glover, “Scatter search and path relinking,” in: D. Corne, M. Dorigo and F. Glover,
Eds., New Ideas in Optimization. McGraw Hill, 1999, pp.297–316.
[19] F. Glover, M. Laguna and R. Mart´ ı, “Scatter search and path relinking: foundations and
advanced designs,” submitted for publication, 2002.
[20] M. Stone, “Cross validation choice and assessment of statistical predictions,” J. Royal
Statistics Society Series B, Vol.36, pp.117–147, 1974.
13
Page 14
[21] R.H. Myers, Classical and Modern Regression with Applications. 2nd Edition, Boston:
PWSKENT, 1990.
[22] S.A. Billings, S. Chen and R.J. Backhouse, “The identification of linear and nonlinear
models of a turbocharged automotive diesel engine,” Mechanical Systems and Signal Pro
cessing, Vol.3, No.2, pp.123–142, 1989.
[23] G.E.P. Box and G.M. Jenkins, Time Series Analysis, Forecasting and Control. Holden Day
Inc., 1976.
14
Page 15
Table 1: Subset model generated for the engine data set by the OLS algorithm with a Gaussian
kernel model of a single common variance.
stepmean vectordiagonal covariance
matrix
weight MSE
4
*
?
X
?
M
?
?
?
?
b
??
0
1
2
3
4
5
6
7
8
9
1558.9
73.9841
34.7312
8.3802
7.5403
4.6502
2.9565
2.4999
1.5953
0.7767
0.5986
0.4682
0.3327
0.2065
0.1589
0.1292
0.1032
0.0758
0.0579
0.0528
4.2823
2.8236
4.5954
3.1978
3.9310
4.2976
4.6183
3.2131
4.5725
3.9844
2.8618
3.4498
3.2284
2.9381
3.1520
3.6866
2.9763
3.3735
3.5491
5.0245
3.7439
5.8200
5.8200
3.7439
5.0439
4.5006
5.8006
5.8006
4.5200
3.7439
4.5200
5.8006
3.7439
5.8006
5.8200
3.7439
3.7245
3.7439
5.0245
3.7439
5.8200
3.7439
4.5006
5.0439
5.0051
5.8006
5.8006
4.5200
4.5200
3.7439
5.8006
4.5006
3.7245
5.8200
4.5200
4.5394
4.5200
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
1.69
109.2247
2.4249
16.0325
5.0481
2.0419
106.5281
0.1787
58.8794
17.0584
4.3978
25.1798
0.8959
61.2593
110.8486
4.5398
2.1195
91.5013
22.2389
16.7227
10
11
12
13
14
15
16
17
18
19
15
View other sources
Hide other sources
 Available from David J Brown · Sep 2, 2014
 Available from psu.edu