Content uploaded by Harris Drucker
Author content
All content in this area was uploaded by Harris Drucker on May 19, 2018
Content may be subject to copyright.
Support Vector Regression Machines
Harris
Drucker’ Chris J.C. Burges**
Linda Kaufman**
Alex
Smola** Vladimir
Vapnik
+
*Bell Labs and Monmouth University
Department of Electronic Engineering
West Long Branch, NJ 07764
**Bell Labs +AT&T Labs
Abstract
A new regression technique based on Vapnik’s concept of support
vectors is introduced. We compare support vector regression
(SVR)
with a committee regression technique (bagging) based on regression
trees and ridge regression done in feature space. On the basis of these
experiments, it is expected that SVR will have advantages in high
dimensionality space because SVR optimization does not depend on the
dimension&y of the input space.
This is a longer version of the paper to appear in Advances in Neural
Inform&on
Processing Systems 9 (proceedings of the 1996 conference)
1. Introduction
In the following, lower case bold characters represent vectors and upper case bold
characters represent matrices. Superscript
“t”
represents the transpose of a vector. y
represents either a vector (in bold) or a single observance of the dependent variable in the
presence of noise.
y’p’
indicates a predicted value due to the input vector
x@’
not seen in
the training set.
Suppose we have
an
unknown function
C(x)
(the
“truth”)
which is a function of a vector
x
(termed input space). The vector
x’
=
[xI,rr,...,x~]
has d components where d is
termed the dimensionality of the input space.
F(x,w)
is a family of functions
parameterized by w.
;
is that value of
w
that minimizes a measure of error between
G(x) and
F(x,j).
Our objective is to estimate w with
G
by observing the N training instances
vi_
j=l,-,N.
We will develop two approximationS for the truth
G(x).
The first one is
F,(x,w)
which
we term
B
feature space representation. One (of many) such feature vectors is:
“Support Vector Regression
Mncbines”,
.%‘eurnl
In/onnn~ion
Processing
.5)srems
9, with C.J. Burges, L.
KalltfitL3&
A.
Smola,
and V.
Vnpnik,
eds. M.C.
Mozer.
J.I. Jordan, T. Petscbe. pp. 155-161, MIT Press.
1997
which is a quadratic function of the input space components. Using the feature space
representation, then
F,
(x,;)
=
z’lj
,
that is,
F,(x,w)
is linear
infearure
space although
it is quadratic in input space. In general, for a p’th order polynomial and d’th
dimensional input space, the feature dimensionalityfof
&
is
i=d-l
n!
where
ct
=
k
!(n-k)!
From now on, when we write
F,
(I,&;)
, we mean the feature
space representation and we must determine the f components of
G
from the N training
“ectcns.
The second representation is a support vector regression (SVR) representation that was
developed by Vladimir Vapnik (1995):
F2 is an expansion explicitly using the training examples. The rationale for calling it a
support vector representation will be clear later as will the necessity for having both an
a* and an a rather than just one multiplicative constant. In this case we must choose the
2N + 1 values of a;
at
and b. If we expand the term raised to the p’th power, we find f
coefficients that multiply the various powers and cross product
tams
of the components
of x. So, in this sense
F,
looks very similar to
F2
in that they have the same number of
terms. However
F,
hasffree
coefficients while
F2
has
2N+l
coefficients that must be
determined from the N training
actors.
For instance, suppose we have a
100
dimensional input vector and use a third order polynomial. The dimensionality of feature
space for both
F,
and
F2
exceeds
176,COO,
but for the feature space representation, we
have to determine over
176,CGO
coefficients while for the SVR representation we have to
determine 2N+l coefficients. Thus, it may take a lot more data to estimate the
coefficients in
Gin
the feature space representation.
We let
a
represent the 2N values of
ai
and
a:.
The optimum values for the components
of
G
or
a
depend on our definition of the loss function and the objective function. Here
the primal objective function is:
where L is a general loss function (to be defined later) and F could be
F,
or
FZ,
yj
is the
observation of
G(x)
in the presence of noise, and the last term is a
regultizer.
The
regularization constant is
U
which in typical developments multiplies the regulatizer but
is placed in front of the first term for reasons discussed later.
If the loss function is quadratic, i.e., we L[.]=[.]*, and we let
F=F,,
i.e., the feature space
representation, the objective function may be minimized by using linear algebra
techniques since the feature space representation is linear in that space. This is termed
ridge regression (Miller, 1990).
In.particulx
let V be a matrix whose i’th row is the i’th
training vector represented in feature space (including the constant term “1” which
represents a bias). V is a matrix where the number of rows is the number of examples
(N) and the number of columns is the dimensionality of feature space
J
Let
E
be the
fxf
diagonal matrix whose elements we l/U. y is the Nxl column vector of
observ$ms
of
the dependent vwiable. We then solve the following matrix formulation for
w
using a
linear technique (Stnmg, 1986) with a linear algebra package (e.g., MATLAB):
V’y
=
[V’V
+ E] G
The rationale for the regularization term is to trade off mean square error (the first term)
in the objective function against the sire of the
G
vector. If U is large, then essentially we
are
minimizing the mean square error on the training set which may give poor
generalization to a test set. We find a good value of U by varying U to find the best
performance on a validation set and then applying that U to the test set. U is very useful
if the dimensionality of the feature set is larger than the number of examples.
Let us now define a different type of loss function termed an e-insensitive loss (Vapnik,
1995):
1
if I
yi-F2(xi,G)
I
<
e
L
=
I
yi-F*(x:,;)
I
-E
otherwise
This defines an
E
tube (Figure 1) so that if the predicted value is within the tube the loss
is zero, while if the predicted point is outside the tube, the loss is the magnitude of the
difference between the predicted value and the radius
E
of the tube.
Specifically, we minimize:
N
N
V(C5i’
+
E3
+
+v)
i=l
i=,
where
ki
or
ti’
is zero if the sample point is inside the tube. If the observed point is
“above” the tube,
5;
is the positive difference between the observed value and
E
and
ai
will be nonzero. Similary,
&*
will be nonzem if the observed point is below the tube
and in this case a,! will be nonzero. Since an observed point can not be simultaneously
on both sides of the tube, either a; or
a,:
will be nonzero, unless the point is within the
tube, in which case, both constants will be zero.
If
U
is large, more emphasis is placed on the
~ITOT
while if U is small, more emphasis is
placed on the norm of the weights leading to (hopefully) a better generalization. The
constraints are: (for all i,
i=l,N)
yi-(w’vi)-b~&+&
(w’vi~b-yiWE,t~
$::
The corresponding Lagrangian is:
L=$v’w)
+
U(;<*i
+
;&)
-
~utLvi-(W’Yij-b+~+Si*]
,=I
I=1
,=I
-&M+W+b-x+&l
-
&u:5:+xSi,
i=l i=l
where the
yi
and
a,
ax
Lagrange multipliers
We find a saddle point of L (Vapnik, 1995) by differentiating with respect to
IV;
,
b, and
5
which results in the equivalent maximization of the (dual space) objective function:
W(a,a*)
=
-&a;+a;)+&a;-a,)
-
+,$
(a;-ai)(af-aj)(v:vj
+
l)p
i=l
i=l
6-l
with the constraints:
We must find N Largrange multiplier pairs (a,,
af).
We can also prove that the product of
ai
and
a’
is
zero
which means that at least one of these two terms is zero. A
vi
corresponding to a non-zero
ai
OT
aIT
is termed a support vector. There can be at most
N
support vectors. Suppose now, we have a new vector
x(p),
then the corresponding
prediction ofy@) is:
y@)
=
;(a:
-
ai)
+
l)P+b
i=l
Maximizing W is a quadratic programming problem but the above expression for W is
not in standard form for use in quadratic programming packages (which usually does
minimization). If we let
then we minimize:
subject to the constraints
ipi
=
zpi
and OSp;CI
i=l,
,2N
i=l
Ntl
c'=[E-y,,E-y~.
”
.E-yN.E+y~.E+y~.
”
,E+YNl
D
-D
Q=
[
1
-D
D
dij
= (v:v, +
1)P
i,j=l;..
,N
Note that no matter what the power p, this remains a quadratic optimization program.
The rationale for the notation of V as the regularization constant is that it now appears
as
an upper bound on the
p
and a vectors. These quadratic programming problems can be
very cpu and memory intensive. Fortunately, we can devise programs that make use of
the fact that for problems with few support vectors (in comparison to the sample size),
storage space is proportional to the number of support vectors. We use an active set
method (Bunch and Kaufman, 1980) to solve this quadratic programming problem.
Although we may find
G
in the SVR representation it is not necessary. That is, the SVR
representation is not explicitly
a
function of
G.
Similarly, if we use radial basis
functions, the expansion is not an explicit function of
G.
That is, we can express the
predicted values as:
In this case, the elements of the Q matrix above
are
djj
=exp[-yiivi
-vi/i’]
and G cannot be explicitly obtained,
2. Nonlinear Experiments
We tried three artificial functions from (Friedman, 1991) and a problem (Boston
Housing) from the
UC1
database. Because the first three problems are artificial, we
know both the observed values and the truths.
Friedman
#l
is a nonlinear prediction problem which has
10
independent variables that
are
uniform in [O,ll:
where
n
is
N(O,l).
Therefore, only five predictor variables
are
really needed, but the
predictor is faced with the problem of trying to distinguish the variables that have no
prediction ability
(
xg
to
x
1o
)
from those that have predictive ability
(x,
to
x5
).
Friedman
#2,#3
have four independent variables and are respectively:
#2 y=(x:+(x*x,-(l/(x*x~)))*)“*+n
#3 y=ta”-’
[
~@-;~z~4)]+n
where the noise. is adjusted to give
3:l
ratio of signal power to noise power and the
variables are uniformly distributed in the following ranges:
Boston Housing: This has 506 cases with the dependent variable being the median price
of housing in the Boston area. There
are
twelve continuous predictor variables. This data
was obtaining from the
UC1
database (anonymous ftp at ftp.ics.uci.edu in directory
/pub/machine-learning-databases) In this case, we have no “truth”, only the observations.
In addition to the input space representation and the SVR representation, we also tried
bagging. Bagging is a technique that combines regressors, in this case regression trees
(Breiman, 1994). We used this technique because we had a local version available. Our
implementation of regression trees is different than Breiman’s but we obtained slightly
better results. Breiman uses cross validation to pick the best tree and prune while we use
a separate validation set. In the case of regression trees, the validation set was used to
prune the trees.
Suppose we have test points with input vectors
.I?’
i=l,M
and make a prediction
yp’
using any procedure discussed here. Suppose
yi
is the actually observed value, which is
the truth
G(x)
plus noise. We define the prediction error (PE) and the modeling error
(ME):
For the three Friedman functions we calculated both the prediction
error
and modeling
error. For Boston Housing, since the “truth” was not known, we calculated the prediction
error only. For the three Friedman functions, we generated (for each experiment) 200
training set examples and 40 validation set examples. The validation set examples were
used to find the optimum regulatization constant in the feature space representation. The
following procedure was followed. Train on the 200 members of the training set with a
choice of regularization constant and obtain the prediction error on the validation set.
Now repeat with a different regulatization constant until a minimum of prediction error
occurs on the validation set. Now, use that regularizer constant that minimizes the
validation set prediction error and test on a
loo0
example test set. This experiment was
repeated for
100
different training sets of size
200
and validation sets of size 40 but one
test set of size
1000.
Different size polynomials were tried (maximum power 3). Second
order polynomials fared best. For Friedman function
#l,
the dimensionality of feature
space is 66 while for the last two problems, the dimensionality of feature space was 15
(for d=2). Thus the size of the feature space is smaller than that of the number of
examples and we would expect that a feature space representation should do well.
A similar procedure was followed for the SVR representation except the regularizer
constant U,
E
and power p were varied to find the minimum validation prediction error.
In the majority of cases
p=2
was the optimum choice of power.
For the Boston Housing data, we picked randomly from the 506 cases using a training set
of size 401, a validation set of size 80 and a test set of size 25. This was repeated 100
times. The optimum power as picked by the validations set varied between
p=4
and
p=5.
3.
Results of experiments
The first experiments we tried were bagging regression trees versus support regression
(Table I).
Table I. Modeling error and prediction error on
the three Friedman problems
(100
trials).
Rather than report the standard
CITOT,
we did a comparison for each training set. That is,
for the first experiment we tried both SVR and bagging on the same training, validation,
and test set. If SVR had a better modeling error on the test set, it counted as a win. Thus
for Friedman
#l,
SVR was always better than bagging on the
100
trials. There is no clear
winner for Friedman function
#3.
In Table II below we normalized the modeling error
by the variance of the truth while the prediction error was normalized by the variance of
the observed data on the test set. This shows that the worst modeling error is on
Friedman
#I
while the worst prediction error is on function
#3.
Table II. Modeling error and prediction error normalized.
Subsequent to our comparison of bagging to SVR, we attempted working directly in
feature space. That is, we used
F,
as
our
approximating function with square loss and a
second degree polynomial. The results of this ridge regression (Table III) are better than
SVR. In retrospect, this is not surprising since the dimensionality of feature space is
small
(f=66
for Friedman #I and
f=15
for the two remaining functions) in relation to the
number of training examples
(ZOO).
‘Ibis was due to the fact that the best approximating
polynomial is second order. The other advantages of the feature space representation in
this particular case
are.
that both PE and ME
ax
mean squared error and the loss function
is mean squared error also.
Table III. Modeling error for SVR and
feature space polynomial approximation.
SVR feature
space
E
#l
-67 .61
#2
4,944
305 1
#3
.0261
.0176
We now ask the question whether
U
and
E
are
important in SVR by comparing the results
in Table I with the results obtaining by setting
E
to zero and U to 100,000 making the
regularizer insignificant (Table IV). On Friedman
#2
(and less so on Friedman
#3),
the
proper choice of
E
and U are important.
Table IV. Comparing the results above with those obtained by setting
E
to zero and U to
100,COO
(labeled suboptimum).
For the case of Boston Housing, the prediction error using bagging was 12.4 while for
SVR we obtained 7.2 and SVR was better than bagging on 71 out of 100 trials. The
optimum power seems to be about five. We never were able to get the feature
representation to work well because the number of coefficients to
Lx.
determined (6885)
was much larger than the number of training examples (401).
One important
chamcterizat~
of a learning method is bias (sometimes termed
bias-
s%I
uared)
and variance, Let
yy’
be the average prediction on the test set for test point i.
Y,~ IS
the prediction for experiment j, test point i. Recall that the test sets are identical for
the
100
experiments. but 100 different training and validation sets were used. When
then the bias is:
bias =
The last term in the brackets is the truth so essentially the bias is mean square difference
between the truth and the average prediction.
The variance is:
The variance is the mean square difference between the prediction for each experiment at
each test point and the average prediction for that test point (Table V). Thus, in
comparing bagging to SVR. the largest effect is the reduction in bias.
Table V: the bias and variance for bagging and SVR.
bias
bias variance variance
bagging
SVR
bag&g
SVR
#l
I
1.74
.0548
1
s2
.61
#2
3111
90.28 7074 5327
#3 .I440
.0144
,008 ,011
4.
Linear Experiments
We investigated the behavior of support vector regression on two linear problems.
The first linear problem was as follows:
y=2x,+xz+x,
In addition there
are
24 other components of the vector
x,
but y is not a (direct) function
of these other 24 components. The x’s
are
picked from a N(O,l) distribution and the x’s
are correlated with E[xIxj]=.8J for
j=2,...,30.
We used normal, uniform and Laplacian
noise scaled to give different signal to noise ratios. Thus if there were no noise, our
predictor would only have three variables. However, every variable has some predictive
power and may help in the presence of noise. We use 60 training examples and 5000 test
examples per
run
with 100 runs. This problem is similar in spirit to that of Breiman
(1994). We compared ordinary least squares (ols), SVR, and forward subset selection
(fss) In fss, we first find the best (of thirty variables) in doing an
ok
prediction of the
observed data. We fix that one variable and then
find
the next variable that used in
combination with the first (fixed) variable, does the best
ok
prediction. This is continued
until all thirty variables
ax
used. Because the training
cnor
will continuously decrease
as variables are added, we use a separate validation set of size 20% of the training set
size. Every time we find an additional independent variable to add, we test on the
validation set. A plot of validation set performance (mean square error) versus number
of variables will show a minimum of the mean squared error and that is the optimum
number of variables to be used. We then find the perfomxmce on a test set of size
5OMl
using the optimum set of variables predicted by the validation set.
We report the prediction error for the three types of noise and three prediction procedures
(Table VI):
Table VI. Prediction error for normal, uniform, and Laplacian noise using ordinary least
squares
(ok),
support vector regression (SVR), and forward subset selection (fss) for
different signal-to-noise ratios
(SNR).
normal
Laplaeian
unifoml
SNR
1
ok SVR fss
01s
SVR fss
ok
SVR fss
.8
45.8 29.3 28.0 40.8 25.4 24.5 39.7
28.1 24.1
1.2
20.0
14.9 12.3
18.1
12.5 10.9
17.6
12.8 11.7
2.5
4.61
3.97 3.12 4.17 3.26 2.48 4.06 3.60 2.76
5.0 I.15
1.33
,768
I
.04
,516 ,599
I
.02 1.08
,617
Although SVR is better than
ok,
fss and SVR gives similar performance. At signal to
noise ratios larger than 5, forward subset selection is better than
ok
but similar to SVR.
The above experiment was for a training set of size 60. For training sets of size 200,
forward subset selection gives the best results.
The above linear problem has a special structure, namely that there there
arc
really only
three predictor variables and there
ax
a total of thirty to choose from, so it might be
expected that forward subset selection would perform well.
We therefore tried the linear prediction problem where the output is a function of all the
variables:
30
y=&+n
i=t
In this case, both
ok
and fss give similar results. SVR is never worse and sometimes
slightly better at low
SNR’s.
However, at high
SNR’s,
SVR is worse than
ok
or
forward subset selection.
5 Conclusions
Support vector regression was compared to bagging and a feature space representation on
four nonlinear problems. On three of these problems a feature space representation was
best, bagging was
wont,
and SVR came in second. On the fourth problem, Boston
Housing, SVR was best and we were unable to construct a feature space representation
because of the high dimensionality required of the feature space. On linear problems,
forward subset selection seems to be the method of choice for the two linear problems we
tried at varying signal to noise ratios.
In retrospect, the problems we decided to test on were too simple. SVR probably has
greatest use when the dimensionality of the input space and the order of the
approximation creates a dimensionality of a feature space representation much larger than
that of the number of examples. This was not the case for the problems we considered.
We thus need real life examples that fulfill these requirements.
6. Acknowledgements
This project was supported by ARPA contract number
NOG+ll4-94-C-1086.
7.
References
Leo Breiman, “Bagging Predictors”, Technical Report 421, September 1994, Department
of Statistics, University of California Berkeley, CA Also at anonymous ftp site:
ftp.stat.berkeley.edu/pub/tech-reports/42l.ps.Z.
Jame R. Bunch and Linda C. Kaufman,
”
A Computational Method of the Indefinite
Quadratic Programming Problem”, Linear Algebra and Its Applications, Elsevier-North
Holland, 1980.
Jerry Friedman, “Multivariate Adaptive Regression Splines”, Annal
of
Statistics,
vol
19,
No. I, pp. l-141
Alan
J.
Miller, Subset Selection in Regression, Chapman and Hall, 1990.
Gilbert Strang, Introduction to Applied Mathematics, Wellesley Cambridge Press, 1986.
Vladimir N. Vapnik, The Narure
ofStatistical
Learning Theory, Springer,
1995.
Figure 1: The parameters for the support vector
regression.