Content uploaded by Harris Drucker
All content in this area was uploaded by Harris Drucker on May 19, 2018
Content may be subject to copyright.
Support Vector Regression Machines
Drucker’ Chris J.C. Burges**
*Bell Labs and Monmouth University
Department of Electronic Engineering
West Long Branch, NJ 07764
**Bell Labs +AT&T Labs
A new regression technique based on Vapnik’s concept of support
vectors is introduced. We compare support vector regression
with a committee regression technique (bagging) based on regression
trees and ridge regression done in feature space. On the basis of these
experiments, it is expected that SVR will have advantages in high
dimensionality space because SVR optimization does not depend on the
dimension&y of the input space.
This is a longer version of the paper to appear in Advances in Neural
Processing Systems 9 (proceedings of the 1996 conference)
In the following, lower case bold characters represent vectors and upper case bold
characters represent matrices. Superscript
represents the transpose of a vector. y
represents either a vector (in bold) or a single observance of the dependent variable in the
presence of noise.
indicates a predicted value due to the input vector
not seen in
the training set.
Suppose we have
which is a function of a vector
(termed input space). The vector
has d components where d is
termed the dimensionality of the input space.
is a family of functions
parameterized by w.
is that value of
that minimizes a measure of error between
Our objective is to estimate w with
by observing the N training instances
We will develop two approximationS for the truth
The first one is
feature space representation. One (of many) such feature vectors is:
“Support Vector Regression
9, with C.J. Burges, L.
J.I. Jordan, T. Petscbe. pp. 155-161, MIT Press.
which is a quadratic function of the input space components. Using the feature space
it is quadratic in input space. In general, for a p’th order polynomial and d’th
dimensional input space, the feature dimensionalityfof
From now on, when we write
, we mean the feature
space representation and we must determine the f components of
from the N training
The second representation is a support vector regression (SVR) representation that was
developed by Vladimir Vapnik (1995):
F2 is an expansion explicitly using the training examples. The rationale for calling it a
support vector representation will be clear later as will the necessity for having both an
a* and an a rather than just one multiplicative constant. In this case we must choose the
2N + 1 values of a;
and b. If we expand the term raised to the p’th power, we find f
coefficients that multiply the various powers and cross product
of the components
of x. So, in this sense
looks very similar to
in that they have the same number of
coefficients that must be
determined from the N training
For instance, suppose we have a
dimensional input vector and use a third order polynomial. The dimensionality of feature
space for both
but for the feature space representation, we
have to determine over
coefficients while for the SVR representation we have to
determine 2N+l coefficients. Thus, it may take a lot more data to estimate the
the feature space representation.
represent the 2N values of
The optimum values for the components
depend on our definition of the loss function and the objective function. Here
the primal objective function is:
where L is a general loss function (to be defined later) and F could be
in the presence of noise, and the last term is a
regularization constant is
which in typical developments multiplies the regulatizer but
is placed in front of the first term for reasons discussed later.
If the loss function is quadratic, i.e., we L[.]=[.]*, and we let
i.e., the feature space
representation, the objective function may be minimized by using linear algebra
techniques since the feature space representation is linear in that space. This is termed
ridge regression (Miller, 1990).
let V be a matrix whose i’th row is the i’th
training vector represented in feature space (including the constant term “1” which
represents a bias). V is a matrix where the number of rows is the number of examples
(N) and the number of columns is the dimensionality of feature space
diagonal matrix whose elements we l/U. y is the Nxl column vector of
the dependent vwiable. We then solve the following matrix formulation for
linear technique (Stnmg, 1986) with a linear algebra package (e.g., MATLAB):
+ E] G
The rationale for the regularization term is to trade off mean square error (the first term)
in the objective function against the sire of the
vector. If U is large, then essentially we
minimizing the mean square error on the training set which may give poor
generalization to a test set. We find a good value of U by varying U to find the best
performance on a validation set and then applying that U to the test set. U is very useful
if the dimensionality of the feature set is larger than the number of examples.
Let us now define a different type of loss function termed an e-insensitive loss (Vapnik,
This defines an
tube (Figure 1) so that if the predicted value is within the tube the loss
is zero, while if the predicted point is outside the tube, the loss is the magnitude of the
difference between the predicted value and the radius
of the tube.
Specifically, we minimize:
is zero if the sample point is inside the tube. If the observed point is
“above” the tube,
is the positive difference between the observed value and
will be nonzero. Similary,
will be nonzem if the observed point is below the tube
and in this case a,! will be nonzero. Since an observed point can not be simultaneously
on both sides of the tube, either a; or
will be nonzero, unless the point is within the
tube, in which case, both constants will be zero.
is large, more emphasis is placed on the
while if U is small, more emphasis is
placed on the norm of the weights leading to (hopefully) a better generalization. The
constraints are: (for all i,
The corresponding Lagrangian is:
We find a saddle point of L (Vapnik, 1995) by differentiating with respect to
which results in the equivalent maximization of the (dual space) objective function:
with the constraints:
We must find N Largrange multiplier pairs (a,,
We can also prove that the product of
which means that at least one of these two terms is zero. A
corresponding to a non-zero
is termed a support vector. There can be at most
support vectors. Suppose now, we have a new vector
then the corresponding
prediction ofy@) is:
Maximizing W is a quadratic programming problem but the above expression for W is
not in standard form for use in quadratic programming packages (which usually does
minimization). If we let
then we minimize:
subject to the constraints
= (v:v, +
Note that no matter what the power p, this remains a quadratic optimization program.
The rationale for the notation of V as the regularization constant is that it now appears
an upper bound on the
and a vectors. These quadratic programming problems can be
very cpu and memory intensive. Fortunately, we can devise programs that make use of
the fact that for problems with few support vectors (in comparison to the sample size),
storage space is proportional to the number of support vectors. We use an active set
method (Bunch and Kaufman, 1980) to solve this quadratic programming problem.
Although we may find
in the SVR representation it is not necessary. That is, the SVR
representation is not explicitly
Similarly, if we use radial basis
functions, the expansion is not an explicit function of
That is, we can express the
predicted values as:
In this case, the elements of the Q matrix above
and G cannot be explicitly obtained,
2. Nonlinear Experiments
We tried three artificial functions from (Friedman, 1991) and a problem (Boston
Housing) from the
database. Because the first three problems are artificial, we
know both the observed values and the truths.
is a nonlinear prediction problem which has
independent variables that
uniform in [O,ll:
Therefore, only five predictor variables
really needed, but the
predictor is faced with the problem of trying to distinguish the variables that have no
from those that have predictive ability
have four independent variables and are respectively:
where the noise. is adjusted to give
ratio of signal power to noise power and the
variables are uniformly distributed in the following ranges:
Boston Housing: This has 506 cases with the dependent variable being the median price
of housing in the Boston area. There
twelve continuous predictor variables. This data
was obtaining from the
database (anonymous ftp at ftp.ics.uci.edu in directory
/pub/machine-learning-databases) In this case, we have no “truth”, only the observations.
In addition to the input space representation and the SVR representation, we also tried
bagging. Bagging is a technique that combines regressors, in this case regression trees
(Breiman, 1994). We used this technique because we had a local version available. Our
implementation of regression trees is different than Breiman’s but we obtained slightly
better results. Breiman uses cross validation to pick the best tree and prune while we use
a separate validation set. In the case of regression trees, the validation set was used to
prune the trees.
Suppose we have test points with input vectors
and make a prediction
using any procedure discussed here. Suppose
is the actually observed value, which is
plus noise. We define the prediction error (PE) and the modeling error
For the three Friedman functions we calculated both the prediction
error. For Boston Housing, since the “truth” was not known, we calculated the prediction
error only. For the three Friedman functions, we generated (for each experiment) 200
training set examples and 40 validation set examples. The validation set examples were
used to find the optimum regulatization constant in the feature space representation. The
following procedure was followed. Train on the 200 members of the training set with a
choice of regularization constant and obtain the prediction error on the validation set.
Now repeat with a different regulatization constant until a minimum of prediction error
occurs on the validation set. Now, use that regularizer constant that minimizes the
validation set prediction error and test on a
example test set. This experiment was
different training sets of size
and validation sets of size 40 but one
test set of size
Different size polynomials were tried (maximum power 3). Second
order polynomials fared best. For Friedman function
the dimensionality of feature
space is 66 while for the last two problems, the dimensionality of feature space was 15
(for d=2). Thus the size of the feature space is smaller than that of the number of
examples and we would expect that a feature space representation should do well.
A similar procedure was followed for the SVR representation except the regularizer
and power p were varied to find the minimum validation prediction error.
In the majority of cases
was the optimum choice of power.
For the Boston Housing data, we picked randomly from the 506 cases using a training set
of size 401, a validation set of size 80 and a test set of size 25. This was repeated 100
times. The optimum power as picked by the validations set varied between
Results of experiments
The first experiments we tried were bagging regression trees versus support regression
Table I. Modeling error and prediction error on
the three Friedman problems
Rather than report the standard
we did a comparison for each training set. That is,
for the first experiment we tried both SVR and bagging on the same training, validation,
and test set. If SVR had a better modeling error on the test set, it counted as a win. Thus
SVR was always better than bagging on the
trials. There is no clear
winner for Friedman function
In Table II below we normalized the modeling error
by the variance of the truth while the prediction error was normalized by the variance of
the observed data on the test set. This shows that the worst modeling error is on
while the worst prediction error is on function
Table II. Modeling error and prediction error normalized.
Subsequent to our comparison of bagging to SVR, we attempted working directly in
feature space. That is, we used
approximating function with square loss and a
second degree polynomial. The results of this ridge regression (Table III) are better than
SVR. In retrospect, this is not surprising since the dimensionality of feature space is
for Friedman #I and
for the two remaining functions) in relation to the
number of training examples
‘Ibis was due to the fact that the best approximating
polynomial is second order. The other advantages of the feature space representation in
this particular case
that both PE and ME
mean squared error and the loss function
is mean squared error also.
Table III. Modeling error for SVR and
feature space polynomial approximation.
We now ask the question whether
important in SVR by comparing the results
in Table I with the results obtaining by setting
to zero and U to 100,000 making the
regularizer insignificant (Table IV). On Friedman
(and less so on Friedman
proper choice of
and U are important.
Table IV. Comparing the results above with those obtained by setting
to zero and U to
For the case of Boston Housing, the prediction error using bagging was 12.4 while for
SVR we obtained 7.2 and SVR was better than bagging on 71 out of 100 trials. The
optimum power seems to be about five. We never were able to get the feature
representation to work well because the number of coefficients to
was much larger than the number of training examples (401).
of a learning method is bias (sometimes termed
and variance, Let
be the average prediction on the test set for test point i.
the prediction for experiment j, test point i. Recall that the test sets are identical for
experiments. but 100 different training and validation sets were used. When
then the bias is:
The last term in the brackets is the truth so essentially the bias is mean square difference
between the truth and the average prediction.
The variance is:
The variance is the mean square difference between the prediction for each experiment at
each test point and the average prediction for that test point (Table V). Thus, in
comparing bagging to SVR. the largest effect is the reduction in bias.
Table V: the bias and variance for bagging and SVR.
bias variance variance
90.28 7074 5327
We investigated the behavior of support vector regression on two linear problems.
The first linear problem was as follows:
In addition there
24 other components of the vector
but y is not a (direct) function
of these other 24 components. The x’s
picked from a N(O,l) distribution and the x’s
are correlated with E[xIxj]=.8J for
We used normal, uniform and Laplacian
noise scaled to give different signal to noise ratios. Thus if there were no noise, our
predictor would only have three variables. However, every variable has some predictive
power and may help in the presence of noise. We use 60 training examples and 5000 test
with 100 runs. This problem is similar in spirit to that of Breiman
(1994). We compared ordinary least squares (ols), SVR, and forward subset selection
(fss) In fss, we first find the best (of thirty variables) in doing an
prediction of the
observed data. We fix that one variable and then
the next variable that used in
combination with the first (fixed) variable, does the best
prediction. This is continued
until all thirty variables
used. Because the training
will continuously decrease
as variables are added, we use a separate validation set of size 20% of the training set
size. Every time we find an additional independent variable to add, we test on the
validation set. A plot of validation set performance (mean square error) versus number
of variables will show a minimum of the mean squared error and that is the optimum
number of variables to be used. We then find the perfomxmce on a test set of size
using the optimum set of variables predicted by the validation set.
We report the prediction error for the three types of noise and three prediction procedures
Table VI. Prediction error for normal, uniform, and Laplacian noise using ordinary least
support vector regression (SVR), and forward subset selection (fss) for
different signal-to-noise ratios
ok SVR fss
45.8 29.3 28.0 40.8 25.4 24.5 39.7
3.97 3.12 4.17 3.26 2.48 4.06 3.60 2.76
Although SVR is better than
fss and SVR gives similar performance. At signal to
noise ratios larger than 5, forward subset selection is better than
but similar to SVR.
The above experiment was for a training set of size 60. For training sets of size 200,
forward subset selection gives the best results.
The above linear problem has a special structure, namely that there there
three predictor variables and there
a total of thirty to choose from, so it might be
expected that forward subset selection would perform well.
We therefore tried the linear prediction problem where the output is a function of all the
In this case, both
and fss give similar results. SVR is never worse and sometimes
slightly better at low
However, at high
SVR is worse than
forward subset selection.
Support vector regression was compared to bagging and a feature space representation on
four nonlinear problems. On three of these problems a feature space representation was
best, bagging was
and SVR came in second. On the fourth problem, Boston
Housing, SVR was best and we were unable to construct a feature space representation
because of the high dimensionality required of the feature space. On linear problems,
forward subset selection seems to be the method of choice for the two linear problems we
tried at varying signal to noise ratios.
In retrospect, the problems we decided to test on were too simple. SVR probably has
greatest use when the dimensionality of the input space and the order of the
approximation creates a dimensionality of a feature space representation much larger than
that of the number of examples. This was not the case for the problems we considered.
We thus need real life examples that fulfill these requirements.
This project was supported by ARPA contract number
Leo Breiman, “Bagging Predictors”, Technical Report 421, September 1994, Department
of Statistics, University of California Berkeley, CA Also at anonymous ftp site:
Jame R. Bunch and Linda C. Kaufman,
A Computational Method of the Indefinite
Quadratic Programming Problem”, Linear Algebra and Its Applications, Elsevier-North
Jerry Friedman, “Multivariate Adaptive Regression Splines”, Annal
No. I, pp. l-141
Miller, Subset Selection in Regression, Chapman and Hall, 1990.
Gilbert Strang, Introduction to Applied Mathematics, Wellesley Cambridge Press, 1986.
Vladimir N. Vapnik, The Narure
Learning Theory, Springer,
Figure 1: The parameters for the support vector