ArticlePDF Available
Support Vector Regression Machines
Harris
Drucker’ Chris J.C. Burges**
Linda Kaufman**
Alex
Smola** Vladimir
Vapnik
+
*Bell Labs and Monmouth University
Department of Electronic Engineering
West Long Branch, NJ 07764
**Bell Labs +AT&T Labs
Abstract
A new regression technique based on Vapnik’s concept of support
vectors is introduced. We compare support vector regression
(SVR)
with a committee regression technique (bagging) based on regression
trees and ridge regression done in feature space. On the basis of these
experiments, it is expected that SVR will have advantages in high
dimensionality space because SVR optimization does not depend on the
dimension&y of the input space.
This is a longer version of the paper to appear in Advances in Neural
Inform&on
Processing Systems 9 (proceedings of the 1996 conference)
1. Introduction
In the following, lower case bold characters represent vectors and upper case bold
characters represent matrices. Superscript
“t”
represents the transpose of a vector. y
represents either a vector (in bold) or a single observance of the dependent variable in the
presence of noise.
y’p’
indicates a predicted value due to the input vector
x@’
not seen in
the training set.
Suppose we have
an
unknown function
C(x)
(the
“truth”)
which is a function of a vector
x
(termed input space). The vector
x’
=
[xI,rr,...,x~]
has d components where d is
termed the dimensionality of the input space.
F(x,w)
is a family of functions
parameterized by w.
;
is that value of
w
that minimizes a measure of error between
G(x) and
F(x,j).
Our objective is to estimate w with
G
by observing the N training instances
vi_
j=l,-,N.
We will develop two approximationS for the truth
G(x).
The first one is
F,(x,w)
which
we term
B
feature space representation. One (of many) such feature vectors is:
“Support Vector Regression
Mncbines”,
.%‘eurnl
In/onnn~ion
Processing
.5)srems
9, with C.J. Burges, L.
KalltfitL3&
A.
Smola,
and V.
Vnpnik,
eds. M.C.
Mozer.
J.I. Jordan, T. Petscbe. pp. 155-161, MIT Press.
1997
which is a quadratic function of the input space components. Using the feature space
representation, then
F,
(x,;)
=
z’lj
,
that is,
F,(x,w)
is linear
infearure
space although
it is quadratic in input space. In general, for a p’th order polynomial and d’th
dimensional input space, the feature dimensionalityfof
&
is
i=d-l
n!
where
ct
=
k
!(n-k)!
From now on, when we write
F,
(I,&;)
, we mean the feature
space representation and we must determine the f components of
G
from the N training
“ectcns.
The second representation is a support vector regression (SVR) representation that was
developed by Vladimir Vapnik (1995):
F2 is an expansion explicitly using the training examples. The rationale for calling it a
support vector representation will be clear later as will the necessity for having both an
a* and an a rather than just one multiplicative constant. In this case we must choose the
2N + 1 values of a;
at
and b. If we expand the term raised to the p’th power, we find f
coefficients that multiply the various powers and cross product
tams
of the components
of x. So, in this sense
F,
looks very similar to
F2
in that they have the same number of
terms. However
F,
hasffree
coefficients while
F2
has
2N+l
coefficients that must be
determined from the N training
actors.
For instance, suppose we have a
100
dimensional input vector and use a third order polynomial. The dimensionality of feature
space for both
F,
and
F2
exceeds
176,COO,
but for the feature space representation, we
have to determine over
176,CGO
coefficients while for the SVR representation we have to
determine 2N+l coefficients. Thus, it may take a lot more data to estimate the
coefficients in
Gin
the feature space representation.
We let
a
represent the 2N values of
ai
and
a:.
The optimum values for the components
of
G
or
a
depend on our definition of the loss function and the objective function. Here
the primal objective function is:
where L is a general loss function (to be defined later) and F could be
F,
or
FZ,
yj
is the
observation of
G(x)
in the presence of noise, and the last term is a
regultizer.
The
regularization constant is
U
which in typical developments multiplies the regulatizer but
is placed in front of the first term for reasons discussed later.
If the loss function is quadratic, i.e., we L[.]=[.]*, and we let
F=F,,
i.e., the feature space
representation, the objective function may be minimized by using linear algebra
techniques since the feature space representation is linear in that space. This is termed
ridge regression (Miller, 1990).
In.particulx
let V be a matrix whose i’th row is the i’th
training vector represented in feature space (including the constant term “1” which
represents a bias). V is a matrix where the number of rows is the number of examples
(N) and the number of columns is the dimensionality of feature space
J
Let
E
be the
fxf
diagonal matrix whose elements we l/U. y is the Nxl column vector of
observ$ms
of
the dependent vwiable. We then solve the following matrix formulation for
w
using a
linear technique (Stnmg, 1986) with a linear algebra package (e.g., MATLAB):
V’y
=
[V’V
+ E] G
The rationale for the regularization term is to trade off mean square error (the first term)
in the objective function against the sire of the
G
vector. If U is large, then essentially we
are
minimizing the mean square error on the training set which may give poor
generalization to a test set. We find a good value of U by varying U to find the best
performance on a validation set and then applying that U to the test set. U is very useful
if the dimensionality of the feature set is larger than the number of examples.
Let us now define a different type of loss function termed an e-insensitive loss (Vapnik,
1995):
1
if I
yi-F2(xi,G)
I
<
e
L
=
I
yi-F*(x:,;)
I
-E
otherwise
This defines an
E
tube (Figure 1) so that if the predicted value is within the tube the loss
is zero, while if the predicted point is outside the tube, the loss is the magnitude of the
difference between the predicted value and the radius
E
of the tube.
Specifically, we minimize:
N
N
V(C5i’
+
E3
+
+v)
i=l
i=,
where
ki
or
ti’
is zero if the sample point is inside the tube. If the observed point is
“above” the tube,
5;
is the positive difference between the observed value and
E
and
ai
will be nonzero. Similary,
&*
will be nonzem if the observed point is below the tube
and in this case a,! will be nonzero. Since an observed point can not be simultaneously
on both sides of the tube, either a; or
a,:
will be nonzero, unless the point is within the
tube, in which case, both constants will be zero.
If
U
is large, more emphasis is placed on the
~ITOT
while if U is small, more emphasis is
placed on the norm of the weights leading to (hopefully) a better generalization. The
constraints are: (for all i,
i=l,N)
yi-(w’vi)-b~&+&
(w’vi~b-yiWE,t~
$::
The corresponding Lagrangian is:
L=$v’w)
+
U(;<*i
+
;&)
-
~utLvi-(W’Yij-b+~+Si*]
,=I
I=1
,=I
-&M+W+b-x+&l
-
&u:5:+xSi,
i=l i=l
where the
yi
and
a,
ax
Lagrange multipliers
We find a saddle point of L (Vapnik, 1995) by differentiating with respect to
IV;
,
b, and
5
which results in the equivalent maximization of the (dual space) objective function:
W(a,a*)
=
-&a;+a;)+&a;-a,)
-
+,$
(a;-ai)(af-aj)(v:vj
+
l)p
i=l
i=l
6-l
with the constraints:
We must find N Largrange multiplier pairs (a,,
af).
We can also prove that the product of
ai
and
a’
is
zero
which means that at least one of these two terms is zero. A
vi
corresponding to a non-zero
ai
OT
aIT
is termed a support vector. There can be at most
N
support vectors. Suppose now, we have a new vector
x(p),
then the corresponding
prediction ofy@) is:
y@)
=
;(a:
-
ai)
+
l)P+b
i=l
Maximizing W is a quadratic programming problem but the above expression for W is
not in standard form for use in quadratic programming packages (which usually does
minimization). If we let
then we minimize:
subject to the constraints
ipi
=
zpi
and OSp;CI
i=l,
,2N
i=l
Ntl
c'=[E-y,,E-y~.
.E-yN.E+y~.E+y~.
,E+YNl
D
-D
Q=
[
1
-D
D
dij
= (v:v, +
1)P
i,j=l;..
,N
Note that no matter what the power p, this remains a quadratic optimization program.
The rationale for the notation of V as the regularization constant is that it now appears
as
an upper bound on the
p
and a vectors. These quadratic programming problems can be
very cpu and memory intensive. Fortunately, we can devise programs that make use of
the fact that for problems with few support vectors (in comparison to the sample size),
storage space is proportional to the number of support vectors. We use an active set
method (Bunch and Kaufman, 1980) to solve this quadratic programming problem.
Although we may find
G
in the SVR representation it is not necessary. That is, the SVR
representation is not explicitly
a
function of
G.
Similarly, if we use radial basis
functions, the expansion is not an explicit function of
G.
That is, we can express the
predicted values as:
In this case, the elements of the Q matrix above
are
djj
=exp[-yiivi
-vi/i’]
and G cannot be explicitly obtained,
2. Nonlinear Experiments
We tried three artificial functions from (Friedman, 1991) and a problem (Boston
Housing) from the
UC1
database. Because the first three problems are artificial, we
know both the observed values and the truths.
Friedman
#l
is a nonlinear prediction problem which has
10
independent variables that
are
uniform in [O,ll:
where
n
is
N(O,l).
Therefore, only five predictor variables
are
really needed, but the
predictor is faced with the problem of trying to distinguish the variables that have no
prediction ability
(
xg
to
x
1o
)
from those that have predictive ability
(x,
to
x5
).
Friedman
#2,#3
have four independent variables and are respectively:
#2 y=(x:+(x*x,-(l/(x*x~)))*)“*+n
#3 y=ta”-’
[
~@-;~z~4)]+n
where the noise. is adjusted to give
3:l
ratio of signal power to noise power and the
variables are uniformly distributed in the following ranges:
Boston Housing: This has 506 cases with the dependent variable being the median price
of housing in the Boston area. There
are
twelve continuous predictor variables. This data
was obtaining from the
UC1
database (anonymous ftp at ftp.ics.uci.edu in directory
/pub/machine-learning-databases) In this case, we have no “truth”, only the observations.
In addition to the input space representation and the SVR representation, we also tried
bagging. Bagging is a technique that combines regressors, in this case regression trees
(Breiman, 1994). We used this technique because we had a local version available. Our
implementation of regression trees is different than Breiman’s but we obtained slightly
better results. Breiman uses cross validation to pick the best tree and prune while we use
a separate validation set. In the case of regression trees, the validation set was used to
prune the trees.
Suppose we have test points with input vectors
.I?’
i=l,M
and make a prediction
yp’
using any procedure discussed here. Suppose
yi
is the actually observed value, which is
the truth
G(x)
plus noise. We define the prediction error (PE) and the modeling error
(ME):
For the three Friedman functions we calculated both the prediction
error
and modeling
error. For Boston Housing, since the “truth” was not known, we calculated the prediction
error only. For the three Friedman functions, we generated (for each experiment) 200
training set examples and 40 validation set examples. The validation set examples were
used to find the optimum regulatization constant in the feature space representation. The
following procedure was followed. Train on the 200 members of the training set with a
choice of regularization constant and obtain the prediction error on the validation set.
Now repeat with a different regulatization constant until a minimum of prediction error
occurs on the validation set. Now, use that regularizer constant that minimizes the
validation set prediction error and test on a
loo0
example test set. This experiment was
repeated for
100
different training sets of size
200
and validation sets of size 40 but one
test set of size
1000.
Different size polynomials were tried (maximum power 3). Second
order polynomials fared best. For Friedman function
#l,
the dimensionality of feature
space is 66 while for the last two problems, the dimensionality of feature space was 15
(for d=2). Thus the size of the feature space is smaller than that of the number of
examples and we would expect that a feature space representation should do well.
A similar procedure was followed for the SVR representation except the regularizer
constant U,
E
and power p were varied to find the minimum validation prediction error.
In the majority of cases
p=2
was the optimum choice of power.
For the Boston Housing data, we picked randomly from the 506 cases using a training set
of size 401, a validation set of size 80 and a test set of size 25. This was repeated 100
times. The optimum power as picked by the validations set varied between
p=4
and
p=5.
3.
Results of experiments
The first experiments we tried were bagging regression trees versus support regression
(Table I).
Table I. Modeling error and prediction error on
the three Friedman problems
(100
trials).
Rather than report the standard
CITOT,
we did a comparison for each training set. That is,
for the first experiment we tried both SVR and bagging on the same training, validation,
and test set. If SVR had a better modeling error on the test set, it counted as a win. Thus
for Friedman
#l,
SVR was always better than bagging on the
100
trials. There is no clear
winner for Friedman function
#3.
In Table II below we normalized the modeling error
by the variance of the truth while the prediction error was normalized by the variance of
the observed data on the test set. This shows that the worst modeling error is on
Friedman
#I
while the worst prediction error is on function
#3.
Table II. Modeling error and prediction error normalized.
Subsequent to our comparison of bagging to SVR, we attempted working directly in
feature space. That is, we used
F,
as
our
approximating function with square loss and a
second degree polynomial. The results of this ridge regression (Table III) are better than
SVR. In retrospect, this is not surprising since the dimensionality of feature space is
small
(f=66
for Friedman #I and
f=15
for the two remaining functions) in relation to the
number of training examples
(ZOO).
‘Ibis was due to the fact that the best approximating
polynomial is second order. The other advantages of the feature space representation in
this particular case
are.
that both PE and ME
ax
mean squared error and the loss function
is mean squared error also.
Table III. Modeling error for SVR and
feature space polynomial approximation.
SVR feature
space
E
#l
-67 .61
#2
4,944
305 1
#3
.0261
.0176
We now ask the question whether
U
and
E
are
important in SVR by comparing the results
in Table I with the results obtaining by setting
E
to zero and U to 100,000 making the
regularizer insignificant (Table IV). On Friedman
#2
(and less so on Friedman
#3),
the
proper choice of
E
and U are important.
Table IV. Comparing the results above with those obtained by setting
E
to zero and U to
100,COO
(labeled suboptimum).
For the case of Boston Housing, the prediction error using bagging was 12.4 while for
SVR we obtained 7.2 and SVR was better than bagging on 71 out of 100 trials. The
optimum power seems to be about five. We never were able to get the feature
representation to work well because the number of coefficients to
Lx.
determined (6885)
was much larger than the number of training examples (401).
One important
chamcterizat~
of a learning method is bias (sometimes termed
bias-
s%I
uared)
and variance, Let
yy’
be the average prediction on the test set for test point i.
Y,~ IS
the prediction for experiment j, test point i. Recall that the test sets are identical for
the
100
experiments. but 100 different training and validation sets were used. When
then the bias is:
bias =
The last term in the brackets is the truth so essentially the bias is mean square difference
between the truth and the average prediction.
The variance is:
The variance is the mean square difference between the prediction for each experiment at
each test point and the average prediction for that test point (Table V). Thus, in
comparing bagging to SVR. the largest effect is the reduction in bias.
Table V: the bias and variance for bagging and SVR.
bias
bias variance variance
bagging
SVR
bag&g
SVR
#l
I
1.74
.0548
1
s2
.61
#2
3111
90.28 7074 5327
#3 .I440
.0144
,008 ,011
4.
Linear Experiments
We investigated the behavior of support vector regression on two linear problems.
The first linear problem was as follows:
y=2x,+xz+x,
In addition there
are
24 other components of the vector
x,
but y is not a (direct) function
of these other 24 components. The x’s
are
picked from a N(O,l) distribution and the x’s
are correlated with E[xIxj]=.8J for
j=2,...,30.
We used normal, uniform and Laplacian
noise scaled to give different signal to noise ratios. Thus if there were no noise, our
predictor would only have three variables. However, every variable has some predictive
power and may help in the presence of noise. We use 60 training examples and 5000 test
examples per
run
with 100 runs. This problem is similar in spirit to that of Breiman
(1994). We compared ordinary least squares (ols), SVR, and forward subset selection
(fss) In fss, we first find the best (of thirty variables) in doing an
ok
prediction of the
observed data. We fix that one variable and then
find
the next variable that used in
combination with the first (fixed) variable, does the best
ok
prediction. This is continued
until all thirty variables
ax
used. Because the training
cnor
will continuously decrease
as variables are added, we use a separate validation set of size 20% of the training set
size. Every time we find an additional independent variable to add, we test on the
validation set. A plot of validation set performance (mean square error) versus number
of variables will show a minimum of the mean squared error and that is the optimum
number of variables to be used. We then find the perfomxmce on a test set of size
5OMl
using the optimum set of variables predicted by the validation set.
We report the prediction error for the three types of noise and three prediction procedures
(Table VI):
Table VI. Prediction error for normal, uniform, and Laplacian noise using ordinary least
squares
(ok),
support vector regression (SVR), and forward subset selection (fss) for
different signal-to-noise ratios
(SNR).
normal
Laplaeian
unifoml
SNR
1
ok SVR fss
01s
SVR fss
ok
SVR fss
.8
45.8 29.3 28.0 40.8 25.4 24.5 39.7
28.1 24.1
1.2
20.0
14.9 12.3
18.1
12.5 10.9
17.6
12.8 11.7
2.5
4.61
3.97 3.12 4.17 3.26 2.48 4.06 3.60 2.76
5.0 I.15
1.33
,768
I
.04
,516 ,599
I
.02 1.08
,617
Although SVR is better than
ok,
fss and SVR gives similar performance. At signal to
noise ratios larger than 5, forward subset selection is better than
ok
but similar to SVR.
The above experiment was for a training set of size 60. For training sets of size 200,
forward subset selection gives the best results.
The above linear problem has a special structure, namely that there there
arc
really only
three predictor variables and there
ax
a total of thirty to choose from, so it might be
expected that forward subset selection would perform well.
We therefore tried the linear prediction problem where the output is a function of all the
variables:
30
y=&+n
i=t
In this case, both
ok
and fss give similar results. SVR is never worse and sometimes
slightly better at low
SNR’s.
However, at high
SNR’s,
SVR is worse than
ok
or
forward subset selection.
5 Conclusions
Support vector regression was compared to bagging and a feature space representation on
four nonlinear problems. On three of these problems a feature space representation was
best, bagging was
wont,
and SVR came in second. On the fourth problem, Boston
Housing, SVR was best and we were unable to construct a feature space representation
because of the high dimensionality required of the feature space. On linear problems,
forward subset selection seems to be the method of choice for the two linear problems we
tried at varying signal to noise ratios.
In retrospect, the problems we decided to test on were too simple. SVR probably has
greatest use when the dimensionality of the input space and the order of the
approximation creates a dimensionality of a feature space representation much larger than
that of the number of examples. This was not the case for the problems we considered.
We thus need real life examples that fulfill these requirements.
6. Acknowledgements
This project was supported by ARPA contract number
NOG+ll4-94-C-1086.
7.
References
Leo Breiman, “Bagging Predictors”, Technical Report 421, September 1994, Department
of Statistics, University of California Berkeley, CA Also at anonymous ftp site:
ftp.stat.berkeley.edu/pub/tech-reports/42l.ps.Z.
Jame R. Bunch and Linda C. Kaufman,
A Computational Method of the Indefinite
Quadratic Programming Problem”, Linear Algebra and Its Applications, Elsevier-North
Holland, 1980.
Jerry Friedman, “Multivariate Adaptive Regression Splines”, Annal
of
Statistics,
vol
19,
No. I, pp. l-141
Alan
J.
Miller, Subset Selection in Regression, Chapman and Hall, 1990.
Gilbert Strang, Introduction to Applied Mathematics, Wellesley Cambridge Press, 1986.
Vladimir N. Vapnik, The Narure
ofStatistical
Learning Theory, Springer,
1995.
Figure 1: The parameters for the support vector
regression.
... SVR is a kernel-based machine learning algorithm. SVR develops its regression algorithm using the classification concepts from the support vector machines (SVM) (Drucker et al., 1996). In the case of regression, a tolerance margin is set to approximate the SVM. ...
Article
Nonperforming loans play a critical role in financial institutions' overall performance and can be controlled by forecasting the probable nonperforming loans. This paper employs a series of machine learning techniques to forecast bank nonperforming loans on emerging countries' financial institutions. Using quarterly cross-sectional data of 322 banks from 15 emerging countries, this study finds that advanced machine learning-based models outperform simple linear techniques in forecasting bank nonperforming loans. Among all 14 linear and nonlinear models, the random forest model outperforms other models. It achieves a 76.10% accuracy in forecasting nonperforming loans. The result is robust in different performance metrics. The variable importance analysis reveals that bank diversification is the most critical determinant for future nonperforming loans of a bank. Additionally, this study revealed that macroeconomic factors are less prominent in predicting nonperforming loans compared with bank-specific factors.
... Another powerful supervised learning model for non-linear regression is SVM 47 . In this model, a threshold (ε) is set by the user to control the maximum allowable error for the regression setting. ...
Article
Full-text available
A combination of wearable sensors’ data and Machine Learning (ML) techniques has been used in many studies to predict specific joint angles and moments. The aim of this study was to compare the performance of four different non-linear regression ML models to estimate lower-limb joints’ kinematics, kinetics, and muscle forces using Inertial Measurement Units (IMUs) and electromyographys’ (EMGs) data. Seventeen healthy volunteers (9F, 28 ± 5 years) were asked to walk over-ground for a minimum of 16 trials. For each trial, marker trajectories and three force-plates data were recorded to calculate pelvis, hip, knee, and ankle kinematics and kinetics, and muscle forces (the targets), as well as 7 IMUs and 16 EMGs. The features from sensors’ data were extracted using the Tsfresh python package and fed into 4 ML models; Convolutional Neural Networks (CNN), Random Forest (RF), Support Vector Machine, and Multivariate Adaptive Regression Spline for targets’ prediction. The RF and CNN models outperformed the other ML models by providing lower prediction errors in all intended targets with a lower computational cost. This study suggested that a combination of wearable sensors’ data with an RF or a CNN model is a promising tool to overcome the limitations of traditional optical motion capture for 3D gait analysis.
... SVR Support Vector Regression (Drucker, Burges, Kaufman, Smola, & Vapnik, 1997) project data point into a hyperplane and separate them with the maximum margin. ...
Conference Paper
Full-text available
With the adaption of online learning environment, students' learning behavior can be recorded as digital data. In order to implement the conceptual framework of learning analytics, many researchers applied machine learning methodologies and used data which collected from digital learning environment to predict students' academic performance for targeting at-risk population. However, along with the characteristic of machine learning methodologies, it presents diversity prediction performance due to the statistical property of educational data and these caused the difficulty to applied machine learning technology to classroom. In this study, we collected the state-of-the-art on regression algorithms and used an E-book-based learning dataset within 53 students for benchmarking the suitable algorithm for targeting at-risk students. In addition, we address the issues from learning environment, including over-concentration score, dropout students and data instance insufficiently, for improving prediction performance. The results revealed that the proposed performance tuning process could obtain optimal performance metrics and avoid over-fitting problem.
... We can find that although all of the selected nodes show the obvious peak patterns, there are great differences in the statistic of the crowd flow of nodes from different districts. However, in the same districts, the statistic of crowd flow is similar even at different nodes, such as node 3 and node 4. Due to these complex spatio-temporal characteristics, the difficulty of feature engineering has increased, and some methods that perform well in traditional data prediction, such as support vector regression (SVR) [4], random forest (RF) [5], gradient boosting decision tree (GBDT) [6], etc., are difficult to achieve more accurate prediction results. In the past decade, due to the rapid development of deep learning technologies, some hybrid neural networks based on convolutional neural networks (CNN) [7] and recurrent neural networks (RNN) [8], such as ConvLSTM [9], PredRNN [10], etc., have gradually been applied to predictive learning of urban spatio-temporal data and gained significant advantages. ...
Preprint
Full-text available
With the development of sophisticated sensors and large database technologies, more and more spatio-temporal data in urban systems are recorded and stored. Predictive learning for the evolution patterns of these spatio-temporal data is a basic but important loop in urban computing, which can better support urban intelligent management decisions, especially in the fields of transportation, environment, security, public health, etc. Since traditional statistical learning and deep learning methods can hardly capture the complex correlations in the urban spatio-temporal data, the framework of spatio-temporal graph neural network (STGNN) has been proposed in recent years. STGNNs enable the extraction of complex spatio-temporal dependencies by integrating graph neural networks (GNNs) and various temporal learning methods. However, for different predictive learning tasks, it is a challenging problem to effectively design the spatial dependencies learning modules, temporal dependencies learning modules and spatio-temporal dependencies fusion methods in STGNN framework. In this paper, we provide a comprehensive survey on recent progress on STGNN technologies for predictive learning in urban computing. We first briefly introduce the construction methods of spatio-temporal graph data and popular deep learning models that are employed in STGNNs. Then we sort out the main application domains and specific predictive learning tasks from the existing literature. Next we analyze the design approaches of STGNN framework and the combination with some advanced technologies in recent years. Finally, we conclude the limitations of the existing research and propose some potential directions.
Article
The maximum quantum efficiency of photosystem II (Fv/Fm) is a widely used indicator of photosynthetic health in plants. Remote sensing of Fv/Fm using MS (multispectral) and RGB imagery has the potential to enable high-throughput screening of plant health in agricultural and ecological applications. This study aimed to estimate Fv/Fm in spring wheat at an experimental base in Hanghou County, Inner Mongolia, from 2020 to 2021. RGB and MS images were obtained at the wheat flowering stage using a Da-Jiang Phantom 4 multispectral drone. A total of 51 vegetation indices were constructed, and the measured Fv/Fm of wheat on the ground was obtained simultaneously using a Handy PEA plant efficiency analyzer. The performance of 26 machine learning algorithms for estimating Fv/Fm using RGB and multispectral imagery was compared. The findings revealed that a majority of the multispectral vegetation indices and approximately half of the RGB vegetation indices demonstrated a strong correlation with Fv/Fm, as evidenced by an absolute correlation coefficient greater than 0.75. The Gradient Boosting Regressor (GBR) was the optimal estimation model for RGB, with the important features being RGBVI and ExR. The Huber model was the optimal estimation model for MS, with the important feature being MSAVI2. The Automatic Relevance Determination (ARD) was the optimal estimation model for the combination (RGB + MS), with the important features being SIPI, ExR, and VEG. The highest accuracy was achieved using the ARD model for estimating Fv/Fm with RGB + MS vegetation indices on the test sets (Test set MAE = 0.019, MSE = 0.001, RMSE = 0.024, R2 = 0.925, RMSLE = 0.014, MAPE = 0.026). The combined analysis suggests that extracting vegetation indices (SIPI, ExR, and VEG) from RGB and MS remote images by UAV as input variables of the model and using the ARD model can significantly improve the accuracy of Fv/Fm estimation at flowering stage. This approach provides new technical support for rapid and accurate monitoring of Fv/Fm in spring wheat in the Hetao Irrigation District.
Chapter
In this short paper, we compare for the first time the performance of some approaches for evapotranspiration prediction, in terms of interpretability, accuracy and training time. The considered techniques are Decision Trees (DTs) and the Adaptive Network-based Fuzzy Inference System with fractional Tikhonov regularization (ANFIS-T), which are known as interpretable. Although interpretable to some extent, Support Vector Regression (SVR) was also included in the comparative analysis, since commonly used in precision agriculture. Experiments were performed on two publicly available datasets. ANFIS-T showed the highest level of interpretability, with comparable accuracy and reduced training time with respect to DTs.KeywordsFractional regularizationDecision treesSupport vector regressionANFIS
Chapter
Multivariate Time Series (MTS) involve multiple time series variables that are interdependent. The MTS follows two dimensions, namely spatial along the different variables composing the MTS and temporal. Both, the complex and the time-evolving nature of MTS data make forecasting one of the most challenging tasks in time series analysis. Typical methods for MTS forecasting are designed to operate in a static manner in time or space without taking into account the evolution of spatio-temporal dependencies among data observations, which may be subject to significant changes. Moreover, it is generally accepted that none of these methods is universally valid for every application. Therefore, we propose an online adaptation of MTS forecasting by devising a fully automated framework for both adaptive input spatio-temporal variables and adequate forecasting model selection. The adaptation is performed in an informed manner following concept-drift detection in both spatio-temporal dependencies and model performance over time. In addition, a well-designed meta-learning scheme is used to automate the selection of appropriate dependence measures and the forecasting model. An extensive empirical study on several real-world datasets shows that our method achieves excellent or on-par results in comparison to the state-of-the-art (SoA) approaches as well as several baselines.KeywordsMultivariate time seriesForecastingAutomated model selectionSpatio-temporal dependenciesConcept-drift
Article
We present an algorithm for the quadratic programming problem of determining a local minimum of ƒ(x)=xTQx+cTx such that ATx⩾b where Q ymmetric matrix which may not be positive definite. Our method combines the active constraint strategy of Murray with the Bunch-Kaufman algorithm for the stable decomposition of a symmetric matrix. Under the active constraint strategy one solves a sequence of equality constrained problems, the equality constraints being chosen from the inequality constraints defining the original problem. The sequence is chosen so that ƒ(x) continues to decrease and x remains feasible. Each equality constrained subproblem requires the solution of a linear system with the projected Hessian matrix, which is symmetric but not necessarily positive definite. The Bunch-Kaufman algorithm computes a decomposition which facilitates the stable determination of the solution to the linear system. The heart of this paper is a set of algorithms for updating the decomposition as the method progresses through the sequence of equality constrained problems. The algorithm has been implemented in a FORTRAN program, and a numerical example is given.
CA Also at anonymous ftp site: ftp
  • Leo Breiman
Leo Breiman, "Bagging Predictors", Technical Report 421, September 1994, Department of Statistics, University of California Berkeley, CA Also at anonymous ftp site: ftp.stat.berkeley.edu/pub/tech-reports/42l.ps.Z.