ArticlePDF Available

Abstract and Figures

Support vector machine (SVM) is a powerful tool in binary classification, known to attain excellent misclassification rates. On the other hand, many realworld classification problems, such as those found in medical diagnosis, churn or fraud prediction, involve misclassification costs which may be different in the different classes. However, it may be hard for the user to provide precise values for such misclassification costs, whereas it may be much easier to identify acceptable misclassification rates values. In this paper we propose a novel SVM model in which misclassification costs are considered by incorporating performance constraints in the problem formulation. Specifically, our aim is to seek the hyperplane with maximal margin yielding misclassification rates below given threshold values. Such maximal margin hyperplane is obtained by solving a quadratic convex problem with linear constraints and integer variables. The reported numerical experience shows that our model gives the user control on the misclassification rates in one class (possibly at the expense of an increase in misclassification rates for the other class) and is feasible in terms of running times.
Content may be subject to copyright.
Noname manuscript No.
(will be inserted by the editor)
On Support Vector Machines under a multiple-cost
scenario
Sandra Ben´ıtez-Pe˜na ·Rafael
Blanquero ·Emilio Carrizosa ·Pepa
Ram´ırez-Cobo
Received: date / Accepted: date
Abstract Support Vector Machine (SVM) is a powerful tool in binary classi-
fication, known to attain excellent misclassification rates. On the other hand,
many realworld classification problems, such as those found in medical diag-
nosis, churn or fraud prediction, involve misclassification costs which may be
different in the different classes. However, it may be hard for the user to pro-
vide precise values for such misclassification costs, whereas it may be much
easier to identify acceptable misclassification rates values. In this paper we
propose a novel SVM model in which misclassification costs are considered by
incorporating performance constraints in the problem formulation. Specifically,
our aim is to seek the hyperplane with maximal margin yielding misclassifi-
cation rates below given threshold values. Such maximal margin hyperplane
is obtained by solving a quadratic convex problem with linear constraints and
integer variables. The reported numerical experience shows that our model
gives the user control on the misclassification rates in one class (possibly at
S. Ben´ıtez-Pe˜na
IMUS, Instituto de Matem´aticas de la Universidad de Sevilla.
Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de Sevilla. Spain.
Tel.: +34-955420861
E-mail: sbenitez1@us.es
R. Blanquero
IMUS, Instituto de Matem´aticas de la Universidad de Sevilla.
Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de Sevilla. Spain.
E. Carrizosa
IMUS, Instituto de Matem´aticas de la Universidad de Sevilla.
Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de Sevilla. Spain.
P. Ram´ırez-Cobo
IMUS, Instituto de Matem´aticas de la Universidad de Sevilla.
Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de C´adiz. Spain.
2 Sandra Ben´ıtez-Pe˜na et al.
the expense of an increase of misclassification rates for the other class) and is
feasible in terms of running times.
Keywords Constrained Classification ·Misclassification costs ·Mixed
Integer Quadratic Programming ·Sensitivity/Specificity trade-off ·Support
Vector Machines
Mathematics Subject Classification (2000) 62P99 ·90C11 ·90C30
1 Introduction
In supervised classification we are given a set of individuals belonging to
two or more different classes, and the final aim is to classify new objects whose
class is unknown. Each object ican be represented by a pair (xi, yi), where
xiRmis the so-called feature vector and yi∈ C is the class membership of
object i.
A state-of-the-art method in supervised classification is the support vector
machine (SVM), see Vapnik (1995, 1998); Cristianini and Shawe-Taylor (2000);
Carrizosa and Romero Morales (2013). In its basic version, SVM addresses two-
class problems, i.e., Chas two elements, say, C={−1,+1}. The SVM aims at
separating both classes by means of a linear classifier, ω>x+β= 0, where ω
is the score vector. We will assume throughout this paper that C={−1,+1}
and refer the reader to e.g. Allwein et al. (2000) for the reduction of multiclass
problems to this case.
The SVM classifier is obtained by solving the following convex quadratic pro-
gramming (QP) formulation with linear constraints:
min
ω,β,ξ ω>ω+CP
iI
ξi
s.t. yi(ω>xi+β)1ξi, i I
ξi0iI,
where Irepresents the set of training data, ξi0 are artificial variables
which allow data points to be misclassified, and C > 0 is a regularization
parameter to be tuned that controls the trade-off between margin minimization
and misclassification errors. Given an object i, it is classified in the positive
or the negative class, according to the sign of the so-called score function,
sign(ω>xi+β), while for the case ω>xi+β= 0, the object is classified
randomly.
As mentioned, the goal in supervised classification is to classify objects in
the correct class. However, ignoring imbalancedness (either in the classes size,
either in the misclassification cost structure) or other costs may have dramatic
consequences in the classification task, see Carrizosa et al. (2008); He and Ma
(2013); Prati et al. (2015); Maldonado et al. (2017). For instance, for clinical
On Support Vector Machines under a multiple-cost scenario 3
databases, there are usually more observations of healthy populations than for
the disease cases, so smaller classification errors are obtained for the first case.
For example, for the well known Breast Cancer Wisconsin (Diagnostic) Data
Set from the UCI repository (Lichman 2013), the number of sick cases (212)
is smaller than the size of control cases (357). If a standard SVM is used for
classifying the dataset, then the obtained rates (average values according to a
10-fold cross-validation approach), are depicted by Table 1. Even though both
Mean Std
% benign instances well class. 99% 1.7
% malign instances well class. 94.8% 4.9
Table 1 Performance of SVM in wisconsin. Average values and standard deviations com-
puted from 10 realizations.
rates are high, it might be of interest to increase the accuracy of the cancer
samples. This problem will be addressed in this paper.
In order to cope with imbalancedness, different methods have been sug-
gested, see Bradford et al. (1998); Freitas et al. (2007); Carrizosa et al. (2008);
Datta and Das (2015). Those methods are based on adding parameters or
adapting the classifier construction, among others. For example, in Carrizosa
et al. (2008) a biobjective problem of simultaneous minimization of misclassi-
fication rate, via the maximization of the margin and measurement costs, is
formulated.
In this paper a new formulation of the SVM is presented, in such a way
that the focus is not only on the minimization of the overall misclassification
rate but also on the performance of the classifier in the two classes. In order
to do that, novel constraints are added to the SVM formulation. The keystone
of the new model is its ability to achieve a deeper control over misclassifi-
cation in contrast to previously existing models. The proposed methodology
will be called Constrained Support Vector Machine (CSVM) and the resulting
classification technique will be referred as CSVM classifier.
The remainder of this paper is structured as follows. In Section 2, the
CSVM is formulated and details concerning its motivation, feasibility and so-
lutions are given. Section 3 aims to illustrate the performance of the new clas-
sifier. A description in depth about the experiments’ design, the real datasets
to be tested as well as the obtained results will be given. The paper ends with
some concluding remarks and possible extensions in Section 4.
2 Constrained Support Vector Machines
In this section the Constrained Support Vector Machine (CSVM) model is
formulated as a Mixed Integer Nonlinear Programming (MINLP) problem
(Bonami et al. 2008; Burer and Letchford 2012), specifically in terms of a
Mixed Integer Quadratic Programming (MIQP) problem.
4 Sandra Ben´ıtez-Pe˜na et al.
This section is structured as follows. In Section 2.1 some theoretical foun-
dations that motivate the novel constraints are given. Then, in Section 2.2 the
formulation of the CSVM is presented. We will depart from the linear kernel
case to later extend to the general kernel case via the kernel trick. Finally, in
Section 2.3, some issues about the CSVM formulation, as its feasibility, shall
be discussed.
2.1 Theoretical Motivation
As commented before, the aim of this work is to build a classifier so that the
user may have control over the performance in the two classes. Specifically,
given a set ={(xi, yi)}iof data (a random sample of a vector (X, Y ) with
unknown distribution), the target is to obtain a classifier such that pp0,
where pis the value of a performance measurement and p0is a threshold chosen
by the user. The performance measures pto be considered in this paper are
the sensitivity or true positive rate (TPR), the specificity or true negative rate
(TNR) and the accuracy (ACC), given by:
TPR :p=P(ω>X+β > 0|Y= +1)
T N R :p=P(ω>X+β < 0|Y=1) (1)
ACC :p=P(Y(ω>X+β)>0).
See for example, Bewick et al. (2004).
If the random variable Z, defined as
Z=1, if an observation is well classified,
0, otherwise,
is considered, then, the values of pas in (1) corresponding to the probability
of correct classification can be rewritten as
TPR :p=E[Z|Y= +1]
T N R :p=E[Z|Y=1]
ACC :p=E[Z]
and estimated from an independent and identically distributed (i.i.d.) sample
{Zi}iS, by
TPR : ˆp=¯
Z+=
P
iS+
Zi
|S+|
T N R : ˆp=¯
Z=
P
iS
Zi
|S|
ACC : ˆp=¯
Z=
P
iS
Zi
|S|,
where S+and Sdenote, respectively, the subsets {iS:yi= +1}and
{iS:yi=1}.
On Support Vector Machines under a multiple-cost scenario 5
From a hypothesis testing viewpoint, our aim is to build a classifier such
that, for a given sample, one can reject the null hypothesis in
H0:pp0
H1:p>p0.
Under the classic decision rule, H0is rejected if ˆpp
0assuming that
α=P(type I error). From Hoeffding Inequality (Hoeffding 1963),
P(ˆpp+c)exp(2nc2).(2)
As α=P(type I error) = P(ˆpp
0|p=p0), substituting pby p0in (2) yields
P(ˆp<p0+c)1exp(2nc2)=1α, (3)
where p0+c=p
0. Therefore, we can take
p
0=p0+rlog α
2n.(4)
Note that nequals |S+|,|S|or |S|, respectively, when considering the TPR,
the TNR or the accuracy.
2.2 CSVM formulation
In this section, the CSVM formulation is presented. As it will be seen, the for-
mulation includes novel performance constraints, which make the optimization
problem a MIQP problem in terms of some integer variables.
We assume to be given a dataset with known labels. From such set we
identify the training set I, used to build the classifier, and the anchor set J,
used to impose a lower bound on the classifier performance. These sets will be
considered disjoint.
With the purpose of building the CSVM, the performance constraints will
be formulated in terms of binary variables {zj}jJ, which are realizations of
the variable Zin Section 2.1 and defined as:
zj=1, if instance jis well classified
0, otherwise.
In order to formulate the CSVM, novel constraints are added to the stan-
dard soft-margin SVM formulation as follows:
min
ω,β,ξ ,z ω>ω+CP
iI
ξi
s.t. yi(ω>xi+β)1ξi, i I(5)
ξi0iI(6)
yj(ω>xj+β)1M(1 zj), j J(CSVM0)(7)
zj∈ {0,1}jJ(8)
ˆp`p
0``L. (9)
6 Sandra Ben´ıtez-Pe˜na et al.
In the previous optimization problem, (5) and (6) are the usual constraints
in the SVM formulation. Constraints (7) ensure that observations jJwith
zj= 1 will be correctly classified, without imposing any restriction when
zj= 0, provided that Mis big enough. A collection of requirements on the
performance of the classifier over Jcan be specified by means of (9). Also, Lis
the set of indexes of the constraints that has the form of (9). These constraints
can be modeled via the binary variables zj, for instance:
TPR :P
jJ+
zjp
0|J+|
T N R :P
jJ
zjp
0|J|
ACC :P
jJ
zjp
0|J|,
where J+and Jdenote, respectively, the subsets {iJ:yi= +1}and
{iJ:yi=1}. As usual in SVM methodology, a mapping into a high-
dimensional feature space may be considered, which allows us to transform this
linear classification technique in a non-linear one. This way we can address
problems with a very large number of features, such as those encountered
in personalized medicine (S´anchez et al. 2016). The various drawbacks that
arise in considering this mapping can be avoided if the so-called kernel trick
(Cristianini and Shawe-Taylor 2000), based on Mercer theorem (Mercer 1909),
is used. Therefore, by considering the (partial) dual problem of (CSVM0) and
the kernel trick, the general formulation of the CSVM is obtained as follows
(the intermediate steps can be found in Appendix A):
min
λ,µ,β,ξ,z P
s,s0I
λsysλs0ys0K(xs, xs0) + P
t,t0J
µtytµt0yt0K(xt, xt0)
+2 P
sI,tJ
λsysµtytK(xs, xt) + CP
iI
ξi
s.t. zj∈ {0,1}jJ
ˆp`p
0``L
yiP
sI
λsysK(xs, xi) + P
tJ
µtytK(xt, xi) + β1ξiiI
yjP
sI
λsysK(xs, xj) + P
tJ
µtytK(xt, xj) + β1M(1 zj)jJ(CSVM)
ξi0iI
P
iI
λiyi+P
jJ
µjyj= 0
0λiC/2iI
0µjMzjjJ.
Here K:Rm×RmRis a kernel function and (λ, µ) are the usual variables
of the dual formulation of the SVM.
On Support Vector Machines under a multiple-cost scenario 7
2.3 Solving the CSVM
In this section we give details about the complexity of our problem as for-
mulated in (CSVM). The problem belongs to the class of MIQP problems,
and thus it can be addressed by standard mixed integer quadratic optimiza-
tion solvers. In particular, the solver Gurobi (Gurobi Optimization 2016) and
its Python language interface (Van Rossum and Drake 2011) have been used
in our numerical experiments. In contrast to the standard SVM formulation,
which is a continuous quadratic problem, the CSVM is harder to solve due
to the presence of binary variables. Hence, the optimal solution may not be
found in a short period of time; however, as discussed in our numerical expe-
rience, good results are obtained when the problems are solved heuristically
by imposing a short time limit to the solver.
Performance constraints (9) may define an infeasible problem since the values
of the p
0`may be unattainable in practice. Hence, the study of the feasibil-
ity of Problem (CSVM) is an important issue. As an example, consider data
composed by two different classes, each one represented respectively by black
and white dots in the top picture in Figure 1. If the optimization problem for
the linear kernel SVM is solved, the resulting classifier is a hyperplane that
aims at separating both classes and maximizes the margin. An approximate
representation of the data and the classifier is shown in the middle panel in
Figure 1. If the aim is to correctly classify all the data corresponding to a
given class, it is intuitively easy to see that this objective can be reached by
moving the SVM hyperplane. In fact, it can be seen in the bottom picture
in Figure 1 how hyperplanes 1 and 2 classify correctly all white points, and
hyperplane 3 classifies all the black dots in the correct class. Among all those
hyperplanes, the SVM selects the one which maximizes the margin. So, intu-
itively, it is evident that if just one constraint of performance is imposed in
only one of the classes, the problem is always feasible. However, and using the
data in Figure 1 again, as well as the linear kernel SVM, it is clear that it is
impossible to classify correctly all the instances at the same time; thus, the
problem is then infeasible. However, there exist results, as Theorem 5 in Burges
(1998), that show that the class of Mercer kernels for which K(x, x0)0 as
kxx0k→∞, and for which K(x, x) is O(1), build classifiers that get a total
correct classification in all the classes in the training sample, without regard
how arbitrarily the data have been chosen. Thus, if a kernel satisfies the pre-
vious conditions, then feasibility is guaranteed. In particular, Radial Function
Basis (RBF) kernel meets these conditions. Therefore, to be on the safe side,
if the performance thresholds imposed are not too low, they should refer only
to one class misclassification rates (so that we can shift the variable βto make
the problem feasible) or to use a kernel, such as the RBF, known to have large
VC dimension (Burges 1998; Cristianini and Shawe-Taylor 2000), defined as
the number of training instances that can be classified correctly.
8 Sandra Ben´ıtez-Pe˜na et al.
V1
V2
V1
V2
Linear
SVM hyperplane
V1
V2
Hyperplane 1
Hyperplane 2
Hyperplane 3
1
Fig. 1 Study of feasiblity and unfeasibility of the CSVM.
3 Computational results
In this section we illustrate the performance of the CSVM compared to the
standard SVM, considered here as a benchmark. In order to do this compari-
On Support Vector Machines under a multiple-cost scenario 9
son, some of the performance measures presented in Section 2.2 are considered.
In particular, true positive and true negative rates will be used here. In what
follows, a description of the data, experiments and results is given.
3.1 Description of the experiments
The objective of this paper, as has been stated before, is to build a clas-
sifier whose performance can be controlled by means of some constraints,
as in Problem (CSVM). As explained in Section 2.1, if we want a perfor-
mance measurement pto be greater than a value p0with a specified confidence
100(1 α)%, we should use an estimator of p, ˆp, and impose it to be greater
than p
0=p0+rlog α
2n, according to (4). From a practical application, this
result will turn out crucial in our experiments.
Two experiments, both having the same structure, will be considered in this
paper. In each one, we will try to improve the performance of the classifier in
one of the classes, even though, as will be seen, a damage may be produced in
the other class. Hence, we will focus on TPR and TNR. Suppose that the esti-
mators for the TPR and TNR obtained by the standard SVM are, respectively,
TPR0and T N R0, so if we want to enhance the performance, the aim will be
TPR TPR0+δ1and T N R T N R0+δ2respectively. In the considered
experiments we have set δ1=δ2= 0.025, although other values can also be
tested. Then, the two experiments are:
Experiment 1: Impose TPR min {1, T P R0+ 0.025}=p0,
Experiment 2: Impose T N R min {1, T N R0+ 0.025}=p0.
That is to say, taking α= 0.05, the constraints in the optimization problem
for these two different experiments turn out to be:
Experiment 1: Impose ˆ
TPR min (1, T P R0+rlog 0.05
2n+ 0.025)=p
0,
Experiment 2: Impose ˆ
T N R min (1, T N R0+rlog 0.05
2n+ 0.025)=p
0.
Although the description of the experiments is presented below, the skele-
ton of the complete methodology can be found in Algorithm 1 for a better and
clearer understanding. Now, we shall discuss the experiments’ design. First, in
all the experiments the time limit and the Mvalue in Problem (CSVM) were
set, respectively, equal to 300 seconds and 100. The selection of these values
is due to the following facts. The time limit should not be too small, since
we should give the optimizer enough time to solve the problem. On the other
hand, the time limit should not be too large if one wants to have reasonable
running times. In the case of the parameter M, if a small value is considered,
there may be many discarded hyperplanes, maybe including the optimal one.
However, if Mis too big, it might cause numerical troubles (Camm et al.
10 Sandra Ben´ıtez-Pe˜na et al.
Algorithm 1: Pseudocode for CSVM
1Split data (D) into “folds” subsets, D={D1,...,Df olds }.
2for kf = 1,...,folds do
3Set V alidation =Dkf and I=D− {Dk f }.
4for each pair (C, γ)in grid ({2(5:5) },{2(5:5)})do
5Split D− {Dkf }=Dinto “folds2 ” subsets, D={D
1,...,D
folds2}.
6for kf2 = 1,..., folds2 do
7Set V alidation=D
kf 2and IJ=D− {Dkf 2}.
8Run standard SVM over IJ.
9Move βof SVM until the instances are correctly classified.
10 Run problem CSVM over I,Jwith initial solutions from before.
11 Validate over V alidation, getting the accuracy (AC C[kf 2]).
12 end
13 Calculate the average accuracy (Pkf2ACC[kf2])/folds2 = ¯
ACC .
14 if ¯
ACC bestAC C then
15 Set bestACC =¯
ACC ,bestγ =γand bestC =C.
16 end
17 end
18 Run standard SVM over IJwith the parameters bestγ and bestC.
19 Move βof SVM until the instances are correctly classified.
20 Run problem CSVM over I,Jwith initial solutions from the previous step.
21 Validate over V alidation, getting the correct classification probabilities
(T P R[kf ], T N R[kf ]).
22 end
23 Calculate the average values for T P R and T NR.
1990). A compromise solution is obtained by considering M= 100, which is
shown to be a good value in our numerical experiments.
Second, one of the most popular kernels K(x, x0) in literature, and the
one considered in this paper, is the well-known RBF kernel (Cristianini and
Shawe-Taylor 2000; Hastie et al. 2001; Hsu et al. 2003; Smola and Sch¨olkopf
2004; Horn et al. 2016), given by
K(x, x0) = exp γkxx0k2, γ > 0,
where γ > 0 is a parameter to be tuned. However, the approach presented in
this paper is valid for arbitrary kernels.
The estimation of the performance for our classifier is based on a 10-fold
cross validation (CV) (Kohavi et al. 1995) as follows. Note that, apart from
tuning γ, the regularization parameter Cintroduced in Section 1 also needs
to be tuned. In addition, for a given pair of parameters (C, γ), the process
consists mainly on solving a standard SVM using all the instances (IJ),
and collect the values of λ(from the dual formulation of the SVM) as well as
the value of β. Once the SVM is solved and with the purpose of providing an
initial solution for the CSVM, the value of βis slightly changed (maintaining
the values of λ’s fixed) until the desired number of instances well classified is
reached. Then, the values of βand λ’s obtained are set as initial solutions for
CSVM. In addition, depending on whether each instance in Jis well classified
or not, we set their values of zas 0 or 1 as initial values for the CSVM.
On Support Vector Machines under a multiple-cost scenario 11
However, we should make the selection of the best pair (C, γ) in each of
the previous folds. In order to do that, a 10-fold CV as before is made for
each pair in a grid given by the 121 different combinations of C= 2(5:5) and
γ= 2(5:5). The general criterion used to select the best pair of parameters
is the accuracy. However, in cases were the datasets are severely unbalanced
in the classes size, other performance measurements which take into account
such imbalancedness, such as the G-mean (Tang et al. 2009), or Youden Index
(Bewick et al. 2004), would be preferable. Finally, the average values of TPR
and TNR obtained in the first CV, in addition to their standard deviations,
are calculated.
3.2 Data description
The performance, in terms of correct classification probabilities and accuracy,
is illustrated using 4 real-life datasets from the UCI repository (Lichman 2013).
In particular, the datasets used are wisconsin (Breast Cancer Wisconsin (Di-
agnostic) Data Set), australian (Statlog (Australian Credit Approval) Data
Set), votes (Congressional Voting Records Data Set) and german (Statlog
(German Credit Data) Data Set).
Details concerning the implementation of the CSVM for the real datasets
are shown in Table 2. Let the first column represent the number of features
Name V|| |+|(%)
wisconsin 30 569 357 (62.7 %)
australian 14 690 383 (55.5%)
votes 16 435 267 (61.4 %)
german 45 1000 700 (70%)
Table 2 Details concerning the implementation of the CSVM for the considered datasets.
composing the set. ||and |+|represent, respectively, the size for each
dataset and the number of positive instances (majority class) in . Finally,
the percentage of positive instances is compiled in the last column.
Note that prior to running the different experiments data have been stan-
dardized, that is to say, each variable in all the 4 considered data sets has zero
mean and unit variance.
3.3 Results
In this section we compare the performance of the strategy proposed to build
the CSVM classifier against that of the SVM classifier in terms of overall
classification accuracy, true positive rate (TPR) and true negative rate (TNR)
of the classifier. Note that, even though from Section 2.3 the problem is always
feasible using the training sample, it may happen that the desired performance
12 Sandra Ben´ıtez-Pe˜na et al.
is not achieved in the validation sample.
Tables 3 and 4 report the results for the benchmark procedure, SVM, and
Name SVM CSVM
Mean Std Mean Std
wisconsin TPR 0.99 0.017 0.945 0.045
TNR 0.948 0.049 0.965 0.037
australian TPR 0.863 0.079 0.772 0.081
TNR 0.83 0.071 0.903 0.05
votes TPR 0.963 0.04 0.846 0.097
TNR 0.951 0.031 0.978 0.038
german TPR 0.905 0.036 0.791 0.063
TNR 0.405 0.114 0.547 0.141
Table 3 TNR for the original SVM and the CSVM strategy
Name SVM CSVM
Mean Std Mean Std
wisconsin TPR 0.99 0.017 0.989 0.018
TNR 0.948 0.049 0.856 0.153
australian TPR 0.863 0.079 0.914 0.046
TNR 0.83 0.071 0.692 0.086
votes TPR 0.963 0.04 0.978 0.026
TNR 0.951 0.031 0.922 0.04
Table 4 TPR for the original SVM and the CSVM strategy
those obtained when imposing a higher classification rate in a selected class, in
particular 0.025 additional points, according to the description of experiments
in Section 3.1. As an exception, in the case of german, a minimum value of
0.65 in the TNR will be imposed, in order to increase the low value obtained
under the standard SVM (0.405).
First, the results when constraints are imposed on the true negative rate
(TNR) are presented in Table 3. In the case of wisconsin, one can observe
that, although the TNR has been increased, an increment of 0.025 points was
not possible. However, the improvement is verified anyway. On the other hand,
in the case of australian, we have been able to increase the value of the TNR
in 0.073 points, without reducing significantly the accuracy. A similar result is
obtained for votes, for which the increase has been in more than 0.025 points,
too. The results are not so good for german, due to its imbalancedness in the
classes size. If instead of the accuracy, the G-mean is used as the criterion
for tuning the values of (C, γ), then the results notably improve, as depicted
by Table 5. In fact, if TNR 0.7 is set instead of TNR 0.65, even better
results are obtained, as can be seen in Table 5, although without reaching the
threshold imposed.
Now, we shall discuss the results when constraints are imposed on the true
positive rate (TPR), depicted by Table 4. Here, in the case of wisconsin,
On Support Vector Machines under a multiple-cost scenario 13
Name SVM CSVM CSVM
(TNR 0.65) (TNR 0.7)
Mean Std Mean Std Mean Std
german TPR 0.905 0.036 0.668 0.111 0.683 0.073
TNR 0.405 0.114 0.671 0.164 0.69 0.103
Table 5 TNR for the original SVM and the CSVM strategy in german, using the G-mean
the increase of 0.025 points is not obtained. In fact and unfortunately, instead
of an increase we can observe a minor decrease. However, this is not a weird
result since the original TPR was very high (near a perfect classification). On
the other hand, if we look at australian, an increase of about 0.05 points
has been reached, without loosing too much overall performance as before. In
addition, in votes, an increase can be observed, too. However, in contrast to
what happened with the TNR, such increase is not bigger than 0.025 points.
4 Conclusions
In this paper, we have proposed a new supervised learning SVM-based method,
the CSVM, which is developed and evaluated. Such classifier is built via a
MIQP problem, which has been solved using a standard and widely available
solver. Also, in order to formulate the constraints that are added to modify
the standard SVM (and hence build the CSVM), and guarantee that the per-
formance measurements will fulfill the imposition with high probability, some
theoretical foundations are given. The applicability of this cost-sensitive SVM
has been demonstrated by numerical experiments on benchmark data sets.
We conclude that it is possible to control the classification rates in one
class, possibly, but not necessarily, at the expense of the other class. This
highly contrasts with the naive approach in which, once the SVM is solved,
its intercept is moved to enhance the positive rates in one class, necessarily
deteriorating the performance in the other class.
Although, for simplicity, all numerical results are presented just adding
one performance constraint, one constraint per class, as well as an overall
accuracy, may be added in our approach. Also for simplicity, we addressed
here two-ways data matrices and two-class problems; however, this approach
could be extended to the case when using more complex data as multi-class
or multi-way arrays (Lyu et al. 2017), which are very common in biomedical
research. On the other hand, an alternative perspective for addressing the
SVM regularization is to consider different norms (Yao and Lee 2014).
Finally, another possible extension, which is under development, is to per-
form a feature selection which uses the proposed constraints in order to control
the misclassification costs. Such process is an essential step in tasks such as
high-dimensional microarray classification problems (Guo 2010).
14 Sandra Ben´ıtez-Pe˜na et al.
Acknowledgements
This research is financed by Fundaci´on BBVA, projects FQM329 and P11-
FQM-7603 (Junta de Andaluc´ıa, Andaluc´ıa) and MTM2015-65915-R (Minis-
terio de Econom´ıa y Competitividad, Spain). The last three are cofunded with
EU ERD Funds. The authors are thankful for such support.
Appendix A: Derivation of the CSVM
In this section, the detailed steps to build the CSVM formulation, are shown.
For that, suppose that we are given the linear model
min
ω,β,ξ ,z ω>ω+CP
iI
ξi
s.t. yi(ω>xi+β)1ξi, i I
ξi0iI
yj(ω>xj+β)1M(1 zj), j J
zj∈ {0,1}jJ
ˆp`p
0``L.
Hence, the problem above can be rewritten as
minzminω,β,ξ ω>ω+CP
iI
ξi
s.t. zj∈ {0,1}jJs.t. yiω>xi+β1ξiiI
ˆp`p
0``L yjω>xj+β1M(1 zj), j J
ξi0iI.
The Karush–K¨uhn–Tucker (KKT) conditions for the inner problem, as-
suming zfixed are given by
ω=P
sI
λsysxs+P
tJ
µtytxt
0 = P
sI
λsys+P
tJ
µtyt
0λsC/2sI
0µtMzttJ.
Thus, substituting the previous expressions into the last optimization prob-
lem, the partial dual of such problem can be calculated, yielding
On Support Vector Machines under a multiple-cost scenario 15
min
zmin
λ,µ,β,ξ P
sI
λsysxs+P
tJ
µtytxt>P
sI
λsysxs+P
tJ
µtytxt+CP
iI
ξi
s.t. zj∈ {0,1}jJs.t. yi P
sI
λsysxs+P
tJ
µtytxt>
xi+β!1ξiiI
ˆp`p
0``L yj P
sI
λsysxs+P
tJ
µtytxt>
xj+β!1M(1 zj)jJ
ξi0iI
P
iI
λiyi+P
jJ
µjyj= 0
0λiC/2iI
0µjMzjjJ.
Finally, from the kernel trick, Problem (CSVM) is obtained.
References
Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: A uni-
fying approach for margin classifiers. Journal of Machine Learning Research
1(Dec), 113–141 (2000)
Bewick, V., Cheek, L., Ball, J.: Statistics review 13: receiver operating char-
acteristic curves. Critical Care 8(6), 508–512 (2004)
Bonami, P., Biegler, L.T., Conn, A.R., Cornujols, G., Grossmann, I.E., Laird,
C.D., Lee, J., Lodi, A., Margot, F., Sawaya, N., Wchter, A.: An algorithmic
framework for convex mixed integer nonlinear programs. Discrete Optimiza-
tion 5(2), 186 – 204 (2008). In Memory of George B. Dantzig
Bradford, J.P., Kunz, C., Kohavi, R., Brunk, C., Brodley, C.E.: Pruning deci-
sion trees with misclassification costs. In: Proceedings of the 10th European
Conference on Machine Learning, ECML ’98, pp. 131–136. Springer (1998)
Burer, S., Letchford, A.N.: Non-convex mixed-integer nonlinear programming:
A survey. Surveys in Operations Research and Management Science 17(2),
97 – 106 (2012)
Burges, C.J.: A Tutorial on Support Vector Machines for Pattern Recognition.
Data Mining and Knowledge Discovery 2(2), 121–167 (1998)
Camm, J.D., Raturi, A.S., Tsubakitani, S.: Cutting Big M Down to Size.
Interfaces 20(5), 61–66 (1990)
Carrizosa, E., Martin-Barragan, B., Romero Morales, D.: Multi-group Sup-
port Vector Machines with Measurement Costs: A Biobjective Approach.
Discrete Applied Mathematics 156(6), 950–966 (2008)
Carrizosa, E., Romero Morales, D.: Supervised Classification and Mathemati-
cal Optimization. Computers & Operations Research 40(1), 150–165 (2013)
Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines
and other kernel-based learning methods. Cambridge University Press, New
York, NY, USA (2000)
16 Sandra Ben´ıtez-Pe˜na et al.
Datta, S., Das, S.: Near-Bayesian Support Vector Machines for imbalanced
data classification with equal or unequal misclassification costs. Neural Net-
works 70, 39–52 (2015)
Freitas, A., Costa-Pereira, A., Brazdil, P.: Cost-Sensitive Decision Trees Ap-
plied to Medical Data. In: Data Warehousing and Knowledge Discovery:
9th International Conference, DaWaK 2007, Regensburg Germany, Septem-
ber 3-7, 2007. Proceedings, pp. 303–312. Springer Berlin Heidelberg, Berlin,
Heidelberg (2007)
Guo, J.: Simultaneous variable selection and class fusion for high-dimensional
linear discriminant analysis. Biostatistics 11(4), 599–608 (2010)
Gurobi Optimization, I.: Gurobi Optimizer Reference Manual (2016). URL
http://www.gurobi.com
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning.
Springer Series in Statistics. Springer New York Inc., New York, NY, USA
(2001)
He, H., Ma, Y.: Imbalanced learning: foundations, algorithms, and applica-
tions. John Wiley & Sons, Inc. (2013)
Hoeffding, W.: Probability Inequalities for Sums of Bounded Random Vari-
ables. Journal of the American Statistical Association 58(301), 13–30 (1963)
Horn, D., Demircio˘glu, A., Bischl, B., Glasmachers, T., Weihs, C.: A compar-
ative study on large scale kernelized support vector machines. Advances in
Data Analysis and Classification (2016)
Hsu, C.W., Chang, C.C., Lin, C.J., et al.: A practical guide to support vector
classification. Tech. rep., Department of Computer Science, National Taiwan
University (2003)
Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy
estimation and model selection. In: Ijcai, vol. 14, pp. 1137–1145. Stanford,
CA (1995)
Lichman, M.: UCI Machine Learning Repository (2013)
Lyu, T., Lock, E.F., Eberly, L.E.: Discriminating sample groups with multi-
way data. Biostatistics (2017)
Maldonado, S., Prez, J., Bravo, C.: Cost-based feature selection for support
vector machines: An application in credit scoring. European Journal of
Operational Research 261(2), 656 – 665 (2017)
Mercer, J.: Functions of positive and negative type, and their connection with
the theory of integral equations. Philosophical transactions of the royal
society of London. Series A 209, 415–446 (1909)
Prati, R.C., Batista, G.E., Silva, D.F.: Class imbalance revisited: a new exper-
imental setup to assess the performance of treatment methods. Knowledge
and Information Systems 45(1), 247–270 (2015)
anchez, B.N., Wu, M., Song, P.X.K., Wang, W.: Study design in high-
dimensional classification analysis. Biostatistics 17(4), 722 (2016)
Smola, A.J., Sch¨olkopf, B.: A tutorial on support vector regression. Statistics
and Computing 14(3), 199–222 (2004)
Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: SVMs Modeling for Highly
Imbalanced Classification. IEEE Transactions on Systems, Man, and Cy-
On Support Vector Machines under a multiple-cost scenario 17
bernetics, Part B (Cybernetics) 39(1), 281–288 (2009)
Van Rossum, G., Drake, F.L.: An Introduction to Python. Network Theory
Ltd. (2011)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag New
York, Inc., New York, NY, USA (1995)
Vapnik, V.N.: Statistical learning theory, vol. 1. Wiley New York, 1 ed. (1998)
Yao, Y., Lee, Y.: Another look at linear programming for feature selection via
methods of regularization. Statistics and Computing 24(5), 885–905 (2014)
View publication statsView publication stats
... The application of mathematical optimization tools, the approach that we undertake in this paper, seems to be a promising (Carrizosa and Romero Morales 2013) and not fully explored option: one overall criterion is to be optimized, while constraints are introduced in the model to demand admissible values for the efficiency measures under consideration. Recently, this approach has been considered either in classification (Benítez-Peña et al. 2019; or in regression . In this paper, this technique is explored for improving the NB performance in the classes of most interest to the user. ...
Article
Full-text available
The Naïve Bayes is a tractable and efficient approach for statistical classification. In general classification problems, the consequences of misclassifications may be rather different in different classes, making it crucial to control misclassification rates in the most critical and, in many realworld problems, minority cases, possibly at the expense of higher misclassification rates in less problematic classes. One traditional approach to address this problem consists of assigning misclassification costs to the different classes and applying the Bayes rule, by optimizing a loss function. However, fixing precise values for such misclassification costs may be problematic in realworld applications. In this paper we address the issue of misclassification for the Naïve Bayes classifier. Instead of requesting precise values of misclassification costs, threshold values are used for different performance measures. This is done by adding constraints to the optimization problem underlying the estimation process. Our findings show that, under a reasonable computational cost, indeed, the performance measures under consideration achieve the desired levels yielding a user-friendly constrained classification procedure.
... In addition, our approach can easily incorporate cost-sensitive performance constraints to ensure that we control not only the overall accuracy of the regressor, but also the accuracy on a number of critical groups, as in [4,7,13,26]. With this, if δ g > 0 denotes the threshold on the loss L g for group g ∈ G, we can add to the feasible region of Problem (1) constraints ...
Preprint
Full-text available
Since the seminal paper by Bates and Granger in 1969, a vast number of ensemble methods that combine different base regressors to generate a unique one have been proposed in the literature. The so-obtained regressor method may have better accuracy than its components , but at the same time it may overfit, it may be distorted by base regressors with low accuracy, and it may be too complex to understand and explain. This paper proposes and studies a novel Mathematical Optimization model, which trades off the accuracy of the ensemble and the number of base regressors used. The latter is controlled by means of a regularization term that penalizes bad regressors. Our approach is flexible to incorporate desirable properties one may have on the ensemble, such as controlling the performance of the ensemble in critical groups of records, or the costs associated with the base regressors involved in the ensemble. We illustrate our approach with a real data set arising in the COVID-19 context.
Article
The information needed for a certain machine application can be often obtained from a subset of the available features. Strongly relevant features should be retained to achieve desirable model performance. This research focuses on selecting relevant independent features for Support Vector Machine (SVM) classifiers in a cost-sensitive manner. A review of recent literature about feature selection for SVM revealed a lack of linear programming embedded SVM feature selection models. Most reviewed models were mixed-integer linear or nonlinear. Further, the review highlighted a lack of cost-sensitive SVM feature selection models. Cost sensitivity improves the generalization of SVM feature selection models, making them applicable to various cost-of-error situations. It also helps with handling imbalanced data. This research introduces an SVM-based filter method named Knapsack Max-Margin Feature Selection (KS-MMFS), which is a proposed linearization of the quadratic Max-Margin Feature Selection (MMFS) model. MMFS provides explicit estimates of feature importance in terms of relevance and redundancy. KS-MMFS was then used to develop a linear cost-sensitive SVM embedded feature selection model. The proposed model was tested on a group of 11 benchmark datasets and compared to relevant models from the literature. The results and analysis showed that different cost sensitivity (i.e., sensitivity-specificity tradeoff) requirements influence the features selected. The analysis demonstrated the competitive performance of the proposed model compared with relevant models. The model achieved an average improvement of 31.8% on classification performance with a 22.4% average reduction in solution time. The results and analysis in this research demonstrated the competitive performance of the proposed model as an efficient cost-sensitive embedded feature selection method.
Article
Subject-independent emotion recognition (SIER) using electroencephalogram (EEG) signals has always been a challenge among the biomedical research community. One of the major reasons behind this is the chaotic and non-stationary nature of EEG signals which may show varying signal characteristics for the same emotions for different individuals. Therefore, it is a daunting challenge for the researchers to enhance the emotion recognition performance of the existing SIER systems, which is still quite low. To address these challenges, in this work, we have proposed a novel deep learning-based approach to efficiently extract and classify emotion-related information from the 2D spectrograms obtained from the 1D EEG signals. To uncover the hidden deep features of EEG signals we proposed a Deep Convolution neural network for Emotion Recognition (DCERNet) with dense connections among layers. The proposed DCERNet is developed with a sequence of customizations over a pre-trained Densenet121 model integrated with softmax and Support Vector Machine (SVM) classifiers on top. The EEG signals are transformed to 2D spectrograms and fed to the proposed model for identifying emotions. To evaluate the performance of the proposed models, comprehensive simulations are conducted using two publicly available databases SJTU Emotion EEG Dataset (SEED) and Database for Emotion Analysis of Physiological Signals (DEAP). Experimental outcomes illustrate that the proposed DCERNet model boosts the emotion recognition accuracy by almost 8% compared to the state-of-the-art methods using DEAP database.
Article
Silent diseases is an umbrella term that captures a spectrum of chronic illnesses that produce no clinically obvious signs and are diagnosed at advanced stages when the damage is irreversible. Current diagnostic strategies of silent diseases depend on self-reported symptoms and observed behavior through extended periods of time, and until now there are no specific clinical tests to diagnose silent diseases. Scientific research suggests the importance of early diagnosis to restore the functionality and reduce diseases-related complications. Previous studies primarily focused on feature selection methods to aid in medical diagnosis. Traditional feature selection methods are primarily focused on correct classification and often ignore features’ costs; the cost of clinical tests required to acquire the feature value. However, in medical diagnosis, features have different associated costs. Because ignoring features’ costs may result in a high cost diagnostic strategy that cannot be used in practice, developing a low-cost diagnostic strategy remains a subject of much interest. In this paper, new Mixed Integer Programming (MIP) models, namely, Cost-sensitive Support Vector Machine (CS-SVM) and Cost-sensitive Multi-surface Method Tree (CS-MSMT) that allow for simultaneous selection of low-cost and informative features are proposed. The CS-SVM and CS-MSMT are superior because they have the ability to account for shared costs. The CS-SVM and CS-MSMT were modified to embed shared costs across feature groups, and are termed Discounted CS-SVM (dCS-SVM) and Discounted CS-MSMT (dCS-MSMT), respectively. Computationally effective algorithm that integrates aggressive bound tightening with the MIP formulation is proposed. To demonstrate the effectiveness of the proposed models, different analysis paradigms are conducted on six UCI medical datasets; Chronic Kidney Disease, Hepatitis, Heart Disease, Thyroid, Diabetes and Leukemia. The results demonstrate the efficiency and robustness of the CS-SVM and CS-MSMT (and consequently the dCS-SVM and dCS-MSMT) under various conditions. The CS-SVM and CS-MSMT improved accuracy by 10.3% and 3.4% and reduced costs by 94.3% and 72.4% in the leukemia dataset, respectively.
Article
The Naïve Bayes has proven to be a tractable and efficient method for classification in multivariate analysis. However, features are usually correlated, a fact that violates the Naïve Bayes’ assumption of conditional independence, and may deteriorate the method’s performance. Moreover, datasets are often characterized by a large number of features, which may complicate the interpretation of the results as well as slow down the method’s execution. In this paper we propose a sparse version of the Naïve Bayes classifier that is characterized by three properties. First, the sparsity is achieved taking into account the correlation structure of the covariates. Second, different performance measures can be used to guide the selection of features. Third, performance constraints on groups of higher interest can be included. Our proposal leads to a smart search, which yields competitive running times, whereas the flexibility in terms of performance measure for classification is integrated. Our findings show that, when compared against well-referenced feature selection approaches, the proposed sparse Naïve Bayes obtains competitive results regarding accuracy, sparsity and running times for balanced datasets. In the case of datasets with unbalanced (or with different importance) classes, a better compromise between classification rates for the different classes is achieved.
Article
Full-text available
Since the seminal paper by Bates and Granger in 1969, a vast number of ensemble methods that combine different base regressors to generate a unique one have been proposed in the literature. The so-obtained regressor method may have better accuracy than its components, but at the same time it may overfit, it may be distorted by base regressors with low accuracy, and it may be too complex to understand and explain. This paper proposes and studies a novel Mathematical Optimization model to build a sparse ensemble, which trades off the accuracy of the ensemble and the number of base regressors used. The latter is controlled by means of a regularization term that penalizes regressors with a poor individual performance. Our approach is flexible to incorporate desirable properties one may have on the ensemble, such as controlling the performance of the ensemble in critical groups of records, or the costs associated with the base regressors involved in the ensemble. We illustrate our approach with real data sets arising in the COVID-19 context.
Preprint
Full-text available
Support vector machines (SVMs) are widely used and constitute one of the best examined and used machine learning models for 2-class classification. Classification in SVM is based on a score procedure, yielding a deterministic classification rule, which can be transformed into a probabilistic rule (as implemented in off-the-shelf SVM libraries), but is not probabilistic in nature. On the other hand, the tuning of the regularization parameters in SVM is known to imply a high computational effort and generates pieces of information which are not fully exploited, and not used to build a probabilistic classification rule. In this paper we propose a novel approach to generate probabilistic outputs for the SVM. The highlights of the paper are: first, a SVM method is designed to be cost-sensitive, and thus the different importance of sensitivity and speci-ficity is readily accommodated in the model. Second, SVM is embedded in an ensemble method to improve its performance, making use of the valuable information generated in the parameters tuning process. Finally, the probabilities estimation is done via bootstrap estimates, avoiding the use of parametric models as competing probabilities estimation in SVM. Numerical tests show the advantages of our procedures.
Article
In spite of the prominent advancements in iris recognition, it can significantly be deceived by contact lenses. As the contact lens wraps the iris region and obstructs sensors from capturing the actual iris. Moreover, cosmetic lenses are prone to forge the iris recognition system by registering an individual with fake iris signatures. Therefore, it is foremost to perceive the existence of the contact lens in human eyes prior to access an iris recognition system. This paper introduces a novel Densely Connected Contact Lens Detection Network (DCLNet) has been proposed, which is a deep convolutional network with dense connections among layers. DCLNet has been designed through a series of customizations over Densenet121 with the addition of Support Vector Machine (SVM) classifier on top. It accepts raw iris images without segmentation and normalization, nevertheless the impact of iris normalization on the proposed model’s performance is separately analyzed. Further, in order to assess the proposed model, extensive experiments are simulated on two widely eminent databases (Notre Dame (ND) Contact Lens 2013 Database and IIIT-Delhi (IIITD) Contact Lens Database). Experimental results reaffirm that the proposed model improves the Correct Classification Rate (CCR) up to 4% as compared to the state of the arts.
Article
In this paper, we present a novel SVM-based approach to construct multiclass classifiers by means of arrangements of hyperplanes. We propose different mixed integer (linear and non linear) programming formulations for the problem using extensions of widely used measures for misclassifying observations where the kernel trick can be adapted to be applicable. Some dimensionality reductions and variable fixing strategies are also developed for these models. An extensive battery of experiments has been run which reveal the powerfulness of our proposal as compared with other previously proposed methodologies.
Article
Full-text available
High-dimensional linear classifiers, such as the support vector machine (SVM) and distance weighted discrimination (DWD), are commonly used in biomedical research to distinguish groups of subjects based on a large number of features. However, their use is limited to applications where a single vector of features is measured for each subject. In practice data are often multi-way, or measured over multiple dimensions. For example, metabolite abundance may be measured over multiple regions or tissues, or gene expression may be measured over multiple time points, for the same subjects. We propose a framework for linear classification of high-dimensional multi-way data, in which coefficients can be factorized into weights that are specific to each dimension. More generally, the coefficients for each measurement in a multi-way dataset are assumed to have low-rank structure. This framework extends existing classification techniques, and we have implemented multi-way versions of SVM and DWD. We describe informative simulation results, and apply multi-way DWD to data for two very different clinical research studies. The first study uses metabolite magnetic resonance spectroscopy data over multiple brain regions to compare patients with and without spinocerebellar ataxia, the second uses publicly available gene expression time-course data to compare treatment responses for patients with multiple sclerosis. Our method improves performance and simplifies interpretation over naive applications of full rank linear classification to multi-way data. An R package is available at https://github.com/lockEF/MultiwayClassification .
Article
Feature Selection is a crucial procedure in Data Science tasks such as Classification, since it identifies the relevant variables, making thus the classification procedures more interpretable, cheaper in terms of measurement and more effective by reducing noise and data overfit. The relevance of features in a classification procedure is linked to the fact that misclassifications costs are frequently asymmetric, since false positive and false negative cases may have very different consequences. However, off-the-shelf Feature Selection procedures seldom take into account such cost-sensitivity of errors. In this paper we propose a mathematical-optimization-based Feature Selection procedure embedded in one of the most popular classification procedures, namely, Support Vector Machines, accommodating asymmetric misclassification costs. The key idea is to replace the traditional margin maximization by minimizing the number of features selected, but imposing upper bounds on the false positive and negative rates. The problem is written as an integer linear problem plus a quadratic convex problem for Support Vector Machines with both linear and radial kernels. The reported numerical experience demonstrates the usefulness of the proposed Feature Selection procedure. Indeed, our results on benchmark data sets show that a substantial decrease of the number of features is obtained, whilst the desired trade-off between false positive and false negative rates is achieved.
Article
In this work we propose two formulations based on Support Vector Machines for simultaneous classification and feature selection that explicitly incorporate attribute acquisition costs. This is a challenging task for two main reasons: the estimation of the acquisition costs is not straightforward and may depend on multivariate factors, and the inter-dependence between variables must be taken into account for the modelling process since companies usually acquire groups of related variables rather than acquiring them individually. Mixed-integer linear programming models are proposed for constructing classifiers that constrain acquisition costs while classifying adequately. Experimental results using credit scoring datasets demonstrate the effectiveness of our methods in terms of predictive performance at a low cost compared to well-known feature selection approaches.
Article
The Supervised Classification problem, one of the oldest and most recurrent problems in applied data analysis, has always been analyzed from many different perspectives. When the emphasis is placed on its overall goal of developing classification rules with minimal classification cost, Supervised Classification can be understood as an optimization problem. On the other hand, when the focus is in modeling the uncertainty involved in the classification of future unknown entities, it can be formulated as a statistical problem. Other perspectives that pay particular attention to pattern recognition and machine learning aspects of Supervised Classification have also a long history that has lead to influential insights and different methodologies.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
Kernelized support vector machines (SVMs) belong to the most widely used classification methods. However, in contrast to linear SVMs, the computation time required to train such a machine becomes a bottleneck when facing large data sets. In order to mitigate this shortcoming of kernel SVMs, many approximate training algorithms were developed. While most of these methods claim to be much faster than the state-of-the-art solver LIBSVM, a thorough comparative study is missing. We aim to fill this gap. We choose several well-known approximate SVM solvers and compare their performance on a number of large benchmark data sets. Our focus is to analyze the trade-off between prediction error and runtime for different learning and accuracy parameter settings. This includes simple subsampling of the data, the poor-man’s approach to handling large scale problems. We employ model-based multi-objective optimization, which allows us to tune the parameters of learning machine and solver over the full range of accuracy/runtime trade-offs. We analyze (differences between) solvers by studying and comparing the Pareto fronts formed by the two objectives classification error and training time. Unsurprisingly, given more runtime most solvers are able to find more accurate solutions, i.e., achieve a higher prediction accuracy. It turns out that LIBSVM with subsampling of the data is a strong baseline. Some solvers systematically outperform others, which allows us to give concrete recommendations of when to use which solver.
Article
The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, working through a non-trivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique and when they are global. We describe how support vector training can be practically implemented, and discuss in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data. We show how Support Vector machines can have very large (even infinite) VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels. While very high VC dimension would normally bode ill for generalization performance, and while at present there exists no theory which shows that good generalization performance is guaranteed for SVMs, there are several arguments which support the observed high accuracy of SVMs, which we review. Results of some experiments which were inspired by these arguments are also presented. We give numerous examples and proofs of most of the key theorems. There is new material, and I hope that the reader will find that even old material is cast in a fresh light.
Article
Advances in high throughput technology have accelerated the use of hundreds to millions of biomarkers to construct classifiers that partition patients into different clinical conditions. Prior to classifier development in actual studies, a critical need is to determine the sample size required to reach a specified classification precision. We develop a systematic approach for sample size determination in high-dimensional (large $p$ small $n$) classification analysis. Our method utilizes the probability of correct classification (PCC) as the optimization objective function and incorporates the higher criticism thresholding procedure for classifier development. Further, we derive the theoretical bound of maximal PCC gain from feature augmentation (e.g. when molecular and clinical predictors are combined in classifier development). Our methods are motivated and illustrated by a study using proteomics markers to classify post-kidney transplantation patients into stable and rejecting classes.
Article
Support Vector Machines (SVMs) form a family of popular classifier algorithms originally developed to solve two-class classification problems. However, SVMs are likely to perform poorly in situations with data imbalance between the classes, particularly when the target class is under-represented. This paper proposes a Near-Bayesian Support Vector Machine (NBSVM) for such imbalanced classification problems, by combining the philosophies of decision boundary shift and unequal regularization costs. Based on certain assumptions which hold true for most real-world datasets, we use the fractions of representation from each of the classes, to achieve the boundary shift as well as the asymmetric regularization costs. The proposed approach is extended to the multi-class scenario and also adapted for cases with unequal misclassification costs for the different classes. Extensive comparison with standard SVM and some state-of-the-art methods is furnished as a proof of the ability of the proposed approach to perform competitively on imbalanced datasets. A modified Sequential Minimal Optimization (SMO) algorithm is also presented to solve the NBSVM optimization problem in a computationally efficient manner. Copyright © 2015 Elsevier Ltd. All rights reserved.