Content uploaded by Sandra Benítez-Peña
Author content
All content in this area was uploaded by Sandra Benítez-Peña on Aug 20, 2018
Content may be subject to copyright.
Noname manuscript No.
(will be inserted by the editor)
On Support Vector Machines under a multiple-cost
scenario
Sandra Ben´ıtez-Pe˜na ·Rafael
Blanquero ·Emilio Carrizosa ·Pepa
Ram´ırez-Cobo
Received: date / Accepted: date
Abstract Support Vector Machine (SVM) is a powerful tool in binary classi-
fication, known to attain excellent misclassification rates. On the other hand,
many realworld classification problems, such as those found in medical diag-
nosis, churn or fraud prediction, involve misclassification costs which may be
different in the different classes. However, it may be hard for the user to pro-
vide precise values for such misclassification costs, whereas it may be much
easier to identify acceptable misclassification rates values. In this paper we
propose a novel SVM model in which misclassification costs are considered by
incorporating performance constraints in the problem formulation. Specifically,
our aim is to seek the hyperplane with maximal margin yielding misclassifi-
cation rates below given threshold values. Such maximal margin hyperplane
is obtained by solving a quadratic convex problem with linear constraints and
integer variables. The reported numerical experience shows that our model
gives the user control on the misclassification rates in one class (possibly at
S. Ben´ıtez-Pe˜na
IMUS, Instituto de Matem´aticas de la Universidad de Sevilla.
Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de Sevilla. Spain.
Tel.: +34-955420861
E-mail: sbenitez1@us.es
R. Blanquero
IMUS, Instituto de Matem´aticas de la Universidad de Sevilla.
Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de Sevilla. Spain.
E. Carrizosa
IMUS, Instituto de Matem´aticas de la Universidad de Sevilla.
Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de Sevilla. Spain.
P. Ram´ırez-Cobo
IMUS, Instituto de Matem´aticas de la Universidad de Sevilla.
Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de C´adiz. Spain.
2 Sandra Ben´ıtez-Pe˜na et al.
the expense of an increase of misclassification rates for the other class) and is
feasible in terms of running times.
Keywords Constrained Classification ·Misclassification costs ·Mixed
Integer Quadratic Programming ·Sensitivity/Specificity trade-off ·Support
Vector Machines
Mathematics Subject Classification (2000) 62P99 ·90C11 ·90C30
1 Introduction
In supervised classification we are given a set Ωof individuals belonging to
two or more different classes, and the final aim is to classify new objects whose
class is unknown. Each object i∈Ωcan be represented by a pair (xi, yi), where
xi∈Rmis the so-called feature vector and yi∈ C is the class membership of
object i.
A state-of-the-art method in supervised classification is the support vector
machine (SVM), see Vapnik (1995, 1998); Cristianini and Shawe-Taylor (2000);
Carrizosa and Romero Morales (2013). In its basic version, SVM addresses two-
class problems, i.e., Chas two elements, say, C={−1,+1}. The SVM aims at
separating both classes by means of a linear classifier, ω>x+β= 0, where ω
is the score vector. We will assume throughout this paper that C={−1,+1}
and refer the reader to e.g. Allwein et al. (2000) for the reduction of multiclass
problems to this case.
The SVM classifier is obtained by solving the following convex quadratic pro-
gramming (QP) formulation with linear constraints:
min
ω,β,ξ ω>ω+CP
i∈I
ξi
s.t. yi(ω>xi+β)≥1−ξi, i ∈I
ξi≥0i∈I,
where Irepresents the set of training data, ξi≥0 are artificial variables
which allow data points to be misclassified, and C > 0 is a regularization
parameter to be tuned that controls the trade-off between margin minimization
and misclassification errors. Given an object i, it is classified in the positive
or the negative class, according to the sign of the so-called score function,
sign(ω>xi+β), while for the case ω>xi+β= 0, the object is classified
randomly.
As mentioned, the goal in supervised classification is to classify objects in
the correct class. However, ignoring imbalancedness (either in the classes size,
either in the misclassification cost structure) or other costs may have dramatic
consequences in the classification task, see Carrizosa et al. (2008); He and Ma
(2013); Prati et al. (2015); Maldonado et al. (2017). For instance, for clinical
On Support Vector Machines under a multiple-cost scenario 3
databases, there are usually more observations of healthy populations than for
the disease cases, so smaller classification errors are obtained for the first case.
For example, for the well known Breast Cancer Wisconsin (Diagnostic) Data
Set from the UCI repository (Lichman 2013), the number of sick cases (212)
is smaller than the size of control cases (357). If a standard SVM is used for
classifying the dataset, then the obtained rates (average values according to a
10-fold cross-validation approach), are depicted by Table 1. Even though both
Mean Std
% benign instances well class. 99% 1.7
% malign instances well class. 94.8% 4.9
Table 1 Performance of SVM in wisconsin. Average values and standard deviations com-
puted from 10 realizations.
rates are high, it might be of interest to increase the accuracy of the cancer
samples. This problem will be addressed in this paper.
In order to cope with imbalancedness, different methods have been sug-
gested, see Bradford et al. (1998); Freitas et al. (2007); Carrizosa et al. (2008);
Datta and Das (2015). Those methods are based on adding parameters or
adapting the classifier construction, among others. For example, in Carrizosa
et al. (2008) a biobjective problem of simultaneous minimization of misclassi-
fication rate, via the maximization of the margin and measurement costs, is
formulated.
In this paper a new formulation of the SVM is presented, in such a way
that the focus is not only on the minimization of the overall misclassification
rate but also on the performance of the classifier in the two classes. In order
to do that, novel constraints are added to the SVM formulation. The keystone
of the new model is its ability to achieve a deeper control over misclassifi-
cation in contrast to previously existing models. The proposed methodology
will be called Constrained Support Vector Machine (CSVM) and the resulting
classification technique will be referred as CSVM classifier.
The remainder of this paper is structured as follows. In Section 2, the
CSVM is formulated and details concerning its motivation, feasibility and so-
lutions are given. Section 3 aims to illustrate the performance of the new clas-
sifier. A description in depth about the experiments’ design, the real datasets
to be tested as well as the obtained results will be given. The paper ends with
some concluding remarks and possible extensions in Section 4.
2 Constrained Support Vector Machines
In this section the Constrained Support Vector Machine (CSVM) model is
formulated as a Mixed Integer Nonlinear Programming (MINLP) problem
(Bonami et al. 2008; Burer and Letchford 2012), specifically in terms of a
Mixed Integer Quadratic Programming (MIQP) problem.
4 Sandra Ben´ıtez-Pe˜na et al.
This section is structured as follows. In Section 2.1 some theoretical foun-
dations that motivate the novel constraints are given. Then, in Section 2.2 the
formulation of the CSVM is presented. We will depart from the linear kernel
case to later extend to the general kernel case via the kernel trick. Finally, in
Section 2.3, some issues about the CSVM formulation, as its feasibility, shall
be discussed.
2.1 Theoretical Motivation
As commented before, the aim of this work is to build a classifier so that the
user may have control over the performance in the two classes. Specifically,
given a set Ω={(xi, yi)}iof data (a random sample of a vector (X, Y ) with
unknown distribution), the target is to obtain a classifier such that p≥p0,
where pis the value of a performance measurement and p0is a threshold chosen
by the user. The performance measures pto be considered in this paper are
the sensitivity or true positive rate (TPR), the specificity or true negative rate
(TNR) and the accuracy (ACC), given by:
TPR :p=P(ω>X+β > 0|Y= +1)
T N R :p=P(ω>X+β < 0|Y=−1) (1)
ACC :p=P(Y(ω>X+β)>0).
See for example, Bewick et al. (2004).
If the random variable Z, defined as
Z=1, if an observation is well classified,
0, otherwise,
is considered, then, the values of pas in (1) corresponding to the probability
of correct classification can be rewritten as
TPR :p=E[Z|Y= +1]
T N R :p=E[Z|Y=−1]
ACC :p=E[Z]
and estimated from an independent and identically distributed (i.i.d.) sample
{Zi}i∈S, by
TPR : ˆp=¯
Z+=
P
i∈S+
Zi
|S+|
T N R : ˆp=¯
Z−=
P
i∈S−
Zi
|S−|
ACC : ˆp=¯
Z=
P
i∈S
Zi
|S|,
where S+and S−denote, respectively, the subsets {i∈S:yi= +1}and
{i∈S:yi=−1}.
On Support Vector Machines under a multiple-cost scenario 5
From a hypothesis testing viewpoint, our aim is to build a classifier such
that, for a given sample, one can reject the null hypothesis in
H0:p≤p0
H1:p>p0.
Under the classic decision rule, H0is rejected if ˆp≥p∗
0assuming that
α=P(type I error). From Hoeffding Inequality (Hoeffding 1963),
P(ˆp≥p+c)≤exp(−2nc2).(2)
As α=P(type I error) = P(ˆp≥p∗
0|p=p0), substituting pby p0in (2) yields
P(ˆp<p0+c)≥1−exp(−2nc2)=1−α, (3)
where p0+c=p∗
0. Therefore, we can take
p∗
0=p0+rlog α
−2n.(4)
Note that nequals |S+|,|S−|or |S|, respectively, when considering the TPR,
the TNR or the accuracy.
2.2 CSVM formulation
In this section, the CSVM formulation is presented. As it will be seen, the for-
mulation includes novel performance constraints, which make the optimization
problem a MIQP problem in terms of some integer variables.
We assume to be given a dataset with known labels. From such set we
identify the training set I, used to build the classifier, and the anchor set J,
used to impose a lower bound on the classifier performance. These sets will be
considered disjoint.
With the purpose of building the CSVM, the performance constraints will
be formulated in terms of binary variables {zj}j∈J, which are realizations of
the variable Zin Section 2.1 and defined as:
zj=1, if instance jis well classified
0, otherwise.
In order to formulate the CSVM, novel constraints are added to the stan-
dard soft-margin SVM formulation as follows:
min
ω,β,ξ ,z ω>ω+CP
i∈I
ξi
s.t. yi(ω>xi+β)≥1−ξi, i ∈I(5)
ξi≥0i∈I(6)
yj(ω>xj+β)≥1−M(1 −zj), j ∈J(CSVM0)(7)
zj∈ {0,1}j∈J(8)
ˆp`≥p∗
0``∈L. (9)
6 Sandra Ben´ıtez-Pe˜na et al.
In the previous optimization problem, (5) and (6) are the usual constraints
in the SVM formulation. Constraints (7) ensure that observations j∈Jwith
zj= 1 will be correctly classified, without imposing any restriction when
zj= 0, provided that Mis big enough. A collection of requirements on the
performance of the classifier over Jcan be specified by means of (9). Also, Lis
the set of indexes of the constraints that has the form of (9). These constraints
can be modeled via the binary variables zj, for instance:
TPR :P
j∈J+
zj≥p∗
0|J+|
T N R :P
j∈J−
zj≥p∗
0|J−|
ACC :P
j∈J
zj≥p∗
0|J|,
where J+and J−denote, respectively, the subsets {i∈J:yi= +1}and
{i∈J:yi=−1}. As usual in SVM methodology, a mapping into a high-
dimensional feature space may be considered, which allows us to transform this
linear classification technique in a non-linear one. This way we can address
problems with a very large number of features, such as those encountered
in personalized medicine (S´anchez et al. 2016). The various drawbacks that
arise in considering this mapping can be avoided if the so-called kernel trick
(Cristianini and Shawe-Taylor 2000), based on Mercer theorem (Mercer 1909),
is used. Therefore, by considering the (partial) dual problem of (CSVM0) and
the kernel trick, the general formulation of the CSVM is obtained as follows
(the intermediate steps can be found in Appendix A):
min
λ,µ,β,ξ,z P
s,s0∈I
λsysλs0ys0K(xs, xs0) + P
t,t0∈J
µtytµt0yt0K(xt, xt0)
+2 P
s∈I,t∈J
λsysµtytK(xs, xt) + CP
i∈I
ξi
s.t. zj∈ {0,1}j∈J
ˆp`≥p∗
0``∈L
yiP
s∈I
λsysK(xs, xi) + P
t∈J
µtytK(xt, xi) + β≥1−ξii∈I
yjP
s∈I
λsysK(xs, xj) + P
t∈J
µtytK(xt, xj) + β≥1−M(1 −zj)j∈J(CSVM)
ξi≥0i∈I
P
i∈I
λiyi+P
j∈J
µjyj= 0
0≤λi≤C/2i∈I
0≤µj≤Mzjj∈J.
Here K:Rm×Rm→Ris a kernel function and (λ, µ) are the usual variables
of the dual formulation of the SVM.
On Support Vector Machines under a multiple-cost scenario 7
2.3 Solving the CSVM
In this section we give details about the complexity of our problem as for-
mulated in (CSVM). The problem belongs to the class of MIQP problems,
and thus it can be addressed by standard mixed integer quadratic optimiza-
tion solvers. In particular, the solver Gurobi (Gurobi Optimization 2016) and
its Python language interface (Van Rossum and Drake 2011) have been used
in our numerical experiments. In contrast to the standard SVM formulation,
which is a continuous quadratic problem, the CSVM is harder to solve due
to the presence of binary variables. Hence, the optimal solution may not be
found in a short period of time; however, as discussed in our numerical expe-
rience, good results are obtained when the problems are solved heuristically
by imposing a short time limit to the solver.
Performance constraints (9) may define an infeasible problem since the values
of the p∗
0`may be unattainable in practice. Hence, the study of the feasibil-
ity of Problem (CSVM) is an important issue. As an example, consider data
composed by two different classes, each one represented respectively by black
and white dots in the top picture in Figure 1. If the optimization problem for
the linear kernel SVM is solved, the resulting classifier is a hyperplane that
aims at separating both classes and maximizes the margin. An approximate
representation of the data and the classifier is shown in the middle panel in
Figure 1. If the aim is to correctly classify all the data corresponding to a
given class, it is intuitively easy to see that this objective can be reached by
moving the SVM hyperplane. In fact, it can be seen in the bottom picture
in Figure 1 how hyperplanes 1 and 2 classify correctly all white points, and
hyperplane 3 classifies all the black dots in the correct class. Among all those
hyperplanes, the SVM selects the one which maximizes the margin. So, intu-
itively, it is evident that if just one constraint of performance is imposed in
only one of the classes, the problem is always feasible. However, and using the
data in Figure 1 again, as well as the linear kernel SVM, it is clear that it is
impossible to classify correctly all the instances at the same time; thus, the
problem is then infeasible. However, there exist results, as Theorem 5 in Burges
(1998), that show that the class of Mercer kernels for which K(x, x0)→0 as
kx−x0k→∞, and for which K(x, x) is O(1), build classifiers that get a total
correct classification in all the classes in the training sample, without regard
how arbitrarily the data have been chosen. Thus, if a kernel satisfies the pre-
vious conditions, then feasibility is guaranteed. In particular, Radial Function
Basis (RBF) kernel meets these conditions. Therefore, to be on the safe side,
if the performance thresholds imposed are not too low, they should refer only
to one class misclassification rates (so that we can shift the variable βto make
the problem feasible) or to use a kernel, such as the RBF, known to have large
VC dimension (Burges 1998; Cristianini and Shawe-Taylor 2000), defined as
the number of training instances that can be classified correctly.
8 Sandra Ben´ıtez-Pe˜na et al.
V1
V2
V1
V2
Linear
SVM hyperplane
V1
V2
Hyperplane 1
Hyperplane 2
Hyperplane 3
1
Fig. 1 Study of feasiblity and unfeasibility of the CSVM.
3 Computational results
In this section we illustrate the performance of the CSVM compared to the
standard SVM, considered here as a benchmark. In order to do this compari-
On Support Vector Machines under a multiple-cost scenario 9
son, some of the performance measures presented in Section 2.2 are considered.
In particular, true positive and true negative rates will be used here. In what
follows, a description of the data, experiments and results is given.
3.1 Description of the experiments
The objective of this paper, as has been stated before, is to build a clas-
sifier whose performance can be controlled by means of some constraints,
as in Problem (CSVM). As explained in Section 2.1, if we want a perfor-
mance measurement pto be greater than a value p0with a specified confidence
100(1 −α)%, we should use an estimator of p, ˆp, and impose it to be greater
than p∗
0=p0+rlog α
−2n, according to (4). From a practical application, this
result will turn out crucial in our experiments.
Two experiments, both having the same structure, will be considered in this
paper. In each one, we will try to improve the performance of the classifier in
one of the classes, even though, as will be seen, a damage may be produced in
the other class. Hence, we will focus on TPR and TNR. Suppose that the esti-
mators for the TPR and TNR obtained by the standard SVM are, respectively,
TPR0and T N R0, so if we want to enhance the performance, the aim will be
TPR ≥TPR0+δ1and T N R ≥T N R0+δ2respectively. In the considered
experiments we have set δ1=δ2= 0.025, although other values can also be
tested. Then, the two experiments are:
–Experiment 1: Impose TPR ≥min {1, T P R0+ 0.025}=p0,
–Experiment 2: Impose T N R ≥min {1, T N R0+ 0.025}=p0.
That is to say, taking α= 0.05, the constraints in the optimization problem
for these two different experiments turn out to be:
–Experiment 1: Impose ˆ
TPR ≥min (1, T P R0+rlog 0.05
−2n+ 0.025)=p∗
0,
–Experiment 2: Impose ˆ
T N R ≥min (1, T N R0+rlog 0.05
−2n+ 0.025)=p∗
0.
Although the description of the experiments is presented below, the skele-
ton of the complete methodology can be found in Algorithm 1 for a better and
clearer understanding. Now, we shall discuss the experiments’ design. First, in
all the experiments the time limit and the Mvalue in Problem (CSVM) were
set, respectively, equal to 300 seconds and 100. The selection of these values
is due to the following facts. The time limit should not be too small, since
we should give the optimizer enough time to solve the problem. On the other
hand, the time limit should not be too large if one wants to have reasonable
running times. In the case of the parameter M, if a small value is considered,
there may be many discarded hyperplanes, maybe including the optimal one.
However, if Mis too big, it might cause numerical troubles (Camm et al.
10 Sandra Ben´ıtez-Pe˜na et al.
Algorithm 1: Pseudocode for CSVM
1Split data (D) into “folds” subsets, D={D1,...,Df olds }.
2for kf = 1,...,folds do
3Set V alidation =Dkf and I=D− {Dk f }.
4for each pair (C, γ)in grid ({2(−5:5) },{2(−5:5)})do
5Split D− {Dkf }=D∗into “folds2 ” subsets, D∗={D∗
1,...,D∗
folds2}.
6for kf2 = 1,..., folds2 do
7Set V alidation∗=D∗
kf 2and I∗∪J∗=D∗− {Dkf 2}.
8Run standard SVM over I∗∪J∗.
9Move βof SVM until the instances are correctly classified.
10 Run problem CSVM over I∗,J∗with initial solutions from before.
11 Validate over V alidation∗, getting the accuracy (AC C[kf 2]).
12 end
13 Calculate the average accuracy (Pkf2ACC[kf2])/folds2 = ¯
ACC .
14 if ¯
ACC ≥bestAC C then
15 Set bestACC =¯
ACC ,bestγ =γand bestC =C.
16 end
17 end
18 Run standard SVM over I∪Jwith the parameters bestγ and bestC.
19 Move βof SVM until the instances are correctly classified.
20 Run problem CSVM over I,Jwith initial solutions from the previous step.
21 Validate over V alidation, getting the correct classification probabilities
(T P R[kf ], T N R[kf ]).
22 end
23 Calculate the average values for T P R and T NR.
1990). A compromise solution is obtained by considering M= 100, which is
shown to be a good value in our numerical experiments.
Second, one of the most popular kernels K(x, x0) in literature, and the
one considered in this paper, is the well-known RBF kernel (Cristianini and
Shawe-Taylor 2000; Hastie et al. 2001; Hsu et al. 2003; Smola and Sch¨olkopf
2004; Horn et al. 2016), given by
K(x, x0) = exp −γkx−x0k2, γ > 0,
where γ > 0 is a parameter to be tuned. However, the approach presented in
this paper is valid for arbitrary kernels.
The estimation of the performance for our classifier is based on a 10-fold
cross validation (CV) (Kohavi et al. 1995) as follows. Note that, apart from
tuning γ, the regularization parameter Cintroduced in Section 1 also needs
to be tuned. In addition, for a given pair of parameters (C, γ), the process
consists mainly on solving a standard SVM using all the instances (I∪J),
and collect the values of λ(from the dual formulation of the SVM) as well as
the value of β. Once the SVM is solved and with the purpose of providing an
initial solution for the CSVM, the value of βis slightly changed (maintaining
the values of λ’s fixed) until the desired number of instances well classified is
reached. Then, the values of βand λ’s obtained are set as initial solutions for
CSVM. In addition, depending on whether each instance in Jis well classified
or not, we set their values of zas 0 or 1 as initial values for the CSVM.
On Support Vector Machines under a multiple-cost scenario 11
However, we should make the selection of the best pair (C, γ) in each of
the previous folds. In order to do that, a 10-fold CV as before is made for
each pair in a grid given by the 121 different combinations of C= 2(−5:5) and
γ= 2(−5:5). The general criterion used to select the best pair of parameters
is the accuracy. However, in cases were the datasets are severely unbalanced
in the classes size, other performance measurements which take into account
such imbalancedness, such as the G-mean (Tang et al. 2009), or Youden Index
(Bewick et al. 2004), would be preferable. Finally, the average values of TPR
and TNR obtained in the first CV, in addition to their standard deviations,
are calculated.
3.2 Data description
The performance, in terms of correct classification probabilities and accuracy,
is illustrated using 4 real-life datasets from the UCI repository (Lichman 2013).
In particular, the datasets used are wisconsin (Breast Cancer Wisconsin (Di-
agnostic) Data Set), australian (Statlog (Australian Credit Approval) Data
Set), votes (Congressional Voting Records Data Set) and german (Statlog
(German Credit Data) Data Set).
Details concerning the implementation of the CSVM for the real datasets
are shown in Table 2. Let the first column represent the number of features
Name V|Ω| |Ω+|(%)
wisconsin 30 569 357 (62.7 %)
australian 14 690 383 (55.5%)
votes 16 435 267 (61.4 %)
german 45 1000 700 (70%)
Table 2 Details concerning the implementation of the CSVM for the considered datasets.
composing the set. |Ω|and |Ω+|represent, respectively, the size for each
dataset and the number of positive instances (majority class) in Ω. Finally,
the percentage of positive instances is compiled in the last column.
Note that prior to running the different experiments data have been stan-
dardized, that is to say, each variable in all the 4 considered data sets has zero
mean and unit variance.
3.3 Results
In this section we compare the performance of the strategy proposed to build
the CSVM classifier against that of the SVM classifier in terms of overall
classification accuracy, true positive rate (TPR) and true negative rate (TNR)
of the classifier. Note that, even though from Section 2.3 the problem is always
feasible using the training sample, it may happen that the desired performance
12 Sandra Ben´ıtez-Pe˜na et al.
is not achieved in the validation sample.
Tables 3 and 4 report the results for the benchmark procedure, SVM, and
Name SVM CSVM
Mean Std Mean Std
wisconsin TPR 0.99 0.017 0.945 0.045
TNR 0.948 0.049 0.965 0.037
australian TPR 0.863 0.079 0.772 0.081
TNR 0.83 0.071 0.903 0.05
votes TPR 0.963 0.04 0.846 0.097
TNR 0.951 0.031 0.978 0.038
german TPR 0.905 0.036 0.791 0.063
TNR 0.405 0.114 0.547 0.141
Table 3 TNR for the original SVM and the CSVM strategy
Name SVM CSVM
Mean Std Mean Std
wisconsin TPR 0.99 0.017 0.989 0.018
TNR 0.948 0.049 0.856 0.153
australian TPR 0.863 0.079 0.914 0.046
TNR 0.83 0.071 0.692 0.086
votes TPR 0.963 0.04 0.978 0.026
TNR 0.951 0.031 0.922 0.04
Table 4 TPR for the original SVM and the CSVM strategy
those obtained when imposing a higher classification rate in a selected class, in
particular 0.025 additional points, according to the description of experiments
in Section 3.1. As an exception, in the case of german, a minimum value of
0.65 in the TNR will be imposed, in order to increase the low value obtained
under the standard SVM (0.405).
First, the results when constraints are imposed on the true negative rate
(TNR) are presented in Table 3. In the case of wisconsin, one can observe
that, although the TNR has been increased, an increment of 0.025 points was
not possible. However, the improvement is verified anyway. On the other hand,
in the case of australian, we have been able to increase the value of the TNR
in 0.073 points, without reducing significantly the accuracy. A similar result is
obtained for votes, for which the increase has been in more than 0.025 points,
too. The results are not so good for german, due to its imbalancedness in the
classes size. If instead of the accuracy, the G-mean is used as the criterion
for tuning the values of (C, γ), then the results notably improve, as depicted
by Table 5. In fact, if TNR ≥0.7 is set instead of TNR ≥0.65, even better
results are obtained, as can be seen in Table 5, although without reaching the
threshold imposed.
Now, we shall discuss the results when constraints are imposed on the true
positive rate (TPR), depicted by Table 4. Here, in the case of wisconsin,
On Support Vector Machines under a multiple-cost scenario 13
Name SVM CSVM CSVM
(TNR ≥0.65) (TNR ≥0.7)
Mean Std Mean Std Mean Std
german TPR 0.905 0.036 0.668 0.111 0.683 0.073
TNR 0.405 0.114 0.671 0.164 0.69 0.103
Table 5 TNR for the original SVM and the CSVM strategy in german, using the G-mean
the increase of 0.025 points is not obtained. In fact and unfortunately, instead
of an increase we can observe a minor decrease. However, this is not a weird
result since the original TPR was very high (near a perfect classification). On
the other hand, if we look at australian, an increase of about 0.05 points
has been reached, without loosing too much overall performance as before. In
addition, in votes, an increase can be observed, too. However, in contrast to
what happened with the TNR, such increase is not bigger than 0.025 points.
4 Conclusions
In this paper, we have proposed a new supervised learning SVM-based method,
the CSVM, which is developed and evaluated. Such classifier is built via a
MIQP problem, which has been solved using a standard and widely available
solver. Also, in order to formulate the constraints that are added to modify
the standard SVM (and hence build the CSVM), and guarantee that the per-
formance measurements will fulfill the imposition with high probability, some
theoretical foundations are given. The applicability of this cost-sensitive SVM
has been demonstrated by numerical experiments on benchmark data sets.
We conclude that it is possible to control the classification rates in one
class, possibly, but not necessarily, at the expense of the other class. This
highly contrasts with the naive approach in which, once the SVM is solved,
its intercept is moved to enhance the positive rates in one class, necessarily
deteriorating the performance in the other class.
Although, for simplicity, all numerical results are presented just adding
one performance constraint, one constraint per class, as well as an overall
accuracy, may be added in our approach. Also for simplicity, we addressed
here two-ways data matrices and two-class problems; however, this approach
could be extended to the case when using more complex data as multi-class
or multi-way arrays (Lyu et al. 2017), which are very common in biomedical
research. On the other hand, an alternative perspective for addressing the
SVM regularization is to consider different norms (Yao and Lee 2014).
Finally, another possible extension, which is under development, is to per-
form a feature selection which uses the proposed constraints in order to control
the misclassification costs. Such process is an essential step in tasks such as
high-dimensional microarray classification problems (Guo 2010).
14 Sandra Ben´ıtez-Pe˜na et al.
Acknowledgements
This research is financed by Fundaci´on BBVA, projects FQM329 and P11-
FQM-7603 (Junta de Andaluc´ıa, Andaluc´ıa) and MTM2015-65915-R (Minis-
terio de Econom´ıa y Competitividad, Spain). The last three are cofunded with
EU ERD Funds. The authors are thankful for such support.
Appendix A: Derivation of the CSVM
In this section, the detailed steps to build the CSVM formulation, are shown.
For that, suppose that we are given the linear model
min
ω,β,ξ ,z ω>ω+CP
i∈I
ξi
s.t. yi(ω>xi+β)≥1−ξi, i ∈I
ξi≥0i∈I
yj(ω>xj+β)≥1−M(1 −zj), j ∈J
zj∈ {0,1}j∈J
ˆp`≥p∗
0``∈L.
Hence, the problem above can be rewritten as
minzminω,β,ξ ω>ω+CP
i∈I
ξi
s.t. zj∈ {0,1}j∈Js.t. yiω>xi+β≥1−ξii∈I
ˆp`≥p∗
0``∈L yjω>xj+β≥1−M(1 −zj), j ∈J
ξi≥0i∈I.
The Karush–K¨uhn–Tucker (KKT) conditions for the inner problem, as-
suming zfixed are given by
ω=P
s∈I
λsysxs+P
t∈J
µtytxt
0 = P
s∈I
λsys+P
t∈J
µtyt
0≤λs≤C/2s∈I
0≤µt≤Mztt∈J.
Thus, substituting the previous expressions into the last optimization prob-
lem, the partial dual of such problem can be calculated, yielding
On Support Vector Machines under a multiple-cost scenario 15
min
zmin
λ,µ,β,ξ P
s∈I
λsysxs+P
t∈J
µtytxt>P
s∈I
λsysxs+P
t∈J
µtytxt+CP
i∈I
ξi
s.t. zj∈ {0,1}j∈Js.t. yi P
s∈I
λsysxs+P
t∈J
µtytxt>
xi+β!≥1−ξii∈I
ˆp`≥p∗
0``∈L yj P
s∈I
λsysxs+P
t∈J
µtytxt>
xj+β!≥1−M(1 −zj)j∈J
ξi≥0i∈I
P
i∈I
λiyi+P
j∈J
µjyj= 0
0≤λi≤C/2i∈I
0≤µj≤Mzjj∈J.
Finally, from the kernel trick, Problem (CSVM) is obtained.
References
Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: A uni-
fying approach for margin classifiers. Journal of Machine Learning Research
1(Dec), 113–141 (2000)
Bewick, V., Cheek, L., Ball, J.: Statistics review 13: receiver operating char-
acteristic curves. Critical Care 8(6), 508–512 (2004)
Bonami, P., Biegler, L.T., Conn, A.R., Cornujols, G., Grossmann, I.E., Laird,
C.D., Lee, J., Lodi, A., Margot, F., Sawaya, N., Wchter, A.: An algorithmic
framework for convex mixed integer nonlinear programs. Discrete Optimiza-
tion 5(2), 186 – 204 (2008). In Memory of George B. Dantzig
Bradford, J.P., Kunz, C., Kohavi, R., Brunk, C., Brodley, C.E.: Pruning deci-
sion trees with misclassification costs. In: Proceedings of the 10th European
Conference on Machine Learning, ECML ’98, pp. 131–136. Springer (1998)
Burer, S., Letchford, A.N.: Non-convex mixed-integer nonlinear programming:
A survey. Surveys in Operations Research and Management Science 17(2),
97 – 106 (2012)
Burges, C.J.: A Tutorial on Support Vector Machines for Pattern Recognition.
Data Mining and Knowledge Discovery 2(2), 121–167 (1998)
Camm, J.D., Raturi, A.S., Tsubakitani, S.: Cutting Big M Down to Size.
Interfaces 20(5), 61–66 (1990)
Carrizosa, E., Martin-Barragan, B., Romero Morales, D.: Multi-group Sup-
port Vector Machines with Measurement Costs: A Biobjective Approach.
Discrete Applied Mathematics 156(6), 950–966 (2008)
Carrizosa, E., Romero Morales, D.: Supervised Classification and Mathemati-
cal Optimization. Computers & Operations Research 40(1), 150–165 (2013)
Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines
and other kernel-based learning methods. Cambridge University Press, New
York, NY, USA (2000)
16 Sandra Ben´ıtez-Pe˜na et al.
Datta, S., Das, S.: Near-Bayesian Support Vector Machines for imbalanced
data classification with equal or unequal misclassification costs. Neural Net-
works 70, 39–52 (2015)
Freitas, A., Costa-Pereira, A., Brazdil, P.: Cost-Sensitive Decision Trees Ap-
plied to Medical Data. In: Data Warehousing and Knowledge Discovery:
9th International Conference, DaWaK 2007, Regensburg Germany, Septem-
ber 3-7, 2007. Proceedings, pp. 303–312. Springer Berlin Heidelberg, Berlin,
Heidelberg (2007)
Guo, J.: Simultaneous variable selection and class fusion for high-dimensional
linear discriminant analysis. Biostatistics 11(4), 599–608 (2010)
Gurobi Optimization, I.: Gurobi Optimizer Reference Manual (2016). URL
http://www.gurobi.com
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning.
Springer Series in Statistics. Springer New York Inc., New York, NY, USA
(2001)
He, H., Ma, Y.: Imbalanced learning: foundations, algorithms, and applica-
tions. John Wiley & Sons, Inc. (2013)
Hoeffding, W.: Probability Inequalities for Sums of Bounded Random Vari-
ables. Journal of the American Statistical Association 58(301), 13–30 (1963)
Horn, D., Demircio˘glu, A., Bischl, B., Glasmachers, T., Weihs, C.: A compar-
ative study on large scale kernelized support vector machines. Advances in
Data Analysis and Classification (2016)
Hsu, C.W., Chang, C.C., Lin, C.J., et al.: A practical guide to support vector
classification. Tech. rep., Department of Computer Science, National Taiwan
University (2003)
Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy
estimation and model selection. In: Ijcai, vol. 14, pp. 1137–1145. Stanford,
CA (1995)
Lichman, M.: UCI Machine Learning Repository (2013)
Lyu, T., Lock, E.F., Eberly, L.E.: Discriminating sample groups with multi-
way data. Biostatistics (2017)
Maldonado, S., Prez, J., Bravo, C.: Cost-based feature selection for support
vector machines: An application in credit scoring. European Journal of
Operational Research 261(2), 656 – 665 (2017)
Mercer, J.: Functions of positive and negative type, and their connection with
the theory of integral equations. Philosophical transactions of the royal
society of London. Series A 209, 415–446 (1909)
Prati, R.C., Batista, G.E., Silva, D.F.: Class imbalance revisited: a new exper-
imental setup to assess the performance of treatment methods. Knowledge
and Information Systems 45(1), 247–270 (2015)
S´anchez, B.N., Wu, M., Song, P.X.K., Wang, W.: Study design in high-
dimensional classification analysis. Biostatistics 17(4), 722 (2016)
Smola, A.J., Sch¨olkopf, B.: A tutorial on support vector regression. Statistics
and Computing 14(3), 199–222 (2004)
Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: SVMs Modeling for Highly
Imbalanced Classification. IEEE Transactions on Systems, Man, and Cy-
On Support Vector Machines under a multiple-cost scenario 17
bernetics, Part B (Cybernetics) 39(1), 281–288 (2009)
Van Rossum, G., Drake, F.L.: An Introduction to Python. Network Theory
Ltd. (2011)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag New
York, Inc., New York, NY, USA (1995)
Vapnik, V.N.: Statistical learning theory, vol. 1. Wiley New York, 1 ed. (1998)
Yao, Y., Lee, Y.: Another look at linear programming for feature selection via
methods of regularization. Statistics and Computing 24(5), 885–905 (2014)
View publication statsView publication stats