Content uploaded by Sandra Benítez-Peña

Author content

All content in this area was uploaded by Sandra Benítez-Peña on Aug 20, 2018

Content may be subject to copyright.

Noname manuscript No.

(will be inserted by the editor)

On Support Vector Machines under a multiple-cost

scenario

Sandra Ben´ıtez-Pe˜na ·Rafael

Blanquero ·Emilio Carrizosa ·Pepa

Ram´ırez-Cobo

Received: date / Accepted: date

Abstract Support Vector Machine (SVM) is a powerful tool in binary classi-

ﬁcation, known to attain excellent misclassiﬁcation rates. On the other hand,

many realworld classiﬁcation problems, such as those found in medical diag-

nosis, churn or fraud prediction, involve misclassiﬁcation costs which may be

diﬀerent in the diﬀerent classes. However, it may be hard for the user to pro-

vide precise values for such misclassiﬁcation costs, whereas it may be much

easier to identify acceptable misclassiﬁcation rates values. In this paper we

propose a novel SVM model in which misclassiﬁcation costs are considered by

incorporating performance constraints in the problem formulation. Speciﬁcally,

our aim is to seek the hyperplane with maximal margin yielding misclassiﬁ-

cation rates below given threshold values. Such maximal margin hyperplane

is obtained by solving a quadratic convex problem with linear constraints and

integer variables. The reported numerical experience shows that our model

gives the user control on the misclassiﬁcation rates in one class (possibly at

S. Ben´ıtez-Pe˜na

IMUS, Instituto de Matem´aticas de la Universidad de Sevilla.

Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de Sevilla. Spain.

Tel.: +34-955420861

E-mail: sbenitez1@us.es

R. Blanquero

IMUS, Instituto de Matem´aticas de la Universidad de Sevilla.

Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de Sevilla. Spain.

E. Carrizosa

IMUS, Instituto de Matem´aticas de la Universidad de Sevilla.

Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de Sevilla. Spain.

P. Ram´ırez-Cobo

IMUS, Instituto de Matem´aticas de la Universidad de Sevilla.

Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de C´adiz. Spain.

2 Sandra Ben´ıtez-Pe˜na et al.

the expense of an increase of misclassiﬁcation rates for the other class) and is

feasible in terms of running times.

Keywords Constrained Classiﬁcation ·Misclassiﬁcation costs ·Mixed

Integer Quadratic Programming ·Sensitivity/Speciﬁcity trade-oﬀ ·Support

Vector Machines

Mathematics Subject Classiﬁcation (2000) 62P99 ·90C11 ·90C30

1 Introduction

In supervised classiﬁcation we are given a set Ωof individuals belonging to

two or more diﬀerent classes, and the ﬁnal aim is to classify new objects whose

class is unknown. Each object i∈Ωcan be represented by a pair (xi, yi), where

xi∈Rmis the so-called feature vector and yi∈ C is the class membership of

object i.

A state-of-the-art method in supervised classiﬁcation is the support vector

machine (SVM), see Vapnik (1995, 1998); Cristianini and Shawe-Taylor (2000);

Carrizosa and Romero Morales (2013). In its basic version, SVM addresses two-

class problems, i.e., Chas two elements, say, C={−1,+1}. The SVM aims at

separating both classes by means of a linear classiﬁer, ω>x+β= 0, where ω

is the score vector. We will assume throughout this paper that C={−1,+1}

and refer the reader to e.g. Allwein et al. (2000) for the reduction of multiclass

problems to this case.

The SVM classiﬁer is obtained by solving the following convex quadratic pro-

gramming (QP) formulation with linear constraints:

min

ω,β,ξ ω>ω+CP

i∈I

ξi

s.t. yi(ω>xi+β)≥1−ξi, i ∈I

ξi≥0i∈I,

where Irepresents the set of training data, ξi≥0 are artiﬁcial variables

which allow data points to be misclassiﬁed, and C > 0 is a regularization

parameter to be tuned that controls the trade-oﬀ between margin minimization

and misclassiﬁcation errors. Given an object i, it is classiﬁed in the positive

or the negative class, according to the sign of the so-called score function,

sign(ω>xi+β), while for the case ω>xi+β= 0, the object is classiﬁed

randomly.

As mentioned, the goal in supervised classiﬁcation is to classify objects in

the correct class. However, ignoring imbalancedness (either in the classes size,

either in the misclassiﬁcation cost structure) or other costs may have dramatic

consequences in the classiﬁcation task, see Carrizosa et al. (2008); He and Ma

(2013); Prati et al. (2015); Maldonado et al. (2017). For instance, for clinical

On Support Vector Machines under a multiple-cost scenario 3

databases, there are usually more observations of healthy populations than for

the disease cases, so smaller classiﬁcation errors are obtained for the ﬁrst case.

For example, for the well known Breast Cancer Wisconsin (Diagnostic) Data

Set from the UCI repository (Lichman 2013), the number of sick cases (212)

is smaller than the size of control cases (357). If a standard SVM is used for

classifying the dataset, then the obtained rates (average values according to a

10-fold cross-validation approach), are depicted by Table 1. Even though both

Mean Std

% benign instances well class. 99% 1.7

% malign instances well class. 94.8% 4.9

Table 1 Performance of SVM in wisconsin. Average values and standard deviations com-

puted from 10 realizations.

rates are high, it might be of interest to increase the accuracy of the cancer

samples. This problem will be addressed in this paper.

In order to cope with imbalancedness, diﬀerent methods have been sug-

gested, see Bradford et al. (1998); Freitas et al. (2007); Carrizosa et al. (2008);

Datta and Das (2015). Those methods are based on adding parameters or

adapting the classiﬁer construction, among others. For example, in Carrizosa

et al. (2008) a biobjective problem of simultaneous minimization of misclassi-

ﬁcation rate, via the maximization of the margin and measurement costs, is

formulated.

In this paper a new formulation of the SVM is presented, in such a way

that the focus is not only on the minimization of the overall misclassiﬁcation

rate but also on the performance of the classiﬁer in the two classes. In order

to do that, novel constraints are added to the SVM formulation. The keystone

of the new model is its ability to achieve a deeper control over misclassiﬁ-

cation in contrast to previously existing models. The proposed methodology

will be called Constrained Support Vector Machine (CSVM) and the resulting

classiﬁcation technique will be referred as CSVM classiﬁer.

The remainder of this paper is structured as follows. In Section 2, the

CSVM is formulated and details concerning its motivation, feasibility and so-

lutions are given. Section 3 aims to illustrate the performance of the new clas-

siﬁer. A description in depth about the experiments’ design, the real datasets

to be tested as well as the obtained results will be given. The paper ends with

some concluding remarks and possible extensions in Section 4.

2 Constrained Support Vector Machines

In this section the Constrained Support Vector Machine (CSVM) model is

formulated as a Mixed Integer Nonlinear Programming (MINLP) problem

(Bonami et al. 2008; Burer and Letchford 2012), speciﬁcally in terms of a

Mixed Integer Quadratic Programming (MIQP) problem.

4 Sandra Ben´ıtez-Pe˜na et al.

This section is structured as follows. In Section 2.1 some theoretical foun-

dations that motivate the novel constraints are given. Then, in Section 2.2 the

formulation of the CSVM is presented. We will depart from the linear kernel

case to later extend to the general kernel case via the kernel trick. Finally, in

Section 2.3, some issues about the CSVM formulation, as its feasibility, shall

be discussed.

2.1 Theoretical Motivation

As commented before, the aim of this work is to build a classiﬁer so that the

user may have control over the performance in the two classes. Speciﬁcally,

given a set Ω={(xi, yi)}iof data (a random sample of a vector (X, Y ) with

unknown distribution), the target is to obtain a classiﬁer such that p≥p0,

where pis the value of a performance measurement and p0is a threshold chosen

by the user. The performance measures pto be considered in this paper are

the sensitivity or true positive rate (TPR), the speciﬁcity or true negative rate

(TNR) and the accuracy (ACC), given by:

TPR :p=P(ω>X+β > 0|Y= +1)

T N R :p=P(ω>X+β < 0|Y=−1) (1)

ACC :p=P(Y(ω>X+β)>0).

See for example, Bewick et al. (2004).

If the random variable Z, deﬁned as

Z=1, if an observation is well classiﬁed,

0, otherwise,

is considered, then, the values of pas in (1) corresponding to the probability

of correct classiﬁcation can be rewritten as

TPR :p=E[Z|Y= +1]

T N R :p=E[Z|Y=−1]

ACC :p=E[Z]

and estimated from an independent and identically distributed (i.i.d.) sample

{Zi}i∈S, by

TPR : ˆp=¯

Z+=

P

i∈S+

Zi

|S+|

T N R : ˆp=¯

Z−=

P

i∈S−

Zi

|S−|

ACC : ˆp=¯

Z=

P

i∈S

Zi

|S|,

where S+and S−denote, respectively, the subsets {i∈S:yi= +1}and

{i∈S:yi=−1}.

On Support Vector Machines under a multiple-cost scenario 5

From a hypothesis testing viewpoint, our aim is to build a classiﬁer such

that, for a given sample, one can reject the null hypothesis in

H0:p≤p0

H1:p>p0.

Under the classic decision rule, H0is rejected if ˆp≥p∗

0assuming that

α=P(type I error). From Hoeﬀding Inequality (Hoeﬀding 1963),

P(ˆp≥p+c)≤exp(−2nc2).(2)

As α=P(type I error) = P(ˆp≥p∗

0|p=p0), substituting pby p0in (2) yields

P(ˆp<p0+c)≥1−exp(−2nc2)=1−α, (3)

where p0+c=p∗

0. Therefore, we can take

p∗

0=p0+rlog α

−2n.(4)

Note that nequals |S+|,|S−|or |S|, respectively, when considering the TPR,

the TNR or the accuracy.

2.2 CSVM formulation

In this section, the CSVM formulation is presented. As it will be seen, the for-

mulation includes novel performance constraints, which make the optimization

problem a MIQP problem in terms of some integer variables.

We assume to be given a dataset with known labels. From such set we

identify the training set I, used to build the classiﬁer, and the anchor set J,

used to impose a lower bound on the classiﬁer performance. These sets will be

considered disjoint.

With the purpose of building the CSVM, the performance constraints will

be formulated in terms of binary variables {zj}j∈J, which are realizations of

the variable Zin Section 2.1 and deﬁned as:

zj=1, if instance jis well classiﬁed

0, otherwise.

In order to formulate the CSVM, novel constraints are added to the stan-

dard soft-margin SVM formulation as follows:

min

ω,β,ξ ,z ω>ω+CP

i∈I

ξi

s.t. yi(ω>xi+β)≥1−ξi, i ∈I(5)

ξi≥0i∈I(6)

yj(ω>xj+β)≥1−M(1 −zj), j ∈J(CSVM0)(7)

zj∈ {0,1}j∈J(8)

ˆp`≥p∗

0``∈L. (9)

6 Sandra Ben´ıtez-Pe˜na et al.

In the previous optimization problem, (5) and (6) are the usual constraints

in the SVM formulation. Constraints (7) ensure that observations j∈Jwith

zj= 1 will be correctly classiﬁed, without imposing any restriction when

zj= 0, provided that Mis big enough. A collection of requirements on the

performance of the classiﬁer over Jcan be speciﬁed by means of (9). Also, Lis

the set of indexes of the constraints that has the form of (9). These constraints

can be modeled via the binary variables zj, for instance:

TPR :P

j∈J+

zj≥p∗

0|J+|

T N R :P

j∈J−

zj≥p∗

0|J−|

ACC :P

j∈J

zj≥p∗

0|J|,

where J+and J−denote, respectively, the subsets {i∈J:yi= +1}and

{i∈J:yi=−1}. As usual in SVM methodology, a mapping into a high-

dimensional feature space may be considered, which allows us to transform this

linear classiﬁcation technique in a non-linear one. This way we can address

problems with a very large number of features, such as those encountered

in personalized medicine (S´anchez et al. 2016). The various drawbacks that

arise in considering this mapping can be avoided if the so-called kernel trick

(Cristianini and Shawe-Taylor 2000), based on Mercer theorem (Mercer 1909),

is used. Therefore, by considering the (partial) dual problem of (CSVM0) and

the kernel trick, the general formulation of the CSVM is obtained as follows

(the intermediate steps can be found in Appendix A):

min

λ,µ,β,ξ,z P

s,s0∈I

λsysλs0ys0K(xs, xs0) + P

t,t0∈J

µtytµt0yt0K(xt, xt0)

+2 P

s∈I,t∈J

λsysµtytK(xs, xt) + CP

i∈I

ξi

s.t. zj∈ {0,1}j∈J

ˆp`≥p∗

0``∈L

yiP

s∈I

λsysK(xs, xi) + P

t∈J

µtytK(xt, xi) + β≥1−ξii∈I

yjP

s∈I

λsysK(xs, xj) + P

t∈J

µtytK(xt, xj) + β≥1−M(1 −zj)j∈J(CSVM)

ξi≥0i∈I

P

i∈I

λiyi+P

j∈J

µjyj= 0

0≤λi≤C/2i∈I

0≤µj≤Mzjj∈J.

Here K:Rm×Rm→Ris a kernel function and (λ, µ) are the usual variables

of the dual formulation of the SVM.

On Support Vector Machines under a multiple-cost scenario 7

2.3 Solving the CSVM

In this section we give details about the complexity of our problem as for-

mulated in (CSVM). The problem belongs to the class of MIQP problems,

and thus it can be addressed by standard mixed integer quadratic optimiza-

tion solvers. In particular, the solver Gurobi (Gurobi Optimization 2016) and

its Python language interface (Van Rossum and Drake 2011) have been used

in our numerical experiments. In contrast to the standard SVM formulation,

which is a continuous quadratic problem, the CSVM is harder to solve due

to the presence of binary variables. Hence, the optimal solution may not be

found in a short period of time; however, as discussed in our numerical expe-

rience, good results are obtained when the problems are solved heuristically

by imposing a short time limit to the solver.

Performance constraints (9) may deﬁne an infeasible problem since the values

of the p∗

0`may be unattainable in practice. Hence, the study of the feasibil-

ity of Problem (CSVM) is an important issue. As an example, consider data

composed by two diﬀerent classes, each one represented respectively by black

and white dots in the top picture in Figure 1. If the optimization problem for

the linear kernel SVM is solved, the resulting classiﬁer is a hyperplane that

aims at separating both classes and maximizes the margin. An approximate

representation of the data and the classiﬁer is shown in the middle panel in

Figure 1. If the aim is to correctly classify all the data corresponding to a

given class, it is intuitively easy to see that this objective can be reached by

moving the SVM hyperplane. In fact, it can be seen in the bottom picture

in Figure 1 how hyperplanes 1 and 2 classify correctly all white points, and

hyperplane 3 classiﬁes all the black dots in the correct class. Among all those

hyperplanes, the SVM selects the one which maximizes the margin. So, intu-

itively, it is evident that if just one constraint of performance is imposed in

only one of the classes, the problem is always feasible. However, and using the

data in Figure 1 again, as well as the linear kernel SVM, it is clear that it is

impossible to classify correctly all the instances at the same time; thus, the

problem is then infeasible. However, there exist results, as Theorem 5 in Burges

(1998), that show that the class of Mercer kernels for which K(x, x0)→0 as

kx−x0k→∞, and for which K(x, x) is O(1), build classiﬁers that get a total

correct classiﬁcation in all the classes in the training sample, without regard

how arbitrarily the data have been chosen. Thus, if a kernel satisﬁes the pre-

vious conditions, then feasibility is guaranteed. In particular, Radial Function

Basis (RBF) kernel meets these conditions. Therefore, to be on the safe side,

if the performance thresholds imposed are not too low, they should refer only

to one class misclassiﬁcation rates (so that we can shift the variable βto make

the problem feasible) or to use a kernel, such as the RBF, known to have large

VC dimension (Burges 1998; Cristianini and Shawe-Taylor 2000), deﬁned as

the number of training instances that can be classiﬁed correctly.

8 Sandra Ben´ıtez-Pe˜na et al.

V1

V2

V1

V2

Linear

SVM hyperplane

V1

V2

Hyperplane 1

Hyperplane 2

Hyperplane 3

1

Fig. 1 Study of feasiblity and unfeasibility of the CSVM.

3 Computational results

In this section we illustrate the performance of the CSVM compared to the

standard SVM, considered here as a benchmark. In order to do this compari-

On Support Vector Machines under a multiple-cost scenario 9

son, some of the performance measures presented in Section 2.2 are considered.

In particular, true positive and true negative rates will be used here. In what

follows, a description of the data, experiments and results is given.

3.1 Description of the experiments

The objective of this paper, as has been stated before, is to build a clas-

siﬁer whose performance can be controlled by means of some constraints,

as in Problem (CSVM). As explained in Section 2.1, if we want a perfor-

mance measurement pto be greater than a value p0with a speciﬁed conﬁdence

100(1 −α)%, we should use an estimator of p, ˆp, and impose it to be greater

than p∗

0=p0+rlog α

−2n, according to (4). From a practical application, this

result will turn out crucial in our experiments.

Two experiments, both having the same structure, will be considered in this

paper. In each one, we will try to improve the performance of the classiﬁer in

one of the classes, even though, as will be seen, a damage may be produced in

the other class. Hence, we will focus on TPR and TNR. Suppose that the esti-

mators for the TPR and TNR obtained by the standard SVM are, respectively,

TPR0and T N R0, so if we want to enhance the performance, the aim will be

TPR ≥TPR0+δ1and T N R ≥T N R0+δ2respectively. In the considered

experiments we have set δ1=δ2= 0.025, although other values can also be

tested. Then, the two experiments are:

–Experiment 1: Impose TPR ≥min {1, T P R0+ 0.025}=p0,

–Experiment 2: Impose T N R ≥min {1, T N R0+ 0.025}=p0.

That is to say, taking α= 0.05, the constraints in the optimization problem

for these two diﬀerent experiments turn out to be:

–Experiment 1: Impose ˆ

TPR ≥min (1, T P R0+rlog 0.05

−2n+ 0.025)=p∗

0,

–Experiment 2: Impose ˆ

T N R ≥min (1, T N R0+rlog 0.05

−2n+ 0.025)=p∗

0.

Although the description of the experiments is presented below, the skele-

ton of the complete methodology can be found in Algorithm 1 for a better and

clearer understanding. Now, we shall discuss the experiments’ design. First, in

all the experiments the time limit and the Mvalue in Problem (CSVM) were

set, respectively, equal to 300 seconds and 100. The selection of these values

is due to the following facts. The time limit should not be too small, since

we should give the optimizer enough time to solve the problem. On the other

hand, the time limit should not be too large if one wants to have reasonable

running times. In the case of the parameter M, if a small value is considered,

there may be many discarded hyperplanes, maybe including the optimal one.

However, if Mis too big, it might cause numerical troubles (Camm et al.

10 Sandra Ben´ıtez-Pe˜na et al.

Algorithm 1: Pseudocode for CSVM

1Split data (D) into “folds” subsets, D={D1,...,Df olds }.

2for kf = 1,...,folds do

3Set V alidation =Dkf and I=D− {Dk f }.

4for each pair (C, γ)in grid ({2(−5:5) },{2(−5:5)})do

5Split D− {Dkf }=D∗into “folds2 ” subsets, D∗={D∗

1,...,D∗

folds2}.

6for kf2 = 1,..., folds2 do

7Set V alidation∗=D∗

kf 2and I∗∪J∗=D∗− {Dkf 2}.

8Run standard SVM over I∗∪J∗.

9Move βof SVM until the instances are correctly classiﬁed.

10 Run problem CSVM over I∗,J∗with initial solutions from before.

11 Validate over V alidation∗, getting the accuracy (AC C[kf 2]).

12 end

13 Calculate the average accuracy (Pkf2ACC[kf2])/folds2 = ¯

ACC .

14 if ¯

ACC ≥bestAC C then

15 Set bestACC =¯

ACC ,bestγ =γand bestC =C.

16 end

17 end

18 Run standard SVM over I∪Jwith the parameters bestγ and bestC.

19 Move βof SVM until the instances are correctly classiﬁed.

20 Run problem CSVM over I,Jwith initial solutions from the previous step.

21 Validate over V alidation, getting the correct classiﬁcation probabilities

(T P R[kf ], T N R[kf ]).

22 end

23 Calculate the average values for T P R and T NR.

1990). A compromise solution is obtained by considering M= 100, which is

shown to be a good value in our numerical experiments.

Second, one of the most popular kernels K(x, x0) in literature, and the

one considered in this paper, is the well-known RBF kernel (Cristianini and

Shawe-Taylor 2000; Hastie et al. 2001; Hsu et al. 2003; Smola and Sch¨olkopf

2004; Horn et al. 2016), given by

K(x, x0) = exp −γkx−x0k2, γ > 0,

where γ > 0 is a parameter to be tuned. However, the approach presented in

this paper is valid for arbitrary kernels.

The estimation of the performance for our classiﬁer is based on a 10-fold

cross validation (CV) (Kohavi et al. 1995) as follows. Note that, apart from

tuning γ, the regularization parameter Cintroduced in Section 1 also needs

to be tuned. In addition, for a given pair of parameters (C, γ), the process

consists mainly on solving a standard SVM using all the instances (I∪J),

and collect the values of λ(from the dual formulation of the SVM) as well as

the value of β. Once the SVM is solved and with the purpose of providing an

initial solution for the CSVM, the value of βis slightly changed (maintaining

the values of λ’s ﬁxed) until the desired number of instances well classiﬁed is

reached. Then, the values of βand λ’s obtained are set as initial solutions for

CSVM. In addition, depending on whether each instance in Jis well classiﬁed

or not, we set their values of zas 0 or 1 as initial values for the CSVM.

On Support Vector Machines under a multiple-cost scenario 11

However, we should make the selection of the best pair (C, γ) in each of

the previous folds. In order to do that, a 10-fold CV as before is made for

each pair in a grid given by the 121 diﬀerent combinations of C= 2(−5:5) and

γ= 2(−5:5). The general criterion used to select the best pair of parameters

is the accuracy. However, in cases were the datasets are severely unbalanced

in the classes size, other performance measurements which take into account

such imbalancedness, such as the G-mean (Tang et al. 2009), or Youden Index

(Bewick et al. 2004), would be preferable. Finally, the average values of TPR

and TNR obtained in the ﬁrst CV, in addition to their standard deviations,

are calculated.

3.2 Data description

The performance, in terms of correct classiﬁcation probabilities and accuracy,

is illustrated using 4 real-life datasets from the UCI repository (Lichman 2013).

In particular, the datasets used are wisconsin (Breast Cancer Wisconsin (Di-

agnostic) Data Set), australian (Statlog (Australian Credit Approval) Data

Set), votes (Congressional Voting Records Data Set) and german (Statlog

(German Credit Data) Data Set).

Details concerning the implementation of the CSVM for the real datasets

are shown in Table 2. Let the ﬁrst column represent the number of features

Name V|Ω| |Ω+|(%)

wisconsin 30 569 357 (62.7 %)

australian 14 690 383 (55.5%)

votes 16 435 267 (61.4 %)

german 45 1000 700 (70%)

Table 2 Details concerning the implementation of the CSVM for the considered datasets.

composing the set. |Ω|and |Ω+|represent, respectively, the size for each

dataset and the number of positive instances (majority class) in Ω. Finally,

the percentage of positive instances is compiled in the last column.

Note that prior to running the diﬀerent experiments data have been stan-

dardized, that is to say, each variable in all the 4 considered data sets has zero

mean and unit variance.

3.3 Results

In this section we compare the performance of the strategy proposed to build

the CSVM classiﬁer against that of the SVM classiﬁer in terms of overall

classiﬁcation accuracy, true positive rate (TPR) and true negative rate (TNR)

of the classiﬁer. Note that, even though from Section 2.3 the problem is always

feasible using the training sample, it may happen that the desired performance

12 Sandra Ben´ıtez-Pe˜na et al.

is not achieved in the validation sample.

Tables 3 and 4 report the results for the benchmark procedure, SVM, and

Name SVM CSVM

Mean Std Mean Std

wisconsin TPR 0.99 0.017 0.945 0.045

TNR 0.948 0.049 0.965 0.037

australian TPR 0.863 0.079 0.772 0.081

TNR 0.83 0.071 0.903 0.05

votes TPR 0.963 0.04 0.846 0.097

TNR 0.951 0.031 0.978 0.038

german TPR 0.905 0.036 0.791 0.063

TNR 0.405 0.114 0.547 0.141

Table 3 TNR for the original SVM and the CSVM strategy

Name SVM CSVM

Mean Std Mean Std

wisconsin TPR 0.99 0.017 0.989 0.018

TNR 0.948 0.049 0.856 0.153

australian TPR 0.863 0.079 0.914 0.046

TNR 0.83 0.071 0.692 0.086

votes TPR 0.963 0.04 0.978 0.026

TNR 0.951 0.031 0.922 0.04

Table 4 TPR for the original SVM and the CSVM strategy

those obtained when imposing a higher classiﬁcation rate in a selected class, in

particular 0.025 additional points, according to the description of experiments

in Section 3.1. As an exception, in the case of german, a minimum value of

0.65 in the TNR will be imposed, in order to increase the low value obtained

under the standard SVM (0.405).

First, the results when constraints are imposed on the true negative rate

(TNR) are presented in Table 3. In the case of wisconsin, one can observe

that, although the TNR has been increased, an increment of 0.025 points was

not possible. However, the improvement is veriﬁed anyway. On the other hand,

in the case of australian, we have been able to increase the value of the TNR

in 0.073 points, without reducing signiﬁcantly the accuracy. A similar result is

obtained for votes, for which the increase has been in more than 0.025 points,

too. The results are not so good for german, due to its imbalancedness in the

classes size. If instead of the accuracy, the G-mean is used as the criterion

for tuning the values of (C, γ), then the results notably improve, as depicted

by Table 5. In fact, if TNR ≥0.7 is set instead of TNR ≥0.65, even better

results are obtained, as can be seen in Table 5, although without reaching the

threshold imposed.

Now, we shall discuss the results when constraints are imposed on the true

positive rate (TPR), depicted by Table 4. Here, in the case of wisconsin,

On Support Vector Machines under a multiple-cost scenario 13

Name SVM CSVM CSVM

(TNR ≥0.65) (TNR ≥0.7)

Mean Std Mean Std Mean Std

german TPR 0.905 0.036 0.668 0.111 0.683 0.073

TNR 0.405 0.114 0.671 0.164 0.69 0.103

Table 5 TNR for the original SVM and the CSVM strategy in german, using the G-mean

the increase of 0.025 points is not obtained. In fact and unfortunately, instead

of an increase we can observe a minor decrease. However, this is not a weird

result since the original TPR was very high (near a perfect classiﬁcation). On

the other hand, if we look at australian, an increase of about 0.05 points

has been reached, without loosing too much overall performance as before. In

addition, in votes, an increase can be observed, too. However, in contrast to

what happened with the TNR, such increase is not bigger than 0.025 points.

4 Conclusions

In this paper, we have proposed a new supervised learning SVM-based method,

the CSVM, which is developed and evaluated. Such classiﬁer is built via a

MIQP problem, which has been solved using a standard and widely available

solver. Also, in order to formulate the constraints that are added to modify

the standard SVM (and hence build the CSVM), and guarantee that the per-

formance measurements will fulﬁll the imposition with high probability, some

theoretical foundations are given. The applicability of this cost-sensitive SVM

has been demonstrated by numerical experiments on benchmark data sets.

We conclude that it is possible to control the classiﬁcation rates in one

class, possibly, but not necessarily, at the expense of the other class. This

highly contrasts with the naive approach in which, once the SVM is solved,

its intercept is moved to enhance the positive rates in one class, necessarily

deteriorating the performance in the other class.

Although, for simplicity, all numerical results are presented just adding

one performance constraint, one constraint per class, as well as an overall

accuracy, may be added in our approach. Also for simplicity, we addressed

here two-ways data matrices and two-class problems; however, this approach

could be extended to the case when using more complex data as multi-class

or multi-way arrays (Lyu et al. 2017), which are very common in biomedical

research. On the other hand, an alternative perspective for addressing the

SVM regularization is to consider diﬀerent norms (Yao and Lee 2014).

Finally, another possible extension, which is under development, is to per-

form a feature selection which uses the proposed constraints in order to control

the misclassiﬁcation costs. Such process is an essential step in tasks such as

high-dimensional microarray classiﬁcation problems (Guo 2010).

14 Sandra Ben´ıtez-Pe˜na et al.

Acknowledgements

This research is ﬁnanced by Fundaci´on BBVA, projects FQM329 and P11-

FQM-7603 (Junta de Andaluc´ıa, Andaluc´ıa) and MTM2015-65915-R (Minis-

terio de Econom´ıa y Competitividad, Spain). The last three are cofunded with

EU ERD Funds. The authors are thankful for such support.

Appendix A: Derivation of the CSVM

In this section, the detailed steps to build the CSVM formulation, are shown.

For that, suppose that we are given the linear model

min

ω,β,ξ ,z ω>ω+CP

i∈I

ξi

s.t. yi(ω>xi+β)≥1−ξi, i ∈I

ξi≥0i∈I

yj(ω>xj+β)≥1−M(1 −zj), j ∈J

zj∈ {0,1}j∈J

ˆp`≥p∗

0``∈L.

Hence, the problem above can be rewritten as

minzminω,β,ξ ω>ω+CP

i∈I

ξi

s.t. zj∈ {0,1}j∈Js.t. yiω>xi+β≥1−ξii∈I

ˆp`≥p∗

0``∈L yjω>xj+β≥1−M(1 −zj), j ∈J

ξi≥0i∈I.

The Karush–K¨uhn–Tucker (KKT) conditions for the inner problem, as-

suming zﬁxed are given by

ω=P

s∈I

λsysxs+P

t∈J

µtytxt

0 = P

s∈I

λsys+P

t∈J

µtyt

0≤λs≤C/2s∈I

0≤µt≤Mztt∈J.

Thus, substituting the previous expressions into the last optimization prob-

lem, the partial dual of such problem can be calculated, yielding

On Support Vector Machines under a multiple-cost scenario 15

min

zmin

λ,µ,β,ξ P

s∈I

λsysxs+P

t∈J

µtytxt>P

s∈I

λsysxs+P

t∈J

µtytxt+CP

i∈I

ξi

s.t. zj∈ {0,1}j∈Js.t. yi P

s∈I

λsysxs+P

t∈J

µtytxt>

xi+β!≥1−ξii∈I

ˆp`≥p∗

0``∈L yj P

s∈I

λsysxs+P

t∈J

µtytxt>

xj+β!≥1−M(1 −zj)j∈J

ξi≥0i∈I

P

i∈I

λiyi+P

j∈J

µjyj= 0

0≤λi≤C/2i∈I

0≤µj≤Mzjj∈J.

Finally, from the kernel trick, Problem (CSVM) is obtained.

References

Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: A uni-

fying approach for margin classiﬁers. Journal of Machine Learning Research

1(Dec), 113–141 (2000)

Bewick, V., Cheek, L., Ball, J.: Statistics review 13: receiver operating char-

acteristic curves. Critical Care 8(6), 508–512 (2004)

Bonami, P., Biegler, L.T., Conn, A.R., Cornujols, G., Grossmann, I.E., Laird,

C.D., Lee, J., Lodi, A., Margot, F., Sawaya, N., Wchter, A.: An algorithmic

framework for convex mixed integer nonlinear programs. Discrete Optimiza-

tion 5(2), 186 – 204 (2008). In Memory of George B. Dantzig

Bradford, J.P., Kunz, C., Kohavi, R., Brunk, C., Brodley, C.E.: Pruning deci-

sion trees with misclassiﬁcation costs. In: Proceedings of the 10th European

Conference on Machine Learning, ECML ’98, pp. 131–136. Springer (1998)

Burer, S., Letchford, A.N.: Non-convex mixed-integer nonlinear programming:

A survey. Surveys in Operations Research and Management Science 17(2),

97 – 106 (2012)

Burges, C.J.: A Tutorial on Support Vector Machines for Pattern Recognition.

Data Mining and Knowledge Discovery 2(2), 121–167 (1998)

Camm, J.D., Raturi, A.S., Tsubakitani, S.: Cutting Big M Down to Size.

Interfaces 20(5), 61–66 (1990)

Carrizosa, E., Martin-Barragan, B., Romero Morales, D.: Multi-group Sup-

port Vector Machines with Measurement Costs: A Biobjective Approach.

Discrete Applied Mathematics 156(6), 950–966 (2008)

Carrizosa, E., Romero Morales, D.: Supervised Classiﬁcation and Mathemati-

cal Optimization. Computers & Operations Research 40(1), 150–165 (2013)

Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines

and other kernel-based learning methods. Cambridge University Press, New

York, NY, USA (2000)

16 Sandra Ben´ıtez-Pe˜na et al.

Datta, S., Das, S.: Near-Bayesian Support Vector Machines for imbalanced

data classiﬁcation with equal or unequal misclassiﬁcation costs. Neural Net-

works 70, 39–52 (2015)

Freitas, A., Costa-Pereira, A., Brazdil, P.: Cost-Sensitive Decision Trees Ap-

plied to Medical Data. In: Data Warehousing and Knowledge Discovery:

9th International Conference, DaWaK 2007, Regensburg Germany, Septem-

ber 3-7, 2007. Proceedings, pp. 303–312. Springer Berlin Heidelberg, Berlin,

Heidelberg (2007)

Guo, J.: Simultaneous variable selection and class fusion for high-dimensional

linear discriminant analysis. Biostatistics 11(4), 599–608 (2010)

Gurobi Optimization, I.: Gurobi Optimizer Reference Manual (2016). URL

http://www.gurobi.com

Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning.

Springer Series in Statistics. Springer New York Inc., New York, NY, USA

(2001)

He, H., Ma, Y.: Imbalanced learning: foundations, algorithms, and applica-

tions. John Wiley & Sons, Inc. (2013)

Hoeﬀding, W.: Probability Inequalities for Sums of Bounded Random Vari-

ables. Journal of the American Statistical Association 58(301), 13–30 (1963)

Horn, D., Demircio˘glu, A., Bischl, B., Glasmachers, T., Weihs, C.: A compar-

ative study on large scale kernelized support vector machines. Advances in

Data Analysis and Classiﬁcation (2016)

Hsu, C.W., Chang, C.C., Lin, C.J., et al.: A practical guide to support vector

classiﬁcation. Tech. rep., Department of Computer Science, National Taiwan

University (2003)

Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy

estimation and model selection. In: Ijcai, vol. 14, pp. 1137–1145. Stanford,

CA (1995)

Lichman, M.: UCI Machine Learning Repository (2013)

Lyu, T., Lock, E.F., Eberly, L.E.: Discriminating sample groups with multi-

way data. Biostatistics (2017)

Maldonado, S., Prez, J., Bravo, C.: Cost-based feature selection for support

vector machines: An application in credit scoring. European Journal of

Operational Research 261(2), 656 – 665 (2017)

Mercer, J.: Functions of positive and negative type, and their connection with

the theory of integral equations. Philosophical transactions of the royal

society of London. Series A 209, 415–446 (1909)

Prati, R.C., Batista, G.E., Silva, D.F.: Class imbalance revisited: a new exper-

imental setup to assess the performance of treatment methods. Knowledge

and Information Systems 45(1), 247–270 (2015)

S´anchez, B.N., Wu, M., Song, P.X.K., Wang, W.: Study design in high-

dimensional classiﬁcation analysis. Biostatistics 17(4), 722 (2016)

Smola, A.J., Sch¨olkopf, B.: A tutorial on support vector regression. Statistics

and Computing 14(3), 199–222 (2004)

Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: SVMs Modeling for Highly

Imbalanced Classiﬁcation. IEEE Transactions on Systems, Man, and Cy-

On Support Vector Machines under a multiple-cost scenario 17

bernetics, Part B (Cybernetics) 39(1), 281–288 (2009)

Van Rossum, G., Drake, F.L.: An Introduction to Python. Network Theory

Ltd. (2011)

Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag New

York, Inc., New York, NY, USA (1995)

Vapnik, V.N.: Statistical learning theory, vol. 1. Wiley New York, 1 ed. (1998)

Yao, Y., Lee, Y.: Another look at linear programming for feature selection via

methods of regularization. Statistics and Computing 24(5), 885–905 (2014)

View publication statsView publication stats