Content uploaded by Sandra Benítez-Peña

Author content

All content in this area was uploaded by Sandra Benítez-Peña on Dec 09, 2020

Content may be subject to copyright.

Cost-sensitive probabilistic predictions for support

vector machines

Sandra Ben´ıtez Pe˜na

Instituto de Matem´aticas de la Universidad de Sevilla,

Department of Statistics and Operations Research, University of Sevilla

and

Rafael Blanquero

Instituto de Matem´aticas de la Universidad de Sevilla,

Department of Statistics and Operations Research, University of Sevilla

and

Emilio Carrizosa

Instituto de Matem´aticas de la Universidad de Sevilla,

Department of Statistics and Operations Research, University of Sevilla

and

Pepa Ram´ırez-Cobo

Department of Statistics and Operations Research, University of C´adiz,

Instituto de Matem´aticas de la Universidad de Sevilla

Abstract

Support vector machines (SVMs) are widely used and constitute one of the

best examined and used machine learning models for two-class classiﬁcation.

Classiﬁcation in SVM is based on a score procedure, yielding a deterministic

classiﬁcation rule, which can be transformed into a probabilistic rule (as imple-

mented in oﬀ-the-shelf SVM libraries), but is not probabilistic in nature. On

the other hand, the tuning of the regularization parameters in SVM is known to

imply a high computational eﬀort and generates pieces of information that are

not fully exploited, not being used to build a probabilistic classiﬁcation rule.

In this paper we propose a novel approach to generate probabilistic outputs

for the SVM. The highlights of the paper are: ﬁrst, a SVM method is designed

to be cost-sensitive, and thus the diﬀerent importance of sensitivity and speci-

ﬁcity is readily accommodated in the model. Second, SVM is embedded in

an ensemble method to improve its performance, making use of the valuable

information generated in the parameters tuning process. Finally, the probabili-

ties estimation is done via bootstrap estimates, avoiding the use of parametric

models as competing probabilities estimation in SVM. Numerical tests show

the advantages of our approach, yielding comparable or better than benchmark

procedures.

Keywords:

Support Vector Machines, Probabilistic Classiﬁcation, Cost-Sensitive

Preprint submitted to Statistical Analysis and Data Mining December 9, 2020

Classiﬁcation.

1. Introduction

Supervised classiﬁcation is one of the most relevant tasks in Data Science.

We are given a set Ω of individuals. Each element i∈Ω is represented by

a pair (xi, yi), where xi∈Rnis the attribute vector, and yi∈ C is the class

membership of object i. We only have class information in I⊂Ω, which is

called the training sample. In its most basic version, the one considered in

this paper, supervised classiﬁcation addresses two-class problems, that is to say,

C={−1,1}.

Support Vector Machine (SVM) is a powerful and state-of-the-art method

in supervised classiﬁcation, that aims at separating both classes by means of

a linear classiﬁer, ω>xi+β, where ωis the score vector. SVM is addressed

by solving the following convex quadratic programming (QP) formulation with

linear constraints:

minω,β,ξ ω>ω+CPi∈Iξi

s.t. yi(ω>xi+β)≥1−ξi, i ∈I

ξi≥0i∈I,

(1)

where ξi≥0 are the so-called slack variables, which allow data points to be

misclassiﬁed and C > 0 is a regularization parameter to be tuned, which controls

the trade-oﬀ between margin minimization and misclassiﬁcation errors, see e.g.

Carrizosa and Morales (2013).

Given an object iwith attribute vector xi, the SVM algorithm produces a

hard labeling in such a way that iis classiﬁed in the positive or the negative class

according to the sign of f(xi), where f(x) = ω>x+βis the score function. When

an attribute vector x0is given, the value f(x0) is called the score value of x0.

However, the SVM method does not result in probabilistic outputs as posterior

probabilities P(y= 1 |x), which are of interest if a measure of conﬁdence in the

predictions is sought, see Murphy (2012). This is of particular importance in

several real-world applications such as cancer screening or credit scoring, where

the risk of a false-negative and a false-positive results are signiﬁcantly diﬀerent.

Several attempts to obtain the posterior probabilities P(y= 1|x) for SVM

have been carried out from the score function f(x). One of them is based

on assigning posterior class probabilities making use of a speciﬁc parametric

family. For example, Wahba (1992), Wahba et al. (1999) proposed a logistic

link function,

P(y= 1|x) = 1

1 + exp(−f(x)).(2)

Also, Vapnik and Vapnik (1998) suggested to estimate P(y= 1|x) in terms

of a series of the trigonometric functions, where the coeﬃcients of the trigono-

metric expansion minimizes a regularized function. Another option is to ﬁt

Gaussians to the class-conditional densities P(f(x)|y= 1) and P(f(x)|y=−1),

2

as proposed in Hastie and Tibshirani (1998). Following this argument, the pos-

terior probability P(y= 1|f(x)) is a sigmoid, whose slope is determined by the

tied variance. One of the best-known heuristics to obtain probabilities is due

to Platt (2000), which considers f(x) as the log-odds ratio log P(y= 1 |x)

P(y=−1|x).

This implies that

P(y= 1 |x) = 1

1 + exp(Af(x) + B),(3)

and Aand Bcan be estimated by maximum likelihood on a validation set. This

technique is implemented by well-known statistical packages such as the ksvm()

function in R, see Karatzoglou et al. (2006) or predict proba in scikit-learn in

Python (Pedregosa et al. (2011)). However, such a method has been criticized

for failing to provide insight and for interpreting f(x) as a log-odds ratio, which

may be not accurate for some datasets, see Murphy (2012); Tipping (2001);

Franc et al. (2011). To illustrate such a phenomenon, consider Figure 1, which

shows the ﬁt of the sigmoid function (3) to the empirical class probabilities

of three diﬀerent, well-referenced datasets: adult,wisconsin and diabetes,

respectively (see Section 3.1). It can be seen that, while for adult dataset the ﬁt

provided by the method given in (3) performs reasonably well, the performance

is poor for the other two datasets.

Sollich (2002) considers a diﬀerent probabilistic framework for SVM classiﬁ-

cation, based on Bayesian theory. In particular, it relates the SVM kernel to the

covariance function for a Gaussian process prior and as a result, optimal values

of the tuning parameter Cand class probabilities are obtained in a natural way.

However, all these previous works make assumptions that might not be satisﬁed

by the data. Finally, other approaches seeking probabilistic outputs are found in

the literature, as Seeger (2000), Kwok (1999, 1998), Herbrich et al. (1999). Also,

none of them considers the problem of unbalancedness, so they may not perform

well on the minority class. Unlike previous works, Tao et al. (2005) consider the

problem of estimating posterior class probabilities taking into account that the

inﬂuence of diﬀerent samples may be unbalanced, resulting in non-robust SVMs.

Their approach, the posterior probability support vector machine (PPSVM), is

distribution-free and weighs unbalanced training samples. However, none of

the previously mentioned works addresses cost-sensitive problems, which are of

crucial importance in many real-world problems. For example, in cancer de-

tection, it is common that there are usually few disease cases compared to the

healthy ones. However, smaller classiﬁcation errors are desired for the ﬁrst case.

As an illustration, consider the dataset cancer-colon (see Ridge (2002)), with

two classes and where only the 35.5% of the records correspond to sick people

(positive class). Assume that the interest is to obtain a good estimation of the

posterior positive class. Using the methods proposed by Platt (2000), Sollich

(2002) and Tao et al. (2005), the mean squared error (MSE) values for the pos-

terior class probabilities of interest were 0.241, 0.252 and 0.208. However, as will

be shown, using the methodology proposed in this paper, these numerical values

can decrease down to 0, maybe at the expense of damaging the prediction of

3

−4 −2 0 2 4

0.0 0.2 0.4 0.6 0.8 1.0

Adult dataset

scores (f(x))

P(y=1|f(x))

f(x)

−4 −2 0 2 4

0.0 0.2 0.4 0.6 0.8 1.0

Wisconsin dataset

scores (f(x))

P(y=1|f(x))

f(x)

−3 −2 −1 0 1 2 3

0.0 0.2 0.4 0.6 0.8 1.0

Diabetes dataset

scores (f(x))

P(y=1|f(x))

f(x)

Figure 1: Fit (in solid line) of the sigmoid function to the empirical class probabilities (dots)

of adult,wisconsin and diabetes datasets. 4

the posterior negative class probability. Another context where it is important

to control the estimation error related to a speciﬁc class is credit scoring, where

the individuals of interest (defaulting ones) constitute the class of the smallest

sample size. Such kind of issues are addressed, from the classiﬁcation context,

in Bradford et al. (1998); Freitas et al. (2007); Carrizosa et al. (2008); Datta and

Das (2015); Ben´ıtez-Pe˜na et al. (2019), and recently, from a feature selection

viewpoint in Ben´ıtez-Pe˜na et al. (2019).

A major contribution of this work is that the SVM is embedded in an en-

semble method, leading, as shown later, to an improvement in performance. It

is known that, in order to solve the SVM problem (1), a tuning process concern-

ing the regularization parameter Cneeds to be performed. Traditionally, all the

information resulting from this tuning procedure is discarded and only the best

Cvalue is used to build the classiﬁer. Instead, in this work, the ﬁnal posterior

class probability estimate is a weighted mean of diﬀerent posterior probabilities,

each one related to speciﬁc values of C. In addition, unlike Platt (2000) and

Sollich (2002), here we propose a novel methodology that does not make use of

parametric models based on the score function f(x) obtained after tuning the

SVM parameters. Instead, we consider a bootstrap framework, which, to the

best of our knowledge, has not been addressed before for this type of problems.

The use of a bootstrap sampling allows us to obtain accurate values for the

densities of the score values, which translates into a better prediction of the

posterior class probability P(y= 1|x).

The paper is structured as follows. First, in Section 2, our methodology

is introduced. Section 2.1 describes how to integrate a bootstrap sampling

into an SVM to produce posterior class probabilities estimates. Section 2.2

explains two diﬀerent ways to obtain cost-sensitive probabilistic predictions.

In Section 3 some experimental results are presented. In particular, several

well-referenced datasets from biomedical contexts are analyzed. Estimates of

the posterior class probabilities under our methodology are compared to those

obtained under benchmark approaches. Finally, the posterior probabilities of

the classes of interest are controlled via the two diﬀerent approaches described

in Section 2.2. Conclusions and further research can be found in Section 4.

2. Cost-sensitive predictive probabilities for SVM

In this section we present our methodology to obtain cost-sensitive posterior

class probabilities for the SVM classiﬁer. First, in Section 2.1 we explain how

to integrate a bootstrap sampling into SVM to produce posterior class proba-

bilities estimates P(y= 1|x). Second, in Section 2.2 we describe two diﬀerent

approaches that allow us to control the posterior probability estimates for the

class of interest.

2.1. SVM posterior class probabilities based on Bootstrap

Assume that we want to solve SVM (1) to classify the observations in a

dataset, such as e.g. wisconsin dataset (see Section 3.1 for details concerning

5

the database). In order to estimate the classiﬁcation error, it is standard to

consider a k-fold CV approach. Figure 2 shows the histogram (in absolute fre-

quencies) of the score values under three diﬀerent choices of k(k= 20,100,500)

for a given individual (randomly chosen). It can be observed that as kin-

creases, the score values are less disperse, a consequence of the fact that the

diﬀerent samples share more elements, and thus they yield more similar scores.

In such a situation, it might not be possible to obtain accurate posterior class

probabilities P(y= 1|x), especially when the observations of the two groups

strongly overlap. What it is proposed in this paper is to replace the k-fold CV

approach by a bootstrap sampling that allows us to avoid the degenerate be-

haviour observed in Figure 2. This phenomenon is illustrated in Figure 3, where

the analogous histograms to Figure 2 are shown, but where a bootstrap sam-

pling with Breplications (B= 20,100,500) has been considered instead. The

idea of using those values is just to illustrate the behavior of this method when

increasing the sample size in contrast with the one shown in Figure 2. Finally,

the estimates for the posterior class probabilities P(y= 1|x) can be obtained

as the relative frequency of the positive (negative in the case of P(y=−1|x))

score values. This approach, given by Algorithm 1 and illustrated as a ﬂowchart

in Figure 4, is detailed next.

First, to carry out our procedure of obtaining probabilities, consider a com-

plete dataset Ω composed of minstances, nvariables, and 2 classes (1 or −1).

The dataset is split in two samples: the training sample T(of sample size mtr)

and the validation sample V(of size m−mtr), see Step 1 of Algorithm 1. As

it is usual in the SVM implementation, a grid for the regularization parame-

ter Cneeds to be set. Then, a matrix P X with as many rows as the number

of instances in the validation sample (m−mtr) and as many columns as the

number of values Cto be used is built. This matrix contains the proportions of

negative scores (and illustrated in Table 1) and is generated through the algo-

rithm as follows. Given a ﬁxed value of C, and Bbootstrap samples from the

training sample T(that will be denoted as T∗

b, b = 1, . . . , B ), SVM is run over

each bootstrap sample and validated over the out of bag sample, i.e., the set

of instances that are in Tbut not in the considered bootstrap sample (we will

denote them as V∗

b, b = 1, . . . , B ) and also in the validation sample V. In this

way, we have validated twice: the ﬁrst time (over the validation in the bootstrap

procedure V∗

b), to measure the performance led by the chosen value Cand then

over the validation set V, to estimate the posterior negative class probabilities

conditioned to C,P(yVi=−1|x, C), where Videnotes the i-th instance in V.

In this way, we shall obtain Bscore values for each instance in V. Such score

values shall be recorded in a matrix with as many rows as number of instances

in the validation sample (m−mtr) and Bcolumns, P redictionV . Finally, we

propose to estimate its posterior negative class probability P(yVi=−1|x) as

a weighted average of the estimates for P(yVi=−1|x, C) when using diﬀerent

values of C(Cr, r = 1, ..., R), but taking only into account the values of Cthat

lead to accuracies close to the best one. Let accr,b denote the standard accuracy

in V∗

b(instances correctly classiﬁed divided by total number of instances), given

the SVM built from using the r-th value of C, and the b-th bootstrap sample;

6

let accrdenote the average value of the coeﬃcients accr,b, namely

accr=PB

b=1 accr,b

B.

Only values yielding high estimates of accrare taken into consideration, the

remaining ones being discarded. Those considered are stored in the set J,

deﬁned as J={j:accj≥maxlaccl−ε}, where ε > 0 is a ﬁxed parameter.

Finally, if weights wj, j ∈J, are deﬁned according to

wj=acc2

j

Pl∈Jacc2

l

,

then, the estimation P(yVi=−1|x) is:

P(yVi=−1|x) = X

j∈J

wjP(yVi=−1|x, Cj).(4)

Split Ω into Tand V:T∪V= Ω, T∩V=∅.

P X = Initialize as an empty matrix

for each Cin grid of Cr, r = 1, . . . , R do

P redictionV = Initialize as an empty matrix

for bin 1,2,. . . ,B do

Create a bootstrap sample T∗

bfrom T.

Run SVM over T∗

b. Validate over V∗

b=T\T∗

b.

Obtain (for each instance in V∗

b) the accuracy accr,b.

Obtain (for each instance in V) scores values for V, denoted as

scoresV .

Insert scoresV as a new column in P redictionV

Estimate P(y=−1|x, C) for each of the (m−mtr) instances in V,

as the proportion of negative scores in each of the (m−mtr) rows

in P redictionV . These estimates are inserted as a new column in

P X .

Using the values accr,b, calculate the weighted average given by (4).

Algorithm 1: Pseudocode for the Bootstrap SVM

The methodology will be illustrated in Section 3.2 where, in addition, some

comparisons with respect to benchmark approaches will also be presented.

2.2. Control over the sensitivity measure

In the previous section, an approach for estimating posterior class probabil-

ities P(y= 1|x) and P(y=−1|x) has been described. In this section, we deal

with the issue of improving the sensitivity of the classiﬁer (proportion of pos-

itive instances that are correctly classiﬁed) which, as commented in Section 1,

7

(a) (b)

(c)

Figure 2: Histogram of scores for a single instance when a k-fold CV is used. Here, we set

k= 20, 100 and 500, obtaining as many score values as the value of k.

8

(a) (b)

(c)

Figure 3: Histogram of scores for a single instance when the Bootstrap with Breplications is

used. As in the k-fold CV, B= 20, 100 and 500.

9

Figure 4: Flowchart of the Bootstrap-based methodology

10

P X =

C1C2··· CR

V1P X1,1P X1,2··· P X1,R

V2P X2,1P X2,2··· P X2,R

.

.

..

.

..

.

.....

.

.

Vm−mtr P Xm−mtr,1P Xm−mtr,2··· P Xm−mtr,R

Table 1: Probabilities P(yVi=−1|x, C )

is a problem of interest, among others, in biomedical contexts. To do this, we

propose two diﬀerent approaches, Ctrl1 and Ctrl2, which are discussed in what

follows and empirically analyzed in Section 3.3.

Method Ctrl1 is based on the fact that the sensitivity measure can be con-

trolled by the posterior class probabilities, as is explained next. In Algorithm 1,

the posterior negative class probabilities have been estimated taking into ac-

count the proportion of negative scores. However, if instead of 0, we consider

a diﬀerent threshold (say a value −a, with apositive), then the estimates for

the values of P(yVi=−1|x, C) decrease (that is, the posterior positive class

probabilities P(yVi= 1|x, C) increase). This is illustrated by two examples in

Figure 5, Figures 5(a) and 5(b), which represent the histograms of the scores

for two diﬀerent individuals of the wisconsin dataset. Figure 5(a) shows the

posterior positive class probability estimates using Algorithm 1, that is, where

the value 0 is used as a threshold to classify in the positive or the negative

class. In Figure 5(b), the threshold value has been moved to the left and, as

a consequence, the resulting estimates have increased. Note from Figure 5(b)

that, with this approach, the probability of an instance to belong to the positive

class may change from below to above 0.5. In practice, in order to obtain a de-

sired posterior positive class probability estimate, the threshold is moved until

a certain proportion of the instances of the positive class are correctly classiﬁed.

Method Ctrl2 also results from Algorithm 1, but, instead of changing the

threshold for the scores, we consider a diﬀerent classiﬁer. Speciﬁcally, we pro-

pose to use a novel version of the SVM, the so-called Constrained SVM (CSVM),

which has been particularly designed to obtain cost-sensitive results, see Ben´ıtez-

Pe˜na et al. (2019). Without going into much detail, the CSVM formulation is

obtained by solving a convex quadratic optimization problem with linear con-

straints and some integer variables:

minω,β,ξ ω>ω+CPi∈Iξi

s.t. yi(ω>xi+β)≥1−ξi, i ∈I

0≤ξi≤M(1 −ζi)i∈I

µ(ζ)`≥λ``∈L

ζi∈ {0,1}i∈I.

(5)

Problem (5) is simply the formulation for the standard SVM with linear

kernel, to which performance constraints have been added: µ(ζ)`≥λ`, where

µ(ζ)`are diﬀerent performance measures, forced to take values above thresholds

λ`, and ζiare new binary variables that check whether record iis counted as

11

(a)

(b)

Figure 5: Control over the probabilities estimation. In Subﬁgures 5(a) we can observe the

original estimated probabilities, whereas in Subﬁgures 5(b) the new cost-sensitive probabilities

for 5(a), obtained by moving the threshold, are depicted.

12

correctly classiﬁed, and Mis a large number. We refer the reader to the original

reference (Ben´ıtez-Pe˜na et al. (2019)) for a more detailed description of the cost-

sensitive classiﬁer. The important message to be kept here is that solving (5)

with a standard software package for diﬀerent values of the parameters λ`yields

classiﬁers with diﬀerent trade-oﬀ between sensitivity and speciﬁcity.

Both methods (Ctrl1 and Ctrl2 ) will be illustrated through numerical ex-

amples in Section 3.3.

3. Experimental results

In this section we illustrate the performance of our method for computing

posterior class probability estimates as in Section 2.1. The results will be com-

pared to those of benchmark approaches by Platt (2000), Sollich (2002) and

Tao et al. (2005). To do this, a variety of datasets with diﬀerent properties con-

cerning size (in the number of instances and/or variables) and unbalancedness

shall be analyzed. Moreover, using real datasets, we test the methods described

in Section 2.2 to control the posterior positive class probability. Speciﬁcally,

this section is organized as follows. In Section 3.1 we present a brief description

of the diﬀerent datasets we have used and describe how the diﬀerent experi-

ments have been implemented. Section 3.2 shows the performance of the novel

approach in comparison to that of benchmark methodologies. Finally, in Sec-

tion 3.3 we apply both Ctrl1 and Ctrl2 to improve the posterior probability of

the class of interest.

3.1. Datasets and description of the experiments

The performance of the diﬀerent methodologies presented in this paper is

illustrated using eleven real-life datasets: wisconsin (Breast Cancer Wisconsin

(Diagnostic)), cancer-colon (Colon Cancer), diabetes (Diabetes), leukemia

(Leukemia), SRBCT (Small Round Blue Cell Tumor), heart (Heart Disease),

adult (Adult), divorce (Divorce Predictors data set), german (German Credit

Data), cervical-cancer (Cervical cancer (Risk Factors)) and banknote (ban-

knote authentication). SRBCT dataset can be obtained from the R package

plsgenomics (Boulesteix et al. (2011)) and leukemia from Golub et al. (1999).

On the other hand, cancer-colon is available at the Kent Ridge Biomedical

Data Repository (Ridge (2002)). The other eight datasets are obtained from the

UCI Repository, (Dheeru and Karra Taniskidou (2017)). Table 2 contains rele-

vant information of the previous datasets. In the second and third columns, the

sample sizes of the validation (|ΩV|) and the complete datasets (|Ω|) are shown,

respectively. The fourth column contains the number of original variables or

attributes (|A|) in the dataset. Finally, the last column collects the number

(|Ω+|) and percentage (%) of positive instances in the complete dataset.

Prior to running the experiments, datasets were standardized so that each

variable has zero mean and unit variance, Graf et al. (2003). Also in relation

to the variables, the categorical ones were transformed into a set of dummy

variables. In addition, those datasets with three classes or more were converted

13

Name |ΩV| |Ω| |A| |Ω+|(%)

wisconsin 57 569 30 357 (62.7%)

cancer-colon 10 62 2000 22 (35.5%)

diabetes 76 759 8 263 (34.7%)

leukemia 10 72 7128 47 (65.3%)

SRBCT 10 83 1022 40 (48.2%)

heart 14 140 13 40 (28.6%)

adult 3256 32560 14 7841 (24.08%)

divorce 17 170 54 84 (49.41%)

german 100 1000 20 300 (30%)

cervical-cancer 86 858 36 55 (6.41%)

banknote 137 1372 5 610 (44.46%)

Table 2: Datasets

into two-class datasets by giving negative label to the largest class and positive

labels to the remaining records. In the case of missing values, they were replaced

by the median. Finally, when running the SVM and the constrained SVM in

(1) or (5), the linear kernel versions were considered. All the experiments have

been carried out using the solver Gurobi (Gurobi Optimization, Inc. (2016)) and

its Python language interface (Python Core Team (2015)). No timelimit was

imposed when solving Problem (1), whereas 300 seconds was set when solving

(5). Also, for the latter problem, Mwas equal to 1000 (see, Ben´ıtez-Pe˜na et al.

(2019) for more details).

In our experiments, the number of folds selected for the k-fold CV is 10

external folds (and we estimate the performance measure by the average over

the 10 folds) and 10 internal folds (in order to obtain the best parameter C). The

number of bootstrap samples Bhas been set equal to 500 and each bootstrap

training sample has the same size as the original training sample. Note that

we cope with the unbalancedness, if present, though one could have performed

under or oversampling in the majority or the minority class, respectively, in

a preprocessing phase. The grid of Cvalues selected in our experiments is

{2−5,2−4, ..., 24,25}.

3.2. Performance of the bootstrap-based approach

In this section we estimate the posterior class probabilities according to the

bootstrap-based novel method described in Section 2.1 and compare the re-

sults with those obtained by the benchmark approaches by Platt (2000), Sollich

(2002) and Tao et al. (2005) commented in Section 1. The obtained results are

summarized in Table 3, whose second, third and fourth columns contain the

mean squared errors (MSE) values obtained when the deterministic class mem-

bership is approximated to its probabilistic counterpart. Note that, according

to Tao et al. (2005), a value for the parameter rneeds to be selected. In this

case, we tested the results for four diﬀerent choices of r(0, √10, √20, √30).

The best results have been highlighted in bold style.

It can be seen from Table 3, how our methodology is the one performing

best for wisconsin,cancer-colon,leukemia,divorce and cervical-cancer,

14

Dataset Bootstrap-based approach Sollich Platt Tao et al.

(r= 0, √10, √20, √30)

wisconsin 0.003 0.064 0.021 0.028, 0.019, 0.034, 0.055

cancer-colon 0.201 0.241 0.252 0.208, 0.208, 0.208, 0.208

diabetes 0.192 0.190 0.157 0.229, 0.233, 0.234, 0.234

leukemia 00.239 0.01 0.029, 0.029, 0.029, 0.029

SRBCT 0.01 0.237 0.011 0,0,0,0

heart 0.143 0.158 0.121 0.15, 0.118, 0.207, 0.218

adult 0.144 0.128 0.068 0.148, 0.078, 0.142, 0.167

divorce 00.189 0.021 0.024, 0.024, 0.024, 0.024

german 0.197 0.229 0.168 0.256, 0.256, 0.256, 0.247

cervical-cancer 0.013 0.187 0.028 0.039, 0.039, 0.032, 0.035

banknote 0.017 0.008 0.008 0.009, 0.145, 0.221, 0.238

Table 3: Mean squared errors (MSE) obtained when predicting the posterior class probabilities

in a linear SVM.

obtaining the lowest values of MSE. Additionally, the method proposed by

Platt (2000) obtains the lowest MSE in diabetes,heart,adult,german and

banknote. Finally, with the method of Tao et al. (2005) a zero MSE for SRBCT

is obtained. On the other hand, the method proposed by Sollich (2002) per-

forms poorly in all cases except in banknote. In conclusion, we have built a

method that is comparable in terms of performance to benchmark approaches,

outperforming them in some datasets.

As described in Section 2.1, the ﬁnal estimate of the posterior class proba-

bilities is set in terms of the results obtained for a range of the regularization

parameter C(see expression 4). It is of interest to compare the results with

those computed using only the value of Cthat provides the best accuracy mea-

sure. The results are shown in Table 4, from which it can be concluded that

embedding the SVM in an ensemble method actually improves its performance.

Dataset Best C Bootstrap-based approach

wisconsin 0.02 0.003

cancer-colon 0.2 0.201

diabetes 0.225 0.192

leukemia 0 0

SRBCT 0 0.01

heart 0.214 0.143

adult 0.152 0.144

divorce 0 0

german 0.28 0.197

cervical 0.047 0.013

banknote 0 0.017

Table 4: Mean squared errors (MSE) using only the best Cand under the bootstrap-based

approach.

15

Figure 6: MSE for the positive class probability predictions of each dataset. Ctrl1

3.3. Results when the posterior class probabilities are controlled

In this section we apply the methodologies described in Section 2.2 in order

to control P(y= 1|x) or P(y=−1|x). In particular, Figures 6 and 8 are

obtained under Ctrl1 and Figures 7 and 9 show the results when the method

based on the CSVM (Ctrl2 ) is implemented. For all the ﬁgures, the class of

interest to be controlled is assumed to be the positive one. Figures 6 and 7

show the MSE when considering only the positive instances, while Figures 8

and 9 depict the MSE for the negative instances.

From Figures 6 and 7, we can see how as the threshold for obtaining a

given proportion of the instances in the correct class (x-axis) is moved to the

right, the MSE becomes lower, as expected. In fact, there are some datasets

(banknote,divorce,leukemia,SRBCT and wisconsin), for which the obtained

MSEs are very close to 0, for both Figures 6 and 7. As a result, the lines deﬁning

the MSE values for those datasets are indistinguishable. However, Figures 8

and 9 present diﬀerent patterns. While Figure 9 behaves as expected (as the

MSEs for the sensitivity become smaller, the MSEs for the speciﬁcity become

constant or higher), the speciﬁcity depicted by Figure 8 remains unaltered. Here

again, some datasets result in almost null MSEs (cervical-cancer,divorce,

leukemia,SRBCT and wisconsin).

An important remark to be made concerning the performance of Ctrl1 and

Ctrl2 is as follows. The ﬁrst one seems to be able to improve the sensitivity

without damaging too much the speciﬁcity, while the second method damages

16

Figure 7: MSE for the positive class probability predictions of each dataset. Ctrl2

Figure 8: MSE for the negative class probability predictions of each dataset. Ctrl1

17

Figure 9: MSE for the negative class probability predictions of each dataset. Ctrl2

in a more signiﬁcant way the speciﬁcity, but at the same time it leads to better

sensitivity values.

4. Conclusions

In this paper we have proposed a procedure to obtain probabilistic outputs

for the Support Vector Machines. Here, contrary to existing proposals, we

present a method that is distribution-free and cost-sensitive. Also, it makes use

of not only a single classiﬁer but a weighted average of them, obtaining more

accurate results.

Our proposal is compared to some benchmark methodologies. The results

show that our approach is comparable or better than such approaches. Two

cost-sensitive alternatives are proposed here. The ﬁrst one is based on changing

the way the probabilities are estimated and the second one proposes to modify

the original classiﬁer by a cost-sensitive version. Results for real datasets have

been shown, demonstrating the usefulness of our novel approach.

For simplicity, the baseline SVM classiﬁers are taken with a linear kernel;

more powerful classiﬁers will be obtained if nonlinear kernels (as RBF) are

used, though at the expense of making the computational eﬀort higher. On the

other hand, traditional SVM can be used as a basis for addressing multiclass

problems. How to extend properly our approach to such multiclass problems is

an interesting research avenue which is now under investigation.

18

Acknowledgements

This research is ﬁnanced by projects EC H2020 MSCA RISE NeEDS Project

(Grant agreement ID: 822214), FQM329 and P18-FR-2369 (Junta de Andaluc´ıa,

Andaluc´ıa), PR2019-029 (Universidad de C´adiz) and PID2019-110886RB-I00

(Ministerio de Ciencia, Innovaci´on y Universidades, Spain). The last three are

cofunded with EU ERD Funds. The authors are thankful for such support.

References

Ben´ıtez-Pe˜na, S., Blanquero, R., Carrizosa, E., and Ram´ırez-Cobo, P. (2019).

On support vector machines under a multiple-cost scenario. Advances in Data

Analysis and Classiﬁcation, 13(3):663–682.

Ben´ıtez-Pe˜na, S., Blanquero, R., Carrizosa, E., and Ram´ırez-Cobo, P. (2019).

Cost-sensitive feature selection for support vector machines. Computers &

Operations Research, 106:169 – 178.

Boulesteix, A.-L., Lambert-Lacroix, S., Peyre, J., and Strimmer, K. (2011).

plsgenomics: Pls analyses for genomics. R package version, pages 1–2.

Bradford, J. P., Kunz, C., Kohavi, R., Brunk, C., and Brodley, C. E. (1998).

Pruning decision trees with misclassiﬁcation costs. In Proceedings of the

10th European Conference on Machine Learning, ECML ’98, pages 131–136.

Springer.

Carrizosa, E., Martin-Barragan, B., and Romero Morales, D. (2008). Multi-

group Support Vector Machines with Measurement Costs: A Biobjective Ap-

proach. Discrete Applied Mathematics, 156(6):950–966.

Carrizosa, E. and Morales, D. R. (2013). Supervised classiﬁcation and mathe-

matical optimization. Computers & Operations Research, 40(1):150–165.

Datta, S. and Das, S. (2015). Near-Bayesian Support Vector Machines for

imbalanced data classiﬁcation with equal or unequal misclassiﬁcation costs.

Neural Networks, 70:39–52.

Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.

Franc, V., Zien, A., and Sch¨olkopf, B. (2011). Support vector machines as

probabilistic models. In Proceedings of the 28th International Conference on

Machine Learning, pages 665–672, Madison, WI, USA. International Machine

Learning Society.

Freitas, A., Costa-Pereira, A., and Brazdil, P. (2007). Cost-Sensitive Decision

Trees Applied to Medical Data. In Data Warehousing and Knowledge Dis-

covery: 9th International Conference, DaWaK 2007, Regensburg Germany,

September 3-7, 2007. Proceedings, pages 303–312, Berlin, Heidelberg. Springer

Berlin Heidelberg.

19

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov,

J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., et al. (1999).

Molecular classiﬁcation of cancer: class discovery and class prediction by gene

expression monitoring. science, 286(5439):531–537.

Graf, A. B., Smola, A. J., and Borer, S. (2003). Classiﬁcation in a normalized

feature space using support vector machines. IEEE Transactions on Neural

Networks, 14(3):597–605.

Gurobi Optimization, Inc. (2016). Gurobi optimizer reference manual.

Hastie, T. and Tibshirani, R. (1998). Classiﬁcation by pairwise coupling. In

Advances in neural information processing systems, pages 507–513.

Herbrich, R., Graepel, T., and Campbell, C. (1999). Bayesian learning in repro-

ducing kernel Hilbert spaces. Leiter der Fachbibliothek Informatik, Sekretariat

FR 5-4.

Karatzoglou, A., Meyer, D., and Hornik, K. (2006). Support vector machines

in r. Journal of Statistical software, 15(9):1–28.

Kwok, J. T.-Y. (1998). Integrating the evidence framework and the support

vector machine. In ESANN, volume 99, pages 177–182.

Kwok, J. T.-Y. (1999). Moderating the outputs of support vector machine

classiﬁers. IEEE Transactions on Neural Networks, 10(5):1018–1031.

Murphy, K. P. (2012). Machine learning, a probabilistic perspective. The MIT

Press.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,

Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-

learn: Machine learning in python. Journal of Machine Learning Research,

12:2825–2830.

Platt, J. C. (2000). Probabilistic outputs for support vector machines and

comparisons to regularized likelihood methods. In Advances in large margin

classiﬁers, pages 61–74. MIT Press.

Python Core Team (2015). Python: A dynamic, open source programming

language. Python Software Foundation.

Ridge, K. (2002). Kent ridge biomedical data set repository.

Seeger, M. (2000). Bayesian model selection for support vector machines, gaus-

sian processes and other kernel classiﬁers. In Advances in neural information

processing systems, pages 603–609.

Sollich, P. (2002). Bayesian methods for support vector machines: Evidence

and predictive class probabilities. Machine learning, 46(1):21–52.

20

Tao, Q., Wu, G.-W., Wang, F.-Y., and Wang, J. (2005). Posterior probability

support vector machines for unbalanced data. IEEE Transactions on Neural

Networks, 16(6):1561–1573.

Tipping, M. (2001). Sparse bayesian learning and the relevance vector machine.

Journal of machine learning research, 1(Jun):211–244.

Vapnik, V. N. and Vapnik, V. (1998). Statistical learning theory, volume 1.

Wiley New York.

Wahba, G. (1992). Multivariate function and operator estimation, based on

smoothing splines and reproducing kernels. In Santa Fe Institute Studies

in the Sciences of Complexity-Proceedings Volume-, volume 12, pages 95–95.

Addison-Wesley Publishing Co.

Wahba, G. et al. (1999). Support vector machines, reproducing kernel hilbert

spaces and the randomized gacv. Advances in Kernel Methods-Support Vector

Learning, 6:69–87.

21