PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Support vector machines (SVMs) are widely used and constitute one of the best examined and used machine learning models for 2-class classification. Classification in SVM is based on a score procedure, yielding a deterministic classification rule, which can be transformed into a probabilistic rule (as implemented in off-the-shelf SVM libraries), but is not probabilistic in nature. On the other hand, the tuning of the regularization parameters in SVM is known to imply a high computational effort and generates pieces of information which are not fully exploited, and not used to build a probabilistic classification rule. In this paper we propose a novel approach to generate probabilistic outputs for the SVM. The highlights of the paper are: first, a SVM method is designed to be cost-sensitive, and thus the different importance of sensitivity and speci-ficity is readily accommodated in the model. Second, SVM is embedded in an ensemble method to improve its performance, making use of the valuable information generated in the parameters tuning process. Finally, the probabilities estimation is done via bootstrap estimates, avoiding the use of parametric models as competing probabilities estimation in SVM. Numerical tests show the advantages of our procedures.
Content may be subject to copyright.
Cost-sensitive probabilistic predictions for support
vector machines
Sandra Ben´ıtez Pe˜na
Instituto de Matem´aticas de la Universidad de Sevilla,
Department of Statistics and Operations Research, University of Sevilla
Rafael Blanquero
Instituto de Matem´aticas de la Universidad de Sevilla,
Department of Statistics and Operations Research, University of Sevilla
Emilio Carrizosa
Instituto de Matem´aticas de la Universidad de Sevilla,
Department of Statistics and Operations Research, University of Sevilla
Pepa Ram´ırez-Cobo
Department of Statistics and Operations Research, University of C´adiz,
Instituto de Matem´aticas de la Universidad de Sevilla
Support vector machines (SVMs) are widely used and constitute one of the
best examined and used machine learning models for two-class classification.
Classification in SVM is based on a score procedure, yielding a deterministic
classification rule, which can be transformed into a probabilistic rule (as imple-
mented in off-the-shelf SVM libraries), but is not probabilistic in nature. On
the other hand, the tuning of the regularization parameters in SVM is known to
imply a high computational effort and generates pieces of information that are
not fully exploited, not being used to build a probabilistic classification rule.
In this paper we propose a novel approach to generate probabilistic outputs
for the SVM. The highlights of the paper are: first, a SVM method is designed
to be cost-sensitive, and thus the different importance of sensitivity and speci-
ficity is readily accommodated in the model. Second, SVM is embedded in
an ensemble method to improve its performance, making use of the valuable
information generated in the parameters tuning process. Finally, the probabili-
ties estimation is done via bootstrap estimates, avoiding the use of parametric
models as competing probabilities estimation in SVM. Numerical tests show
the advantages of our approach, yielding comparable or better than benchmark
Support Vector Machines, Probabilistic Classification, Cost-Sensitive
Preprint submitted to Statistical Analysis and Data Mining December 9, 2020
1. Introduction
Supervised classification is one of the most relevant tasks in Data Science.
We are given a set Ω of individuals. Each element iΩ is represented by
a pair (xi, yi), where xiRnis the attribute vector, and yi∈ C is the class
membership of object i. We only have class information in IΩ, which is
called the training sample. In its most basic version, the one considered in
this paper, supervised classification addresses two-class problems, that is to say,
Support Vector Machine (SVM) is a powerful and state-of-the-art method
in supervised classification, that aims at separating both classes by means of
a linear classifier, ω>xi+β, where ωis the score vector. SVM is addressed
by solving the following convex quadratic programming (QP) formulation with
linear constraints:
minω,β,ξ ω>ω+CPiIξi
s.t. yi(ω>xi+β)1ξi, i I
where ξi0 are the so-called slack variables, which allow data points to be
misclassified and C > 0 is a regularization parameter to be tuned, which controls
the trade-off between margin minimization and misclassification errors, see e.g.
Carrizosa and Morales (2013).
Given an object iwith attribute vector xi, the SVM algorithm produces a
hard labeling in such a way that iis classified in the positive or the negative class
according to the sign of f(xi), where f(x) = ω>x+βis the score function. When
an attribute vector x0is given, the value f(x0) is called the score value of x0.
However, the SVM method does not result in probabilistic outputs as posterior
probabilities P(y= 1 |x), which are of interest if a measure of confidence in the
predictions is sought, see Murphy (2012). This is of particular importance in
several real-world applications such as cancer screening or credit scoring, where
the risk of a false-negative and a false-positive results are significantly different.
Several attempts to obtain the posterior probabilities P(y= 1|x) for SVM
have been carried out from the score function f(x). One of them is based
on assigning posterior class probabilities making use of a specific parametric
family. For example, Wahba (1992), Wahba et al. (1999) proposed a logistic
link function,
P(y= 1|x) = 1
1 + exp(f(x)).(2)
Also, Vapnik and Vapnik (1998) suggested to estimate P(y= 1|x) in terms
of a series of the trigonometric functions, where the coefficients of the trigono-
metric expansion minimizes a regularized function. Another option is to fit
Gaussians to the class-conditional densities P(f(x)|y= 1) and P(f(x)|y=1),
as proposed in Hastie and Tibshirani (1998). Following this argument, the pos-
terior probability P(y= 1|f(x)) is a sigmoid, whose slope is determined by the
tied variance. One of the best-known heuristics to obtain probabilities is due
to Platt (2000), which considers f(x) as the log-odds ratio log P(y= 1 |x)
This implies that
P(y= 1 |x) = 1
1 + exp(Af(x) + B),(3)
and Aand Bcan be estimated by maximum likelihood on a validation set. This
technique is implemented by well-known statistical packages such as the ksvm()
function in R, see Karatzoglou et al. (2006) or predict proba in scikit-learn in
Python (Pedregosa et al. (2011)). However, such a method has been criticized
for failing to provide insight and for interpreting f(x) as a log-odds ratio, which
may be not accurate for some datasets, see Murphy (2012); Tipping (2001);
Franc et al. (2011). To illustrate such a phenomenon, consider Figure 1, which
shows the fit of the sigmoid function (3) to the empirical class probabilities
of three different, well-referenced datasets: adult,wisconsin and diabetes,
respectively (see Section 3.1). It can be seen that, while for adult dataset the fit
provided by the method given in (3) performs reasonably well, the performance
is poor for the other two datasets.
Sollich (2002) considers a different probabilistic framework for SVM classifi-
cation, based on Bayesian theory. In particular, it relates the SVM kernel to the
covariance function for a Gaussian process prior and as a result, optimal values
of the tuning parameter Cand class probabilities are obtained in a natural way.
However, all these previous works make assumptions that might not be satisfied
by the data. Finally, other approaches seeking probabilistic outputs are found in
the literature, as Seeger (2000), Kwok (1999, 1998), Herbrich et al. (1999). Also,
none of them considers the problem of unbalancedness, so they may not perform
well on the minority class. Unlike previous works, Tao et al. (2005) consider the
problem of estimating posterior class probabilities taking into account that the
influence of different samples may be unbalanced, resulting in non-robust SVMs.
Their approach, the posterior probability support vector machine (PPSVM), is
distribution-free and weighs unbalanced training samples. However, none of
the previously mentioned works addresses cost-sensitive problems, which are of
crucial importance in many real-world problems. For example, in cancer de-
tection, it is common that there are usually few disease cases compared to the
healthy ones. However, smaller classification errors are desired for the first case.
As an illustration, consider the dataset cancer-colon (see Ridge (2002)), with
two classes and where only the 35.5% of the records correspond to sick people
(positive class). Assume that the interest is to obtain a good estimation of the
posterior positive class. Using the methods proposed by Platt (2000), Sollich
(2002) and Tao et al. (2005), the mean squared error (MSE) values for the pos-
terior class probabilities of interest were 0.241, 0.252 and 0.208. However, as will
be shown, using the methodology proposed in this paper, these numerical values
can decrease down to 0, maybe at the expense of damaging the prediction of
−4 −2 0 2 4
0.0 0.2 0.4 0.6 0.8 1.0
Adult dataset
scores (f(x))
−4 −2 0 2 4
0.0 0.2 0.4 0.6 0.8 1.0
Wisconsin dataset
scores (f(x))
−3 −2 −1 0 1 2 3
0.0 0.2 0.4 0.6 0.8 1.0
Diabetes dataset
scores (f(x))
Figure 1: Fit (in solid line) of the sigmoid function to the empirical class probabilities (dots)
of adult,wisconsin and diabetes datasets. 4
the posterior negative class probability. Another context where it is important
to control the estimation error related to a specific class is credit scoring, where
the individuals of interest (defaulting ones) constitute the class of the smallest
sample size. Such kind of issues are addressed, from the classification context,
in Bradford et al. (1998); Freitas et al. (2007); Carrizosa et al. (2008); Datta and
Das (2015); Ben´ıtez-Pe˜na et al. (2019), and recently, from a feature selection
viewpoint in Ben´ıtez-Pe˜na et al. (2019).
A major contribution of this work is that the SVM is embedded in an en-
semble method, leading, as shown later, to an improvement in performance. It
is known that, in order to solve the SVM problem (1), a tuning process concern-
ing the regularization parameter Cneeds to be performed. Traditionally, all the
information resulting from this tuning procedure is discarded and only the best
Cvalue is used to build the classifier. Instead, in this work, the final posterior
class probability estimate is a weighted mean of different posterior probabilities,
each one related to specific values of C. In addition, unlike Platt (2000) and
Sollich (2002), here we propose a novel methodology that does not make use of
parametric models based on the score function f(x) obtained after tuning the
SVM parameters. Instead, we consider a bootstrap framework, which, to the
best of our knowledge, has not been addressed before for this type of problems.
The use of a bootstrap sampling allows us to obtain accurate values for the
densities of the score values, which translates into a better prediction of the
posterior class probability P(y= 1|x).
The paper is structured as follows. First, in Section 2, our methodology
is introduced. Section 2.1 describes how to integrate a bootstrap sampling
into an SVM to produce posterior class probabilities estimates. Section 2.2
explains two different ways to obtain cost-sensitive probabilistic predictions.
In Section 3 some experimental results are presented. In particular, several
well-referenced datasets from biomedical contexts are analyzed. Estimates of
the posterior class probabilities under our methodology are compared to those
obtained under benchmark approaches. Finally, the posterior probabilities of
the classes of interest are controlled via the two different approaches described
in Section 2.2. Conclusions and further research can be found in Section 4.
2. Cost-sensitive predictive probabilities for SVM
In this section we present our methodology to obtain cost-sensitive posterior
class probabilities for the SVM classifier. First, in Section 2.1 we explain how
to integrate a bootstrap sampling into SVM to produce posterior class proba-
bilities estimates P(y= 1|x). Second, in Section 2.2 we describe two different
approaches that allow us to control the posterior probability estimates for the
class of interest.
2.1. SVM posterior class probabilities based on Bootstrap
Assume that we want to solve SVM (1) to classify the observations in a
dataset, such as e.g. wisconsin dataset (see Section 3.1 for details concerning
the database). In order to estimate the classification error, it is standard to
consider a k-fold CV approach. Figure 2 shows the histogram (in absolute fre-
quencies) of the score values under three different choices of k(k= 20,100,500)
for a given individual (randomly chosen). It can be observed that as kin-
creases, the score values are less disperse, a consequence of the fact that the
different samples share more elements, and thus they yield more similar scores.
In such a situation, it might not be possible to obtain accurate posterior class
probabilities P(y= 1|x), especially when the observations of the two groups
strongly overlap. What it is proposed in this paper is to replace the k-fold CV
approach by a bootstrap sampling that allows us to avoid the degenerate be-
haviour observed in Figure 2. This phenomenon is illustrated in Figure 3, where
the analogous histograms to Figure 2 are shown, but where a bootstrap sam-
pling with Breplications (B= 20,100,500) has been considered instead. The
idea of using those values is just to illustrate the behavior of this method when
increasing the sample size in contrast with the one shown in Figure 2. Finally,
the estimates for the posterior class probabilities P(y= 1|x) can be obtained
as the relative frequency of the positive (negative in the case of P(y=1|x))
score values. This approach, given by Algorithm 1 and illustrated as a flowchart
in Figure 4, is detailed next.
First, to carry out our procedure of obtaining probabilities, consider a com-
plete dataset Ω composed of minstances, nvariables, and 2 classes (1 or 1).
The dataset is split in two samples: the training sample T(of sample size mtr)
and the validation sample V(of size mmtr), see Step 1 of Algorithm 1. As
it is usual in the SVM implementation, a grid for the regularization parame-
ter Cneeds to be set. Then, a matrix P X with as many rows as the number
of instances in the validation sample (mmtr) and as many columns as the
number of values Cto be used is built. This matrix contains the proportions of
negative scores (and illustrated in Table 1) and is generated through the algo-
rithm as follows. Given a fixed value of C, and Bbootstrap samples from the
training sample T(that will be denoted as T
b, b = 1, . . . , B ), SVM is run over
each bootstrap sample and validated over the out of bag sample, i.e., the set
of instances that are in Tbut not in the considered bootstrap sample (we will
denote them as V
b, b = 1, . . . , B ) and also in the validation sample V. In this
way, we have validated twice: the first time (over the validation in the bootstrap
procedure V
b), to measure the performance led by the chosen value Cand then
over the validation set V, to estimate the posterior negative class probabilities
conditioned to C,P(yVi=1|x, C), where Videnotes the i-th instance in V.
In this way, we shall obtain Bscore values for each instance in V. Such score
values shall be recorded in a matrix with as many rows as number of instances
in the validation sample (mmtr) and Bcolumns, P redictionV . Finally, we
propose to estimate its posterior negative class probability P(yVi=1|x) as
a weighted average of the estimates for P(yVi=1|x, C) when using different
values of C(Cr, r = 1, ..., R), but taking only into account the values of Cthat
lead to accuracies close to the best one. Let accr,b denote the standard accuracy
in V
b(instances correctly classified divided by total number of instances), given
the SVM built from using the r-th value of C, and the b-th bootstrap sample;
let accrdenote the average value of the coefficients accr,b, namely
b=1 accr,b
Only values yielding high estimates of accrare taken into consideration, the
remaining ones being discarded. Those considered are stored in the set J,
defined as J={j:accjmaxlacclε}, where ε > 0 is a fixed parameter.
Finally, if weights wj, j J, are defined according to
then, the estimation P(yVi=1|x) is:
P(yVi=1|x) = X
wjP(yVi=1|x, Cj).(4)
Split Ω into Tand V:TV= Ω, TV=.
P X = Initialize as an empty matrix
for each Cin grid of Cr, r = 1, . . . , R do
P redictionV = Initialize as an empty matrix
for bin 1,2,. . . ,B do
Create a bootstrap sample T
bfrom T.
Run SVM over T
b. Validate over V
Obtain (for each instance in V
b) the accuracy accr,b.
Obtain (for each instance in V) scores values for V, denoted as
scoresV .
Insert scoresV as a new column in P redictionV
Estimate P(y=1|x, C) for each of the (mmtr) instances in V,
as the proportion of negative scores in each of the (mmtr) rows
in P redictionV . These estimates are inserted as a new column in
P X .
Using the values accr,b, calculate the weighted average given by (4).
Algorithm 1: Pseudocode for the Bootstrap SVM
The methodology will be illustrated in Section 3.2 where, in addition, some
comparisons with respect to benchmark approaches will also be presented.
2.2. Control over the sensitivity measure
In the previous section, an approach for estimating posterior class probabil-
ities P(y= 1|x) and P(y=1|x) has been described. In this section, we deal
with the issue of improving the sensitivity of the classifier (proportion of pos-
itive instances that are correctly classified) which, as commented in Section 1,
(a) (b)
Figure 2: Histogram of scores for a single instance when a k-fold CV is used. Here, we set
k= 20, 100 and 500, obtaining as many score values as the value of k.
(a) (b)
Figure 3: Histogram of scores for a single instance when the Bootstrap with Breplications is
used. As in the k-fold CV, B= 20, 100 and 500.
Figure 4: Flowchart of the Bootstrap-based methodology
P X =
C1C2··· CR
V1P X1,1P X1,2··· P X1,R
V2P X2,1P X2,2··· P X2,R
Vmmtr P Xmmtr,1P Xmmtr,2··· P Xmmtr,R
Table 1: Probabilities P(yVi=1|x, C )
is a problem of interest, among others, in biomedical contexts. To do this, we
propose two different approaches, Ctrl1 and Ctrl2, which are discussed in what
follows and empirically analyzed in Section 3.3.
Method Ctrl1 is based on the fact that the sensitivity measure can be con-
trolled by the posterior class probabilities, as is explained next. In Algorithm 1,
the posterior negative class probabilities have been estimated taking into ac-
count the proportion of negative scores. However, if instead of 0, we consider
a different threshold (say a value a, with apositive), then the estimates for
the values of P(yVi=1|x, C) decrease (that is, the posterior positive class
probabilities P(yVi= 1|x, C) increase). This is illustrated by two examples in
Figure 5, Figures 5(a) and 5(b), which represent the histograms of the scores
for two different individuals of the wisconsin dataset. Figure 5(a) shows the
posterior positive class probability estimates using Algorithm 1, that is, where
the value 0 is used as a threshold to classify in the positive or the negative
class. In Figure 5(b), the threshold value has been moved to the left and, as
a consequence, the resulting estimates have increased. Note from Figure 5(b)
that, with this approach, the probability of an instance to belong to the positive
class may change from below to above 0.5. In practice, in order to obtain a de-
sired posterior positive class probability estimate, the threshold is moved until
a certain proportion of the instances of the positive class are correctly classified.
Method Ctrl2 also results from Algorithm 1, but, instead of changing the
threshold for the scores, we consider a different classifier. Specifically, we pro-
pose to use a novel version of the SVM, the so-called Constrained SVM (CSVM),
which has been particularly designed to obtain cost-sensitive results, see Ben´ıtez-
Pe˜na et al. (2019). Without going into much detail, the CSVM formulation is
obtained by solving a convex quadratic optimization problem with linear con-
straints and some integer variables:
minω,β,ξ ω>ω+CPiIξi
s.t. yi(ω>xi+β)1ξi, i I
0ξiM(1 ζi)iI
ζi∈ {0,1}iI.
Problem (5) is simply the formulation for the standard SVM with linear
kernel, to which performance constraints have been added: µ(ζ)`λ`, where
µ(ζ)`are different performance measures, forced to take values above thresholds
λ`, and ζiare new binary variables that check whether record iis counted as
Figure 5: Control over the probabilities estimation. In Subfigures 5(a) we can observe the
original estimated probabilities, whereas in Subfigures 5(b) the new cost-sensitive probabilities
for 5(a), obtained by moving the threshold, are depicted.
correctly classified, and Mis a large number. We refer the reader to the original
reference (Ben´ıtez-Pe˜na et al. (2019)) for a more detailed description of the cost-
sensitive classifier. The important message to be kept here is that solving (5)
with a standard software package for different values of the parameters λ`yields
classifiers with different trade-off between sensitivity and specificity.
Both methods (Ctrl1 and Ctrl2 ) will be illustrated through numerical ex-
amples in Section 3.3.
3. Experimental results
In this section we illustrate the performance of our method for computing
posterior class probability estimates as in Section 2.1. The results will be com-
pared to those of benchmark approaches by Platt (2000), Sollich (2002) and
Tao et al. (2005). To do this, a variety of datasets with different properties con-
cerning size (in the number of instances and/or variables) and unbalancedness
shall be analyzed. Moreover, using real datasets, we test the methods described
in Section 2.2 to control the posterior positive class probability. Specifically,
this section is organized as follows. In Section 3.1 we present a brief description
of the different datasets we have used and describe how the different experi-
ments have been implemented. Section 3.2 shows the performance of the novel
approach in comparison to that of benchmark methodologies. Finally, in Sec-
tion 3.3 we apply both Ctrl1 and Ctrl2 to improve the posterior probability of
the class of interest.
3.1. Datasets and description of the experiments
The performance of the different methodologies presented in this paper is
illustrated using eleven real-life datasets: wisconsin (Breast Cancer Wisconsin
(Diagnostic)), cancer-colon (Colon Cancer), diabetes (Diabetes), leukemia
(Leukemia), SRBCT (Small Round Blue Cell Tumor), heart (Heart Disease),
adult (Adult), divorce (Divorce Predictors data set), german (German Credit
Data), cervical-cancer (Cervical cancer (Risk Factors)) and banknote (ban-
knote authentication). SRBCT dataset can be obtained from the R package
plsgenomics (Boulesteix et al. (2011)) and leukemia from Golub et al. (1999).
On the other hand, cancer-colon is available at the Kent Ridge Biomedical
Data Repository (Ridge (2002)). The other eight datasets are obtained from the
UCI Repository, (Dheeru and Karra Taniskidou (2017)). Table 2 contains rele-
vant information of the previous datasets. In the second and third columns, the
sample sizes of the validation (|V|) and the complete datasets (||) are shown,
respectively. The fourth column contains the number of original variables or
attributes (|A|) in the dataset. Finally, the last column collects the number
(|+|) and percentage (%) of positive instances in the complete dataset.
Prior to running the experiments, datasets were standardized so that each
variable has zero mean and unit variance, Graf et al. (2003). Also in relation
to the variables, the categorical ones were transformed into a set of dummy
variables. In addition, those datasets with three classes or more were converted
Name |V| || |A| |+|(%)
wisconsin 57 569 30 357 (62.7%)
cancer-colon 10 62 2000 22 (35.5%)
diabetes 76 759 8 263 (34.7%)
leukemia 10 72 7128 47 (65.3%)
SRBCT 10 83 1022 40 (48.2%)
heart 14 140 13 40 (28.6%)
adult 3256 32560 14 7841 (24.08%)
divorce 17 170 54 84 (49.41%)
german 100 1000 20 300 (30%)
cervical-cancer 86 858 36 55 (6.41%)
banknote 137 1372 5 610 (44.46%)
Table 2: Datasets
into two-class datasets by giving negative label to the largest class and positive
labels to the remaining records. In the case of missing values, they were replaced
by the median. Finally, when running the SVM and the constrained SVM in
(1) or (5), the linear kernel versions were considered. All the experiments have
been carried out using the solver Gurobi (Gurobi Optimization, Inc. (2016)) and
its Python language interface (Python Core Team (2015)). No timelimit was
imposed when solving Problem (1), whereas 300 seconds was set when solving
(5). Also, for the latter problem, Mwas equal to 1000 (see, Ben´ıtez-Pe˜na et al.
(2019) for more details).
In our experiments, the number of folds selected for the k-fold CV is 10
external folds (and we estimate the performance measure by the average over
the 10 folds) and 10 internal folds (in order to obtain the best parameter C). The
number of bootstrap samples Bhas been set equal to 500 and each bootstrap
training sample has the same size as the original training sample. Note that
we cope with the unbalancedness, if present, though one could have performed
under or oversampling in the majority or the minority class, respectively, in
a preprocessing phase. The grid of Cvalues selected in our experiments is
{25,24, ..., 24,25}.
3.2. Performance of the bootstrap-based approach
In this section we estimate the posterior class probabilities according to the
bootstrap-based novel method described in Section 2.1 and compare the re-
sults with those obtained by the benchmark approaches by Platt (2000), Sollich
(2002) and Tao et al. (2005) commented in Section 1. The obtained results are
summarized in Table 3, whose second, third and fourth columns contain the
mean squared errors (MSE) values obtained when the deterministic class mem-
bership is approximated to its probabilistic counterpart. Note that, according
to Tao et al. (2005), a value for the parameter rneeds to be selected. In this
case, we tested the results for four different choices of r(0, 10, 20, 30).
The best results have been highlighted in bold style.
It can be seen from Table 3, how our methodology is the one performing
best for wisconsin,cancer-colon,leukemia,divorce and cervical-cancer,
Dataset Bootstrap-based approach Sollich Platt Tao et al.
(r= 0, 10, 20, 30)
wisconsin 0.003 0.064 0.021 0.028, 0.019, 0.034, 0.055
cancer-colon 0.201 0.241 0.252 0.208, 0.208, 0.208, 0.208
diabetes 0.192 0.190 0.157 0.229, 0.233, 0.234, 0.234
leukemia 00.239 0.01 0.029, 0.029, 0.029, 0.029
SRBCT 0.01 0.237 0.011 0,0,0,0
heart 0.143 0.158 0.121 0.15, 0.118, 0.207, 0.218
adult 0.144 0.128 0.068 0.148, 0.078, 0.142, 0.167
divorce 00.189 0.021 0.024, 0.024, 0.024, 0.024
german 0.197 0.229 0.168 0.256, 0.256, 0.256, 0.247
cervical-cancer 0.013 0.187 0.028 0.039, 0.039, 0.032, 0.035
banknote 0.017 0.008 0.008 0.009, 0.145, 0.221, 0.238
Table 3: Mean squared errors (MSE) obtained when predicting the posterior class probabilities
in a linear SVM.
obtaining the lowest values of MSE. Additionally, the method proposed by
Platt (2000) obtains the lowest MSE in diabetes,heart,adult,german and
banknote. Finally, with the method of Tao et al. (2005) a zero MSE for SRBCT
is obtained. On the other hand, the method proposed by Sollich (2002) per-
forms poorly in all cases except in banknote. In conclusion, we have built a
method that is comparable in terms of performance to benchmark approaches,
outperforming them in some datasets.
As described in Section 2.1, the final estimate of the posterior class proba-
bilities is set in terms of the results obtained for a range of the regularization
parameter C(see expression 4). It is of interest to compare the results with
those computed using only the value of Cthat provides the best accuracy mea-
sure. The results are shown in Table 4, from which it can be concluded that
embedding the SVM in an ensemble method actually improves its performance.
Dataset Best C Bootstrap-based approach
wisconsin 0.02 0.003
cancer-colon 0.2 0.201
diabetes 0.225 0.192
leukemia 0 0
SRBCT 0 0.01
heart 0.214 0.143
adult 0.152 0.144
divorce 0 0
german 0.28 0.197
cervical 0.047 0.013
banknote 0 0.017
Table 4: Mean squared errors (MSE) using only the best Cand under the bootstrap-based
Figure 6: MSE for the positive class probability predictions of each dataset. Ctrl1
3.3. Results when the posterior class probabilities are controlled
In this section we apply the methodologies described in Section 2.2 in order
to control P(y= 1|x) or P(y=1|x). In particular, Figures 6 and 8 are
obtained under Ctrl1 and Figures 7 and 9 show the results when the method
based on the CSVM (Ctrl2 ) is implemented. For all the figures, the class of
interest to be controlled is assumed to be the positive one. Figures 6 and 7
show the MSE when considering only the positive instances, while Figures 8
and 9 depict the MSE for the negative instances.
From Figures 6 and 7, we can see how as the threshold for obtaining a
given proportion of the instances in the correct class (x-axis) is moved to the
right, the MSE becomes lower, as expected. In fact, there are some datasets
(banknote,divorce,leukemia,SRBCT and wisconsin), for which the obtained
MSEs are very close to 0, for both Figures 6 and 7. As a result, the lines defining
the MSE values for those datasets are indistinguishable. However, Figures 8
and 9 present different patterns. While Figure 9 behaves as expected (as the
MSEs for the sensitivity become smaller, the MSEs for the specificity become
constant or higher), the specificity depicted by Figure 8 remains unaltered. Here
again, some datasets result in almost null MSEs (cervical-cancer,divorce,
leukemia,SRBCT and wisconsin).
An important remark to be made concerning the performance of Ctrl1 and
Ctrl2 is as follows. The first one seems to be able to improve the sensitivity
without damaging too much the specificity, while the second method damages
Figure 7: MSE for the positive class probability predictions of each dataset. Ctrl2
Figure 8: MSE for the negative class probability predictions of each dataset. Ctrl1
Figure 9: MSE for the negative class probability predictions of each dataset. Ctrl2
in a more significant way the specificity, but at the same time it leads to better
sensitivity values.
4. Conclusions
In this paper we have proposed a procedure to obtain probabilistic outputs
for the Support Vector Machines. Here, contrary to existing proposals, we
present a method that is distribution-free and cost-sensitive. Also, it makes use
of not only a single classifier but a weighted average of them, obtaining more
accurate results.
Our proposal is compared to some benchmark methodologies. The results
show that our approach is comparable or better than such approaches. Two
cost-sensitive alternatives are proposed here. The first one is based on changing
the way the probabilities are estimated and the second one proposes to modify
the original classifier by a cost-sensitive version. Results for real datasets have
been shown, demonstrating the usefulness of our novel approach.
For simplicity, the baseline SVM classifiers are taken with a linear kernel;
more powerful classifiers will be obtained if nonlinear kernels (as RBF) are
used, though at the expense of making the computational effort higher. On the
other hand, traditional SVM can be used as a basis for addressing multiclass
problems. How to extend properly our approach to such multiclass problems is
an interesting research avenue which is now under investigation.
This research is financed by projects EC H2020 MSCA RISE NeEDS Project
(Grant agreement ID: 822214), FQM329 and P18-FR-2369 (Junta de Andaluc´ıa,
Andaluc´ıa), PR2019-029 (Universidad de C´adiz) and PID2019-110886RB-I00
(Ministerio de Ciencia, Innovaci´on y Universidades, Spain). The last three are
cofunded with EU ERD Funds. The authors are thankful for such support.
Ben´ıtez-Pe˜na, S., Blanquero, R., Carrizosa, E., and Ram´ırez-Cobo, P. (2019).
On support vector machines under a multiple-cost scenario. Advances in Data
Analysis and Classification, 13(3):663–682.
Ben´ıtez-Pe˜na, S., Blanquero, R., Carrizosa, E., and Ram´ırez-Cobo, P. (2019).
Cost-sensitive feature selection for support vector machines. Computers &
Operations Research, 106:169 – 178.
Boulesteix, A.-L., Lambert-Lacroix, S., Peyre, J., and Strimmer, K. (2011).
plsgenomics: Pls analyses for genomics. R package version, pages 1–2.
Bradford, J. P., Kunz, C., Kohavi, R., Brunk, C., and Brodley, C. E. (1998).
Pruning decision trees with misclassification costs. In Proceedings of the
10th European Conference on Machine Learning, ECML ’98, pages 131–136.
Carrizosa, E., Martin-Barragan, B., and Romero Morales, D. (2008). Multi-
group Support Vector Machines with Measurement Costs: A Biobjective Ap-
proach. Discrete Applied Mathematics, 156(6):950–966.
Carrizosa, E. and Morales, D. R. (2013). Supervised classification and mathe-
matical optimization. Computers & Operations Research, 40(1):150–165.
Datta, S. and Das, S. (2015). Near-Bayesian Support Vector Machines for
imbalanced data classification with equal or unequal misclassification costs.
Neural Networks, 70:39–52.
Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.
Franc, V., Zien, A., and Sch¨olkopf, B. (2011). Support vector machines as
probabilistic models. In Proceedings of the 28th International Conference on
Machine Learning, pages 665–672, Madison, WI, USA. International Machine
Learning Society.
Freitas, A., Costa-Pereira, A., and Brazdil, P. (2007). Cost-Sensitive Decision
Trees Applied to Medical Data. In Data Warehousing and Knowledge Dis-
covery: 9th International Conference, DaWaK 2007, Regensburg Germany,
September 3-7, 2007. Proceedings, pages 303–312, Berlin, Heidelberg. Springer
Berlin Heidelberg.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov,
J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., et al. (1999).
Molecular classification of cancer: class discovery and class prediction by gene
expression monitoring. science, 286(5439):531–537.
Graf, A. B., Smola, A. J., and Borer, S. (2003). Classification in a normalized
feature space using support vector machines. IEEE Transactions on Neural
Networks, 14(3):597–605.
Gurobi Optimization, Inc. (2016). Gurobi optimizer reference manual.
Hastie, T. and Tibshirani, R. (1998). Classification by pairwise coupling. In
Advances in neural information processing systems, pages 507–513.
Herbrich, R., Graepel, T., and Campbell, C. (1999). Bayesian learning in repro-
ducing kernel Hilbert spaces. Leiter der Fachbibliothek Informatik, Sekretariat
FR 5-4.
Karatzoglou, A., Meyer, D., and Hornik, K. (2006). Support vector machines
in r. Journal of Statistical software, 15(9):1–28.
Kwok, J. T.-Y. (1998). Integrating the evidence framework and the support
vector machine. In ESANN, volume 99, pages 177–182.
Kwok, J. T.-Y. (1999). Moderating the outputs of support vector machine
classifiers. IEEE Transactions on Neural Networks, 10(5):1018–1031.
Murphy, K. P. (2012). Machine learning, a probabilistic perspective. The MIT
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-
learn: Machine learning in python. Journal of Machine Learning Research,
Platt, J. C. (2000). Probabilistic outputs for support vector machines and
comparisons to regularized likelihood methods. In Advances in large margin
classifiers, pages 61–74. MIT Press.
Python Core Team (2015). Python: A dynamic, open source programming
language. Python Software Foundation.
Ridge, K. (2002). Kent ridge biomedical data set repository.
Seeger, M. (2000). Bayesian model selection for support vector machines, gaus-
sian processes and other kernel classifiers. In Advances in neural information
processing systems, pages 603–609.
Sollich, P. (2002). Bayesian methods for support vector machines: Evidence
and predictive class probabilities. Machine learning, 46(1):21–52.
Tao, Q., Wu, G.-W., Wang, F.-Y., and Wang, J. (2005). Posterior probability
support vector machines for unbalanced data. IEEE Transactions on Neural
Networks, 16(6):1561–1573.
Tipping, M. (2001). Sparse bayesian learning and the relevance vector machine.
Journal of machine learning research, 1(Jun):211–244.
Vapnik, V. N. and Vapnik, V. (1998). Statistical learning theory, volume 1.
Wiley New York.
Wahba, G. (1992). Multivariate function and operator estimation, based on
smoothing splines and reproducing kernels. In Santa Fe Institute Studies
in the Sciences of Complexity-Proceedings Volume-, volume 12, pages 95–95.
Addison-Wesley Publishing Co.
Wahba, G. et al. (1999). Support vector machines, reproducing kernel hilbert
spaces and the randomized gacv. Advances in Kernel Methods-Support Vector
Learning, 6:69–87.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Support vector machine (SVM) is a powerful tool in binary classification, known to attain excellent misclassification rates. On the other hand, many realworld classification problems, such as those found in medical diagnosis, churn or fraud prediction, involve misclassification costs which may be different in the different classes. However, it may be hard for the user to provide precise values for such misclassification costs, whereas it may be much easier to identify acceptable misclassification rates values. In this paper we propose a novel SVM model in which misclassification costs are considered by incorporating performance constraints in the problem formulation. Specifically, our aim is to seek the hyperplane with maximal margin yielding misclassification rates below given threshold values. Such maximal margin hyperplane is obtained by solving a quadratic convex problem with linear constraints and integer variables. The reported numerical experience shows that our model gives the user control on the misclassification rates in one class (possibly at the expense of an increase in misclassification rates for the other class) and is feasible in terms of running times.
Full-text available
Data mining techniques often ask for the resolution of optimization problems. Supervised classification, and, in particular, support vector machines, can be seen as a paradigmatic instance. In this paper, some links between mathematical optimization methods and supervised classification are emphasized. It is shown that many different areas of mathematical optimization play a central role in off-the-shelf supervised classification methods. Moreover, mathematical optimization turns out to be extremely useful to address important issues in classification, such as identifying relevant variables, improving the interpretability of classifiers or dealing with vagueness/noise in the data.
Conference Paper
Full-text available
Classification plays an important role in medicine, especially for medical diagnosis. Health applications often require classifiers that minimize the total cost, including misclassifications costs and test costs. In fact, there are many reasons for considering costs in medicine, as diagnostic tests are not free and health budgets are limited. Our aim with this work was to define, implement and test a strategy for cost-sensitive learning. We defined an algorithm for decision tree induction that considers costs, including test costs, delayed costs and costs associated with risk. Then we applied our strategy to train and evaluate cost-sensitive decision trees in medical data. Built trees can be tested following some strategies, including group costs, common costs, and individual costs. Using the factor of “risk” it is possible to penalize invasive or delayed tests and obtain decision trees patient-friendly.
Support Vector Machines (SVMs) form a family of popular classifier algorithms originally developed to solve two-class classification problems. However, SVMs are likely to perform poorly in situations with data imbalance between the classes, particularly when the target class is under-represented. This paper proposes a Near-Bayesian Support Vector Machine (NBSVM) for such imbalanced classification problems, by combining the philosophies of decision boundary shift and unequal regularization costs. Based on certain assumptions which hold true for most real-world datasets, we use the fractions of representation from each of the classes, to achieve the boundary shift as well as the asymmetric regularization costs. The proposed approach is extended to the multi-class scenario and also adapted for cases with unequal misclassification costs for the different classes. Extensive comparison with standard SVM and some state-of-the-art methods is furnished as a proof of the ability of the proposed approach to perform competitively on imbalanced datasets. A modified Sequential Minimal Optimization (SMO) algorithm is also presented to solve the NBSVM optimization problem in a computationally efficient manner. Copyright © 2015 Elsevier Ltd. All rights reserved.
This paper introduces a general Bayesian framework for obtaining sparse solutions to regression and classification tasks utilising models linear in the parameters. Although this framework is fully general, we illustrate our approach with a particular specialisation that we denote the 'relevance vector machine' (RVM), a model of identical functional form to the popular and state-of-the-art 'support vector machine' (SVM). We demonstrate that by exploiting a probabilistic Bayesian learning framework, we can derive accurate prediction models which typically utilise dramatically fewer basis functions than a comparable SVM while offering a number of additional advantages. These include the benefits of probabilistic predictions, automatic estimation of 'nuisance' parameters, and the facility to utilise arbitrary basis functions (e.g. non-'Mercer' kernels). We detail the Bayesian framework and associated learning algorithm for the RVM, and give some illustrative examples of its application along with some comparative benchmarks. We offer some explanation for the exceptional degree of sparsity obtained, and discuss and demonstrate some of the advantageous features, and potential extensions, of Bayesian relevance learning.
Support Vector Machine has shown to have good performance in many practical classification settings. In this paper we propose, for multi-group classification, a biobjective optimization model in which we consider not only the generalization ability (modeled through the margin maximization), but also costs associated with the features. This cost is not limited to an economical payment, but can also refer to risk, computational effort, space requirements, etc. We introduce a Biobjective Mixed Integer Problem, for which Pareto optimal solutions are obtained. Those Pareto optimal solutions correspond to different classification rules, among which the user would choose the one yielding the most appropriate compromise between the cost and the expected misclassification rate.
Conference Paper
We discuss a strategy for polychotomous classification that involves coupling the estimating class probabilities for each pair of classes, and estimates together. The coupling model is similar to the Bradley-Terry method for paired comparisons. We study the nature of the class probability estimates that arise, and examine the performance of the procedure in real and simulated data sets. Classifiers used include linear discriminants, nearest neighbors, adaptive nonlinear methods and the support vector machine.