ArticlePDF Available

The Foundations of Cost-Sensitive Learning

Authors:

Abstract

This paper revisits the problem of optimal learning and decision-making when different misclassification errors incur different penalties. We characterize precisely but intuitively when a cost matrix is reasonable, and we show how to avoid the mistake of defining a cost matrix that is economically incoherent. For the two-class case, we prove a theorem that shows how to change the proportion of negative examples in a training set in order to make optimal cost-sensitive classification decisions using a classifier learned by a standard non-costsensitive learning method. However, we then argue that changing the balance of negative and positive training examples has little effect on the classifiers produced by standard Bayesian and decision tree learning methods. Accordingly, the recommended way of applying one of these methods in a domain with differing misclassification costs is to learn a classifier from the training set as given, and then to compute optimal decisions ...
To appear, Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01).
The Foundations of Cost-Sensitive Learning
Charles Elkan
Department of Computer Science and Engineering 0114
University of California, San Diego
La Jolla, California 92093-0114
elkan@cs.ucsd.edu
Abstract
This paper revisits the problem of optimal learn-
ing and decision-making when different misclassi-
fication errors incur different penalties. We char-
acterize precisely but intuitively when a cost ma-
trix is reasonable, and we show how to avoid the
mistake of defining a cost matrix that is economi-
cally incoherent. For the two-class case, we prove
a theorem that shows how to change the proportion
of negative examples in a training set in order to
make optimal cost-sensitive classification decisions
using a classifier learned by a standard non-cost-
sensitive learning method. However, we then argue
that changing the balance of negative and positive
training examples has little effect on the classifiers
produced by standard Bayesian and decision tree
learning methods. Accordingly, the recommended
way of applying one of these methods in a domain
with differing misclassification costs is to learn a
classifier from the training set as given, and then
to compute optimal decisions explicitly using the
probability estimates given by the classifier.
1 Making decisions based on a cost matrix
Given a specification of costs for correct and incorrect pre-
dictions, an example should be predicted to have the class
that leads to the lowest expected cost, where the expectation
is computed using the conditional probability of each class
given the example. Mathematically, let the

entry in a
cost matrix
be the cost of predicting class
when the true
class is
. If

then the prediction is correct, while if

the prediction is incorrect. The optimal prediction for
an example
is the class
that minimizes

!
(1)
Costs are not necessarily monetary. A cost can also be a waste
of time, or the severity of an illness, for example.
For each
,

is a sum over the alternative possibilities
for the true class of
. In this framework, the role of a learning
algorithm is to produce a classifier that for any example
can estimate the probability
"#
of each class
being the
true class of
. For an example
, making the prediction
means acting as if
is the true class of
. The essence of
cost-sensitive decision-making is that it can be optimal to act
as if one class is true even when some other class is more
probable. For example, it can be rational not to approve a
large credit card transaction even if the transaction is most
likely legitimate.
1.1 Cost matrix properties
A cost matrix
always has the following structure when
there are only two classes:
actual negative actual positive
predict negative
$%$&')(!*+*
$%-,./(!*10
predict positive
2,&$&')(302*
2,&-,./(30+0
Recent papers have followed the convention that cost ma-
trix rows correspond to alternative predicted classes, while
columns correspond to actual classes, i.e. row/column =
/
=
predicted/actual.
In our notation, the cost of a false positive is
(302*
while
the cost of a false negative is
(!*!0
. Conceptually, the cost of
labeling an example incorrectly should always be greater than
the cost of labeling it correctly. Mathematically, it should
always be the case that
(0*54 (*+*
and
(*!064 (00
. We call
these conditions the “reasonableness” conditions.
Suppose that the first reasonableness condition is violated,
so
(!*+*879(302*
but still
(!*!0 4(30+0
. In this case the optimal
policy is to label all examples positive. Similarly, if
(:0* 4(!*+*
but
(300;7<(!*10
then it is optimal to label all examples negative.
We leave the case where both reasonableness conditions are
violated for the reader to analyze.
Margineantu [2000]has pointed out that for some cost ma-
trices, some class labels are never predicted by the optimal
policy as given by Equation (1). We can state a simple, intu-
itive criterion for when this happens. Say that row
=
domi-
nates row
>
in a cost matrix
if for all
,
=
?7
>
@A
.
In this case the cost of predicting
>
is no greater than the cost
of predicting
=
, regardless of what the true class
is. So it
is optimal never to predict
=
. As a special case, the opti-
mal prediction is always
>
if row
>
is dominated by all other
rows in a cost matrix. The two reasonableness conditions for
a two-class cost matrix imply that neither row in the matrix
dominates the other.
Given a cost matrix, the decisions that are optimal are un-
changed if each entry in the matrix is multiplied by a positive
constant. This scaling corresponds to changing the unit of
account for costs. Similarly, the decisions that are optimal
are unchanged
B
if a constant is added to each entry in the ma-
trix. This shifting corresponds to changing the baseline away
from which costs are measured. By scaling and shifting en-
tries, any two-class cost matrix that satisfies the reasonable-
ness conditions can be transformed into a simpler matrix that
always leads to the same decisions:
0
(1C
*10
1
(
C
0+0
where
(-C
*!0
DE( *10GF (** HIE( 02*JF (**
and
(1C
00
K( 0+0F
(*+* +HI( 0*F (**
. From a matrix perspective, a 2x2 cost matrix
effectively has two degrees of freedom.
1.2 Costs versus benefits
Although most recent research in machine learning has used
the terminology of costs, doing accounting in terms of bene-
fits is generally preferable, because avoiding mistakes is eas-
ier, since there is a natural baseline from which to measure
all benefits, whether positive or negative. This baseline is the
state of the agent before it takes a decision regarding an ex-
ample. After the agent has made the decision, if it is better
off, its benefit is positive. Otherwise, its benefit is negative.
When thinking in terms of costs, it is easy to posit a cost
matrix that is logically contradictory because not all entries
in the matrix are measured from the same baseline. For ex-
ample, consider the so-called German credit dataset that was
published as part of the Statlog project [Michie et al., 1994].
The cost matrix given with this dataset is as follows:
actual bad actual good
predict bad 0 1
predict good 5 0
Here examples are people who apply for a loan from a bank.
Actual good” means that a customer would repay a loan
while “actual bad” means that the customer would default.
The action associated with “predict bad” is to deny the loan.
Hence, the cashflow relative to any baseline associated with
this prediction is the same regardless of whether “actual
good” or “actual bad” is true. In every economically reason-
able cost matrix for this domain, both entries in the “predict
bad” row must be the same.
Costs or benefits can be measured against any baseline, but
the baseline must be fixed. An opportunity cost is a fore-
gone benefit, i.e. a missed opportunity rather than an actual
penalty. It is easy to make the mistake of measuring different
opportunity costs against different baselines. For example,
the erroneous cost matrix above can be justified informally as
follows: “The cost of approving a good customer is zero, and
the cost of rejecting a bad customer is zero, because in both
cases the correct decision has been made. If a good customer
is rejected, the cost is an opportunity cost, the foregone profit
of 1. If a bad customer is approved for a loan, the cost is the
lost loan principal of 5.”
To see concretely that the reasoning in quotes above is in-
correct, suppose that the bank has one customer of each of the
four types. Clearly the cost matrix above is intended to imply
that the net change in the assets of the bank is then
F
4. Al-
ternatively, suppose that we have four customers who receive
loans and repay them. The net change in assets is then +4.
Regardless of the baseline, any method of accounting should
give a difference of 8 between these scenarios. But with the
erroneous cost matrix above, the first scenario gives a total
cost of 6, while the second scenario givesa total cost of 0.
In general the amount in some cells of a cost or benefit
matrix may not be constant, and may be different for different
examples. For example, consider the credit card transactions
domain. Here the benefit matrix might be
fraudulent legitimate
refuse $20
F
$20
approve
F
0.02
where
is the size of the transaction in dollars. Approving
a fraudulent transaction costs the amount of the transaction
because the bank is liable for the expenses of fraud. Refusing
a legitimate transaction has a non-trivial cost because it an-
noys a customer. Refusing a fraudulent transaction has a non-
trivial benefit because it may prevent further fraud and lead to
the arrest of a criminal. Research on cost-sensitive learning
and decision-making when costs may be example-dependent
is only just beginning [Zadrozny and Elkan, 2001a].
1.3 Making optimal decisions
In the two-class case, the optimal prediction is class 1 if and
only if the expected cost of this prediction is less than or equal
to the expected cost of predicting class 0, i.e. if and only if
G/$#
2(302*MLNGO,
2(30+0
P
G/$#
2(!*+*MLNGO,
2(!*10
which is equivalent to
Q, FSR 2(302*ML R(30+0
P
2, FTR 2(1**ML R(1*!0
given
RU"GO,A
. If this inequality is in fact an equality,
then predicting either class is optimal.
The threshold for making optimal decisions is
RV
such that
2, FTR
V
2(302*ML R
V
(30+0WOQ, FSR
V
Q(!*+*ML R
V
(!*10X
Assuming the reasonableness conditions the optimal predic-
tion is class 1 if and only if
R7RV
. Rearranging the equation
for
RV
leads to the solution
R
V
(02*WF (*+*
(30* F(!*+*MLN(!*10 F(30+0
(2)
assuming the denominator is nonzero, which is implied by
the reasonableness conditions. This formula for
RV
shows that
any 2x2 cost matrix has essentially only one degree of free-
dom from a decision-making perspective, although it has two
degrees of freedom from a matrix perspective. The cause of
the apparent contradiction is that the optimal decision-making
policy is a nonlinear function of the cost matrix.
2 Achieving cost-sensitivity by rebalancing
In this section we turn to the question of how to obtain a clas-
sifier that is useful for cost-sensitive decision-making.
Standard learning algorithms are designed to yield clas-
sifiers that maximize accuracy. In the two-class case, these
classifiers implicitly make decisions based on the probability
threshold 0.5. The conclusion of the previous section was that
we need
Y
a classifier that given an example
, says whether or
not
"<Z,A
[7 RV
for some target threshold
RV
that in
general is different from 0.5. How can a standard learning al-
gorithm be made to produce a classifier that makes decisions
based on a general
RV
?
The most common method of achieving this objective is
to rebalance the training set given to the learning algorithm,
i.e. to change the the proportion of positive and negativetrain-
ing examples in the training set. Although rebalancing is a
common idea, the general formula for how to do it correctly
has not been published. The following theorem provides this
formula.
Theorem 1: To make a target probability threshold
RV
cor-
respond to a given probability threshold
R%*
, the number of
negative examples in the training set should be multiplied by
RV
,FSR
V
,FTR *
R%*
While the formula in Theorem 1 is simple, the proof of its
correctness is not. We defer the proof until the end of the
next section.
In the special case where the threshold used by the learn-
ing method is
R\ ]$^ `_
and
(*+* ( 0+0 $
, the theorem says
that the number of negative training examples should be mul-
tiplied by
RV
H^2, F[R V
ab(302*XH(!*10:
This special case is used
by Breiman et al. [1984].
The directionality of Theorem 1 is important to understand.
Suppose we have a learning algorithm
that yields classi-
fiers that make predictions based on a probability threshold
R*
. Given a training set
c
and a desired probability thresh-
old
RV
, the theorem says how to create a training set
c
C
by
changing the number of negative training examples such that
applied to
c
C
gives the desired classifier.
Theorem 1 does not say in what way the number of nega-
tive examples should be changed. If a learning algorithm can
use weights on training examples, then the weight of each
negative example can be set to the factor given by the theo-
rem. Otherwise, we must do oversampling or undersampling.
Oversampling means duplicating examples, and undersam-
pling means deleting examples.
Sampling can be done either randomly or deterministically.
While deterministic sampling can reduce variance, it risks in-
troducing bias, if the non-random choice of examples to du-
plicate or eliminate is correlated with some property of the
examples. Undersampling that is deterministic in the sense
that the fraction of examples with each value of a certain fea-
ture is held constant is often called stratified sampling.
It is possible to change the number of positive examples in-
stead of or as well as changing the number of negative exam-
ples. However in many domains one class is rare compared to
the other, and it is important to keep all available examples of
the rare class. In these cases, if we call the rare class the posi-
tive class, Theorem 1 says directly how to change the number
of common examples without discarding or duplicating any
of the rare examples.
3 New probabilities given a new base rate
In this section we state and prove a theorem of independent
interest that happens also to be the tool needed to prove The-
orem 1. The new theorem answers the question of how the
predicted class membership probability of an example should
change in response to a change in base rates. Suppose that
Rd59,A
is correct for an example
, if
is drawn
from a population with base rate
e
fd,:
positive ex-
amples. But suppose that in fact
is drawn from a population
with base rate
e
C
. What is
RCg)hCi,A
?
We make the assumption that the shift in base rate is
the only change in the population to which
belongs.
Formally, we assume that within the positive and nega-
tive subpopulations, example probabilities are unchanged:
hC
O,.j)
O,.
and
hCE
k)$&l)
G)$&
.
Given these assumptions, the following theorem shows how
to compute
RC
as a function of
R
,
e
, and
e
C
.
Theorem 2: In the context just described,
R
C
e
C
RFTR
e
e
FSR
e
L
e
C
RF
e-e
C
Proof: Using Bayes’ rule,
R)G,A
is

i,:QO,:


GO,.
e

Because
GO,
and
)$
are mutually exclusive,

is

GO,.2G,.mLn
)$&Q)$&1
Let
(
fd,:
, let
o

[$&
, and let
p
o
HX(
.
Then
R
(
e
(
e
L
o
2, F
e
e
e
L
p
2, F
e
Similarly,
R
C
e
C
e
C
L
p
2, F
e
C
Now we can solve for
p
as a function of
R
and
e
. We have
R
e
LR
p
2, F
e
f
e
so
p
q
e
F5R
e
+HI RrF5R
e
1
Then the
denominator for
RC
is
e
C
L
p
2, F
e
C
j
e
C
L
e
FSR
e
RFTR
e
F
e
C
e
FTR
e
RFSR
e
e
FTR
e
RFTR
e
L
e
Cs
,F
e
FSR
e
RFTR
eut
e
FTR
e
L
e
CRF
e1e
C
RFSR
e
Finally we have
R
C
e
C
RFTR
e
e
FSR
e
L
e
C
RF
e-e
C
It is important to note that Theorem 2 is a statement about
true probabilities given different base rates. The proof does
not rely on how probabilities may be estimated based on some
learning process. In particular, the proof does not use any
assumptions of independence or conditional independence, as
made for example by a naive Bayesian classifier.
If a classifier yields estimated probabilities
v
R
that we as-
sume are correct given a base rate
e
, then Theorem 2 lets us
00.2 0.4 0.6 0.8
p0
0.2
0.4
0.6
0.8
1
b’
0
0.2
0.4
0.6
0.8
1
p’
Figure 1:
RC
as a function of
R
and
e
, when
e
Cg)$^ `_
.
compute estimated probabilities
v
RC
that are correct given a dif-
ferent base rate
e
C
. From this point of view, the theorem has
a remarkable aspect. It lets us use a classifier learned from a
training set drawn from one probability distribution on a test
set drawn from a different probability distribution. The theo-
rem thus relaxes one of the most fundamental assumptions of
almost all research on machine learning, that training and test
sets are drawn from the same population.
The insight in the proof is the introduction of the variable
p
that is the ratio of

)w$A
and

)x,.
. If we
try to compute the actual values of these probabilities, we
find that we have more variables to solve for than we have
simultaneous equations. Fortunately, all we need to know for
any particular example
is the ratio
p
.
The special case of Theorem 2 where
RCU$^ `_
was recently
worked out independently by Weiss and Provost [2001]. The
case where
e
U$^ `_
is also interesting. Suppose that we do not
know the base rate of positive examples at the time we learn
a classifier. Then it is reasonable to use a training set with
e
y$^ `_
. Theorem 2 says how to compute probabilities
RC
later that are correct given that the population of test examples
has base rate
e
C
. Specifically,
R
C
e
C
RFSR HXz
,.H{z FSR HXz|L
e
C
RF
e
C
HXz
R
RL)2, FTR -2, F
e
C
H
e
C
This function of
R
and
e
C
is plotted in Figure 1.
Using Theorem 2 as a lemma, we can now prove Theo-
rem 1 with a slight change of notation.
Theorem 1: To make a target probability threshold
R
cor-
respond to a given probability threshold
RC
, the number of
negative training examples should be multiplied by
R
,FTR
,FSR C
R
C
Proof: We want to compute an adjusted base rate
e
C
such
that for a classifier trained using this base rate, an estimated
probability
RC
corresponds to a probability
R
for a classifier
trained using the base rate
e
.
We need to compute the adjusted
e
C
as a function of
e
,
R
,
and
RC
. From the proof of Theorem 2,
RC
e
FrR CR
e
LRC
e
CRTF
RC
e1e
C}
e
CRJF
e
CR
e
. Collecting all the
e
C
terms on the left, we
have
e
CR~F
e
CR
e
F
e
CRCRL
e
CRC
e
RC
e
FR CR
e
, which gives that
the adjusted base rate should be
e
C
RC
e
FTR CR
e
RJFTR
e
FSR
C
RLR
C
e
RC
e
2, FTR
RFTR
e
FTR
C
RLR
C
e
Suppose that
e
i,.H^2,{L
>
and
e
C#,.H^2,{L
>
C
so the number
of negative training examples should be multiplied by
>
CH
>
to
get the adjusted base rate
e
C
. We have that
>
Cg2, F
e
C"+H
e
C
is
RJFTR
e
FSR CRLRC
e
FTR C
e
LRC
e
R
RJFSR
e
FR
C
RLR
C
e
RFSR
e
FTR CRLRC
e
R
C
e
2, FTR
R2, F
e
FSR CR2, F
e
R
C
e
2, FTR
R2, F
e
-2, FTR C
R
C
e
2, FTR
Therefore
>
C
>
R2, F
e
-2, FTR C
R
C
e
Q, FSR
e
,F
e
RQ, FSR C"
R
C
2, FTR
Note that the effective cardinality of the subset of negative
training examples must be changed in a way that does not
change the distribution of examples within this subset.
4 Effects of changing base rates
Changing the training set prevalence of positive and negative
examples is a common method of making a learning algo-
rithm cost-sensitive. A natural question is what effect such a
change has on the behavior of standard learning algorithms.
Separately, many researchers have proposed duplicating or
discarding examples when one class of examples is rare, on
the assumption that standard learning methods perform bet-
ter when the prevalence of different classes is approximately
equal [Kubat and Matwin, 1997; Japkowicz, 2000]. The pur-
pose of this section is to investigatethis assumption.
4.1 Changing base rates and Bayesian learning
Given an example
, a Bayesian classifier applies Bayes’
rule to compute the probability of each class
as

h

QHX
1
Typically


is computed by a func-
tion learned from a training set,
"
is estimated as the train-
ing set frequency of class
, and

is computed indirectly
by solving the equation

ji,
.
A Bayesian learning method essentially learns a model


of each class
separately. If the frequency of a class is
changed in the training set, the only change is to the estimated
base rate
"
of each class. Therefore there is little reason
to expect the accuracy of decision-making with a Bayesian
classifier to be higher with any particular base rates.
Naive Bayesian classifiers are the most important special
case of Bayesian classification. A naive Bayesian classi-
fier is based on the assumption that within each class, the
values of the attributes of examples are independent. It
is well-known that these classifiers tend to give inaccurate
probability estimates [Domingos and Pazzani, 1996]. Given
an example
, suppose that a naive Bayesian classifier com-
putes
h
as its estimate of
")q,
. Usually
is too extreme: for most
, either
is close to 0 and
then
[9"/,A
or
is close to 1 and then
4k,A
.
However, the ranking of examples by naive Bayesian clas-
sifiers tends to be correct: if
8
^
then
i
,A
69/, ^
. This fact suggests that given a cost-
sensitive application where optimal decision-making uses the
probability threshold
RV
, one should empirically determine
a different threshold
R
such that
S7
R
is equivalent to
,A
7 RV
. This procedure is likely to improve
the accuracy of decision-making, while changing the propor-
tion of negative examples using Theorem 1 in order to use the
threshold 0.5 is not.
4.2 Decision tree growing
We turn our attention now to standard decision tree learning
methods, which have two phases. In the first phase a tree is
grown top-down, while in the second phase nodes are pruned
from the tree. We discuss separately the effect on each phase
of changing the proportion of negative and positive training
examples.
A splitting criterion is a metric applied to an attribute
that measures how homogeneous the induced subsets are, if
a training set is partitioned based on the values of this at-
tribute. Consider a discrete attribute
that has values
) 0
through
/A
for some
=
7<z
. In the two-class case, stan-
dard splitting criteria have the form
j
1
0

/
u R
-, FTR
where
R
"i,A
q
and all probabilities are
frequencies in the training set to be split based on
. The
function
u R-, F[R
measures the impurity or heterogeneity
of each subset of training examples. All such functions are
qualitatively similar, with a unique maximum at
R/$^ `_
, and
equal minima at
R/$
and
Ri,
.
Drummond and Holte [2000]have shown that for two-
valued attributes the impurity function
z^ RQ, FSR
sug-
gested by Kearns and Mansour [1996]is invariant to changes
in the proportion of different classes in the training data. We
prove here a more general result that applies to all discrete-
valued attributes and that shows that related impurity func-
tions, including the Gini index [Breiman et al., 1984], are not
invariant to base rate changes.
Theorem 3: Suppose
u R-, FR kM RQ, FR Q
where
4$
and
4$
. For any collection of discrete-valued
attributes, the attribute that minimizes
using
is the
same regardless of changes in the base rate
,.
of the
training set if
/$% _
, and not otherwise in general.
Proof: For any attribute
, by definition
j)
!
0

Qk,
/$g
where
^0
through
are the possible values of
. So by
Bayes’ rule
H
is

-

i,:QG,:


)$&2"/$&
E
Grouping the
E
factors for each
gives that
HX
is

0
E
i,:2"O,:2E
GU$A2G/$AQ
Now the base rate factors can be brought outside the sum, so
is
}0
GO,. GU$A
times the sum

0!#

i,:

)$&
(3)
Because
0
",: ¡$A
is constant for all at-
tributes, the attribute
for which
is minimum is deter-
mined by the minimum of (3). If
z
y,
then (3) depends
only on
E
G,.
and
E
G/$A
, which do not depend
on the base rates. Otherwise, (3) is different for different base
rates because

lU
i,:QG,:¢LnE
GU$&QG/$A
unless the attribute
is independent of the class
, that is

i,:')
)$&
for
,
P
P
=
.
The sum (3) has its maximum value 1 if
is independent
of
. As desired, the sum is smaller otherwise, if
and
are
correlated and hence splitting on
is reasonable.
Theorem 3 implies that changing the proportion of positive
or negative examples in the training set has no effect on the
structure of the tree if the decision tree growing method uses
the
z
R2, FTR
impurity criterion. If the algorithm uses a
different criterion, such as the C4.5 entropy measure, the ef-
fect is usually small, because all impurity criteria are similar.
The experimental results of Drummond and Holte [2000]
and Dietterich et al. [1996]show that the
z R Q, FSR
crite-
rion normally leads to somewhat smaller unpruned decision
trees, sometimes leads to more accurate trees, and never leads
to much less accurate trees. Therefore we can recommend its
use, and we can conclude that regardless of the impurity cri-
terion, applying Theorem 1 is not likely to have have much
influence on the growing phase of decision tree learning.
4.3 Decision tree pruning
Standard methods for pruning decision trees are highly sen-
sitive to the prevalence of different classes among training
examples. If all classes except one are rare, then C4.5 often
prunes the decision tree down to a single node that classifies
all examples as members of the common class. Such a clas-
sifier is useless for decision-making if failing to recognize an
example in a rare class is an expensive error.
Several papers have examined recently the issue of how to
obtain good probability estimates from decision trees [Brad-
ford et al., 1998; Provost and Domingos, 2000; Zadrozny and
Elkan, 2001b]. It is clear that it is necessary to use a smooth-
ing method to adjust the probability estimates at each leaf of
a decision tree. It is not so clear what pruning methods are
best.
The experiments of Bauer and Kohavi [1999]suggest that
no pruning is best when using a decision tree with probability
smoothing. The overall conclusion of Bradford et al. [1998]
is that
£
the best pruning is either no pruning or what they call
“Laplace pruning.” The idea of Laplace pruning is:
1. Do Laplace smoothing: If
>
training examples reach a
node, of which
are positive, let the estimate at this
node of
i,A
be
L/,:+HI
>
L8z{
.
2. Compute the expected loss at each node using the
smoothed probability estimates, the cost matrix, and the
training set.
3. If the expected loss at a node is less than the sum of the
expected losses at its children, prune the children.
We can show intuitively that Laplace pruning is similar to
no pruning. In the absence of probability smoothing, the ex-
pected loss at a node is always greater than or equal to the sum
of the expected losses at its children. Equality holds only if
the optimal predicted class at each child is the same as the
optimal predicted class at the parent. Therefore, in the ab-
sence of smoothing, step (3) cannot change the meaning of a
decision tree, i.e. the classes predicted by the tree, so Laplace
pruning is equivalent to no pruning.
With probability smoothing, if the expected loss at a node
is less than the sum of the expected losses at its children, the
difference must be caused by smoothing, so without smooth-
ing there would presumably be equality. So pruning the chil-
dren is still only a simplification that leaves the meaning of
the tree unchanged. Note that the effect of Laplace smooth-
ing is small at internal tree nodes, because at these nodes typ-
ically
4-4 ,
and
>
4-4 z
.
In summary, growing a decision tree can be done in a
cost-insensitive way. When using a decision tree to esti-
mate probabilities, it is preferable to do no pruning. If costs
are example-dependent, then decisions should be made using
smoothed probability estimates and Equation (1). If costs are
fixed, i.e. there is a single well-defined cost matrix, then each
node in the unpruned decision tree can be labeled with the
optimal predicted class for that leaf. If all the leaves under a
certain node are labeled with the same class, then the subtree
under that node can be eliminated. This simplification makes
the tree smaller but does not change its predictions.
5 Conclusions
This paper has reviewed the basic concepts behind optimal
learning and decision-making when different misclassifica-
tion errors cause different losses. For the two-class case, we
have shown rigorously how to increase or decrease the pro-
portion of negative examples in a training set in order to make
optimal cost-sensitive classification decisions using a classi-
fier learned by a standard non cost-sensitive learning method.
However, we have investigated the behavior of Bayesian and
decision tree learning methods, and concluded that changing
the balance of negative and positive training examples has
little effect on learned classifiers. Accordingly, the recom-
mended way of using one of these methods in a domain with
differing misclassification costs is to learn a classifier from
the training set as given, and then to use Equation (1) or Equa-
tion (2) directly, after smoothing probability estimates and/or
adjusting the threshold of Equation (2) empirically if neces-
sary.
References
[Bauer and Kohavi, 1999]Eric Bauer and Ron Kohavi. An em-
pirical comparison of voting classification algorithms: Bagging,
boosting, and variants. Machine Learning, 36:105–139, 1999.
[Bradford et al., 1998]J. Bradford, C. Kunz, R. Kohavi, C. Brunk,
and C. Brodley. Pruning decision trees with misclassification
costs. In Proceedings of the European Conference on Machine
Learning, pages 131–136, 1998.
[Breiman et al., 1984]L. Breiman, J. H. Friedman, R. A. Olshen,
and C. J. Stone. Classification and Regression Trees. Wadswoth,
Belmont, California, 1984.
[Dietterich et al., 1996]T. G. Dietterich, M. Kearns, and Y. Man-
sour. Applying the weak learning framework to understand and
improve C4.5. In Proceedings of the Thirteenth International
Conference on Machine Learning, pages 96–104. Morgan Kauf-
mann, 1996.
[Domingos and Pazzani, 1996]Pedro Domingos and Michael Paz-
zani. Beyond independence: Conditions for the optimality of
the simple Bayesian classifier. In Proceedings of the Thirteenth
International Conference on Machine Learning, pages 105–112.
Morgan Kaufmann, 1996.
[Drummond and Holte, 2000]Chris Drummond and Robert C.
Holte. Exploiting the cost (in)sensitivity of decision tree splitting
criteria. In Proceedings of the Seventeenth International Confer-
ence on Machine Learning, pages 239–246, 2000.
[Japkowicz, 2000]N. Japkowicz. The class imbalance problem:
Significance and strategies. In Proceedings of the International
Conference on Artificial Intelligence, Las Vegas, June 2000.
[Kearns and Mansour, 1996]M. Kearns and Y. Mansour. On the
boosting ability of top-down decision tree learning algorithms.
In Proceedings of the Annual ACM Symposium on the Theory of
Computing, pages 459–468. ACM Press, 1996.
[Kubat and Matwin, 1997]M. Kubat and S. Matwin. Addressing
the curse of imbalanced training sets: One-sided sampling. In
Proceedings of the Fourteenth International Conference on Ma-
chine Learning, pages 179–186. Morgan Kaufmann, 1997.
[Margineantu, 2000]Dragos Margineantu. On class probability es-
timates and cost-sensitive evaluation of classifiers. In Workshop
Notes, Workshop on Cost-Sensitive Learning, International Con-
ference on Machine Learning, June 2000.
[Michie et al., 1994]D. Michie, D. J. Spiegelhalter, and C. C. Tay-
lor. Machine Learning, Neural and Statistical Classification. El-
lis Horwood, 1994.
[Provost and Domingos, 2000]Foster Provost and Pedro Domin-
gos. Well-trained PETs: Improving probability estimation trees.
Technical Report CDER #00-04-IS, Stern School of Business,
New York University, 2000.
[Weiss and Provost, 2001]Gary M. Weiss and Foster Provost. The
effect of class distribution on classifier learning. Technical Re-
port ML-TR 43, Department of Computer Science, Rutgers Uni-
versity, 2001.
[Zadrozny and Elkan, 2001a]Bianca Zadrozny and Charles Elkan.
Learning and making decisions when costs and probabilities are
both unknown. Technical Report CS2001-0664, Department of
Computer Science and Engineering, University of California, San
Diego, January 2001.
[Zadrozny and Elkan, 2001b]Bianca Zadrozny and Charles Elkan.
Obtaining calibrated probability estimates from decision trees
and naive Bayesian classifiers. In Proceedings of the Eighteenth
International Conference on Machine Learning, 2001. To appear.
... Class imbalance remains one of the most stubborn threats to safe clinical prediction: skewed data encourage algorithms to optimise overall accuracy at the expense of rare-but clinically essential-events. Algorithm-level approaches that embed explicit mis-classification penalties can theoretically offset this bias [32,33]. At the same time, recent deep-learning innovations such as deep belief nets and focal-loss functions promise further gains in high-dimensional settings [34,35]. ...
... By establishing when and for whom resampling or weighting truly adds value, the review will help data scientists avoid reflexive oversampling that can obscure calibration or foster over-fitting. Evidence that cost-sensitive losses rival data-level balancing could shift practice toward simpler, loss-function-centric pipelines already available in mainstream frameworks [32][33][34][35]. Clinicians and journal editors could use the findings to demand fuller reporting of calibration, confusion matrices and mis-classification costs, accelerating the adoption of emerging AI reporting extensions (e.g., TRIPOD-AI), see also [26]. ...
Preprint
Full-text available
Introduction Class imbalance—situations where clinically important “positive” cases form <30% of the dataset—systematically degrades the sensitivity and fairness of medical prediction models. Although data-level techniques such as random oversampling, random undersampling and SMOTE, and algorithm-level approaches like cost-sensitive learning, are widely used, the empirical evidence describing when these corrections improve model performance remains fragmented across diseases and modelling frameworks. This protocol outlines a scoping systematic review with meta-regression that will map and quantitatively summarise 15 years of research on resampling strategies in imbalanced clinical datasets, addressing a critical methodological gap in trustworthy medical AI. Methods and analysis We will search MEDLINE, EMBASE, Scopus, Web of Science Core Collection and IEEE Xplore, plus grey-literature sources (medRxiv, arXiv, bioRxiv) for primary studies (2009 – 31 Dec 2024) that apply at least one resampling or cost-sensitive method to binary clinical prediction tasks with a minority-class prevalence <30%. No language restrictions will be applied. Two reviewers will screen records, extract data with a piloted form and document the process in a PRISMA flow diagram. A descriptive synthesis will catalogue clinical domain, sample size, imbalance ratio, resampling technique, model type and performance metrics where ≥10 studies report compatible AUCs, a random-effects mixed-effects meta-regression (logit-transformed AUC) will examine moderators including imbalance ratio, resampling class, model family and sample size. Small-study effects will be probed with funnel plots, Egger’s test, trim-and-fill and weight-function models; influence diagnostics and leave-one-out analyses will assess robustness. Because this is a methodological review, formal clinical risk-of-bias tools are optional; instead, design-level screening, influence diagnostics and sensitivity analyses will ensure transparency. Discussion By combining a broad conceptual map with quantitative estimates, this review will establish when data-level versus algorithm-level balancing yields genuine improvements in discrimination, calibration and cost-sensitive metrics across diverse medical domains. The findings will guide researchers in choosing parsimonious, evidence-based imbalance corrections, inform journal and regulatory reporting standards, and highlight research gaps, such as the under-reporting of calibration and misclassification costs, that must be addressed before balanced models can be trusted in clinical practice. Systematic review registration INPLASY202550026
... Example reweighting in machine learning involves strategies to prioritize training data. Traditional methods use prior knowledge for resampling [46][47][48][49], while data-driven approaches adaptively map loss values to weights during training [50,51]. Two dominant paradigms exist: (1) lossdecreasing functions (e.g., self-paced learning [52][53][54][55], iterative reweighting [25,56]) to handle noisy labels, and (2) loss-increasing functions (e.g., AdaBoost [26,57], hard negative mining [58], focal loss [23]) to emphasize challenging examples. ...
Article
Full-text available
Deep neural networks (DNNs) have achieved impressive performance in various applications, but are susceptible to overfitting biases in training data, such as label noise and class imbalance. Example reweighting methods can be used to solve this issue, while often require manually specifying the weighting function forms. Recently, Meta-Weight-Net (MW-Net) method has been proposed to automatically learn the weighting function parameterized by a Multi-Layer Perceptron (MLP) in a meta-learning manner. However, the update of MW-Net suffers from expensive computations due to the second-order gradient computation in bilevel optimization. To address this issue, we propose a First-order MW-Net (FMW-Net) algorithm based on value-function approach, which relies solely on first-order gradient information. The novel learning algorithm has better scalability due to its lower compute/memory costs (compared to MW-Net, the time cost is reduced to approximately 33%, and the memory cost is reduced to 75%), making it both practical and efficient for large-scale models in deep learning, e.g., large language models. We present empirical results demonstrating its superior practical efficiency. Source code is available at https://github.com/ybzhouni/FMW-Net.
Article
Dynamic treatment regimes (DTRs) represent sequential decision rules for multiple intervention stages. Each rule maps patients’ covariates to optional treatments. The optimal dynamic treatment regime is the one that maximizes the mean outcome of interest if followed by the overall population. Motivated by a clinical study on the treatment of advanced colorectal cancer with traditional Chinese medicine, we propose a censored C-learning (CC-learning) method to estimate the DTR with multiple treatments based on survival data. To address the challenges of multiple stages with right censoring, we modified the backward recursion algorithm to adapt to the flexible number and timing of treatments. We propose a framework for multiple treatments that transforms the optimization problem of multiple treatment comparisons into an example-dependent, cost-sensitive classification problem. With data space expansion and classification techniques, the CC-learning method can produce an interpretable optimal DTR. We theoretically prove the method’s optimality and assess its performance with finite sample simulations. Using our method, we identify the interpretable tree treatment regimes at each stage for the advanced colorectal cancer treatment data from Xiyuan Hospital.
Article
Full-text available
This paper investigates how the splitting criteria and pruning methods of decision tree learning algorithms are influenced by misclassification costs or changes to the class distribution. Splitting criteria that are relatively insensitive to costs (class distributions) are found to perform as well as or better than, in terms of expected misclassification cost, splitting criteria that are cost sensitive. Consequently there are two opposite ways of dealing with imbalance. One is to combine a costinsensitive splitting criterion with a cost insensitive pruning method to produce a decision tree algorithm little affected by cost or prior class distribution. The other is to grow a cost-independent tree which is then pruned in a cost-sensitive manner. 1. Introduction When applying machine learning to real world classification problems two complications that often arise are imbalanced classes (one class occurs much more often than the other (Kubat et al., 1998; Ezawa et al., 1...
Article
We analyze the performance of top–down algorithms for decision tree learning, such as those employed by the widely used C4.5 and CART software packages. Our main result is a proof that such algorithms areboostingalgorithms. By this we mean that if the functions that label the internal nodes of the decision tree can weakly approximate the unknown target function, then the top–down algorithms we study will amplify this weaks advantage to build a tree achieving any desired level of accuracy. The bounds we obtain for this amplification show an interesting dependence on thesplitting criterionused by the top–down algorithm. More precisely, if the functions used to label the internal nodes have error 1/2−γas approximations to the target function, then for the splitting criteria used by CART and C4.5, trees of size (1/ε)O(1/γ2ε2)and (1/ε)O(log(1/ε)/γ2)(respectively) suffice to drive the error belowε. Thus (for example), a small constant advantage over random guessing is amplified to any larger constant advantage with trees of constant size. For a new splitting criterion suggested by our analysis, the much stronger bound of (1/ε)O(1/γ2)which is polynomial in 1/ε) is obtained, which is provably optimal for decision tree algorithms. The differing bounds have a natured explanation in terms of concavity properties of the splitting criterion. The primary contribution of this work is in proving that some popular and empirically successful heuristics that are base on first principles meet the criteria of an independently motivated theoretical model.
Article
this paper is to push this interaction further in light of these recent developments. In particular, we perform experiments suggested by the formal results for Adaboost and C4:5 within the weak learning framework. We concentrate on two particularly intriguing issues
Article
Although the majority of conceptlearning systems previously designed usually assume that their training sets are well-balanced, this assumption is not necessarily correct. Indeed, there exist many domains for which one class is represented by a large number of examples while the other is represented by only a few. The purpose of this paper is 1) to demonstrate experimentally that, at least in the case of connectionist systems, class imbalances hinder the performance of standard classifiers and 2) to compare the performance of several approaches previously proposed to deal with the problem. 1 Introduction As the field of machine learning makes a rapid transition from the status of "academic discipline " to that of "applied science", a myriad of new issues, not previously considered by the machine learning community, is now coming into light. One such issue is the class imbalance problem. The class imbalance problem corresponds to domains for which one class is represented by a large n...
Article
This paper addresses two cost-sensitive learning methodology issues. First, we ask the question of whether Bagging is always an appropriate procedure to compute accurate class-probability estimates for cost-sensitive classification. Second, we will point the reader to a potential source of erroneous results in the most common procedure of evaluating cost-sensitive classifiers when the real misclassification costs are unknown. The paper concludes with an experimental comparison of MetaCost and BagCost, a procedure that labels unseen examples based their class probability estimates.
Article
Adding examples of the majority class to the training set can have a detrimental effect on the learner's behavior: noisy or otherwise unreliable examples from the majority class can overwhelm the minority class. The paper discusses criteria to evaluate the utility of classifiers induced from such imbalanced training sets, gives explanation of the poor behavior of some learners under these circumstances, and suggests as a solution a simple technique called one-sided selection of examples. 1 Introduction The general topic of this paper is learning from examples described by pairs [(x; c(x)], where x is a vector of attribute values and c(x) is the corresponding concept label. For simplicity, we consider only problems where c(x) is either positive or negative, and all attributes are continuous. Since Fisher (1936), this task has received plenty of attention from statisticians as well as from researchers in artificial neural networks, AI, and ML. A typical scenario assumes the e...