PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Counterfactual explanations have become a very popular interpretability tool to understand and explain how complex machine learning models make decisions for individual instances. Most of the research on counterfactual explainability focuses on tabular and image data and much less on models dealing with functional data. In this paper, a counterfactual analysis for functional data is addressed, in which the goal is to identify the samples of the dataset from which the counterfactual explanation is made of, as well as how they are combined so that the individual instance and its counterfactual are as close as possible. Our methodology can be used with different distance measures for multivariate functional data and is applicable to any score-based classifier. We illustrate our methodology using two different real-world datasets, one univariate and another multivariate.
A New Model for Counterfactual Analysis for
Functional Data
Emilio Carrizosa*1, Jasone Ram´
ırez-Ayerbe†1 , and Dolores Romero
Morales‡2
1Instituto de Matem´
aticas de la Universidad de Sevilla, Seville, Spain
2Department of Economics, Copenhagen Business School,
Frederiksberg, Denmark
Abstract
Counterfactual explanations have become a very popular interpretability tool to
understand and explain how complex machine learning models make decisions for
individual instances. Most of the research on counterfactual explainability focuses
on tabular and image data and much less on models dealing with functional data.
In this paper, a counterfactual analysis for functional data is addressed, in which
the goal is to identify the samples of the dataset from which the counterfactual
explanation is made of, as well as how they are combined so that the individual
instance and its counterfactual are as close as possible. Our methodology can
be used with different distance measures for multivariate functional data and is
applicable to any score-based classifier. We illustrate our methodology using two
different real-world datasets, one univariate and another multivariate.
Keywords— Counterfactual Explanations; Mathematical Optimization; Functional
Data; Prototypes; Random Forests
1 Introduction
Machine Learning models are increasingly being used for high stakes decision-making
settings such as healthcare, law or finance. Many of these Machine Learning models
are black-boxes and therefore they do not explain how they arrive to decisions in a
way that humans can understand. Nowadays, there is an increasing number of laws
and regulations [16] coming into place to enforce the decisions of algorithms to be
interpretable (a.k.a. transparent) [11, 12, 14, 26, 34]. Interpretability is enhanced by
selecting the features that have the greatest impact on the model as a whole [2, 4, 35],
*Emilio Carrizosa: ecarrizosa@us.es
Jasone Ram´
ırez-Ayerbe: mrayerbe@us.es
Dolores Romero Morales: drm.eco@cbs.dk
1
but also knowing these locally for the decision made for each individual [24, 25, 28].
A specific type of post-hoc interpretability tools is the counterfactual explanation,
i.e., a set of changes that can be made to an instance such that the given machine learn-
ing model would have classified it in a different class. For instance, in a credit score
application one may be interested in knowing how the debt history of a person should
have been to change the prediction to loan should be granted. See [32] for the semi-
nal paper on counterfactual explanations in predictive modelling and [17, 22, 31] for
recent surveys. Additionally, there are different criteria or constraints that a counterfac-
tual should satisfy [27, 32], like actionability, plausibility and sparsity. Actionability
ensures that a counterfactual does not change immovable features, plausibility is nec-
essary to guarantee that counterfactual explanations are realistic and do not depart from
what can be observed in the sample, whereas sparsity ensures, for instance, that only a
few features are moved.
This paper studies score-based multiclass classification methods. For these, a quite
versatile model-based framework can be built for counterfactual analysis. Let f:X
{1, . . . , K}be a multiclass classifier based on score functions (f1, . . . , fK), where K
is the number of classes and Xis the feature space. Given an instance x0 X , let
f(x0)arg maxkfk(x0)denote the class assigned by the classifier to x0. For a fixed
class k, the counterfactual instance of x0is defined in this paper as the feasible x
obtained with a minimal perturbation of x0and classified by the score-based classifier
in class k. This yields the following optimization problem:
(minxC(x,x0)
s.t. fk(x)fk(x)k= 1,...,K k =k
x X0.
(1)
The objective function C(x,x0)is a cost function that measures the dissimilarity
between the given instance x0and the counterfactual instance x. In the feasible region,
we have two types of constraints. In the first one, we ensure that the counterfactual x
is classified in class kby imposing that the score fk(x)is the maximum across all
k. In the second type of constraint, we ensure that the counterfactual is in X0, the set
containing the actionability and plausibility constraints.
Most of the work on counterfactual explanations focuses on tabular and image data
[22], and much less on functional data, This type of data, which arises in important do-
mains such as econometrics, energy, marketing [21, 29, 30], is addressed in this paper.
In principle, one could apply the methods developed for tabular data also to functional
data, just by considering that each feature is the measurement of the function at a
time instant. However, doing so, fundamental information such as the autocorrelation
structure along consecutive time instants would be lost. For this reason, some works
on counterfactual explanations exploiting the functional nature of data have been sug-
gested, e.g., [1, 10], but, as far as the authors know, none of them uses the structure and
properties of the machine learning model.
2
Our aim is to model and solve Problem (1) when data are functions. The coun-
terfactual explanation x X 0is a hypothetical instance that is as similar as possible
to the instance x0,generated by combining various existing instances, hereafter pro-
totypes, from the dataset. First, we can deal with multivariate functional data, i.e., our
data xare functions taking values in some RJ.Second, we are able to identify the in-
stances from the dataset that generate the counterfactual for each instance, controlling
how sparse the counterfactual explanation is, in terms of both the number of proto-
types used to create the counterfactual xand the number of features changed to move
from x0to x. Third, we can handle a collection of distance measures, while ensuring
that the cost function Cis tractable. We will show that, under mild assumptions of
the set defining the actionability and plausibility constraints, obtaining counterfactual
explanations reduces to solving a Mixed Integer Convex Quadratic Model with linear
constraints, which can be solved with standard optimization packages.
The remainder of the paper is organized as follows. In Section 2, we model the
problem of finding counterfactual explanations through an optimization problem (1)
when data are functions. In Section 3, we focus on counterfactual analysis for additive
tree models. In Section 4, a numerical illustration using real-world datasets is provided.
Finally, conclusions and possible lines of future research are provided in Section 5.
2 Counterfactual analysis for functional data
In this section, we will detail the formulation for Problem (1) explanations when deal-
ing with functional data. To do this we need to model the structure of the counterfactual
instances, the constraints associated with them, as well as the cost function C. This will
be done in what follows. We postpone to the next section the analysis of a state-of-the-
art class of score-based classification models, namely, additive classification trees.
Counterfactual instances and constraints
An instance x X F Jis defined as a vector of Jfunctional features. Hence,
x= (x1(t), . . . , xJ(t)), where xj: [0, T ]R,j= 1, . . . , J , are Riemann integrable
functions defined in interval [0, T ]. Notice that xj(t)may be a static feature, e.g., age,
defined then as a constant function. Note also that, for a given time instant t[0, T ],
x(t)is a vector in RJcomponents, which may represent Jmeasurements of indepen-
dent attributes, or they may be related, e.g., one can include an attribute xj,some of its
its derivatives to provide information also on e.g., the growth speed or the convexity of
xj.
Let us discuss constraints on xin Problem (1). First, we need to ensure that the
counterfactual explanation xis realistic. In the case of functional data, this yields
an infinite-dimensional optimization problem. To enhance the tractability of this re-
quirement, we propose the use of instances of the dataset, i.e., prototypes, to generate
the counterfactual explanation. Let xb,b= 1, . . . , B, be all the instances that have
3
been classified by the model in class k. For an instance x0= (x0
1, . . . , x0
J), fea-
ture jof the counterfactual explanation xjis defined as the convex combination of
the original feature x0
jand the feature jof all Bprototypes xb
j. Thus, the counterfac-
tual explanation xis defined for each feature jas xj=α0jx0+PB
b=1 αb
jxb
j, where
PB
b=0 αb
j= 1,j= 1, . . . , J .
In order to gain interpretability of the so obtained counterfactual explanation x, we
want to use as few prototypes xbas possible in the construction of x. For this reason
we will impose a maximum of Bmax prototypes to be used, where Bmax is a parameter
defined by the user.
In X0we impose the unmovable constraints, other constraints like upper or lower
limits on the static variables, or constraints avoiding, e.g., that prototypes too far from
x0are considered (which is done by setting to zero all coefficients αb
jif the distance
between x0and xbis above a threshold value).
Cost function
Recall that C(x,x0)is the cost of changing x0to x, which can be measured by the
proximity between the curves defining xand x0).
The proximity between curves can be measured in several ways. One can use for
instance the squared Euclidean distance:
xx02
2=ZT
0
J
X
j=1
(xj(t)x0
j(t))2dt (2)
Needless to say, different weights can be assigned to each feature in the expression
above.
Another popular distance used in the literature [13, 33] is the Dynamic Time Warp-
ing (DTW) distance, which measures the dissimilarity between two functions that
may be inspected at different speed, see Figure 1. More explicitly, suppose we have
xand x, discretised in two sequences of length n, so that the J-variate functions
xand xare replaced by (x(t1),...,x(tn)) and (x(t1),...,x(tn)).Observe that
each x(t),x(t)are vectors in RJ.A warping path πis a chain of pairs of the form
π= (q11, q21 )(q12, q22 ). . . (q1Q, q2Q)of length Q,nQ2n1,
satisfying the following two conditions:
1. (q11, q21 )=(t1, t1),and (q1Q, q2Q)=(tn, tn)
2. q1rq1(r+1) q1r+ 1,and q2rq2(r+1) q2r+ 1, r = 1,2, . . . , Q 1
Let Wdenote the set of all warping paths. Then, the DTW distance DTW(x,x)be-
tween xand xis the minimal squared Euclidean distance between pairs of the form
4
(x(q11),...,x(q1Q)) and (x(q21 ),...,x(q2Q)) when (q11, q21 )(q12, q22 )
. . . (q1Q, q2Q)is a warping path, i.e.,
DTW(x,x) = min PQ
r=1 PJ
j=1 xj(q1r)x
j(q2r)2
s.t. π= (q11, q21 )(q12, q22 ). . . (q1Q, q2Q) W
(3)
Observe that DT W can be efficiently evaluated using dynamic programming [3].
(a) Warping path π= (0,0) (1,1)
(2,2) (3,3) (4,4) (5,5)
(6,6) (7,7)
(b) Optimal warping path π= (0,0)
(1,1) (2,2) (3,3) (4,4)
(4,5) (5,6) (6,6) (7,7)
Figure 1: Comparison between different warping paths between the functions x(in
blue) and x(in orange)
Additionally, Cmay contain, on top of the distance-based term described above,
other terms measuring, e.g., the number of features altered when moving from x0to x.
In particular, in Section 3 we will discuss in detail the cases
C(x,x0) = λ0x0x0+λ2ZT
0
J
X
j=1
(xj(t)x0
j(t))2dt, (4)
and
C(x,x0) = λ0x0x0+λ2DTW(x,x0),(5)
where x0x0indicates how many components of x0= (x0
1, . . . , x0
J)and x=
(x1, . . . , xJ)are not equal,
x0x0=j:x0
j=xj,(6)
and λ0, λ20,not simultaneously 0.
3 Additive Tree Models
Problem (1) under the modelling assumptions in Section 2 can be addressed for several
score-based classifiers. In particular, it can be formulated for additive tree models such
5
as Random Forest [6] or XGBoost models [8], as well as linear models such as logistic
regression and linear support vector machines. Below, we focus on additive tree mod-
els (ATM), and extend to functional data the analysis for tabular data described in the
previous work of the authors [7].
The ATM model is composed of Tclassification trees. Each tree thas a series
of branching nodes s, each having associated a feature v(s),a time instant ts,and a
threshold value cs,so that records xgo through the left or the right of the branching
node depending on whether xv(s)(ts)csor not. Moreover, the tree thas associated
a weight wt0,so that the the class predicted for an instance xis the most voted
class according to the weights wt.The ATM can be viewed as a score-based classifier
by associating to class kthe score fkdefined as:
fk(x) = X
t∈{1,...,T }/t∈Tk(x)
wt,(7)
where Tk(x)denotes the subset of trees that classify xin class k.
To model Problem (1), the following notation and decision variables will be used:
Data
x0the instance for which a minimum cost counterfactual xis sought
Lt
ksubset of leaves in tree twhose output is class k= 1, . . . , K,t=
1, . . . , T
Ltset of leaves in tree t. Hence, Lt=kLt
k,t= 1, . . . , T
Left(l, t)the set of ancestor nodes of leaf lin tree twhose left branch takes part
in the path that ends in l,l Lt,t= 1, . . . , T
Right(l, t)the set of ancestor nodes of leaf lin tree twhose right branch takes
part in the path that ends in l,l Lt,t= 1, . . . , T
v(s)feature used in split s,sLeft(l, t)Right(l, t)
tstime point used in split s,sLeft(l, t)Right(l, t)
csthreshold value used for split s,sLeft(l, t)Right(l, t)
wtweight of tree t,t= 1, . . . , T
xbinstances of the dataset that have been classified in class k,b=
1, . . . , B
Bmax maximum number of prototypes allowed to construct the counterfac-
tual explanation x
Decision Variables
6
xcounterfactual, x X 0
zt
lbinary decision variable that indicates whether the counterfactual in-
stance ends in leaf l Lt(zt
l= 1) or not (zt
l= 0), t= 1, . . . , T
αb
jcoefficient associated to the prototype xbof the convex combina-
tion to construct feature j xjof the counterfactual explanation, b=
0, . . . , B,j= 1, . . . , J .
ubbinary decision variable that indicates whether prototype xbis used
in the construction of the counterfactual explanation x,b= 0, . . . , B
Recall that ATM is already known, i.e., the whole structure, including the topology
of the trees and the feature and threshold used in each split, is given. Thus, in order
to compute the score of the counterfactual instance, the only requirement is to know in
which leaf node it has ended up. When we end up in a specific leaf, the corresponding
branching conditions are activated. For each split sLeft(l, t)if the condition is true,
then xv(s)(ts)cs, otherwise xv(s)(ts)> cs. To introduce these logical conditions,
we use the following big Mconstraints:
xv(s)(ts)M1(1 zt
l) + ϵcssLeft(l, t)(8)
xv(s)(ts) + M2(1 zt
l)ϵcssRight(l, t).(9)
Due to the impossibility of the Mixed-Integer Optimization solvers to admit a strict
inequality, a small positive quantity ϵis introduced in the equations (8) and (9), as is
done in [5]. With this, our counterfactual variable xv(s)at point tsis not allowed to
take values around the threshold value in csat the split s. Please note that the value of
M1and M2can be tightened for each split.
The score function in (7) can be rewritten as a linear expression as follows:
T
X
t=1 X
l∈Lt
k
wtzt
l,
for k= 1, . . . , K.
Recall that one type of sparsity that we wanted was to use few prototypes to build
our counterfactual explanation. To model this, we introduce binary decision variables
ub, which control the number of prototypes that can be used in the convex combination
yielding xthrough parameter Bmax.
Given instance x0and a cost function C, the formulation associated with Problem
(1), the problem of finding the minimal cost perturbation that causes the classifier to
classify it in class kis as follows:
7
min
x,z,α,u
C(x,x0)(10)
s.t. xv(s)(ts)M1(1 zt
l) + ϵcssLeft(l, t)l Ltt= 1,...,T
(11)
xv(s)(ts) + M2(1 zt
l)ϵcssRight(l, t)l Ltt= 1,...,T
(12)
X
l∈Lt
zt
l= 1 t= 1,...,T (13)
T
X
t=1
X
l∈Lt
k
wtzt
l
T
X
t=1
X
l∈Lt
k
wtzt
lk= 1,...,K k =k(14)
xj=α0jx0
j+
B
X
b=1
αb
jxb
jj= 1,...,J (15)
B
X
b=0
αb
j= 1 j= 1,...,J (16)
αb
jubb= 1,...,B j= 1,...,J (17)
B
X
b=1
ubBmax (18)
ub {0,1} b= 1,...,B (19)
zt
l {0,1} l Ltt= 1,...,T (20)
αb
j0b= 1,...,B j= 1,...,J (21)
x X0.(22)
The cost function in (10) is discussed in more detail below, where we measure the
movement from the original instance x0to its counterfactual explanation xfor func-
tional data. Constraints (11) and (12) control to which leaf the counterfactual instance
is assigned and constraint (13) enforces that only one leaf is active for each tree. Con-
straint (14) ensures that the counterfactual instance is assigned to class k, i.e., the
score of class kis the highest one among all classes. Constraints (15) and (16) define
for each feature jthe counterfactual instance as the convex combination of x0
jand the
prototypes xb
j. To ensure sparsity in the prototypes, constraints (17)-(18) restrict the
number of prototypes used in the convex combination to Bmax. Constraints (19) and
(20) ensure that all uband zt
lare binary, constraint (21) that the coefficients αb
jare non
negative and constraint (22) that the counterfactual xis in X0, the set containing the
rest of the actionability and plausibility constraints.
Let us now discuss the objective function in (10) for the particular choices of C
introduced in Section 2, namely, (4) and (5). In order to model the 0term defined
in (6), binary decision variables ξjare introduced. For every feature j= 1, . . . , J,,
8
ξj= 0 if and only if α0j= 1, i.e., if xj=x0
j. This is expressed as
ξj1α0jξjj= 1, . . . , J (23)
ξj∈{0,1}, j = 1, . . . , J. (24)
Moreover, we have that
x0x0=
J
X
j=1
ξj.
Thus, for the cost function Cin (4), we have the following reformulation of (10)-
(22):
min
x,z,α,u,ξλ0
J
X
j=1
ξj+λ2ZT
0
J
X
j=1
(xj(t)x0
j(t))2dt
s.t. (11) (22),(23) (24).
(CEF)
For the particular case of (CEF) where only the 0distance is considered, i.e.,
λ2= 0, the objective function as well as the constraints are linear, (assuming X0
is also defined through linear constraints), while we have both binary and continuous
decision variables. Therefore, Problem (CEF) can be solved using a Mixed Integer
Linear Programming (MILP) solver. For arbitrary λ20, taking into account that, by
(15),
(xj(t)x0
j(t))2= B
X
b=0
αb
jxb
j(t)x0
j(t)!2
,
the second term in the objective can be expressed as a convex quadratic function in
the decision variables αb
j,and thus (again, assuming X0is also defined through linear
constraints) Problem (CEF) is a Mixed Integer Convex Quadratic Model with linear
constraints.
Let us address Problem (1) when the cost function Chas the form (5), and thus
the DTW distance is involved. As in Section 2, the time interval [0, T ]is discretised in
time instants t1, . . . , tn,and thus the DTW distance is the minimal squared Euclidean
distance among the warping paths W, yielding
min
x,z,α,ξ,uλ0
J
X
j=1
ξj+λ2
Q
X
r=1
J
X
j=1 xj(q1r)x0
j(q2r)2
s.t. (11) (22),(23) (24)
(q11, q21 )(q12, q22 ). . . (q1Q, q2Q) W
(CEFDTW)
Notice how for a fixed warping path in W, all the constraints (11)-(24) are linear,
while we have both binary and continuous variables. Hence, if X0is again defined by
linear constraints, since the objective function is quadratic, Problem (CEFDTW) is a
Mixed Integer Convex Quadratic Model with linear constraints, that can be solved us-
ing standard optimization packages. For this reason we propose an alternating heuristic
to solve problem (CEFDTW):
9
Algorithm 1: Algorithm to calculate counterfactual explanations with the
DTW-based cost (5)
1Initialisation: Let π(0) be the warping path (t1, t1)(t2, t2). . . (tn, tn)
2r= 0
3Solve Problem (CEFDTW) with π(r)as warping path, and obtain the
counterfactual instance xand DTW distance δ(r)
4Find the optimal warping path πand its corresponding δby solving
δ= min PQ
r=1 PJ
j=1 xj(q1r)x0
j(q2r)2
s.t. (q11, q21 )(q12, q22 ). . . (q1Q, q2Q) W,
5if δ=δ(r)then
6stop
7else
8update π(r+ 1) = π,r=r+ 1 and go to step 3
Output: counterfactual instance x
4 Numerical illustration
We will illustrate our methodology in two real-world datasets, one univariate and an-
other multivariate, from the UCR archive [9]. For a given instance, we are able to
identify the individuals of the dataset from which the corresponding counterfactual is
made up and what their contribution is. Furthermore, we show the two different sparsi-
ties that we can obtain with our model, namely, the number of prototypes used for the
counterfactual and the number of functional features that change. The use of different
distances, i.e., the Euclidean and the DTW distances, is also displayed.
All the mathematical optimization problems have been implemented using Pyomo
optimization modeling language [19, 20] in Python 3.8. As solver, we have used Gurobi
9.0 [18]. A value of ϵ= 1e6has been imposed in (11) and (12). The values of the
big-Min (11) and (12) are node dependent, and they have been tightened following
the process described in [7]. For all the computational experiments, the classification
model considered has been a Random Forest with T= 200 trees and a maximum
depth of 4. Our experiments have been conducted on a PC, with an Intel R CoreTM i7-
1065G7 CPU @ 1.30GHz 1.50 GHz processor and 16 gigabytes RAM. The operating
system is 64 bits.
The first dataset, ItalyPowerDemand [23], has one functional feature. There are
1096 instances and each instance is a time series of length 24, representing the power
demand in Italy in six months. The binary classification task is to distinguish days from
October to March (response value 1) from April to September (response value +1).
The second dataset, NATOPS [15], has 24 functional time series of length 51 repre-
10
senting the X, Y, and Z coordinates of the left and right hand, wrist, thumb and elbows
as captured by a Kinect 2 sensor. There are 260 instances and we chose two classes of
the 6 that there are in the dataset. The binary classification task is thus to distinguish
the gesture All Clear” (response value 1) from “Not Clear” (response value +1).
4.1 Experimental results
ItalyPowerDemand
We present the counterfactual for an instance x0of the dataset ItalyPowerDemand in
Figure 2. In each case, we represent the original curve, the prototypes, and the final
counterfactual.
The first cost model analysed is the squared Euclidean model (4) with λ0= 0
(since we have only one feature, λ0>0is meaningless). Different values of Bmax
have been used. The smaller the value of Bmax is, the more sparse the counterfactual is
in terms of prototypes, while the larger the value of Bmax is, the higher the freedom to
use prototypes and therefore the smaller the distances obtained. In Figure 3 we plot the
relation between the number of prototypes and the distance. It is illustrated how using
more than one prototype may be beneficial, but using more than 4 prototypes gives us
less sparsity without smaller distances.
To show the flexibility of our model, the same experiments have been carried out
but changing the cost based on DTW distances (5), again with λ0= 0. The counterfac-
tual solutions have been calculated with the heuristic procedure described in Algorithm
1. The results are depicted in Figure 4. As before, one can see how the objective func-
tion decreases as the number of prototypes Bmax increases. However, in this case, it is
sufficient to use 2 prototypes, as 3 or more will not improve much the objective func-
tion, see Figure 5.
NATOPS
We present now the counterfactual for an instance x0of the multivariate dataset NATOPS.
The cost function used has been of the form is the squared Euclidean model (4) with
λ0= 1, λ2= 0.005.
In Figure 6 the counterfactual instance xfor x0for Bmax = 1 is shown. As the
cost function Ccontains as its first term the 0norm, we obtain a sparse solution in
the sense of the features we need to change to move from x0to x. Indeed, to change
its class, only three functional features have to be modified. In Figure 7 the changed
features are presented.
As in the univariate case, we can impose different values of Bmax. In Figure 8 we
show the counterfactual explanation for Bmax = 2 and for the same cost function. Note
11
(a) Bmax = 1 (b) Bmax = 2
(c) Bmax = 3 (d) Bmax = 4
Figure 2: Counterfactual explanations for x0of the ItalyPowerDemand data set which
has been predicted by the Random Forest in k0=1and whose counterfactual x
has to be predicted in class k= +1. Different values of Bmax, i.e., the number of
prototypes used for the convex combination, have been imposed.The cost function is
model (4) with λ0= 0, λ2= 1.
how giving the flexibility to use more than one prototype, results in only having to
change two features, see Figure 9.
5 Conclusions
In this paper, we have proposed a novel approach to build counterfactual explanations
when dealing with multivariate functional data in classification problems by means of
mathematical optimization. With our method, we ensure plausible and sparse explana-
tions, controlling not only the number of prototypes of the dataset used to create the
counterfactuals, but also the number of features that need to be changed. Our model
is also flexible enough to be used with different distance measures, e.g., the Euclidean
distance or the DTW distance. Moreover, our methodology is applicable to score-based
12
Figure 3: Euclidean distance obtained vs number of prototypes used in a counterfactual
explanation xfor x0of the ItalyPowerDemand data set which has been predicted by
the Random Forest in k0=1and it is imposed k= +1.
classifiers, including additive tree models, such as random forest or XGBoost models,
as well as linear models, such as logistic regression and linear support vector machines.
We have illustrated our methodology on various real-world datasets.
There are several interesting lines of future research. First, an extension to other
non score-based classifiers, like k-NN classifiers, deserve some study. Secondly, to de-
fine counterfactual explanations for functional data one could be interested in keeping
fixed a part of the curves defining the features. With our method we build the coun-
terfactuals from scratch using the combinations of prototypes in the interval [0, T ], but
suppose we have an instance defined in the interval [0, t0), and one might want to find
out how the rest of the curve in the interval [t0, T ]would have to be like to make the
overall curve being classified in class k. When constructing the rest of the curve, one
would need to maintain the smoothness and other properties of the curve. Finally, an
extension to other distances, like the optimal transportation distance, is also a topic of
interest.
Acknowledgements
This research has been financed in part by research projects EC H2020 MSCA RISE
NeEDS (Grant agreement ID: 822214), FQM-329, P18-FR-2369 and US-1381178
(Junta de Andaluc´
ıa), and PID2019-110886RB-I00 (Ministerio de Ciencia, Innovaci´
on
y Universidades, Spain). This support is gratefully acknowledged.
13
(a) Bmax = 1 (b) Bmax = 2
(c) Bmax = 3 (d) Bmax = 4
Figure 4: Counterfactual explanations for x0of the ItalyPowerDemand data set which
has been predicted with a Random Forest in k0=1and whose counterfactual x
has to be predicted in class k= +1. Different values of Bmax, i.e., the number of
prototypes used for the convex combination, have been imposed. The cost function is
model (5) with λ0= 0, λ2= 1.
References
[1] Emre Ates, Burak Aksar, Vitus J Leung, and Ayse K Coskun. Counterfactual
explanations for multivariate time series. In 2021 International Conference on
Applied Artificial Intelligence (ICAPAI), pages 1–8. IEEE, 2021.
[2] S. Ben´
ıtez-Pe˜
na, E. Carrizosa, V. Guerrero, MD. Jim´
enez-Gamero, B. Mart´
ın-
Barrag´
an, C. Molero-R´
ıo, P. Ram´
ırez-Cobo, D. Romero Morales, and MR Sillero-
Denamiel. On sparse ensemble methods: An application to short-term predic-
tions of the evolution of COVID-19. European Journal of Operational Research,
295(2):648–663, 2021.
[3] Donald J Berndt and James Clifford. Using dynamic time warping to find patterns
14
Figure 5: DTW distance obtained vs number of prototypes used used in a counterfac-
tual explanation xfor x0of the ItalyPowerDemand data set which has been predicted
by the Random Forest in k0=1and it is imposed k= +1.
in time series. In KDD workshop, volume 10, pages 359–370. Seattle, WA, USA,
1994.
[4] D. Bertsimas, A. King, and R. Mazumder. Best subset selection via a modern
optimization lens. The Annals of Statistics, 44(2):813–852, 2016.
[5] Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learn-
ing, 106(7):1039–1082, 2017.
[6] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
[7] Emilio Carrizosa, Jasone Ram´
ırez-Ayerbe, and Dolores Romero Morales.
Generating collective counterfactual explanations in score-based classifi-
cation via mathematical optimization. Technical Report IMUS, Sevilla,
Spain, 2022. https://www.researchgate.net/publication/
353073138_Generating_Counterfactual_Explanations_
in_Score-Based_Classification_via_Mathematical_
Optimization.
[8] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA,
2016. ACM.
[9] Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh,
Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn
15
Figure 6: Counterfactual explanations for x0of the NATOPS data set which has been
predicted by the Random Forest in k0= +1 and whose counterfactual xhas to be
predicted in class k=1.Bmax = 1 prototype has been imposed. The cost function
is model (4) with λ0= 1, λ2= 0.005.
Keogh. The UCR time series archive. IEEE/CAA Journal of Automatica Sinica,
6(6):1293–1305, 2019.
[10] Eoin Delaney, Derek Greene, and Mark T Keane. Instance-based counterfactual
explanations for time series classification. In International Conference on Case-
Based Reasoning, pages 32–47. Springer, 2021.
[11] Mengnan Du, Ninghao Liu, and Xia Hu. Techniques for interpretable machine
learning. Communications of the ACM, 63(1):68–77, 2019.
[12] Carlos Eiras-Franco, Bertha Guijarro-Berdinas, Amparo Alonso-Betanzos, and
16
Antonio Bahamonde. A scalable decision-tree-based method to explain interac-
tions in dyadic data. Decision Support Systems, 127:113141, 2019.
[13] Philippe Esling and Carlos Agon. Time-series data mining. ACM Computing
Surveys (CSUR), 45(1):1–34, 2012.
[14] Runshan Fu, Manmohan Aseri, ParamVir Singh, and Kannan Srinivasan. “Un”
fair machine learning algorithms. Management Science, 68(6):4173–4195, 2022.
[15] Nehla Ghouaiel, Pierre-Franc¸ois Marteau, and Marc Dupont. Continuous pattern
detection and recognition in stream-a benchmark for online gesture recognition.
International Journal of Applied Pattern Recognition, 4(2):146–160, 2017.
[16] B. Goodman and S. Flaxman. European Union regulations on algorithmic
decision-making and a “right to explanation”. AI Magazine, 38(3):50–57, 2017.
[17] Riccardo Guidotti. Counterfactual explanations and how to find them: literature
review and benchmarking. Forthcoming in Data Mining and Knowledge Discov-
ery, 2022.
[18] LLC Gurobi Optimization. Gurobi optimizer reference manual, 2021.
[19] William E. Hart, Carl D. Laird, Jean-Paul Watson, David L. Woodruff, Gabriel A.
Hackebeil, Bethany L. Nicholson, and John D. Siirola. Pyomo–optimization mod-
eling in Python, volume 67. Springer Science & Business Media, second edition,
2017.
[20] William E Hart, Jean-Paul Watson, and David L Woodruff. Pyomo: modeling and
solving mathematical programs in Python. Mathematical Programming Compu-
tation, 3(3):219–260, 2011.
[21] Wolfgang Jank and Galit Shmueli. Functional data analysis in electronic com-
merce research. Statistical Science, 21(2):155–166, 2006.
[22] Amir-Hossein Karimi, Gilles Barthe, Bernhard Sch ¨
olkopf, and Isabel Valera.
A survey of algorithmic recourse: definitions, formulations, solutions, and
prospects. arXiv preprint arXiv:2010.04050, 2021.
[23] Eamonn Keogh, Li Wei, Xiaopeng Xi, Stefano Lonardi, Jin Shieh, and Scott
Sirowy. Intelligent icons: Integrating lite-weight data mining and visualization
into GUI operating systems. In Sixth International Conference on Data Mining
(ICDM’06), pages 912–916. IEEE, 2006.
[24] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model pre-
dictions. In Proceedings of the 31st International Conference on Neural Informa-
tion Processing Systems, pages 4768–4777, 2017.
[25] S.M. Lundberg, G. Erion, H. Chen, A. DeGrave, J.M. Prutkin, B. Nair, R. Katz,
J. Himmelfarb, N. Bansal, and S.-I. Lee. From local explanations to global under-
standing with explainable AI for trees. Nature Machine Intelligence, 2(1):2522–
5839, 2020.
17
[26] Tim Miller. Explanation in artificial intelligence: Insights from the social sci-
ences. Artificial Intelligence, 267:1–38, 2019.
[27] Kiarash Mohammadi, Amir-Hossein Karimi, Gilles Barthe, and Isabel Valera.
Scaling guarantees for nearest counterfactual explanations. In Proceedings of the
2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 177–187, 2021.
[28] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust
you?” Explaining the predictions of any classifier. In Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Min-
ing, pages 1135–1144, 2016.
[29] Ashish Sood, Gareth M James, and Gerard J Tellis. Functional regression: A new
model for predicting market penetration of new products. Marketing Science,
28(1):36–51, 2009.
[30] Nur Sunar and Jayashankar M. Swaminathan. Net-metered distributed renewable
energy: A peril for utilities? Management Science, 67(11):6716–6733, 2021.
[31] Sahil Verma, John Dickerson, and Keegan Hines. Counterfactual explanations for
machine learning: A review. arXiv preprint arXiv:2010.10596, 2020.
[32] Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explana-
tions without opening the black box: Automated decisions and the GDPR. Harv.
JL & Tech., 31:841, 2017.
[33] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. A brief survey on sequence
classification. ACM SIGKDD Explorations Newsletter, 12(1):40–48, 2010.
[34] Dmitry Zhdanov, Sudip Bhattacharjee, and Mikhail A Bragin. Incorporating FAT
and privacy aware AI modeling approaches into business decision making frame-
works. Decision Support Systems, 155:113715, 2022.
[35] Zemin Zheng, Jinchi Lv, and Wei Lin. Nonsparse learning with latent variables.
Operations Research, 69(1):346–359, 2021.
18
(a) Feature 7
(b) Feature 11
(c) Feature 13
Figure 7: Changed features in the counterfactual explanation for x0of the NATOPS
data set which has been predicted by the Random Forest in k0=1and whose coun-
terfactual xhas to be predicted in class k= +1 with Bmax = 1. The cost function is
model (4) with λ0= 1, λ2= 0.005.
19
Figure 8: Counterfactual explanations for x0of the NATOPS data set which has been
predicted by the Random Forest in k0= +1 and whose counterfactual xhas to be
predicted in class k=1.Bmax = 2 prototypes has been imposed. Cost function:
model (4) with λ0= 1, λ2= 0.005.
20
(a) Feature 11
(b) Feature 22
Figure 9: Changed features in the counterfactual explanation for x0of the NATOPS
data set which has been predicted by the Random Forest in k0=1and whose coun-
terfactual xhas to be predicted in class k=1with Bmax = 2.The cost function is
model (4) with λ0= 1, λ2= 0.005.
21
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Interpretable machine learning aims at unveiling the reasons behind predictions returned by uninterpretable classifiers. One of the most valuable types of explanation consists of counterfactuals. A counterfactual explanation reveals what should have been different in an instance to observe a diverse outcome. For instance, a bank customer asks for a loan that is rejected. The counterfactual explanation consists of what should have been different for the customer in order to have the loan accepted. Recently, there has been an explosion of proposals for counterfactual explainers. The aim of this work is to survey the most recent explainers returning counterfactual explanations. We categorize explainers based on the approach adopted to return the counterfactuals, and we label them according to characteristics of the method and properties of the counterfactuals returned. In addition, we visually compare the explanations, and we report quantitative benchmarking assessing minimality, actionability, stability, diversity, discriminative power, and running time. The results make evident that the current state of the art does not provide a counterfactual explainer able to guarantee all these properties simultaneously.
Article
Full-text available
Since the seminal paper by Bates and Granger in 1969, a vast number of ensemble methods that combine different base regressors to generate a unique one have been proposed in the literature. The so-obtained regressor method may have better accuracy than its components, but at the same time it may overfit, it may be distorted by base regressors with low accuracy, and it may be too complex to understand and explain. This paper proposes and studies a novel Mathematical Optimization model to build a sparse ensemble, which trades off the accuracy of the ensemble and the number of base regressors used. The latter is controlled by means of a regularization term that penalizes regressors with a poor individual performance. Our approach is flexible to incorporate desirable properties one may have on the ensemble, such as controlling the performance of the ensemble in critical groups of records, or the costs associated with the base regressors involved in the ensemble. We illustrate our approach with real data sets arising in the COVID-19 context.
Article
Full-text available
Tree-based machine learning models such as random forests, decision trees and gradient boosted trees are popular nonlinear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here we improve the interpretability of tree-based models through three main contributions. (1) A polynomial time algorithm to compute optimal explanations based on game theory. (2) A new type of explanation that directly measures local feature interaction effects. (3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to (1) identify high-magnitude but low-frequency nonlinear mortality risk factors in the US population, (2) highlight distinct population subgroups with shared risk characteristics, (3) identify nonlinear interaction effects among risk factors for chronic kidney disease and (4) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model’s performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains. Tree-based machine learning models are widely used in domains such as healthcare, finance and public services. The authors present an explanation method for trees that enables the computation of optimal local explanations for individual predictions, and demonstrate their method on three medical datasets.
Article
Full-text available
Gaining relevant insight from a dyadic dataset, which describes interactions between two entities, is an open problem that has sparked the interest of researchers and industry data scientists alike. However, the existing methods have poor explainability, a quality that is becoming essential in certain applications. We describe an explainable and scalable method that, operating on dyadic datasets, obtains an easily interpretable high-level summary of the relationship between entities. To do this, we propose a quality measure, which can be configured to a level that suits the user, that factors in the explainability of the model. We report experiments that confirm better results for the proposed method over alternatives, in terms of both explainability and accuracy. We also analyse the method’s capacity to extract relevant actionable information and to handle large datasets.
Article
We present a formal approach to build and evaluate AI systems that include principles of Fairness, Accountability and Transparency (FAT), which are extremely important in various domains where AI models are used, yet their utilization in business settings is scant. We develop and instantiate a FAT-based framework with a privacy-constrained dataset and build a model to demonstrate the balance among these 3 dimensions. These principles are gaining prominence with higher awareness of privacy and fairness in business and society. Our results indicate that FAT can co-exist in a well-designed system. Our contribution lies in presenting and evaluating a functional, FAT-based machine learning model in an affinity prediction scenario. Contrary to common belief, we show that explainable AI/ML systems need not have a major negative impact on predictive performance. Our approach is applicable in a variety of fields such as insurance, health diagnostics, government funds allocation and other business settings. Our work has broad policy implications as well, by making AI and AI-based decisions more ethical, less controversial, and hence, trustworthy. Our work contributes to emerging AI policy perspectives worldwide.
Article
Ensuring fairness in algorithmic decision making is a crucial policy issue. Current legislation ensures fairness by barring algorithm designers from using demographic information in their decision making. As a result, to be legally compliant, the algorithms need to ensure equal treatment. However, in many cases, ensuring equal treatment leads to disparate impact particularly when there are differences among groups based on demographic classes. In response, several “fair” machine learning (ML) algorithms that require impact parity (e.g., equal opportunity) at the cost of equal treatment have recently been proposed to adjust for the societal inequalities. Advocates of fair ML propose changing the law to allow the use of protected class-specific decision rules. We show that the proposed fair ML algorithms that require impact parity, while conceptually appealing, can make everyone worse off, including the very class they aim to protect. Compared with the current law, which requires treatment parity, the fair ML algorithms, which require impact parity, limit the benefits of a more accurate algorithm for a firm. As a result, profit maximizing firms could underinvest in learning, that is, improving the accuracy of their machine learning algorithms. We show that the investment in learning decreases when misclassification is costly, which is exactly the case when greater accuracy is otherwise desired. Our paper highlights the importance of considering strategic behavior of stake holders when developing and evaluating fair ML algorithms. Overall, our results indicate that fair ML algorithms that require impact parity, if turned into law, may not be able to deliver some of the anticipated benefits. This paper was accepted by Kartik Hosanagar, information systems.
Article
Uncovering the mysterious ways machine learning models make decisions.
Article
The UCR time series archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1nearest neighbor classification), a fraction might be mis-attributing the reasons for their improvement. Moreover, the improvements claimed by these papers might have been achievable with a much simpler modification, requiring just a few lines of code.