Content uploaded by Jasone Ramirez-Ayerbe

Author content

All content in this area was uploaded by Jasone Ramirez-Ayerbe on Sep 15, 2022

Content may be subject to copyright.

A New Model for Counterfactual Analysis for

Functional Data

Emilio Carrizosa*1, Jasone Ram´

ırez-Ayerbe†1 , and Dolores Romero

Morales‡2

1Instituto de Matem´

aticas de la Universidad de Sevilla, Seville, Spain

2Department of Economics, Copenhagen Business School,

Frederiksberg, Denmark

Abstract

Counterfactual explanations have become a very popular interpretability tool to

understand and explain how complex machine learning models make decisions for

individual instances. Most of the research on counterfactual explainability focuses

on tabular and image data and much less on models dealing with functional data.

In this paper, a counterfactual analysis for functional data is addressed, in which

the goal is to identify the samples of the dataset from which the counterfactual

explanation is made of, as well as how they are combined so that the individual

instance and its counterfactual are as close as possible. Our methodology can

be used with different distance measures for multivariate functional data and is

applicable to any score-based classiﬁer. We illustrate our methodology using two

different real-world datasets, one univariate and another multivariate.

Keywords— Counterfactual Explanations; Mathematical Optimization; Functional

Data; Prototypes; Random Forests

1 Introduction

Machine Learning models are increasingly being used for high stakes decision-making

settings such as healthcare, law or ﬁnance. Many of these Machine Learning models

are black-boxes and therefore they do not explain how they arrive to decisions in a

way that humans can understand. Nowadays, there is an increasing number of laws

and regulations [16] coming into place to enforce the decisions of algorithms to be

interpretable (a.k.a. transparent) [11, 12, 14, 26, 34]. Interpretability is enhanced by

selecting the features that have the greatest impact on the model as a whole [2, 4, 35],

*Emilio Carrizosa: ecarrizosa@us.es

†Jasone Ram´

ırez-Ayerbe: mrayerbe@us.es

‡Dolores Romero Morales: drm.eco@cbs.dk

1

but also knowing these locally for the decision made for each individual [24, 25, 28].

A speciﬁc type of post-hoc interpretability tools is the counterfactual explanation,

i.e., a set of changes that can be made to an instance such that the given machine learn-

ing model would have classiﬁed it in a different class. For instance, in a credit score

application one may be interested in knowing how the debt history of a person should

have been to change the prediction to loan should be granted. See [32] for the semi-

nal paper on counterfactual explanations in predictive modelling and [17, 22, 31] for

recent surveys. Additionally, there are different criteria or constraints that a counterfac-

tual should satisfy [27, 32], like actionability, plausibility and sparsity. Actionability

ensures that a counterfactual does not change immovable features, plausibility is nec-

essary to guarantee that counterfactual explanations are realistic and do not depart from

what can be observed in the sample, whereas sparsity ensures, for instance, that only a

few features are moved.

This paper studies score-based multiclass classiﬁcation methods. For these, a quite

versatile model-based framework can be built for counterfactual analysis. Let f:X →

{1, . . . , K}be a multiclass classiﬁer based on score functions (f1, . . . , fK), where K

is the number of classes and Xis the feature space. Given an instance x0∈ X , let

f(x0)∈arg maxkfk(x0)denote the class assigned by the classiﬁer to x0. For a ﬁxed

class k∗, the counterfactual instance of x0is deﬁned in this paper as the feasible x

obtained with a minimal perturbation of x0and classiﬁed by the score-based classiﬁer

in class k∗. This yields the following optimization problem:

(minxC(x,x0)

s.t. fk∗(x)≥fk(x)∀k= 1,...,K k =k∗

x∈ X0.

(1)

The objective function C(x,x0)is a cost function that measures the dissimilarity

between the given instance x0and the counterfactual instance x. In the feasible region,

we have two types of constraints. In the ﬁrst one, we ensure that the counterfactual x

is classiﬁed in class k∗by imposing that the score fk∗(x)is the maximum across all

k. In the second type of constraint, we ensure that the counterfactual is in X0, the set

containing the actionability and plausibility constraints.

Most of the work on counterfactual explanations focuses on tabular and image data

[22], and much less on functional data, This type of data, which arises in important do-

mains such as econometrics, energy, marketing [21, 29, 30], is addressed in this paper.

In principle, one could apply the methods developed for tabular data also to functional

data, just by considering that each feature is the measurement of the function at a

time instant. However, doing so, fundamental information such as the autocorrelation

structure along consecutive time instants would be lost. For this reason, some works

on counterfactual explanations exploiting the functional nature of data have been sug-

gested, e.g., [1, 10], but, as far as the authors know, none of them uses the structure and

properties of the machine learning model.

2

Our aim is to model and solve Problem (1) when data are functions. The coun-

terfactual explanation x∈ X 0is a hypothetical instance that is as similar as possible

to the instance x0,generated by combining various existing instances, hereafter pro-

totypes, from the dataset. First, we can deal with multivariate functional data, i.e., our

data xare functions taking values in some RJ.Second, we are able to identify the in-

stances from the dataset that generate the counterfactual for each instance, controlling

how sparse the counterfactual explanation is, in terms of both the number of proto-

types used to create the counterfactual xand the number of features changed to move

from x0to x. Third, we can handle a collection of distance measures, while ensuring

that the cost function Cis tractable. We will show that, under mild assumptions of

the set deﬁning the actionability and plausibility constraints, obtaining counterfactual

explanations reduces to solving a Mixed Integer Convex Quadratic Model with linear

constraints, which can be solved with standard optimization packages.

The remainder of the paper is organized as follows. In Section 2, we model the

problem of ﬁnding counterfactual explanations through an optimization problem (1)

when data are functions. In Section 3, we focus on counterfactual analysis for additive

tree models. In Section 4, a numerical illustration using real-world datasets is provided.

Finally, conclusions and possible lines of future research are provided in Section 5.

2 Counterfactual analysis for functional data

In this section, we will detail the formulation for Problem (1) explanations when deal-

ing with functional data. To do this we need to model the structure of the counterfactual

instances, the constraints associated with them, as well as the cost function C. This will

be done in what follows. We postpone to the next section the analysis of a state-of-the-

art class of score-based classiﬁcation models, namely, additive classiﬁcation trees.

Counterfactual instances and constraints

An instance x∈ X ⊂ F Jis deﬁned as a vector of Jfunctional features. Hence,

x= (x1(t), . . . , xJ(t)), where xj: [0, T ]→R,j= 1, . . . , J , are Riemann integrable

functions deﬁned in interval [0, T ]. Notice that xj(t)may be a static feature, e.g., age,

deﬁned then as a constant function. Note also that, for a given time instant t∈[0, T ],

x(t)is a vector in RJcomponents, which may represent Jmeasurements of indepen-

dent attributes, or they may be related, e.g., one can include an attribute xj,some of its

its derivatives to provide information also on e.g., the growth speed or the convexity of

xj.

Let us discuss constraints on xin Problem (1). First, we need to ensure that the

counterfactual explanation xis realistic. In the case of functional data, this yields

an inﬁnite-dimensional optimization problem. To enhance the tractability of this re-

quirement, we propose the use of instances of the dataset, i.e., prototypes, to generate

the counterfactual explanation. Let xb,b= 1, . . . , B, be all the instances that have

3

been classiﬁed by the model in class k∗. For an instance x0= (x0

1, . . . , x0

J), fea-

ture jof the counterfactual explanation xjis deﬁned as the convex combination of

the original feature x0

jand the feature jof all Bprototypes xb

j. Thus, the counterfac-

tual explanation xis deﬁned for each feature jas xj=α0jx0+PB

b=1 αb

jxb

j, where

PB

b=0 αb

j= 1,∀j= 1, . . . , J .

In order to gain interpretability of the so obtained counterfactual explanation x, we

want to use as few prototypes xbas possible in the construction of x. For this reason

we will impose a maximum of Bmax prototypes to be used, where Bmax is a parameter

deﬁned by the user.

In X0we impose the unmovable constraints, other constraints like upper or lower

limits on the static variables, or constraints avoiding, e.g., that prototypes too far from

x0are considered (which is done by setting to zero all coefﬁcients αb

jif the distance

between x0and xbis above a threshold value).

Cost function

Recall that C(x,x0)is the cost of changing x0to x, which can be measured by the

proximity between the curves deﬁning xand x0).

The proximity between curves can be measured in several ways. One can use for

instance the squared Euclidean distance:

∥x−x0∥2

2=ZT

0

J

X

j=1

(xj(t)−x0

j(t))2dt (2)

Needless to say, different weights can be assigned to each feature in the expression

above.

Another popular distance used in the literature [13, 33] is the Dynamic Time Warp-

ing (DTW) distance, which measures the dissimilarity between two functions that

may be inspected at different speed, see Figure 1. More explicitly, suppose we have

xand x′, discretised in two sequences of length n, so that the J-variate functions

xand x′are replaced by (x(t1),...,x(tn)) and (x′(t1),...,x′(tn)).Observe that

each x(t),x′(t)are vectors in RJ.A warping path πis a chain of pairs of the form

π= (q11, q21 )→(q12, q22 )→. . . →(q1Q, q2Q)of length Q,n≤Q≤2n−1,

satisfying the following two conditions:

1. (q11, q21 )=(t1, t1),and (q1Q, q2Q)=(tn, tn)

2. q1r≤q1(r+1) ≤q1r+ 1,and q2r≤q2(r+1) ≤q2r+ 1, r = 1,2, . . . , Q −1

Let Wdenote the set of all warping paths. Then, the DTW distance DTW(x,x′)be-

tween xand x′is the minimal squared Euclidean distance between pairs of the form

4

(x(q11),...,x(q1Q)) and (x′(q21 ),...,x′(q2Q)) when (q11, q21 )→(q12, q22 )→

. . . →(q1Q, q2Q)is a warping path, i.e.,

DTW(x,x′) = min PQ

r=1 PJ

j=1 xj(q1r)−x′

j(q2r)2

s.t. π= (q11, q21 )→(q12, q22 )→. . . →(q1Q, q2Q)∈ W

(3)

Observe that DT W can be efﬁciently evaluated using dynamic programming [3].

(a) Warping path π= (0,0) →(1,1) →

(2,2) →(3,3) →(4,4) →(5,5) →

(6,6) →(7,7)

(b) Optimal warping path π∗= (0,0) →

(1,1) →(2,2) →(3,3) →(4,4) →

(4,5) →(5,6) →(6,6) →(7,7)

Figure 1: Comparison between different warping paths between the functions x(in

blue) and x′(in orange)

Additionally, Cmay contain, on top of the distance-based term described above,

other terms measuring, e.g., the number of features altered when moving from x0to x.

In particular, in Section 3 we will discuss in detail the cases

C(x,x0) = λ0∥x0−x∥0+λ2ZT

0

J

X

j=1

(xj(t)−x0

j(t))2dt, (4)

and

C(x,x0) = λ0∥x0−x∥0+λ2DTW(x,x0),(5)

where ∥x0−x∥0indicates how many components of x0= (x0

1, . . . , x0

J)and x=

(x1, . . . , xJ)are not equal,

∥x0−x∥0=j:x0

j=xj,(6)

and λ0, λ2≥0,not simultaneously 0.

3 Additive Tree Models

Problem (1) under the modelling assumptions in Section 2 can be addressed for several

score-based classiﬁers. In particular, it can be formulated for additive tree models such

5

as Random Forest [6] or XGBoost models [8], as well as linear models such as logistic

regression and linear support vector machines. Below, we focus on additive tree mod-

els (ATM), and extend to functional data the analysis for tabular data described in the

previous work of the authors [7].

The ATM model is composed of Tclassiﬁcation trees. Each tree thas a series

of branching nodes s, each having associated a feature v(s),a time instant ts,and a

threshold value cs,so that records xgo through the left or the right of the branching

node depending on whether xv(s)(ts)≤csor not. Moreover, the tree thas associated

a weight wt≥0,so that the the class predicted for an instance xis the most voted

class according to the weights wt.The ATM can be viewed as a score-based classiﬁer

by associating to class kthe score fkdeﬁned as:

fk(x) = X

t∈{1,...,T }/t∈Tk(x)

wt,(7)

where Tk(x)denotes the subset of trees that classify xin class k.

To model Problem (1), the following notation and decision variables will be used:

Data

x0the instance for which a minimum cost counterfactual xis sought

Lt

ksubset of leaves in tree twhose output is class k= 1, . . . , K,t=

1, . . . , T

Ltset of leaves in tree t. Hence, Lt=∪kLt

k,t= 1, . . . , T

Left(l, t)the set of ancestor nodes of leaf lin tree twhose left branch takes part

in the path that ends in l,l∈ Lt,t= 1, . . . , T

Right(l, t)the set of ancestor nodes of leaf lin tree twhose right branch takes

part in the path that ends in l,l∈ Lt,t= 1, . . . , T

v(s)feature used in split s,s∈Left(l, t)∪Right(l, t)

tstime point used in split s,s∈Left(l, t)∪Right(l, t)

csthreshold value used for split s,s∈Left(l, t)∪Right(l, t)

wtweight of tree t,t= 1, . . . , T

xbinstances of the dataset that have been classiﬁed in class k∗,b=

1, . . . , B

Bmax maximum number of prototypes allowed to construct the counterfac-

tual explanation x

Decision Variables

6

xcounterfactual, x∈ X 0

zt

lbinary decision variable that indicates whether the counterfactual in-

stance ends in leaf l∈ Lt(zt

l= 1) or not (zt

l= 0), t= 1, . . . , T

αb

jcoefﬁcient associated to the prototype xbof the convex combina-

tion to construct feature j xjof the counterfactual explanation, b=

0, . . . , B,j= 1, . . . , J .

ubbinary decision variable that indicates whether prototype xbis used

in the construction of the counterfactual explanation x,b= 0, . . . , B

Recall that ATM is already known, i.e., the whole structure, including the topology

of the trees and the feature and threshold used in each split, is given. Thus, in order

to compute the score of the counterfactual instance, the only requirement is to know in

which leaf node it has ended up. When we end up in a speciﬁc leaf, the corresponding

branching conditions are activated. For each split s∈Left(l, t)if the condition is true,

then xv(s)(ts)≤cs, otherwise xv(s)(ts)> cs. To introduce these logical conditions,

we use the following big Mconstraints:

xv(s)(ts)−M1(1 −zt

l) + ϵ≤css∈Left(l, t)(8)

xv(s)(ts) + M2(1 −zt

l)−ϵ≥css∈Right(l, t).(9)

Due to the impossibility of the Mixed-Integer Optimization solvers to admit a strict

inequality, a small positive quantity ϵis introduced in the equations (8) and (9), as is

done in [5]. With this, our counterfactual variable xv(s)at point tsis not allowed to

take values around the threshold value in csat the split s. Please note that the value of

M1and M2can be tightened for each split.

The score function in (7) can be rewritten as a linear expression as follows:

T

X

t=1 X

l∈Lt

k

wtzt

l,

for k= 1, . . . , K.

Recall that one type of sparsity that we wanted was to use few prototypes to build

our counterfactual explanation. To model this, we introduce binary decision variables

ub, which control the number of prototypes that can be used in the convex combination

yielding xthrough parameter Bmax.

Given instance x0and a cost function C, the formulation associated with Problem

(1), the problem of ﬁnding the minimal cost perturbation that causes the classiﬁer to

classify it in class k∗is as follows:

7

min

x,z,α,u

C(x,x0)(10)

s.t. xv(s)(ts)−M1(1 −zt

l) + ϵ≤cs∀s∈Left(l, t)∀l∈ Lt∀t= 1,...,T

(11)

xv(s)(ts) + M2(1 −zt

l)−ϵ≥cs∀s∈Right(l, t)∀l∈ Lt∀t= 1,...,T

(12)

X

l∈Lt

zt

l= 1 ∀t= 1,...,T (13)

T

X

t=1

X

l∈Lt

k∗

wtzt

l≥

T

X

t=1

X

l∈Lt

k

wtzt

l∀k= 1,...,K k =k∗(14)

xj=α0jx0

j+

B

X

b=1

αb

jxb

j∀j= 1,...,J (15)

B

X

b=0

αb

j= 1 ∀j= 1,...,J (16)

αb

j≤ub∀b= 1,...,B ∀j= 1,...,J (17)

B

X

b=1

ub≤Bmax (18)

ub∈ {0,1} ∀b= 1,...,B (19)

zt

l∈ {0,1} ∀l∈ Lt∀t= 1,...,T (20)

αb

j≥0∀b= 1,...,B ∀j= 1,...,J (21)

x∈ X0.(22)

The cost function in (10) is discussed in more detail below, where we measure the

movement from the original instance x0to its counterfactual explanation xfor func-

tional data. Constraints (11) and (12) control to which leaf the counterfactual instance

is assigned and constraint (13) enforces that only one leaf is active for each tree. Con-

straint (14) ensures that the counterfactual instance is assigned to class k∗, i.e., the

score of class k∗is the highest one among all classes. Constraints (15) and (16) deﬁne

for each feature jthe counterfactual instance as the convex combination of x0

jand the

prototypes xb

j. To ensure sparsity in the prototypes, constraints (17)-(18) restrict the

number of prototypes used in the convex combination to Bmax. Constraints (19) and

(20) ensure that all uband zt

lare binary, constraint (21) that the coefﬁcients αb

jare non

negative and constraint (22) that the counterfactual xis in X0, the set containing the

rest of the actionability and plausibility constraints.

Let us now discuss the objective function in (10) for the particular choices of C

introduced in Section 2, namely, (4) and (5). In order to model the ℓ0term deﬁned

in (6), binary decision variables ξjare introduced. For every feature j= 1, . . . , J,,

8

ξj= 0 if and only if α0j= 1, i.e., if xj=x0

j. This is expressed as

−ξj≤1−α0j≤ξjj= 1, . . . , J (23)

ξj∈{0,1}, j = 1, . . . , J. (24)

Moreover, we have that

∥x0−x∥0=

J

X

j=1

ξj.

Thus, for the cost function Cin (4), we have the following reformulation of (10)-

(22):

min

x,z,α,u,ξλ0

J

X

j=1

ξj+λ2ZT

0

J

X

j=1

(xj(t)−x0

j(t))2dt

s.t. (11) −(22),(23) −(24).

(CEF)

For the particular case of (CEF) where only the ℓ0distance is considered, i.e.,

λ2= 0, the objective function as well as the constraints are linear, (assuming X0

is also deﬁned through linear constraints), while we have both binary and continuous

decision variables. Therefore, Problem (CEF) can be solved using a Mixed Integer

Linear Programming (MILP) solver. For arbitrary λ2≥0, taking into account that, by

(15),

(xj(t)−x0

j(t))2= B

X

b=0

αb

jxb

j(t)−x0

j(t)!2

,

the second term in the objective can be expressed as a convex quadratic function in

the decision variables αb

j,and thus (again, assuming X0is also deﬁned through linear

constraints) Problem (CEF) is a Mixed Integer Convex Quadratic Model with linear

constraints.

Let us address Problem (1) when the cost function Chas the form (5), and thus

the DTW distance is involved. As in Section 2, the time interval [0, T ]is discretised in

time instants t1, . . . , tn,and thus the DTW distance is the minimal squared Euclidean

distance among the warping paths W, yielding

min

x,z,α,ξ,uλ0

J

X

j=1

ξj+λ2

Q

X

r=1

J

X

j=1 xj(q1r)−x0

j(q2r)2

s.t. (11) −(22),(23) −(24)

(q11, q21 )→(q12, q22 )→. . . →(q1Q, q2Q)∈ W

(CEFDTW)

Notice how for a ﬁxed warping path in W, all the constraints (11)-(24) are linear,

while we have both binary and continuous variables. Hence, if X0is again deﬁned by

linear constraints, since the objective function is quadratic, Problem (CEFDTW) is a

Mixed Integer Convex Quadratic Model with linear constraints, that can be solved us-

ing standard optimization packages. For this reason we propose an alternating heuristic

to solve problem (CEFDTW):

9

Algorithm 1: Algorithm to calculate counterfactual explanations with the

DTW-based cost (5)

1Initialisation: Let π(0) be the warping path (t1, t1)→(t2, t2). . . →(tn, tn)

2r= 0

3Solve Problem (CEFDTW) with π(r)as warping path, and obtain the

counterfactual instance xand DTW distance δ(r)

4Find the optimal warping path π∗and its corresponding δ∗by solving

δ∗= min PQ

r=1 PJ

j=1 xj(q1r)−x0

j(q2r)2

s.t. (q11, q21 )→(q12, q22 )→. . . →(q1Q, q2Q)∈ W,

5if δ∗=δ(r)then

6stop

7else

8update π(r+ 1) = π∗,r=r+ 1 and go to step 3

Output: counterfactual instance x

4 Numerical illustration

We will illustrate our methodology in two real-world datasets, one univariate and an-

other multivariate, from the UCR archive [9]. For a given instance, we are able to

identify the individuals of the dataset from which the corresponding counterfactual is

made up and what their contribution is. Furthermore, we show the two different sparsi-

ties that we can obtain with our model, namely, the number of prototypes used for the

counterfactual and the number of functional features that change. The use of different

distances, i.e., the Euclidean and the DTW distances, is also displayed.

All the mathematical optimization problems have been implemented using Pyomo

optimization modeling language [19, 20] in Python 3.8. As solver, we have used Gurobi

9.0 [18]. A value of ϵ= 1e−6has been imposed in (11) and (12). The values of the

big-Min (11) and (12) are node dependent, and they have been tightened following

the process described in [7]. For all the computational experiments, the classiﬁcation

model considered has been a Random Forest with T= 200 trees and a maximum

depth of 4. Our experiments have been conducted on a PC, with an Intel R CoreTM i7-

1065G7 CPU @ 1.30GHz 1.50 GHz processor and 16 gigabytes RAM. The operating

system is 64 bits.

The ﬁrst dataset, ItalyPowerDemand [23], has one functional feature. There are

1096 instances and each instance is a time series of length 24, representing the power

demand in Italy in six months. The binary classiﬁcation task is to distinguish days from

October to March (response value −1) from April to September (response value +1).

The second dataset, NATOPS [15], has 24 functional time series of length 51 repre-

10

senting the X, Y, and Z coordinates of the left and right hand, wrist, thumb and elbows

as captured by a Kinect 2 sensor. There are 260 instances and we chose two classes of

the 6 that there are in the dataset. The binary classiﬁcation task is thus to distinguish

the gesture “All Clear” (response value −1) from “Not Clear” (response value +1).

4.1 Experimental results

ItalyPowerDemand

We present the counterfactual for an instance x0of the dataset ItalyPowerDemand in

Figure 2. In each case, we represent the original curve, the prototypes, and the ﬁnal

counterfactual.

The ﬁrst cost model analysed is the squared Euclidean model (4) with λ0= 0

(since we have only one feature, λ0>0is meaningless). Different values of Bmax

have been used. The smaller the value of Bmax is, the more sparse the counterfactual is

in terms of prototypes, while the larger the value of Bmax is, the higher the freedom to

use prototypes and therefore the smaller the distances obtained. In Figure 3 we plot the

relation between the number of prototypes and the distance. It is illustrated how using

more than one prototype may be beneﬁcial, but using more than 4 prototypes gives us

less sparsity without smaller distances.

To show the ﬂexibility of our model, the same experiments have been carried out

but changing the cost based on DTW distances (5), again with λ0= 0. The counterfac-

tual solutions have been calculated with the heuristic procedure described in Algorithm

1. The results are depicted in Figure 4. As before, one can see how the objective func-

tion decreases as the number of prototypes Bmax increases. However, in this case, it is

sufﬁcient to use 2 prototypes, as 3 or more will not improve much the objective func-

tion, see Figure 5.

NATOPS

We present now the counterfactual for an instance x0of the multivariate dataset NATOPS.

The cost function used has been of the form is the squared Euclidean model (4) with

λ0= 1, λ2= 0.005.

In Figure 6 the counterfactual instance xfor x0for Bmax = 1 is shown. As the

cost function Ccontains as its ﬁrst term the ℓ0norm, we obtain a sparse solution in

the sense of the features we need to change to move from x0to x. Indeed, to change

its class, only three functional features have to be modiﬁed. In Figure 7 the changed

features are presented.

As in the univariate case, we can impose different values of Bmax. In Figure 8 we

show the counterfactual explanation for Bmax = 2 and for the same cost function. Note

11

(a) Bmax = 1 (b) Bmax = 2

(c) Bmax = 3 (d) Bmax = 4

Figure 2: Counterfactual explanations for x0of the ItalyPowerDemand data set which

has been predicted by the Random Forest in k0=−1and whose counterfactual x

has to be predicted in class k∗= +1. Different values of Bmax, i.e., the number of

prototypes used for the convex combination, have been imposed.The cost function is

model (4) with λ0= 0, λ2= 1.

how giving the ﬂexibility to use more than one prototype, results in only having to

change two features, see Figure 9.

5 Conclusions

In this paper, we have proposed a novel approach to build counterfactual explanations

when dealing with multivariate functional data in classiﬁcation problems by means of

mathematical optimization. With our method, we ensure plausible and sparse explana-

tions, controlling not only the number of prototypes of the dataset used to create the

counterfactuals, but also the number of features that need to be changed. Our model

is also ﬂexible enough to be used with different distance measures, e.g., the Euclidean

distance or the DTW distance. Moreover, our methodology is applicable to score-based

12

Figure 3: Euclidean distance obtained vs number of prototypes used in a counterfactual

explanation xfor x0of the ItalyPowerDemand data set which has been predicted by

the Random Forest in k0=−1and it is imposed k∗= +1.

classiﬁers, including additive tree models, such as random forest or XGBoost models,

as well as linear models, such as logistic regression and linear support vector machines.

We have illustrated our methodology on various real-world datasets.

There are several interesting lines of future research. First, an extension to other

non score-based classiﬁers, like k-NN classiﬁers, deserve some study. Secondly, to de-

ﬁne counterfactual explanations for functional data one could be interested in keeping

ﬁxed a part of the curves deﬁning the features. With our method we build the coun-

terfactuals from scratch using the combinations of prototypes in the interval [0, T ], but

suppose we have an instance deﬁned in the interval [0, t0), and one might want to ﬁnd

out how the rest of the curve in the interval [t0, T ]would have to be like to make the

overall curve being classiﬁed in class k∗. When constructing the rest of the curve, one

would need to maintain the smoothness and other properties of the curve. Finally, an

extension to other distances, like the optimal transportation distance, is also a topic of

interest.

Acknowledgements

This research has been ﬁnanced in part by research projects EC H2020 MSCA RISE

NeEDS (Grant agreement ID: 822214), FQM-329, P18-FR-2369 and US-1381178

(Junta de Andaluc´

ıa), and PID2019-110886RB-I00 (Ministerio de Ciencia, Innovaci´

on

y Universidades, Spain). This support is gratefully acknowledged.

13

(a) Bmax = 1 (b) Bmax = 2

(c) Bmax = 3 (d) Bmax = 4

Figure 4: Counterfactual explanations for x0of the ItalyPowerDemand data set which

has been predicted with a Random Forest in k0=−1and whose counterfactual x

has to be predicted in class k∗= +1. Different values of Bmax, i.e., the number of

prototypes used for the convex combination, have been imposed. The cost function is

model (5) with λ0= 0, λ2= 1.

References

[1] Emre Ates, Burak Aksar, Vitus J Leung, and Ayse K Coskun. Counterfactual

explanations for multivariate time series. In 2021 International Conference on

Applied Artiﬁcial Intelligence (ICAPAI), pages 1–8. IEEE, 2021.

[2] S. Ben´

ıtez-Pe˜

na, E. Carrizosa, V. Guerrero, MD. Jim´

enez-Gamero, B. Mart´

ın-

Barrag´

an, C. Molero-R´

ıo, P. Ram´

ırez-Cobo, D. Romero Morales, and MR Sillero-

Denamiel. On sparse ensemble methods: An application to short-term predic-

tions of the evolution of COVID-19. European Journal of Operational Research,

295(2):648–663, 2021.

[3] Donald J Berndt and James Clifford. Using dynamic time warping to ﬁnd patterns

14

Figure 5: DTW distance obtained vs number of prototypes used used in a counterfac-

tual explanation xfor x0of the ItalyPowerDemand data set which has been predicted

by the Random Forest in k0=−1and it is imposed k∗= +1.

in time series. In KDD workshop, volume 10, pages 359–370. Seattle, WA, USA,

1994.

[4] D. Bertsimas, A. King, and R. Mazumder. Best subset selection via a modern

optimization lens. The Annals of Statistics, 44(2):813–852, 2016.

[5] Dimitris Bertsimas and Jack Dunn. Optimal classiﬁcation trees. Machine Learn-

ing, 106(7):1039–1082, 2017.

[6] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

[7] Emilio Carrizosa, Jasone Ram´

ırez-Ayerbe, and Dolores Romero Morales.

Generating collective counterfactual explanations in score-based classiﬁ-

cation via mathematical optimization. Technical Report IMUS, Sevilla,

Spain, 2022. https://www.researchgate.net/publication/

353073138_Generating_Counterfactual_Explanations_

in_Score-Based_Classification_via_Mathematical_

Optimization.

[8] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA,

2016. ACM.

[9] Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh,

Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn

15

Figure 6: Counterfactual explanations for x0of the NATOPS data set which has been

predicted by the Random Forest in k0= +1 and whose counterfactual xhas to be

predicted in class k∗=−1.Bmax = 1 prototype has been imposed. The cost function

is model (4) with λ0= 1, λ2= 0.005.

Keogh. The UCR time series archive. IEEE/CAA Journal of Automatica Sinica,

6(6):1293–1305, 2019.

[10] Eoin Delaney, Derek Greene, and Mark T Keane. Instance-based counterfactual

explanations for time series classiﬁcation. In International Conference on Case-

Based Reasoning, pages 32–47. Springer, 2021.

[11] Mengnan Du, Ninghao Liu, and Xia Hu. Techniques for interpretable machine

learning. Communications of the ACM, 63(1):68–77, 2019.

[12] Carlos Eiras-Franco, Bertha Guijarro-Berdinas, Amparo Alonso-Betanzos, and

16

Antonio Bahamonde. A scalable decision-tree-based method to explain interac-

tions in dyadic data. Decision Support Systems, 127:113141, 2019.

[13] Philippe Esling and Carlos Agon. Time-series data mining. ACM Computing

Surveys (CSUR), 45(1):1–34, 2012.

[14] Runshan Fu, Manmohan Aseri, ParamVir Singh, and Kannan Srinivasan. “Un”

fair machine learning algorithms. Management Science, 68(6):4173–4195, 2022.

[15] Nehla Ghouaiel, Pierre-Franc¸ois Marteau, and Marc Dupont. Continuous pattern

detection and recognition in stream-a benchmark for online gesture recognition.

International Journal of Applied Pattern Recognition, 4(2):146–160, 2017.

[16] B. Goodman and S. Flaxman. European Union regulations on algorithmic

decision-making and a “right to explanation”. AI Magazine, 38(3):50–57, 2017.

[17] Riccardo Guidotti. Counterfactual explanations and how to ﬁnd them: literature

review and benchmarking. Forthcoming in Data Mining and Knowledge Discov-

ery, 2022.

[18] LLC Gurobi Optimization. Gurobi optimizer reference manual, 2021.

[19] William E. Hart, Carl D. Laird, Jean-Paul Watson, David L. Woodruff, Gabriel A.

Hackebeil, Bethany L. Nicholson, and John D. Siirola. Pyomo–optimization mod-

eling in Python, volume 67. Springer Science & Business Media, second edition,

2017.

[20] William E Hart, Jean-Paul Watson, and David L Woodruff. Pyomo: modeling and

solving mathematical programs in Python. Mathematical Programming Compu-

tation, 3(3):219–260, 2011.

[21] Wolfgang Jank and Galit Shmueli. Functional data analysis in electronic com-

merce research. Statistical Science, 21(2):155–166, 2006.

[22] Amir-Hossein Karimi, Gilles Barthe, Bernhard Sch ¨

olkopf, and Isabel Valera.

A survey of algorithmic recourse: deﬁnitions, formulations, solutions, and

prospects. arXiv preprint arXiv:2010.04050, 2021.

[23] Eamonn Keogh, Li Wei, Xiaopeng Xi, Stefano Lonardi, Jin Shieh, and Scott

Sirowy. Intelligent icons: Integrating lite-weight data mining and visualization

into GUI operating systems. In Sixth International Conference on Data Mining

(ICDM’06), pages 912–916. IEEE, 2006.

[24] Scott M Lundberg and Su-In Lee. A uniﬁed approach to interpreting model pre-

dictions. In Proceedings of the 31st International Conference on Neural Informa-

tion Processing Systems, pages 4768–4777, 2017.

[25] S.M. Lundberg, G. Erion, H. Chen, A. DeGrave, J.M. Prutkin, B. Nair, R. Katz,

J. Himmelfarb, N. Bansal, and S.-I. Lee. From local explanations to global under-

standing with explainable AI for trees. Nature Machine Intelligence, 2(1):2522–

5839, 2020.

17

[26] Tim Miller. Explanation in artiﬁcial intelligence: Insights from the social sci-

ences. Artiﬁcial Intelligence, 267:1–38, 2019.

[27] Kiarash Mohammadi, Amir-Hossein Karimi, Gilles Barthe, and Isabel Valera.

Scaling guarantees for nearest counterfactual explanations. In Proceedings of the

2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 177–187, 2021.

[28] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust

you?” Explaining the predictions of any classiﬁer. In Proceedings of the 22nd

ACM SIGKDD International Conference on Knowledge Discovery and Data Min-

ing, pages 1135–1144, 2016.

[29] Ashish Sood, Gareth M James, and Gerard J Tellis. Functional regression: A new

model for predicting market penetration of new products. Marketing Science,

28(1):36–51, 2009.

[30] Nur Sunar and Jayashankar M. Swaminathan. Net-metered distributed renewable

energy: A peril for utilities? Management Science, 67(11):6716–6733, 2021.

[31] Sahil Verma, John Dickerson, and Keegan Hines. Counterfactual explanations for

machine learning: A review. arXiv preprint arXiv:2010.10596, 2020.

[32] Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explana-

tions without opening the black box: Automated decisions and the GDPR. Harv.

JL & Tech., 31:841, 2017.

[33] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. A brief survey on sequence

classiﬁcation. ACM SIGKDD Explorations Newsletter, 12(1):40–48, 2010.

[34] Dmitry Zhdanov, Sudip Bhattacharjee, and Mikhail A Bragin. Incorporating FAT

and privacy aware AI modeling approaches into business decision making frame-

works. Decision Support Systems, 155:113715, 2022.

[35] Zemin Zheng, Jinchi Lv, and Wei Lin. Nonsparse learning with latent variables.

Operations Research, 69(1):346–359, 2021.

18

(a) Feature 7

(b) Feature 11

(c) Feature 13

Figure 7: Changed features in the counterfactual explanation for x0of the NATOPS

data set which has been predicted by the Random Forest in k0=−1and whose coun-

terfactual xhas to be predicted in class k∗= +1 with Bmax = 1. The cost function is

model (4) with λ0= 1, λ2= 0.005.

19

(a) Feature 11

(b) Feature 22

Figure 9: Changed features in the counterfactual explanation for x0of the NATOPS

data set which has been predicted by the Random Forest in k0=−1and whose coun-

terfactual xhas to be predicted in class k∗=−1with Bmax = 2.The cost function is

model (4) with λ0= 1, λ2= 0.005.

21