Conference PaperPDF Available

Contact Personalization using a Score Understanding Method

Authors:

Abstract and Figures

This paper presents a method to interpret the output of a classification (or regression) model. The interpretation is based on two concepts: the variable importance and the value importance of the variable. Unlike most of the state of art interpretation methods, our approach allows the interpretation of the model output for every instance. Understanding the score given by a model for one instance can for example lead to an immediate decision in a customer relational management (CRM) system. Moreover the proposed method does not depend on a particular model and is therefore usable for any model or software used to produce the scores.
Content may be subject to copyright.
Contact Personalization
using a Score Understanding Method
Vincent Lemaire, Raphael F´eraud, Nicolas Voisine
Orange Labs, 2 avenue Pierre Marzin, 22307 Lannion Cedex - France
E-mail: vincent.lemaire@orange-ftgroup.com
AbstractThis paper presents a method to interpret the out-
put of a classification (or regression) model. The interpretation
is based on two concepts: the variable importance and the value
importance of the variable. Unlike most of the state of art
interpretation methods, our approach allows the interpretation
of the model output for every instance. Understanding the score
given by a model for one instance can for example lead to
an immediate decision in a Customer Relational Management
(CRM) system. Moreover the proposed method does not depend
on a particular model and is therefore usable for any model or
software used to produce the scores.
I. INTRODUCTION
The most elaborate way, in a CRM system, to build knowl-
edge on customer is to produce scores. Tools which produce
scores allow to project, on a given population, quantifiable
information. The score is an evaluation for all instances of a
target variable to explain. The score (the output of a model)
is computed using input variables which describe instances.
Scores are then “injected” in the information system (IS), for
example, to personalize the customer relationship.
Nevertheless, sometimes the scores are not directly us-
able. For example if a scoring model identifies a customer
interested in churning, the score does not say anything on
the action needed to avoid his cancellation. To prevent this
intention to churn, the fragility of the customer and its causes
have to be identified.
We propose to solve this problem by interpreting the
classification produced by the model for every instance.
To make possible the industrial implementation of this
solution we propose a completely automatic method. The
interpretation of the score is delivered for every instance to
feed the information system. This knowledge could then be
exploited to provide information personalized in the customer
relationship management.
The proposed method is independent of the model used
to build the scores. The most powerful model can be used
without changing the difficulty of its interpretation. This
interpretation method could thus remove one of the principal
difficulty of the use of models like Support Vector Machines
(SVM), Random Forest (RF) or artificial neural networks
(ANN)in the marketing services.
II. POSITIONING AND PREVIOUS WORKS
A. Variable importance
The field of machine learning abounds in techniques able
to effectively solve problems of regression and/or classifi-
cation. These techniques build a model from a training data
base made up of a finite number of examples. The built model
is used to associate an input vector to an output vector on a
class label.
The large number of the models (linear regression, ANN,
naive bayes, Random Forest (RF), Parzen window...) existing
in the literature lead to a number of interpretation methods,
generally specific to each model. The interpretation of the
model is often based on: the parameters and the structure
of the model [1], statistical tests on the coefficient’s model
[2], geometrical interpretations [3], rules [4] or fuzzy rules
[5]. Resulting interpretations are often complex based on
averages (for several individuals), for a given model (ANN,
Decision Tree), or for a given task (regression OR classifi-
cation).
Another approach consists in analysing the model as a
black box with a sensibility analysis method. In these “What
if?” analyses, the structure and the parameters of the model
are only needed to compute the output of the model. This
independence gives valid interpretation methods whatever the
model.
To analyze in detail the state of the art approaches, nota-
tions which will be used below in this paper are introduced
in table I.
V
j
: an input variable j;
X : a vector of J dimension;
K : the number of training examples;
X
n
: a example n;
X
nj
: the component j of the vector X
n
;
F : the predictive model;
p : the component p of the output vector;
F
p
(X) : the output value of the component p
of the output vector of the model;
and : F
p
j
(a; b) = F
P
j
(a
1
, . . . , a
j1
, b, a
j+1
, . . . , a
J
);
TABLE I
NOTATIONS
In this table F
p
j
(a; b) denotes the output p of the model
when the component j, value a, is replaced by the value b.
The proposed method analyses the outputs of the model one
by one. Therefore the simplified notation F
j
will be used
(instead of F
p
j
). All calculations presented in this paper are
identical whatever the output p of the model.
Framling [6] introduces a variable importance mea-
sure, I, based on sensitivity analysis: I(V
j
|F, X
n
, p =
[F
j
(X
n
, max(V
j
))F
j
(X
n
, min(V
j
))]/[max [F(X
n
), n]
min [F (X
n
), n]]; where max(V
j
) and max(V
j
) denotes
respectively the maximum and the minimum value of V
j
.
Min(V
j
)
h
Max(V
j
)
+1
-1
F
j
(X
n
, M in(V
j
))
Max(F (X
n
)) n
F
j
(X
n
, M ax(V
j
))
S
F (X
n
)
Min(F (X
n
)) n
Fig. 1. What if simulation: Output values of the model vs. values of V
j
.
This measurement is interesting but can be misleading
when F is not monotonous (see Figure 1). In this illustrative
example, the variable V
j
is important for the model F :
according to the values of this variable an example can be
classified in class +1 or 1. However F (X
n
, max(V
j
)) and
F (X
n
, min(V
j
)) are close, which leads to underestimate the
importance of the variable V
j
. Moreover, this method is based
on extremums variable and thus very sensitive to noise.
Another approach is based on the variation of the model
output for a variation h of the variable V
j
and an example X
n
(see Fig. 1). When h tends towards zero, this measurement
corresponds to the partial derivative of the model compared
to the variable V
j
. In this case, measurement is local and
can give an erroneous importance measurement: the partial
derivative at the point F(X
n
) is null for this example
whereas the variable V
j
is important. When h is larger, as in
the previous case, this measurement can be misleading when
F is not monotonous. The problem is the same when these
measurements are averaged on all examples.
Feraud et al. [7] proposes a method based on the integral
of the variations of the outputs model. This measurement is
well adapted to non monotonous functions. On the illustrative
example (see Fig. 1), this measurement is related to the
surface under the curve. As this surface is important, the
variable V
j
is important. The principal drawback of this
method is that it does not take into account the distribution
of the examples to define the interval of integration.
We propose a method of variable importance measurement
based on the integral of the output variations of the model
using the probability distributions of the examples. This mea-
surement was tested successfully for classification problems
in [8] and of regression in [9]. This method will be used in
this paper as the “variable importance” definition.
B. Variable influence
For a given problem, a subset of relevant variables can
be chosen using the variable importance measurement. This
variable selection increases the model robustness and fa-
cilitates the model interpretation. However, the notion of
variable importance, for an instance X
n
, is not sufficient to
interpret its classification.
One way to complete the interpretation is to analyse the
importance of the value of the considered variable V
j
on
the output value of the model. In Figure 1 the example X
n
belongs to the class 1. What indicates the value of the
variable V
j
for this example? Is it possible to change its class
by modifying the V
j
value? We propose to answer questions
such as these ones using a measurement of the value of a
given variable V
j
for an example X
n
. The importance of the
value of a variable will be called its “influence”.
To produce an interpretation of the model F´eraud et
al. [7] propose to segment examples and then characterize
each cluster using the variables importance and influences
inside every cluster. In this paper the objective is to propose
a method which produces, automatically (without human
assistance), an interpretation of the score for each example
(instead for each cluster).
Therefore an “influence measurement” relative to every
example will be proposed in the next section. Among existing
methods the method proposed in [6] by Framling is the
closest. But Framling uses extremums and an assumption
of monotonous variations of the output model versus the
variations of the input variable. The proposed “influence”
measure is based on the distribution of the examples and is
therefore more robust to outliers.
III. METHOD DESCRIPTION
A. Importance of an input variable for an example
Considering
1
the model F , the example X
n
, the input
variable V
j
and the variable to be explained p, the sensitivity
of the model S(V
j
/F, X
n
, p) is defined as the sum of the
variations observed on the output p when perturbing the
example X
n
using the probability distribution of the input
variable V
j
.
The perturbed output of the model F , for an example X
n
is the model output for this example but having replaced
the value of the variable V
j
with the value for an example
k. The measured variation, for the example X
n
, is then
the difference between the “true output” F
j
(X
n
) and the
“perturbed output” F
j
(X
n
, X
k
) of the model.
The sensitivity of the model is then the mean value of
||F
j
(X
n
) F
j
(X
n
, X
k
)||
2
for the probability distribution
of the variable V
j
. Approximating the variable probability
distribution by the empirical distribution of the examples:
S(V
j
|F, X
n
, p) =
K
X
k=1
||F
j
(X
n
) F
j
(X
n
; X
k
) ||
2
(1)
A sensitivity distribution is available by carrying out this
sensitivity measurement on the output p and whatever is the
input variable
2
V
j
. The importance of the variable V
j
to the
1
Definitions I and I
v
are presented here for one variable V
j
, of the input
vector of the model, and one output p, of the output vector. These definitions
are the same whatever the considered variables j and p.
2
The importance is not intrinsic to one input variable but to all variables.
The distribution is established for all the input variables and using all the
examples
example X
n
, I(V
j
|F, X
n
, p), is then defined as the rank o
of the model sensitivity, S(V
j
|F, X
n
, p), in the sensitivity
distribution S(V
j
|F, X
i
, p) i, j :
I(V
j
|F, X
n
, p) = (2)
P [(S(V
j
|F, X
i
, p)i, j) S(V
j
|F, X
n
, p)] o
This measurement provides the variable importance of an
input variable to an example relatively to all others examples
and all others input variables. This relative measurement
gives relevant information to every instance.
B. Influence on an example of an input variable value
An input variable can “pull up” (high value) or “pull
down” (low value) the model output. For the example X
n
the
“natural” value of the output model p is by definition F (X
n
)
(which can also be denoted by F
j
(X
n
, X
n
)). The perturbed
value considering the input variable V
j
is F
j
(X
n
, X
k
).
The distribution of F
j
(X
n
, X
k
) represents the “potential”
values for the example X
n
if its variable V
j
was different.
The position of the natural value of X
n
(F (X
n
)) within
this distribution gives information on the value of V
j
(X
nj
).
The influence of the variable V
j
on an example X
n
is then
defined, I
v
(V
j
|F, X
n
, p), as the rank r of the “natural” output
model within the “potential values”:
I
v
(V
j
|F, X
n
, p) = P [(F
j
(X
n
, X
k
)k) F (X
n
)] r. (3)
For example, for a two classes classification problem
(output 1 or +1), a high value of the rank of I
v
shows
a positive influence on the class +1 and a negative one on
the class 1. Reciprocally a low value of the rank of I
v
shows a positive influence on the class 1 and negative one
on the class +1.
C. Automation of the interpretation: discussion
In business applications of CRM, scores identify cus-
tomers most interested to react positively to a marketing
campaign. For example, rather than to send a mail to all its
customers to offer a product, a company will prefer to target
the subset of its customers having the most “appetency” for
the product. The marketing campaign will be less expensive,
and the customers who are not interested by the product will
have a lower probability to receive the publicity’s product in
their post-box (or mailbox).
The score interpretation brings additional information to
improve the effectiveness of marketing campaigns. The score
understanding provides means to support and personalize
commercial action. For example if a customer is identified
as fragile because he wishes to renew his mobile phone,
the telecommunication company will be able to react by
proposing a subscription with a reduction on the purchase
price of a mobile phone. If the fragility of another customer
corresponds to an under use of its “pay monthly plan”, the
company will be able to propose a better adapted plan.
In our system (see Figure 2), scores and score interpre-
tations are evaluated in the deployment phase. Customer
identifiers having the highest scores and the corresponding
interpretation are send to the CRM system. This system uses
the score understanding to personalize customer relation-
ships.
The proposed method in this paper analyses the sensitiv-
ity of the model output p considering each input variable
independently.
Modelisation ModelCustomers
Database
Target
choice
Deployment
InterpretationsScores
Contact Customization
Target
Population
Fig. 2. Application architecture
The different steps needed to obtain the score under-
standing can require a long computation time. To speed
up this computation two solutions are possible. The first
solution extracts “an abstract” of each input variable using for
example the method presented in [10] or centile information
for continuous value and the method presented in [11] for
categorical variables. The second one consists in memorising
the S(.) distribution.
IV. ILLUSTRATION ON A TOY EXAMPLE
A. Toy example
A toy example has been constructed to test and observe
the model interpretation method proposed in this paper. This
toy example is presented in Figure 3. In this figure the class
1 is in black and the class +1 is in gray. The Figure 4
illustrates “a priori” influence zones of the two dimensions:
(1) areas of points A and C: examples where both V
1
and V
2
influence the class, (2) area of point B: examples where only
V
1
influences the class, (3) areas of point D and F: examples
where only V
2
influences the class and (4) area of point E:
examples where any dimension influences the class.
Data: 1000 examples for the training set and 1000 for the
test set, were randomly drawn (V
1
[0 : 2], V
2
[0 : 2]).
Models - Two types of model were tested on this toy
example: (1) a Neural Network [12] (NN) using one hid-
den layer, a sigmoid activation function, the standard back
propagation algorithm (stochastic version) and the squared
error for cost function. Using a cross validation procedure the
number of hidden units has been fixed to 4; (2) a Parzen Win-
dow [13] (PW) using an Gaussian Kernel and the L2 norm
0.5 1.0 1.5 2.0
1.0
0.5
1.5
2.0
1.0
0.5
1.5
2.0
V
V
2
1
Fig. 3. Toy example: two classes
1.0 1.5 2.00.5
1.0
0.5
1.5
2.0
V
1
V
2
A
B
C
E
D
F
Fig. 4. Influence zones
(P (y
i
|X
n
) =
³
P
n
y=y
i
K(X
n
, X
k
)/
P
n
K(X
n
, X
k
)
´
where K(X
n
, X
k
) = exp(||X
n
X
k
||
2
)/(2σ
2
)). The pa-
rameter σ was fixed to 0.1 using a cross validation proce-
dure. Whatever the model the data were standardized before
training.
B. Construction of the elements of the interpretation
Among the 1000 test examples, 6 representative examples
of influence zones of variables V
1
and V
2
were selected to il-
lustrate the method. Their location is indicated in the figure 4
and they are named from A to F : A(0.25,1.50), B(1.00,1.50),
C(1.75,1.50), D(0.25,0.25), E(1.00,0.25), F(1.75,0.25).
The interpretation as of the these 6 examples requires the
following steps (for n {A, B, C, D, E, F }):
for I(V
j
/F, X
n
, p) :
(1.1) calculation of S(V
j
/F, X
i
, p) j, i
(1.2) sorting S(.)
(1.3) determination of the rank of S(V
j
/F, X
n
, p);
for I
v
(V
j
/F, X
n
, p) :
(2.1) calculation of the F (X
n
, X
k
) k;
(2.2) sorting F (.)
(2.3) determination of the rank of F (X
n
);
C. Results and discussion
The Figure 6 shows the sensibility distribution (S(.),
equation 1) obtained for V
1
using the NN and the PW on
the training set. The x-coordinate represents a sensitivity
value and the y-coordinate its corresponding rank in the
distribution. The sensibility ranks progresses by stages (the
result, not presented here, is the same for V 2). Sensibility
distributions are constituted of some important modalities
relatively to the considered classification problem and the
models used. These distributions concatenate the effect of
individual sensibilities and influence zones: zones where the
input variables have no interest, zones where they have high
interest and transitory zones.
Figure 5 presents the distributions of “potential” output for
the test point F and both the input variables V
1
, V
2
using
the NN. The obtained distribution using the input variable
V
1
has an only one modality: F (X
n
, X
k
) = 1.0 k. This
result is consistent since this variable has no influence for this
example F. The obtained distribution using the input variable
V
2
has 3 modes : F (X
n
, X
k
) = 1, 1 F (X
n
, X
k
)
+1, F (X
n
, X
k
) = +1.
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
MLP
Parzen
Fig. 5. Ordered sensibility distribu-
tion for V
1
.
0
10
20
30
40
50
60
70
80
90
100
-1 -0.5 0 0.5 1
V2
V1
Fig. 6. Ordered “potential” output
for the test point ‘F’ and V
1
,V
2
using
the MLP.
Figures 6 and 5 show that it could be interesting to use
a rank range instead of a single rank. Quintiles, Q
1
, Q
2
,
Q
3
, Q
4
and Q
5
, will be now used with the respective labels:
“Very weak”, “Weak”, Average”, “Strong”, “Very Strong”.
Each rank belongs to one of these quintiles (value of Q in
the Table II) and has therefore the corresponding label. The
joint observation of Table II), Figure II and Figure 5 shows
a total coherence in the obtained results.
The influence of an input variable (I
v
) has to be evaluated
also in conjunction with the variable importance (I). If I = 0
the corresponding I
v
is unimportant. Variables with a small
I should not be used in the interpretation. In this case the
interpretation has to be based only on the important variables
(in these cases the value I
v
is not presented in the Table II).
Interpretation using the MLP
V
j
, X
n
S I F (X
n
) I
v
V
1
, X
A
1.24 Q
4
(o=63) +1.00 Q
5
(r=99)
V
2
, X
A
0.96 Q
3
(o=49) +1.00 Q
5
(r=99)
V
1
, X
B
2.70 Q
5
(o=89) -1.00 Q
1
(r=14)
V
2
, X
B
0.00 - - -
V
1
, X
C
1.24 Q
4
(o=63) +1.00 Q
5
(r=99)
V
2
, X
C
0.93 Q
2
(o=31) +1.00 Q
5
(r=99)
V
1
, X
D
0.00 - -
V
2
, X
D
3.03 Q
5
(o=95) -1.00 Q
2
(r=22)
V
1
, X
E
0.00 - - -
V
2
, X
E
0.00 - - -
V
1
, X
F
0.00 - - -
V
2
, X
F
3.05 Q
5
(o=98) -1.00 Q
2
(r=21)
Interpretation using the Parzen window
V
j
, X
n
S I F (X
n
) I
v
V
1
, X
A
1.16 Q
4
(o=63) +0.99 Q
4
(r=74)
V
2
, X
A
0.97 Q
3
(o=53) +0.99 Q
4
(r=74)
V
1
, X
B
2.28 Q
5
(o=89) -0.99 Q
2
(r=25)
V
2
, X
B
0.00 - - -
V
1
, X
C
1.16 Q
4
(o=63) +0.99 Q
4
(r=75)
V
2
, X
C
0.90 Q
2
(o=35) +0.99 Q
4
(r=67)
V
1
, X
D
0.00 - - -
V
2
, X
D
2.96 Q
5
(o=96) -0.99 Q
1
(r=12)
V
1
, X
E
0.00 - - -
V
2
, X
E
0.00 - - -
V
1
, X
F
0.00 - - -
V
2
, X
F
3.02 Q
5
(o=90) -0.99 Q
1
(r=12)
TABLE II
INTERPRETATION OF THE 6 TEST POINTS
D. Two examples of obtained interpretations
Two interpretations using Table II are presented here. The
first interpretation is for the test point A using the Parzen
Window. The interpretation contains 3 elements: (1) the point
belongs to the class +1 with a probability (the CRM score)
of 0.99 (the value of F (X
A
)) because:
* (2): V
1
which is very important indicates that it belongs
strongly to the class +1
* (3): V
2
which is moderately important indicates that it
belongs strongly to the class +1
The second interpretation is for the test point D
3
using the
MLP. The interpretation contains 2 elements: (1) the point
belongs to the class 1 with a probability (the CRM score)
of 1.00 (the value of F (X
D
)) because:
* (2): V
2
which is very important indicates that it belongs
strongly to the class 1
The inspection of obtained interpretations, Table II, on all
points of the figure 3 shows that interpretations are consistent
whatever the tested model; thus is an important advantage
of the proposed method. The interpretation method is also
usable for other applications: the importance (I) and the
influence (I
v
) (of an input variable) being known, the class
of an example (a customer in our application of this method)
could be changed or reinforced.
V. TRANSPOSITION TO A REAL APPLICATION
A. Introduction to the “Why and “How” notions
The aim of the transposition detailed in this section is a
proof of concept, intended for a Orange
TM
Business Unit,
of the interpretation method presented in this paper. The
purpose is to show that the interpretation method can be
used in the context of CRM.
The way to improve customer’s relationship is described
in the following example. A campaign is designed to reduce
customers’ churn. The score (probability that a customer, X
n
,
churns) interpretation has to explain (i) “Why” the trained
model indicates that the customer has this score and (ii)
“How” it is possible to decrease this score.
The “Why” and “How” information are not useful for
all customers. Marketers need this information only for
customers on which the campaign will be applied. These
customers are selected using their churn probability (high
scores). These customers are named “the target”.
Using the “Why” and “How” information, marketers will
write a more personalized script to retain customers. The
commercial script can be personalized for each customer
relationship. In the discussion between the teleoperator and
the customer is rarely possible to influence more than one
aspect of this customer (one input variable of the classifica-
tion model which produces scores). Therefore an only one
variable will be kept in the Why and How interpretations as
described in the next section.
3
For the point D which belongs to the class 1, and reciprocally for the
point A of the class +1, a low rank of I
v
indicates a positive influence on
the class 1 and negative one on the class +1, see section III-B
B. Implementation
The Why notion uses the definition of I presented in
section III-A. This definition is used, here, only for the most
important variable. This variable describes a “profile” on the
customer X
n
and we define a Why notion by:
W hy(X
n
|F, p) = argmax
V
j
[I(V
j
|F, X
n
, p)] (4)
The computation time of W hy(X
n
) is in O(KJ). This
computation can be simplified only if the V
dj
, the number
d of different values of the variable V
j
are considered.
In this case the computation time of W hy(X
n
) is in
O
³³
P
j
j=1
V
dj
´
J
´
. Computation time can exceed a day
(since more than one million of customers are concerned)
and become useless in the CRM-Analytics loop (see Figure
2). To reduce this computation time, variables which have
more than 100 different values are discretized using centiles.
Therefore a variable has now a maximum of T modalities
(T 100, j). The why notion uses then for S(.) the
computation:
S(V
j
|F, X
n
, p) =
T
X
t=1
||F
j
(X
n
)) F
j
(X
n
; V
tj
) ||
2
P (V
tj
) (5)
where P (V
tj
) is the probability of V
tj
.
The “How” interpretation looks for values of variables that
positively change the score of a customer (“pull down” value
for churn or vice versa “pull up” value for “appetency”).
This interpretation is tied to I
v
(see equation 3). Here for
the Orange Business Unit application, the “How” is limited
to the more positive variable, such as (F
j
(., .) [0:1]):
How(X
n
|F, p) = argmin
V
j
»
argmin
t
[F
j
(X
n
, V
tj
)]
(6)
Here the problem is to prevent churn and to find the
“worst” variable. Furthermore, variables that cannot be
changed, such as sex, birthday or address, are not tested.
C. Experiments on Orange scores
Orange scores are calculated with the SAS
TM
, Kxen
TM
or
Khiops
TM
software (depending on the Business Unit and the
country). Results presented here have been obtained using
the Kxen software using a model close to a ridge regression.
However the structure of the model is not used as detailed
above in this paper.
For confidentiality reasons results of the “why” and “how”
approaches on recent Orange scores are not presented. Only
the “Why” information is illustrated on an older model
of churn. This model is computed on a table of 100000
customers. The target is composed of 10 % of customers.
Input variables are defined as follow: indicators of tele-
phone use; flags on the possession of service or product;
indicators on customer (sex, senior (yes/no), ...); indicators
of customer environment; indicators of customer purchasing
behaviour; ...
Why % of the Target Usage Product 1 Product 2 Service 1 Customer Customer Customer ...
Indication Environment Behavior
Usage 58% 0.19 0.00 1.04 0.68 0.99 0.99 0.07 ...
Product 1 17% 2.10 6.77 1.20 1.03 1.23 0.95 3.05 ...
Product 2 15% 1.85 0.00 0.49 1.15 0.79 1.01 1.06 ...
Product 3 6% 1.97 0.08 1.16 3.74 0.66 0.99 1.40 ...
... ... ... ... ... ... ... ... ... ...
TABLE III
WHY” RESULTS
Table III shows on the first column the name of the most
important variable using the definition equation of 4. The
second column indicates the percentage of customers for
which this variable is the most important. From the third
to the last, columns gives ratios. For example the cell at the
intersection of the “Usage” column and the “Product 1” line
gives the ratio between the mean value of the input variable
“Usage” and the mean variable of customer for which the
“Product 1” input variable is the most important (in the
“Why” sense). This cell indicates customers who have a
mean greater than the mean population.
Table III shows a main profile, which is pointed by the
“Usage” variable, that contains 58 % of the “target popu-
lation”. The analysis of the first line of this table indicates
(1) for the first column: customers with weak usage of some
services (5 times smaller than the mean population); (2) for
the second column: customers with no services or product of
type Product 1”; and so on. Therefore a possible marketing
campaign can be build to push service usage or to suggest
adequate services for their consumption. Others lines and cell
of the table III can be analysed using the same process.
15 models have been tested (for this churn problem) with
different numbers of input variables. All tests demonstrate
that the approach is useful. The “Why” approach allows
to detect profiles in high scores and to provide relevant
interpretation. The “How” approach seeks the best value that
will allow to reinforce (or change) a score.
D. Discussions
The Orange case shows the usefulness of the approach
to detect high scores profiles. The profiles interpretation is
easy since it contains only the most important variable which
characterizes the profile itself.
However profile built using only the most important vari-
able is not always the best choice. If all high scores have
the same most important variable the second most sensitive
variable has to be considered and so on. When the model
has a lot of input variables the profile could be difficult to
analyse. This is another obstacle for marketing use of the
interpretation method.
VI. CONCLUSION
A method to interpret results of a predictive model has
been presented. Experimental results on a toy problem using
two different models and experimental results using another
model (from a commercial software) were performed. Re-
sults show a very nice behavior of the method. At the
moment this method is being industrialized in Orange CRM
applications.
Even if the method was elaborated for black box models
there are still ways to improve the approaches to speed up
computing of sensitivity. The sensitivity analysis of specific
model (i.e. logistic regression) could be accelerated by
finding an analytic sensitivity function for the model. For
example the method is exact for naive bayes model which
is used in the Khiops software
4
. The proposed method will
be added to the Khiops software next year. Future work
concerns the extension of the method to obtain an instance
selection method.
ACKNOWLEDGMENTS
Authors would like to thank Claude Riwan and the Score Team
of Orange France for their contribution to the experimentation of
the method presented in this paper.
REFERENCES
[1] Y. Arcadius, J. Akossou, and R. Palm, “Consequences of variable se-
lection on the interpretation of the results in multiple linear regression,
in Biotechnol. Agron. Soc. Environ., vol. 9, 2005, pp. 11–18.
[2] J. Nakache and J. Confais, Statistique explicative appliqu´ee. TECH-
NIP, 2003.
[3] J. J. Brennan and L. M. Seiford, “Linear programming and l1 regres-
sion: A geometric interpretation, Computational Statistics & Data
Analysis, 1987.
[4] S. Thrun, “Extracting rules from artifcial neural networks with dis-
tributed representations, in Advances in Neural Information Process-
ing Systems, M. Press, Ed., vol. 7. Cambridge, MA: G. Tesauro, D.
Touretzky, T. Leen, 1995.
[5] J. M. Benitez, J. L. Castro, and I. Requena, Are artificial neural
networks black boxes, IEEE Transactions on Neural Networks, vol. 8,
no. 5, pp. 1156–1164, 1997, septembre.
[6] K. Fr¨amling, “Explaining results of neural networks by contextual
importance and utility, in AISB, 1996.
[7] R. F´eraud and F. Cl´erot, A methodology to explain neural network
classification, Neural Networks, vol. 15, no. 2, pp. 237–246, 2002.
[8] V. Lemaire and C. Cl´erot, An input variable importance definition
based on empirical data probability and its use in variable selection,
in International Joint Conference on Neural Networks IJCNN, 2004.
[9] V. Lemaire and R. F´eraud, “Driven forward features selection: a
comparative study on neural networks, in International Conference
on Neural Information Processing, Hong-Kong, October 2006.
[10] M. Boulll´e, “Khiops: a statistical discretization method of continuous
attributes, Machine Learning (ML), vol. 55, no. 1, pp. 53–69, 2004.
[11] M. Boull´e, A bayes optimal approach for partitioning the values of
categorical attributes, Journal of Machine Learning Research, 2005.
[12] J. A. Anderson, An introduction to neural network. MIT Press, 1995.
[13] E. Parzen, “On estimation of a probability density function and mode,
Ann. Math. Stat., pp. 1065–1076, 1962.
4
http://www.francetelecom.com/en/group/rd/offer/
software/applications/providers/khiops.html
... Besides the IME method, Bohanec et al. (2017) also used the EXPLAIN method. Lemaire et al. (2008) applied Table 2: Applications of explanation methods. The type label denotes either a model-specific (specific) or a model-agnostic (agnostic) approach. ...
... SRV (Lipovetsky and Conklin, 2001) agnostic telecommunications industry EXPLAIN (Robnik-Šikonja and Kononenko, 2008) agnostic medicine, economics, electrical industry, airline industry, text categorisation problems Lemaire's method (Lemaire et al., 2008) agnostic telecommunications industry IME (Štrumbelj et al., 2009) agnostic medicine, economics, electrical industry, airline industry, text categorisation problems LIME (Ribeiro et al., 2016) agnostic text, image and tabular classification problems QII (Datta et al., 2016) agnostic socioeconomics SHAP (Lundberg and Lee, 2017b) agnostic medicine, economics, sports, aviation industry, software industry Saabas's method (Saabas, 2014) specific real estate industry LRP (Bach et al., 2015) specific image classification problems DeepLIFT (Shrikumar et al., 2017) specific biological problems Arras's method (Arras et al., 2017) specific text categorisation problems GAM (Ibrahim et al., 2019) specific economics RUDDER (Arjona-Medina et al., 2019) specific game industry explanations in the telecommunications industry. Authors used instance level explanations for immediate decisions in a customer relationship management system. ...
Preprint
Full-text available
Feature construction can contribute to comprehensibility and performance of machine learning models. Unfortunately, it usually requires exhaustive search in the attribute space or time-consuming human involvement to generate meaningful features. We propose a novel heuristic approach for reducing the search space based on aggregation of instance-based explanations of predictive models. The proposed Explainable Feature Construction (EFC) methodology identifies groups of co-occurring attributes exposed by popular explanation methods, such as IME and SHAP. We empirically show that reducing the search to these groups significantly reduces the time of feature construction using logical, relational, Cartesian, numerical, and threshold num-of-N and X-of-N constructive operators. An analysis on 10 transparent synthetic datasets shows that EFC effectively identifies informative groups of attributes and constructs relevant features. Using 30 real-world classification datasets, we show significant improvements in classification accuracy for several classifiers and demonstrate the feasibility of the proposed feature construction even for large datasets. Finally, EFC generated interpretable features on a real-world problem from the financial industry, which were confirmed by a domain expert.
... The instance, which is classified by a black box, is given as the input of the explanation method along with the black box. To investigate the decision making process, neighbors around the given instance are generated in the first place by altering feature values [18], [27]. Closer (more similar) neighbors are assigned with higher weights as they are considered to be more representative of the locality and vice versa. ...
Preprint
Full-text available
The importance of the neighborhood for training a local surrogate model to approximate the local decision boundary of a black box classifier has been already highlighted in the literature. Several attempts have been made to construct a better neighborhood for high dimensional data, like texts, by using generative autoencoders. However, existing approaches mainly generate neighbors by selecting purely at random from the latent space and struggle under the curse of dimensionality to learn a good local decision boundary. To overcome this problem, we propose a progressive approximation of the neighborhood using counterfactual instances as initial landmarks and a careful 2-stage sampling approach to refine counterfactuals and generate factuals in the neighborhood of the input instance to be explained. Our work focuses on textual data and our explanations consist of both word-level explanations from the original instance (intrinsic) and the neighborhood (extrinsic) and factual- and counterfactual-instances discovered during the neighborhood generation process that further reveal the effect of altering certain parts in the input text. Our experiments on real-world datasets demonstrate that our method outperforms the competitors in terms of usefulness and stability (for the qualitative part) and completeness, compactness and correctness (for the quantitative part).
... 17 Each point reflects one of the observations between 1990 and 2019 and their respective value 16 This metric computes the mean absolute difference between the observed predicted values and the predicted values after permuting feature k : 1 m m i=1 |ŷ i −ŷ perm i(k) |. The higher this difference, the higher the importance of the feature k (see [26,36] for similar approaches to measure variable importance). 17 Showing the Shapley values based on the forecasting predictions makes it difficult to disentangle whether nonlinear patterns are due to a nonlinear functional form or to (slow) changes of the functional form over time. ...
Chapter
Full-text available
We present a comprehensive comparative case study for the use of machine learning models for macroeconomics forecasting. We find that machine learning models mostly outperform conventional econometric approaches in forecasting changes in US unemployment on a 1-year horizon. To address the black box critique of machine learning models, we apply and compare two variables attribution methods: permutation importance and Shapley values. While the aggregate information derived from both approaches is broadly in line, Shapley values offer several advantages, such as the discovery of unknown functional forms in the data generating process and the ability to perform statistical inference. The latter is achieved by the Shapley regression framework, which allows for the evaluation and communication of machine learning models akin to that of linear models.
... Subsequently, PredDiff was reconsidered for image classification applications in the deep learning literature, see for example (Zintgraf et al., 2017;Tian & Cai, 2017;Wei et al., 2018;Gu & Tresp, 2020). Other notable contributions in this direction are (Lemaire et al., 2008;Datta et al., 2016;Casalicchio et al., 2019). However, all previous studies miss a comprehensive treatment in a well-controlled setting, testing analytical and experimental limits of PredDiff . ...
Preprint
PredDiff$ is a model-agnostic, local attribution method that is firmly rooted in probability theory. Its simple intuition is to measure prediction changes when marginalizing out feature variables. In this work, we clarify properties of $PredDiff$ and put forward several extensions of the original formalism. Most notably, we introduce a new measure for interaction effects. Interactions are an inevitable step towards a comprehensive understanding of black-box models. Importantly, our framework readily allows to investigate interactions between arbitrary feature subsets and scales linearly with their number. We demonstrate the soundness of $PredDiff$ relevances and interactions both in the classification and regression setting. To this end, we use different analytic, synthetic and real-world datasets.
... A common way of assessing feature importance is based on simulating lack of knowledge about features (Robnik-Šikonja and Kononenko, 2008;Lemaire et al., 2008). For example, one could compare the original models output with the output obtained when removing a specific feature from the data and the model (e.g., by imputing a default value for the feature). ...
Preprint
Full-text available
Lack of understanding of the decisions made by model-based AI systems is an important barrier for their adoption. We examine counterfactual explanations as an alternative for explaining AI decisions. The counterfactual approach defines an explanation as a set of the system's data inputs that causally drives the decision (meaning that removing them changes the decision) and is irreducible (meaning that removing any subset of the inputs in the explanation does not change the decision). We generalize previous work on counterfactual explanations, resulting in a framework that (a) is model-agnostic, (b) can address features with arbitrary data types, (c) is able explain decisions made by complex AI systems that incorporate multiple models, and (d) is scalable to large numbers of features. We also propose a heuristic procedure to find the most useful explanations depending on the context. We contrast counterfactual explanations with another alternative: methods that explain model predictions by weighting features according to their importance (e.g., SHAP, LIME). This paper presents two fundamental reasons why explaining model predictions is not the same as explaining the decisions made using those predictions, suggesting we should carefully consider whether importance-weight explanations are well-suited to explain decisions made by AI systems. Specifically, we show that (1) features that have a large importance weight for a model prediction may not actually affect the corresponding decision, and (2) importance weights are insufficient to communicate whether and how features influence system decisions. We demonstrate this using several examples, including three detailed studies using real-world data that compare the counterfactual approach with SHAP and illustrate various conditions under which counterfactual explanations explain data-driven decisions better than feature importance weights.
... Contact personalization using a score understanding method 35 Computes the influence of a feature by measuring the effect of changing the feature's value on the model's prediction. ...
Article
Full-text available
Background Machine learning models that are used for predicting clinical outcomes can be made more useful by augmenting predictions with simple and reliable patient-specific explanations for each prediction. Objectives This article evaluates the quality of explanations of predictions using physician reviewers. The predictions are obtained from a machine learning model that is developed to predict dire outcomes (severe complications including death) in patients with community acquired pneumonia (CAP). Methods Using a dataset of patients diagnosed with CAP, we developed a predictive model to predict dire outcomes. On a set of 40 patients, who were predicted to be either at very high risk or at very low risk of developing a dire outcome, we applied an explanation method to generate patient-specific explanations. Three physician reviewers independently evaluated each explanatory feature in the context of the patient's data and were instructed to disagree with a feature if they did not agree with the magnitude of support, the direction of support (supportive versus contradictory), or both. Results The model used for generating predictions achieved a F1 score of 0.43 and area under the receiver operating characteristic curve (AUROC) of 0.84 (95% confidence interval [CI]: 0.81–0.87). Interreviewer agreement between two reviewers was strong (Cohen's kappa coefficient = 0.87) and fair to moderate between the third reviewer and others (Cohen's kappa coefficient = 0.49 and 0.33). Agreement rates between reviewers and generated explanations—defined as the proportion of explanatory features with which majority of reviewers agreed—were 0.78 for actual explanations and 0.52 for fabricated explanations, and the difference between the two agreement rates was statistically significant (Chi-square = 19.76, p-value < 0.01). Conclusion There was good agreement among physician reviewers on patient-specific explanations that were generated to augment predictions of clinical outcomes. Such explanations can be useful in interpreting predictions of clinical outcomes.
Book
Full-text available
This open access book covers the use of data science, including advanced machine learning, big data analytics, Semantic Web technologies, natural language processing, social media analysis, time series analysis, among others, for applications in economics and finance. In addition, it shows some successful applications of advanced data science solutions used to extract new knowledge from data in order to improve economic forecasting models. The book starts with an introduction on the use of data science technologies in economics and finance and is followed by thirteen chapters showing success stories of the application of specific data science methodologies, touching on particular topics related to novel big data sources and technologies for economic analysis (e.g. social media and news); big data models leveraging on supervised/unsupervised (deep) machine learning; natural language processing to build economic and financial indicators; and forecasting and nowcasting of economic variables through time series analysis. This book is relevant to all stakeholders involved in digital and data-intensive research in economics and finance, helping them to understand the main opportunities and challenges, become familiar with the latest methodological findings, and learn how to use and evaluate the performances of novel tools and frameworks. It primarily targets data scientists and business analysts exploiting data science technologies, and it will also be a useful resource to research students in disciplines and courses related to these topics. Overall, readers will learn modern and effective data science solutions to create tangible innovations for economic and financial applications.
Presentation
Full-text available
This a talk about some insights of technical aspects of Khiops Interpretation for the Inria Team 'Lacodam' - March 2021 you may find other details about this tool on http://vincentlemaire-labs.fr/iki.html
Chapter
The most successful prediction models (e.g., SVM, neural networks, or boosting) unfortunately do not provide explanations of their predictions. In many important applications of machine learning, the comprehension of the decision process is of utmost importance and dominates the classification accuracy (e.g., in business and medicine). This chapter introduces general explanation methods that are independent of the prediction model and can be used with all classification models that output probabilities. It explains how the methods work and graphically explains models' decisions for new unlabeled cases. The approach is put in the context of applications from medicine, business, and macro-economy.
Article
Full-text available
The use of neural networks is still difficult in many application areas due to the lack of explanation facilities (the « black box » problem). The concepts of contextual importance and contextual utility presented make it possible to explain the results of neural networks in a user-understandable way. The explanations obtained are of same quality as those of expert systems, but they may be more flexible since the reasoning module and the explanation module are completely separated. The numerical complexity of estimating the contextual importance and contextual utility is to a great extent solved by the neural net proposed (INKA), which also has good function approximation and training properties.
Conference Paper
Full-text available
In the field of neural networks, feature selection has been studied for the last ten years and classical as well as original methods have been employed. This paper reviews the efficiency of four approaches to do a driven forward features selection on neural networks . We assess the efficiency of these methods compare to the simple Pearson criterion in case of a regression problem.
Article
Full-text available
Neural networks are still frustrating tools in the data mining arsenal. They exhibit excellent modelling performance, but do not give a clue about the structure of their models. We propose a methodology to explain the classification obtained by a multilayer perceptron. We introduce the concept of 'causal importance' and define a saliency measurement allowing the selection of relevant variables. Once the model is trained with the relevant variables only, we define a clustering of the data built from the hidden layer representation. Combining the saliency and the causal importance on a cluster by cluster basis allows an interpretation of the neural network classifier to be built. We illustrate the performances of this methodology on three benchmark datasets.
Article
Full-text available
Artificial neural networks are efficient computing models which have shown their strengths in solving hard problems in artificial intelligence. They have also been shown to be universal approximators. Notwithstanding, one of the major criticisms is their being black boxes, since no satisfactory explanation of their behavior has been offered. In this paper, we provide such an interpretation of neural networks so that they will no longer be seen as black boxes. This is stated after establishing the equality between a certain class of neural nets and fuzzy rule-based systems. This interpretation is built with fuzzy rules using a new fuzzy logic operator which is defined after introducing the concept of f-duality. In addition, this interpretation offers an automated knowledge acquisition procedure.
Article
Consequences of variable selection on the interpretation of the results in multiple linear regression. A priori or a posteriori variable selection is a common practise in multiple linear regression. The user is however not always aware of the consequences on the results due to this variable selection. In this note, the presence of omission bias and selection bias is explained by means of a Monte Carlo experiment. The consequences of variable selection on the regression coefficients and on the predicted values are then analysed. The user’s attention is drawn to the risk of misinterpretation of the regression coefficients, specially after variable selection. On the other hand, the consequences of variable selection on the predicted values of the response variable are rather limited, at least for the given example. © 2005, FAC UNIV SCIENCES AGRONOMIQUES GEMBLOUX. All rights reserved.
Article
Approaches to the solution of the estimation problem have a long history, going back to at least Edgeworth in the nineteenth century. Modern solution approaches have been of two types: (i) descent methods and (ii) primal or dual LP methods. Simplex based algorithms have a standard geometric interpretation, but occasionally much more tangible geometric insights are available. For example, in the capacitated transhipment problem the workings of the LP solution algorithms can be interpreted on the underlying network. In this paper we develop geometric insight into the solution process directly in the space where the problem originates - the space of observations.
Article
In supervised machine learning, some algorithms are restricted to discrete data and have to discretize continuous attributes. Many discretization methods, based on statistical criteria, information content, or other specialized criteria, have been studied in the past. In this paper, we propose the discretization method Khiops *, based on the chi-square statistic. In contrast with related methods ChiMerge and ChiSplit, this method optimizes the chi-square criterion in a global manner on the whole discretization domain and does not require any stopping criterion. A theoretical study followed by experiments demonstrates the robustness and the good predictive performance of the method.
Article
In supervised machine learning, the partitioning of the values (also called grouping) of a categorical attribute aims at constructing a new synthetic attribute which keeps the information of the initial attribute and reduces the number of its values. In this paper, we propose a new grouping method MODL founded on a Bayesian approach. The method relies on a model space of grouping models and on a prior distribution defined on this model space. This results in an evaluation criterion of grouping, which is minimal for the most probable grouping given the data, the Bayes optimal grouping. We propose new super-linear optimization heuristics that yields near-optimal groupings. Extensive comparative experiments demonstrate that the MODL grouping method builds high quality groupings in terms of predictive quality, robustness and small number of groups.