Content uploaded by Raphaël Feraud
Author content
All content in this area was uploaded by Raphaël Feraud
Content may be subject to copyright.
Contact Personalization
using a Score Understanding Method
Vincent Lemaire, Raphael F´eraud, Nicolas Voisine
Orange Labs, 2 avenue Pierre Marzin, 22307 Lannion Cedex - France
E-mail: vincent.lemaire@orange-ftgroup.com
Abstract—This paper presents a method to interpret the out-
put of a classification (or regression) model. The interpretation
is based on two concepts: the variable importance and the value
importance of the variable. Unlike most of the state of art
interpretation methods, our approach allows the interpretation
of the model output for every instance. Understanding the score
given by a model for one instance can for example lead to
an immediate decision in a Customer Relational Management
(CRM) system. Moreover the proposed method does not depend
on a particular model and is therefore usable for any model or
software used to produce the scores.
I. INTRODUCTION
The most elaborate way, in a CRM system, to build knowl-
edge on customer is to produce scores. Tools which produce
scores allow to project, on a given population, quantifiable
information. The score is an evaluation for all instances of a
target variable to explain. The score (the output of a model)
is computed using input variables which describe instances.
Scores are then “injected” in the information system (IS), for
example, to personalize the customer relationship.
Nevertheless, sometimes the scores are not directly us-
able. For example if a scoring model identifies a customer
interested in churning, the score does not say anything on
the action needed to avoid his cancellation. To prevent this
intention to churn, the fragility of the customer and its causes
have to be identified.
We propose to solve this problem by interpreting the
classification produced by the model for every instance.
To make possible the industrial implementation of this
solution we propose a completely automatic method. The
interpretation of the score is delivered for every instance to
feed the information system. This knowledge could then be
exploited to provide information personalized in the customer
relationship management.
The proposed method is independent of the model used
to build the scores. The most powerful model can be used
without changing the difficulty of its interpretation. This
interpretation method could thus remove one of the principal
difficulty of the use of models like Support Vector Machines
(SVM), Random Forest (RF) or artificial neural networks
(ANN)in the marketing services.
II. POSITIONING AND PREVIOUS WORKS
A. Variable importance
The field of machine learning abounds in techniques able
to effectively solve problems of regression and/or classifi-
cation. These techniques build a model from a training data
base made up of a finite number of examples. The built model
is used to associate an input vector to an output vector on a
class label.
The large number of the models (linear regression, ANN,
naive bayes, Random Forest (RF), Parzen window...) existing
in the literature lead to a number of interpretation methods,
generally specific to each model. The interpretation of the
model is often based on: the parameters and the structure
of the model [1], statistical tests on the coefficient’s model
[2], geometrical interpretations [3], rules [4] or fuzzy rules
[5]. Resulting interpretations are often complex based on
averages (for several individuals), for a given model (ANN,
Decision Tree), or for a given task (regression OR classifi-
cation).
Another approach consists in analysing the model as a
black box with a sensibility analysis method. In these “What
if?” analyses, the structure and the parameters of the model
are only needed to compute the output of the model. This
independence gives valid interpretation methods whatever the
model.
To analyze in detail the state of the art approaches, nota-
tions which will be used below in this paper are introduced
in table I.
V
j
: an input variable j;
X : a vector of J dimension;
K : the number of training examples;
X
n
: a example n;
X
nj
: the component j of the vector X
n
;
F : the predictive model;
p : the component p of the output vector;
F
p
(X) : the output value of the component p
of the output vector of the model;
and : F
p
j
(a; b) = F
P
j
(a
1
, . . . , a
j−1
, b, a
j+1
, . . . , a
J
);
TABLE I
NOTATIONS
In this table F
p
j
(a; b) denotes the output p of the model
when the component j, value a, is replaced by the value b.
The proposed method analyses the outputs of the model one
by one. Therefore the simplified notation F
j
will be used
(instead of F
p
j
). All calculations presented in this paper are
identical whatever the output p of the model.
Framling [6] introduces a variable importance mea-
sure, I, based on sensitivity analysis: I(V
j
|F, X
n
, p =
[F
j
(X
n
, max(V
j
))−F
j
(X
n
, min(V
j
))]/[max [F(X
n
), ∀n]−
min [F (X
n
), ∀n]]; where max(V
j
) and max(V
j
) denotes
respectively the maximum and the minimum value of V
j
.
Min(V
j
)
h
Max(V
j
)
+1
-1
F
j
(X
n
, M in(V
j
))
Max(F (X
n
)) ∀n
F
j
(X
n
, M ax(V
j
))
S
F (X
n
)
Min(F (X
n
)) ∀n
Fig. 1. What if simulation: Output values of the model vs. values of V
j
.
This measurement is interesting but can be misleading
when F is not monotonous (see Figure 1). In this illustrative
example, the variable V
j
is important for the model F :
according to the values of this variable an example can be
classified in class +1 or −1. However F (X
n
, max(V
j
)) and
F (X
n
, min(V
j
)) are close, which leads to underestimate the
importance of the variable V
j
. Moreover, this method is based
on extremums variable and thus very sensitive to noise.
Another approach is based on the variation of the model
output for a variation h of the variable V
j
and an example X
n
(see Fig. 1). When h tends towards zero, this measurement
corresponds to the partial derivative of the model compared
to the variable V
j
. In this case, measurement is local and
can give an erroneous importance measurement: the partial
derivative at the point F(X
n
) is null for this example
whereas the variable V
j
is important. When h is larger, as in
the previous case, this measurement can be misleading when
F is not monotonous. The problem is the same when these
measurements are averaged on all examples.
Feraud et al. [7] proposes a method based on the integral
of the variations of the outputs model. This measurement is
well adapted to non monotonous functions. On the illustrative
example (see Fig. 1), this measurement is related to the
surface under the curve. As this surface is important, the
variable V
j
is important. The principal drawback of this
method is that it does not take into account the distribution
of the examples to define the interval of integration.
We propose a method of variable importance measurement
based on the integral of the output variations of the model
using the probability distributions of the examples. This mea-
surement was tested successfully for classification problems
in [8] and of regression in [9]. This method will be used in
this paper as the “variable importance” definition.
B. Variable influence
For a given problem, a subset of relevant variables can
be chosen using the variable importance measurement. This
variable selection increases the model robustness and fa-
cilitates the model interpretation. However, the notion of
variable importance, for an instance X
n
, is not sufficient to
interpret its classification.
One way to complete the interpretation is to analyse the
importance of the value of the considered variable V
j
on
the output value of the model. In Figure 1 the example X
n
belongs to the class −1. What indicates the value of the
variable V
j
for this example? Is it possible to change its class
by modifying the V
j
value? We propose to answer questions
such as these ones using a measurement of the value of a
given variable V
j
for an example X
n
. The importance of the
value of a variable will be called its “influence”.
To produce an interpretation of the model F´eraud et
al. [7] propose to segment examples and then characterize
each cluster using the variables importance and influences
inside every cluster. In this paper the objective is to propose
a method which produces, automatically (without human
assistance), an interpretation of the score for each example
(instead for each cluster).
Therefore an “influence measurement” relative to every
example will be proposed in the next section. Among existing
methods the method proposed in [6] by Framling is the
closest. But Framling uses extremums and an assumption
of monotonous variations of the output model versus the
variations of the input variable. The proposed “influence”
measure is based on the distribution of the examples and is
therefore more robust to outliers.
III. METHOD DESCRIPTION
A. Importance of an input variable for an example
Considering
1
the model F , the example X
n
, the input
variable V
j
and the variable to be explained p, the sensitivity
of the model S(V
j
/F, X
n
, p) is defined as the sum of the
variations observed on the output p when perturbing the
example X
n
using the probability distribution of the input
variable V
j
.
The perturbed output of the model F , for an example X
n
is the model output for this example but having replaced
the value of the variable V
j
with the value for an example
k. The measured variation, for the example X
n
, is then
the difference between the “true output” F
j
(X
n
) and the
“perturbed output” F
j
(X
n
, X
k
) of the model.
The sensitivity of the model is then the mean value of
||F
j
(X
n
) − F
j
(X
n
, X
k
)||
2
for the probability distribution
of the variable V
j
. Approximating the variable probability
distribution by the empirical distribution of the examples:
S(V
j
|F, X
n
, p) =
K
X
k=1
||F
j
(X
n
) − F
j
(X
n
; X
k
) ||
2
(1)
A sensitivity distribution is available by carrying out this
sensitivity measurement on the output p and whatever is the
input variable
2
V
j
. The importance of the variable V
j
to the
1
Definitions I and I
v
are presented here for one variable V
j
, of the input
vector of the model, and one output p, of the output vector. These definitions
are the same whatever the considered variables j and p.
2
The importance is not intrinsic to one input variable but to all variables.
The distribution is established for all the input variables and using all the
examples
example X
n
, I(V
j
|F, X
n
, p), is then defined as the rank o
of the model sensitivity, S(V
j
|F, X
n
, p), in the sensitivity
distribution S(V
j
|F, X
i
, p) ∀i, j :
I(V
j
|F, X
n
, p) = (2)
P [(S(V
j
|F, X
i
, p)∀i, ∀j) ≤ S(V
j
|F, X
n
, p)] ≥ o
This measurement provides the variable importance of an
input variable to an example relatively to all others examples
and all others input variables. This relative measurement
gives relevant information to every instance.
B. Influence on an example of an input variable value
An input variable can “pull up” (high value) or “pull
down” (low value) the model output. For the example X
n
the
“natural” value of the output model p is by definition F (X
n
)
(which can also be denoted by F
j
(X
n
, X
n
)). The perturbed
value considering the input variable V
j
is F
j
(X
n
, X
k
).
The distribution of F
j
(X
n
, X
k
) represents the “potential”
values for the example X
n
if its variable V
j
was different.
The position of the natural value of X
n
(F (X
n
)) within
this distribution gives information on the value of V
j
(X
nj
).
The influence of the variable V
j
on an example X
n
is then
defined, I
v
(V
j
|F, X
n
, p), as the rank r of the “natural” output
model within the “potential values”:
I
v
(V
j
|F, X
n
, p) = P [(F
j
(X
n
, X
k
)∀k) ≤ F (X
n
)] ≥ r. (3)
For example, for a two classes classification problem
(output −1 or +1), a high value of the rank of I
v
shows
a positive influence on the class +1 and a negative one on
the class −1. Reciprocally a low value of the rank of I
v
shows a positive influence on the class −1 and negative one
on the class +1.
C. Automation of the interpretation: discussion
In business applications of CRM, scores identify cus-
tomers most interested to react positively to a marketing
campaign. For example, rather than to send a mail to all its
customers to offer a product, a company will prefer to target
the subset of its customers having the most “appetency” for
the product. The marketing campaign will be less expensive,
and the customers who are not interested by the product will
have a lower probability to receive the publicity’s product in
their post-box (or mailbox).
The score interpretation brings additional information to
improve the effectiveness of marketing campaigns. The score
understanding provides means to support and personalize
commercial action. For example if a customer is identified
as fragile because he wishes to renew his mobile phone,
the telecommunication company will be able to react by
proposing a subscription with a reduction on the purchase
price of a mobile phone. If the fragility of another customer
corresponds to an under use of its “pay monthly plan”, the
company will be able to propose a better adapted plan.
In our system (see Figure 2), scores and score interpre-
tations are evaluated in the deployment phase. Customer
identifiers having the highest scores and the corresponding
interpretation are send to the CRM system. This system uses
the score understanding to personalize customer relation-
ships.
The proposed method in this paper analyses the sensitiv-
ity of the model output p considering each input variable
independently.
Modelisation ModelCustomers
Database
Target
choice
Deployment
InterpretationsScores
Contact Customization
Target
Population
Fig. 2. Application architecture
The different steps needed to obtain the score under-
standing can require a long computation time. To speed
up this computation two solutions are possible. The first
solution extracts “an abstract” of each input variable using for
example the method presented in [10] or centile information
for continuous value and the method presented in [11] for
categorical variables. The second one consists in memorising
the S(.) distribution.
IV. ILLUSTRATION ON A TOY EXAMPLE
A. Toy example
A toy example has been constructed to test and observe
the model interpretation method proposed in this paper. This
toy example is presented in Figure 3. In this figure the class
−1 is in black and the class +1 is in gray. The Figure 4
illustrates “a priori” influence zones of the two dimensions:
(1) areas of points A and C: examples where both V
1
and V
2
influence the class, (2) area of point B: examples where only
V
1
influences the class, (3) areas of point D and F: examples
where only V
2
influences the class and (4) area of point E:
examples where any dimension influences the class.
Data: 1000 examples for the training set and 1000 for the
test set, were randomly drawn (V
1
∈ [0 : 2], V
2
∈ [0 : 2]).
Models - Two types of model were tested on this toy
example: (1) a Neural Network [12] (NN) using one hid-
den layer, a sigmoid activation function, the standard back
propagation algorithm (stochastic version) and the squared
error for cost function. Using a cross validation procedure the
number of hidden units has been fixed to 4; (2) a Parzen Win-
dow [13] (PW) using an Gaussian Kernel and the L2 norm
0.5 1.0 1.5 2.0
1.0
0.5
1.5
2.0
1.0
0.5
1.5
2.0
V
V
2
1
Fig. 3. Toy example: two classes
1.0 1.5 2.00.5
1.0
0.5
1.5
2.0
V
1
V
2
A
B
C
E
D
F
Fig. 4. Influence zones
(P (y
i
|X
n
) =
³
P
n
y=y
i
K(X
n
, X
k
)/
P
n
K(X
n
, X
k
)
´
where K(X
n
, X
k
) = exp(||X
n
− X
k
||
2
)/(2σ
2
)). The pa-
rameter σ was fixed to 0.1 using a cross validation proce-
dure. Whatever the model the data were standardized before
training.
B. Construction of the elements of the interpretation
Among the 1000 test examples, 6 representative examples
of influence zones of variables V
1
and V
2
were selected to il-
lustrate the method. Their location is indicated in the figure 4
and they are named from A to F : A(0.25,1.50), B(1.00,1.50),
C(1.75,1.50), D(0.25,0.25), E(1.00,0.25), F(1.75,0.25).
The interpretation as of the these 6 examples requires the
following steps (for n ∈ {A, B, C, D, E, F }):
• for I(V
j
/F, X
n
, p) :
(1.1) calculation of S(V
j
/F, X
i
, p) ∀j, ∀i
(1.2) sorting S(.)
(1.3) determination of the rank of S(V
j
/F, X
n
, p);
• for I
v
(V
j
/F, X
n
, p) :
(2.1) calculation of the F (X
n
, X
k
) ∀k;
(2.2) sorting F (.)
(2.3) determination of the rank of F (X
n
);
C. Results and discussion
The Figure 6 shows the sensibility distribution (S(.),
equation 1) obtained for V
1
using the NN and the PW on
the training set. The x-coordinate represents a sensitivity
value and the y-coordinate its corresponding rank in the
distribution. The sensibility ranks progresses by stages (the
result, not presented here, is the same for V 2). Sensibility
distributions are constituted of some important modalities
relatively to the considered classification problem and the
models used. These distributions concatenate the effect of
individual sensibilities and influence zones: zones where the
input variables have no interest, zones where they have high
interest and transitory zones.
Figure 5 presents the distributions of “potential” output for
the test point F and both the input variables V
1
, V
2
using
the NN. The obtained distribution using the input variable
V
1
has an only one modality: F (X
n
, X
k
) = −1.0 ∀k. This
result is consistent since this variable has no influence for this
example F. The obtained distribution using the input variable
V
2
has 3 modes : F (X
n
, X
k
) = −1, −1 ≤ F (X
n
, X
k
) ≤
+1, F (X
n
, X
k
) = +1.
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
MLP
Parzen
Fig. 5. Ordered sensibility distribu-
tion for V
1
.
0
10
20
30
40
50
60
70
80
90
100
-1 -0.5 0 0.5 1
V2
V1
Fig. 6. Ordered “potential” output
for the test point ‘F’ and V
1
,V
2
using
the MLP.
Figures 6 and 5 show that it could be interesting to use
a rank range instead of a single rank. Quintiles, Q
1
, Q
2
,
Q
3
, Q
4
and Q
5
, will be now used with the respective labels:
“Very weak”, “Weak”, “Average”, “Strong”, “Very Strong”.
Each rank belongs to one of these quintiles (value of Q in
the Table II) and has therefore the corresponding label. The
joint observation of Table II), Figure II and Figure 5 shows
a total coherence in the obtained results.
The influence of an input variable (I
v
) has to be evaluated
also in conjunction with the variable importance (I). If I = 0
the corresponding I
v
is unimportant. Variables with a small
I should not be used in the interpretation. In this case the
interpretation has to be based only on the important variables
(in these cases the value I
v
is not presented in the Table II).
Interpretation using the MLP
V
j
, X
n
S I F (X
n
) I
v
V
1
, X
A
1.24 Q
4
(o=63) +1.00 Q
5
(r=99)
V
2
, X
A
0.96 Q
3
(o=49) +1.00 Q
5
(r=99)
V
1
, X
B
2.70 Q
5
(o=89) -1.00 Q
1
(r=14)
V
2
, X
B
0.00 - - -
V
1
, X
C
1.24 Q
4
(o=63) +1.00 Q
5
(r=99)
V
2
, X
C
0.93 Q
2
(o=31) +1.00 Q
5
(r=99)
V
1
, X
D
0.00 - -
V
2
, X
D
3.03 Q
5
(o=95) -1.00 Q
2
(r=22)
V
1
, X
E
0.00 - - -
V
2
, X
E
0.00 - - -
V
1
, X
F
0.00 - - -
V
2
, X
F
3.05 Q
5
(o=98) -1.00 Q
2
(r=21)
Interpretation using the Parzen window
V
j
, X
n
S I F (X
n
) I
v
V
1
, X
A
1.16 Q
4
(o=63) +0.99 Q
4
(r=74)
V
2
, X
A
0.97 Q
3
(o=53) +0.99 Q
4
(r=74)
V
1
, X
B
2.28 Q
5
(o=89) -0.99 Q
2
(r=25)
V
2
, X
B
0.00 - - -
V
1
, X
C
1.16 Q
4
(o=63) +0.99 Q
4
(r=75)
V
2
, X
C
0.90 Q
2
(o=35) +0.99 Q
4
(r=67)
V
1
, X
D
0.00 - - -
V
2
, X
D
2.96 Q
5
(o=96) -0.99 Q
1
(r=12)
V
1
, X
E
0.00 - - -
V
2
, X
E
0.00 - - -
V
1
, X
F
0.00 - - -
V
2
, X
F
3.02 Q
5
(o=90) -0.99 Q
1
(r=12)
TABLE II
INTERPRETATION OF THE 6 TEST POINTS
D. Two examples of obtained interpretations
Two interpretations using Table II are presented here. The
first interpretation is for the test point A using the Parzen
Window. The interpretation contains 3 elements: (1) the point
belongs to the class +1 with a probability (the CRM score)
of 0.99 (the value of F (X
A
)) because:
* (2): V
1
which is very important indicates that it belongs
strongly to the class +1
* (3): V
2
which is moderately important indicates that it
belongs strongly to the class +1
The second interpretation is for the test point D
3
using the
MLP. The interpretation contains 2 elements: (1) the point
belongs to the class −1 with a probability (the CRM score)
of 1.00 (the value of F (X
D
)) because:
* (2): V
2
which is very important indicates that it belongs
strongly to the class −1
The inspection of obtained interpretations, Table II, on all
points of the figure 3 shows that interpretations are consistent
whatever the tested model; thus is an important advantage
of the proposed method. The interpretation method is also
usable for other applications: the importance (I) and the
influence (I
v
) (of an input variable) being known, the class
of an example (a customer in our application of this method)
could be changed or reinforced.
V. TRANSPOSITION TO A REAL APPLICATION
A. Introduction to the “Why” and “How” notions
The aim of the transposition detailed in this section is a
proof of concept, intended for a Orange
TM
Business Unit,
of the interpretation method presented in this paper. The
purpose is to show that the interpretation method can be
used in the context of CRM.
The way to improve customer’s relationship is described
in the following example. A campaign is designed to reduce
customers’ churn. The score (probability that a customer, X
n
,
churns) interpretation has to explain (i) “Why” the trained
model indicates that the customer has this score and (ii)
“How” it is possible to decrease this score.
The “Why” and “How” information are not useful for
all customers. Marketers need this information only for
customers on which the campaign will be applied. These
customers are selected using their churn probability (high
scores). These customers are named “the target”.
Using the “Why” and “How” information, marketers will
write a more personalized script to retain customers. The
commercial script can be personalized for each customer
relationship. In the discussion between the teleoperator and
the customer is rarely possible to influence more than one
aspect of this customer (one input variable of the classifica-
tion model which produces scores). Therefore an only one
variable will be kept in the Why and How interpretations as
described in the next section.
3
For the point D which belongs to the class −1, and reciprocally for the
point A of the class +1, a low rank of I
v
indicates a positive influence on
the class −1 and negative one on the class +1, see section III-B
B. Implementation
The Why notion uses the definition of I presented in
section III-A. This definition is used, here, only for the most
important variable. This variable describes a “profile” on the
customer X
n
and we define a Why notion by:
W hy(X
n
|F, p) = argmax
V
j
[I(V
j
|F, X
n
, p)] (4)
The computation time of W hy(X
n
) is in O(KJ). This
computation can be simplified only if the V
dj
, the number
d of different values of the variable V
j
are considered.
In this case the computation time of W hy(X
n
) is in
O
³³
P
j
j=1
V
dj
´
J
´
. Computation time can exceed a day
(since more than one million of customers are concerned)
and become useless in the CRM-Analytics loop (see Figure
2). To reduce this computation time, variables which have
more than 100 different values are discretized using centiles.
Therefore a variable has now a maximum of T modalities
(T ≤ 100, ∀j). The why notion uses then for S(.) the
computation:
S(V
j
|F, X
n
, p) =
T
X
t=1
||F
j
(X
n
)) − F
j
(X
n
; V
tj
) ||
2
P (V
tj
) (5)
where P (V
tj
) is the probability of V
tj
.
The “How” interpretation looks for values of variables that
positively change the score of a customer (“pull down” value
for churn or vice versa “pull up” value for “appetency”).
This interpretation is tied to I
v
(see equation 3). Here for
the Orange Business Unit application, the “How” is limited
to the more positive variable, such as (F
j
(., .) ∈ [0:1]):
How(X
n
|F, p) = argmin
V
j
»
argmin
t
[F
j
(X
n
, V
tj
)]
–
(6)
Here the problem is to prevent churn and to find the
“worst” variable. Furthermore, variables that cannot be
changed, such as sex, birthday or address, are not tested.
C. Experiments on Orange scores
Orange scores are calculated with the SAS
TM
, Kxen
TM
or
Khiops
TM
software (depending on the Business Unit and the
country). Results presented here have been obtained using
the Kxen software using a model close to a ridge regression.
However the structure of the model is not used as detailed
above in this paper.
For confidentiality reasons results of the “why” and “how”
approaches on recent Orange scores are not presented. Only
the “Why” information is illustrated on an older model
of churn. This model is computed on a table of 100000
customers. The target is composed of 10 % of customers.
Input variables are defined as follow: indicators of tele-
phone use; flags on the possession of service or product;
indicators on customer (sex, senior (yes/no), ...); indicators
of customer environment; indicators of customer purchasing
behaviour; ...
Why % of the Target Usage Product 1 Product 2 Service 1 Customer Customer Customer ...
Indication Environment Behavior
Usage 58% 0.19 0.00 1.04 0.68 0.99 0.99 0.07 ...
Product 1 17% 2.10 6.77 1.20 1.03 1.23 0.95 3.05 ...
Product 2 15% 1.85 0.00 0.49 1.15 0.79 1.01 1.06 ...
Product 3 6% 1.97 0.08 1.16 3.74 0.66 0.99 1.40 ...
... ... ... ... ... ... ... ... ... ...
TABLE III
“WHY” RESULTS
Table III shows on the first column the name of the most
important variable using the definition equation of 4. The
second column indicates the percentage of customers for
which this variable is the most important. From the third
to the last, columns gives ratios. For example the cell at the
intersection of the “Usage” column and the “Product 1” line
gives the ratio between the mean value of the input variable
“Usage” and the mean variable of customer for which the
“Product 1” input variable is the most important (in the
“Why” sense). This cell indicates customers who have a
mean greater than the mean population.
Table III shows a main profile, which is pointed by the
“Usage” variable, that contains 58 % of the “target popu-
lation”. The analysis of the first line of this table indicates
(1) for the first column: customers with weak usage of some
services (5 times smaller than the mean population); (2) for
the second column: customers with no services or product of
type “Product 1”; and so on. Therefore a possible marketing
campaign can be build to push service usage or to suggest
adequate services for their consumption. Others lines and cell
of the table III can be analysed using the same process.
15 models have been tested (for this churn problem) with
different numbers of input variables. All tests demonstrate
that the approach is useful. The “Why” approach allows
to detect profiles in high scores and to provide relevant
interpretation. The “How” approach seeks the best value that
will allow to reinforce (or change) a score.
D. Discussions
The Orange case shows the usefulness of the approach
to detect high scores profiles. The profiles interpretation is
easy since it contains only the most important variable which
characterizes the profile itself.
However profile built using only the most important vari-
able is not always the best choice. If all high scores have
the same most important variable the second most sensitive
variable has to be considered and so on. When the model
has a lot of input variables the profile could be difficult to
analyse. This is another obstacle for marketing use of the
interpretation method.
VI. CONCLUSION
A method to interpret results of a predictive model has
been presented. Experimental results on a toy problem using
two different models and experimental results using another
model (from a commercial software) were performed. Re-
sults show a very nice behavior of the method. At the
moment this method is being industrialized in Orange CRM
applications.
Even if the method was elaborated for black box models
there are still ways to improve the approaches to speed up
computing of sensitivity. The sensitivity analysis of specific
model (i.e. logistic regression) could be accelerated by
finding an analytic sensitivity function for the model. For
example the method is exact for naive bayes model which
is used in the Khiops software
4
. The proposed method will
be added to the Khiops software next year. Future work
concerns the extension of the method to obtain an instance
selection method.
ACKNOWLEDGMENTS
Authors would like to thank Claude Riwan and the Score Team
of Orange France for their contribution to the experimentation of
the method presented in this paper.
REFERENCES
[1] Y. Arcadius, J. Akossou, and R. Palm, “Consequences of variable se-
lection on the interpretation of the results in multiple linear regression,”
in Biotechnol. Agron. Soc. Environ., vol. 9, 2005, pp. 11–18.
[2] J. Nakache and J. Confais, Statistique explicative appliqu´ee. TECH-
NIP, 2003.
[3] J. J. Brennan and L. M. Seiford, “Linear programming and l1 regres-
sion: A geometric interpretation,” Computational Statistics & Data
Analysis, 1987.
[4] S. Thrun, “Extracting rules from artifcial neural networks with dis-
tributed representations,” in Advances in Neural Information Process-
ing Systems, M. Press, Ed., vol. 7. Cambridge, MA: G. Tesauro, D.
Touretzky, T. Leen, 1995.
[5] J. M. Benitez, J. L. Castro, and I. Requena, “Are artificial neural
networks black boxes,” IEEE Transactions on Neural Networks, vol. 8,
no. 5, pp. 1156–1164, 1997, septembre.
[6] K. Fr¨amling, “Explaining results of neural networks by contextual
importance and utility,” in AISB, 1996.
[7] R. F´eraud and F. Cl´erot, “A methodology to explain neural network
classification,” Neural Networks, vol. 15, no. 2, pp. 237–246, 2002.
[8] V. Lemaire and C. Cl´erot, “An input variable importance definition
based on empirical data probability and its use in variable selection,”
in International Joint Conference on Neural Networks IJCNN, 2004.
[9] V. Lemaire and R. F´eraud, “Driven forward features selection: a
comparative study on neural networks,” in International Conference
on Neural Information Processing, Hong-Kong, October 2006.
[10] M. Boulll´e, “Khiops: a statistical discretization method of continuous
attributes,” Machine Learning (ML), vol. 55, no. 1, pp. 53–69, 2004.
[11] M. Boull´e, “A bayes optimal approach for partitioning the values of
categorical attributes,” Journal of Machine Learning Research, 2005.
[12] J. A. Anderson, An introduction to neural network. MIT Press, 1995.
[13] E. Parzen, “On estimation of a probability density function and mode,”
Ann. Math. Stat., pp. 1065–1076, 1962.
4
http://www.francetelecom.com/en/group/rd/offer/
software/applications/providers/khiops.html