Content uploaded by Raphaël Feraud

Author content

All content in this area was uploaded by Raphaël Feraud

Content may be subject to copyright.

Contact Personalization

using a Score Understanding Method

Vincent Lemaire, Raphael F´eraud, Nicolas Voisine

Orange Labs, 2 avenue Pierre Marzin, 22307 Lannion Cedex - France

E-mail: vincent.lemaire@orange-ftgroup.com

Abstract—This paper presents a method to interpret the out-

put of a classiﬁcation (or regression) model. The interpretation

is based on two concepts: the variable importance and the value

importance of the variable. Unlike most of the state of art

interpretation methods, our approach allows the interpretation

of the model output for every instance. Understanding the score

given by a model for one instance can for example lead to

an immediate decision in a Customer Relational Management

(CRM) system. Moreover the proposed method does not depend

on a particular model and is therefore usable for any model or

software used to produce the scores.

I. INTRODUCTION

The most elaborate way, in a CRM system, to build knowl-

edge on customer is to produce scores. Tools which produce

scores allow to project, on a given population, quantiﬁable

information. The score is an evaluation for all instances of a

target variable to explain. The score (the output of a model)

is computed using input variables which describe instances.

Scores are then “injected” in the information system (IS), for

example, to personalize the customer relationship.

Nevertheless, sometimes the scores are not directly us-

able. For example if a scoring model identiﬁes a customer

interested in churning, the score does not say anything on

the action needed to avoid his cancellation. To prevent this

intention to churn, the fragility of the customer and its causes

have to be identiﬁed.

We propose to solve this problem by interpreting the

classiﬁcation produced by the model for every instance.

To make possible the industrial implementation of this

solution we propose a completely automatic method. The

interpretation of the score is delivered for every instance to

feed the information system. This knowledge could then be

exploited to provide information personalized in the customer

relationship management.

The proposed method is independent of the model used

to build the scores. The most powerful model can be used

without changing the difﬁculty of its interpretation. This

interpretation method could thus remove one of the principal

difﬁculty of the use of models like Support Vector Machines

(SVM), Random Forest (RF) or artiﬁcial neural networks

(ANN)in the marketing services.

II. POSITIONING AND PREVIOUS WORKS

A. Variable importance

The ﬁeld of machine learning abounds in techniques able

to effectively solve problems of regression and/or classiﬁ-

cation. These techniques build a model from a training data

base made up of a ﬁnite number of examples. The built model

is used to associate an input vector to an output vector on a

class label.

The large number of the models (linear regression, ANN,

naive bayes, Random Forest (RF), Parzen window...) existing

in the literature lead to a number of interpretation methods,

generally speciﬁc to each model. The interpretation of the

model is often based on: the parameters and the structure

of the model [1], statistical tests on the coefﬁcient’s model

[2], geometrical interpretations [3], rules [4] or fuzzy rules

[5]. Resulting interpretations are often complex based on

averages (for several individuals), for a given model (ANN,

Decision Tree), or for a given task (regression OR classiﬁ-

cation).

Another approach consists in analysing the model as a

black box with a sensibility analysis method. In these “What

if?” analyses, the structure and the parameters of the model

are only needed to compute the output of the model. This

independence gives valid interpretation methods whatever the

model.

To analyze in detail the state of the art approaches, nota-

tions which will be used below in this paper are introduced

in table I.

V

j

: an input variable j;

X : a vector of J dimension;

K : the number of training examples;

X

n

: a example n;

X

nj

: the component j of the vector X

n

;

F : the predictive model;

p : the component p of the output vector;

F

p

(X) : the output value of the component p

of the output vector of the model;

and : F

p

j

(a; b) = F

P

j

(a

1

, . . . , a

j−1

, b, a

j+1

, . . . , a

J

);

TABLE I

NOTATIONS

In this table F

p

j

(a; b) denotes the output p of the model

when the component j, value a, is replaced by the value b.

The proposed method analyses the outputs of the model one

by one. Therefore the simpliﬁed notation F

j

will be used

(instead of F

p

j

). All calculations presented in this paper are

identical whatever the output p of the model.

Framling [6] introduces a variable importance mea-

sure, I, based on sensitivity analysis: I(V

j

|F, X

n

, p =

[F

j

(X

n

, max(V

j

))−F

j

(X

n

, min(V

j

))]/[max [F(X

n

), ∀n]−

min [F (X

n

), ∀n]]; where max(V

j

) and max(V

j

) denotes

respectively the maximum and the minimum value of V

j

.

Min(V

j

)

h

Max(V

j

)

+1

-1

F

j

(X

n

, M in(V

j

))

Max(F (X

n

)) ∀n

F

j

(X

n

, M ax(V

j

))

S

F (X

n

)

Min(F (X

n

)) ∀n

Fig. 1. What if simulation: Output values of the model vs. values of V

j

.

This measurement is interesting but can be misleading

when F is not monotonous (see Figure 1). In this illustrative

example, the variable V

j

is important for the model F :

according to the values of this variable an example can be

classiﬁed in class +1 or −1. However F (X

n

, max(V

j

)) and

F (X

n

, min(V

j

)) are close, which leads to underestimate the

importance of the variable V

j

. Moreover, this method is based

on extremums variable and thus very sensitive to noise.

Another approach is based on the variation of the model

output for a variation h of the variable V

j

and an example X

n

(see Fig. 1). When h tends towards zero, this measurement

corresponds to the partial derivative of the model compared

to the variable V

j

. In this case, measurement is local and

can give an erroneous importance measurement: the partial

derivative at the point F(X

n

) is null for this example

whereas the variable V

j

is important. When h is larger, as in

the previous case, this measurement can be misleading when

F is not monotonous. The problem is the same when these

measurements are averaged on all examples.

Feraud et al. [7] proposes a method based on the integral

of the variations of the outputs model. This measurement is

well adapted to non monotonous functions. On the illustrative

example (see Fig. 1), this measurement is related to the

surface under the curve. As this surface is important, the

variable V

j

is important. The principal drawback of this

method is that it does not take into account the distribution

of the examples to deﬁne the interval of integration.

We propose a method of variable importance measurement

based on the integral of the output variations of the model

using the probability distributions of the examples. This mea-

surement was tested successfully for classiﬁcation problems

in [8] and of regression in [9]. This method will be used in

this paper as the “variable importance” deﬁnition.

B. Variable inﬂuence

For a given problem, a subset of relevant variables can

be chosen using the variable importance measurement. This

variable selection increases the model robustness and fa-

cilitates the model interpretation. However, the notion of

variable importance, for an instance X

n

, is not sufﬁcient to

interpret its classiﬁcation.

One way to complete the interpretation is to analyse the

importance of the value of the considered variable V

j

on

the output value of the model. In Figure 1 the example X

n

belongs to the class −1. What indicates the value of the

variable V

j

for this example? Is it possible to change its class

by modifying the V

j

value? We propose to answer questions

such as these ones using a measurement of the value of a

given variable V

j

for an example X

n

. The importance of the

value of a variable will be called its “inﬂuence”.

To produce an interpretation of the model F´eraud et

al. [7] propose to segment examples and then characterize

each cluster using the variables importance and inﬂuences

inside every cluster. In this paper the objective is to propose

a method which produces, automatically (without human

assistance), an interpretation of the score for each example

(instead for each cluster).

Therefore an “inﬂuence measurement” relative to every

example will be proposed in the next section. Among existing

methods the method proposed in [6] by Framling is the

closest. But Framling uses extremums and an assumption

of monotonous variations of the output model versus the

variations of the input variable. The proposed “inﬂuence”

measure is based on the distribution of the examples and is

therefore more robust to outliers.

III. METHOD DESCRIPTION

A. Importance of an input variable for an example

Considering

1

the model F , the example X

n

, the input

variable V

j

and the variable to be explained p, the sensitivity

of the model S(V

j

/F, X

n

, p) is deﬁned as the sum of the

variations observed on the output p when perturbing the

example X

n

using the probability distribution of the input

variable V

j

.

The perturbed output of the model F , for an example X

n

is the model output for this example but having replaced

the value of the variable V

j

with the value for an example

k. The measured variation, for the example X

n

, is then

the difference between the “true output” F

j

(X

n

) and the

“perturbed output” F

j

(X

n

, X

k

) of the model.

The sensitivity of the model is then the mean value of

||F

j

(X

n

) − F

j

(X

n

, X

k

)||

2

for the probability distribution

of the variable V

j

. Approximating the variable probability

distribution by the empirical distribution of the examples:

S(V

j

|F, X

n

, p) =

K

X

k=1

||F

j

(X

n

) − F

j

(X

n

; X

k

) ||

2

(1)

A sensitivity distribution is available by carrying out this

sensitivity measurement on the output p and whatever is the

input variable

2

V

j

. The importance of the variable V

j

to the

1

Deﬁnitions I and I

v

are presented here for one variable V

j

, of the input

vector of the model, and one output p, of the output vector. These deﬁnitions

are the same whatever the considered variables j and p.

2

The importance is not intrinsic to one input variable but to all variables.

The distribution is established for all the input variables and using all the

examples

example X

n

, I(V

j

|F, X

n

, p), is then deﬁned as the rank o

of the model sensitivity, S(V

j

|F, X

n

, p), in the sensitivity

distribution S(V

j

|F, X

i

, p) ∀i, j :

I(V

j

|F, X

n

, p) = (2)

P [(S(V

j

|F, X

i

, p)∀i, ∀j) ≤ S(V

j

|F, X

n

, p)] ≥ o

This measurement provides the variable importance of an

input variable to an example relatively to all others examples

and all others input variables. This relative measurement

gives relevant information to every instance.

B. Inﬂuence on an example of an input variable value

An input variable can “pull up” (high value) or “pull

down” (low value) the model output. For the example X

n

the

“natural” value of the output model p is by deﬁnition F (X

n

)

(which can also be denoted by F

j

(X

n

, X

n

)). The perturbed

value considering the input variable V

j

is F

j

(X

n

, X

k

).

The distribution of F

j

(X

n

, X

k

) represents the “potential”

values for the example X

n

if its variable V

j

was different.

The position of the natural value of X

n

(F (X

n

)) within

this distribution gives information on the value of V

j

(X

nj

).

The inﬂuence of the variable V

j

on an example X

n

is then

deﬁned, I

v

(V

j

|F, X

n

, p), as the rank r of the “natural” output

model within the “potential values”:

I

v

(V

j

|F, X

n

, p) = P [(F

j

(X

n

, X

k

)∀k) ≤ F (X

n

)] ≥ r. (3)

For example, for a two classes classiﬁcation problem

(output −1 or +1), a high value of the rank of I

v

shows

a positive inﬂuence on the class +1 and a negative one on

the class −1. Reciprocally a low value of the rank of I

v

shows a positive inﬂuence on the class −1 and negative one

on the class +1.

C. Automation of the interpretation: discussion

In business applications of CRM, scores identify cus-

tomers most interested to react positively to a marketing

campaign. For example, rather than to send a mail to all its

customers to offer a product, a company will prefer to target

the subset of its customers having the most “appetency” for

the product. The marketing campaign will be less expensive,

and the customers who are not interested by the product will

have a lower probability to receive the publicity’s product in

their post-box (or mailbox).

The score interpretation brings additional information to

improve the effectiveness of marketing campaigns. The score

understanding provides means to support and personalize

commercial action. For example if a customer is identiﬁed

as fragile because he wishes to renew his mobile phone,

the telecommunication company will be able to react by

proposing a subscription with a reduction on the purchase

price of a mobile phone. If the fragility of another customer

corresponds to an under use of its “pay monthly plan”, the

company will be able to propose a better adapted plan.

In our system (see Figure 2), scores and score interpre-

tations are evaluated in the deployment phase. Customer

identiﬁers having the highest scores and the corresponding

interpretation are send to the CRM system. This system uses

the score understanding to personalize customer relation-

ships.

The proposed method in this paper analyses the sensitiv-

ity of the model output p considering each input variable

independently.

Modelisation ModelCustomers

Database

Target

choice

Deployment

InterpretationsScores

Contact Customization

Target

Population

Fig. 2. Application architecture

The different steps needed to obtain the score under-

standing can require a long computation time. To speed

up this computation two solutions are possible. The ﬁrst

solution extracts “an abstract” of each input variable using for

example the method presented in [10] or centile information

for continuous value and the method presented in [11] for

categorical variables. The second one consists in memorising

the S(.) distribution.

IV. ILLUSTRATION ON A TOY EXAMPLE

A. Toy example

A toy example has been constructed to test and observe

the model interpretation method proposed in this paper. This

toy example is presented in Figure 3. In this ﬁgure the class

−1 is in black and the class +1 is in gray. The Figure 4

illustrates “a priori” inﬂuence zones of the two dimensions:

(1) areas of points A and C: examples where both V

1

and V

2

inﬂuence the class, (2) area of point B: examples where only

V

1

inﬂuences the class, (3) areas of point D and F: examples

where only V

2

inﬂuences the class and (4) area of point E:

examples where any dimension inﬂuences the class.

Data: 1000 examples for the training set and 1000 for the

test set, were randomly drawn (V

1

∈ [0 : 2], V

2

∈ [0 : 2]).

Models - Two types of model were tested on this toy

example: (1) a Neural Network [12] (NN) using one hid-

den layer, a sigmoid activation function, the standard back

propagation algorithm (stochastic version) and the squared

error for cost function. Using a cross validation procedure the

number of hidden units has been ﬁxed to 4; (2) a Parzen Win-

dow [13] (PW) using an Gaussian Kernel and the L2 norm

0.5 1.0 1.5 2.0

1.0

0.5

1.5

2.0

1.0

0.5

1.5

2.0

V

V

2

1

Fig. 3. Toy example: two classes

1.0 1.5 2.00.5

1.0

0.5

1.5

2.0

V

1

V

2

A

B

C

E

D

F

Fig. 4. Inﬂuence zones

(P (y

i

|X

n

) =

³

P

n

y=y

i

K(X

n

, X

k

)/

P

n

K(X

n

, X

k

)

´

where K(X

n

, X

k

) = exp(||X

n

− X

k

||

2

)/(2σ

2

)). The pa-

rameter σ was ﬁxed to 0.1 using a cross validation proce-

dure. Whatever the model the data were standardized before

training.

B. Construction of the elements of the interpretation

Among the 1000 test examples, 6 representative examples

of inﬂuence zones of variables V

1

and V

2

were selected to il-

lustrate the method. Their location is indicated in the ﬁgure 4

and they are named from A to F : A(0.25,1.50), B(1.00,1.50),

C(1.75,1.50), D(0.25,0.25), E(1.00,0.25), F(1.75,0.25).

The interpretation as of the these 6 examples requires the

following steps (for n ∈ {A, B, C, D, E, F }):

• for I(V

j

/F, X

n

, p) :

(1.1) calculation of S(V

j

/F, X

i

, p) ∀j, ∀i

(1.2) sorting S(.)

(1.3) determination of the rank of S(V

j

/F, X

n

, p);

• for I

v

(V

j

/F, X

n

, p) :

(2.1) calculation of the F (X

n

, X

k

) ∀k;

(2.2) sorting F (.)

(2.3) determination of the rank of F (X

n

);

C. Results and discussion

The Figure 6 shows the sensibility distribution (S(.),

equation 1) obtained for V

1

using the NN and the PW on

the training set. The x-coordinate represents a sensitivity

value and the y-coordinate its corresponding rank in the

distribution. The sensibility ranks progresses by stages (the

result, not presented here, is the same for V 2). Sensibility

distributions are constituted of some important modalities

relatively to the considered classiﬁcation problem and the

models used. These distributions concatenate the effect of

individual sensibilities and inﬂuence zones: zones where the

input variables have no interest, zones where they have high

interest and transitory zones.

Figure 5 presents the distributions of “potential” output for

the test point F and both the input variables V

1

, V

2

using

the NN. The obtained distribution using the input variable

V

1

has an only one modality: F (X

n

, X

k

) = −1.0 ∀k. This

result is consistent since this variable has no inﬂuence for this

example F. The obtained distribution using the input variable

V

2

has 3 modes : F (X

n

, X

k

) = −1, −1 ≤ F (X

n

, X

k

) ≤

+1, F (X

n

, X

k

) = +1.

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

MLP

Parzen

Fig. 5. Ordered sensibility distribu-

tion for V

1

.

0

10

20

30

40

50

60

70

80

90

100

-1 -0.5 0 0.5 1

V2

V1

Fig. 6. Ordered “potential” output

for the test point ‘F’ and V

1

,V

2

using

the MLP.

Figures 6 and 5 show that it could be interesting to use

a rank range instead of a single rank. Quintiles, Q

1

, Q

2

,

Q

3

, Q

4

and Q

5

, will be now used with the respective labels:

“Very weak”, “Weak”, “Average”, “Strong”, “Very Strong”.

Each rank belongs to one of these quintiles (value of Q in

the Table II) and has therefore the corresponding label. The

joint observation of Table II), Figure II and Figure 5 shows

a total coherence in the obtained results.

The inﬂuence of an input variable (I

v

) has to be evaluated

also in conjunction with the variable importance (I). If I = 0

the corresponding I

v

is unimportant. Variables with a small

I should not be used in the interpretation. In this case the

interpretation has to be based only on the important variables

(in these cases the value I

v

is not presented in the Table II).

Interpretation using the MLP

V

j

, X

n

S I F (X

n

) I

v

V

1

, X

A

1.24 Q

4

(o=63) +1.00 Q

5

(r=99)

V

2

, X

A

0.96 Q

3

(o=49) +1.00 Q

5

(r=99)

V

1

, X

B

2.70 Q

5

(o=89) -1.00 Q

1

(r=14)

V

2

, X

B

0.00 - - -

V

1

, X

C

1.24 Q

4

(o=63) +1.00 Q

5

(r=99)

V

2

, X

C

0.93 Q

2

(o=31) +1.00 Q

5

(r=99)

V

1

, X

D

0.00 - -

V

2

, X

D

3.03 Q

5

(o=95) -1.00 Q

2

(r=22)

V

1

, X

E

0.00 - - -

V

2

, X

E

0.00 - - -

V

1

, X

F

0.00 - - -

V

2

, X

F

3.05 Q

5

(o=98) -1.00 Q

2

(r=21)

Interpretation using the Parzen window

V

j

, X

n

S I F (X

n

) I

v

V

1

, X

A

1.16 Q

4

(o=63) +0.99 Q

4

(r=74)

V

2

, X

A

0.97 Q

3

(o=53) +0.99 Q

4

(r=74)

V

1

, X

B

2.28 Q

5

(o=89) -0.99 Q

2

(r=25)

V

2

, X

B

0.00 - - -

V

1

, X

C

1.16 Q

4

(o=63) +0.99 Q

4

(r=75)

V

2

, X

C

0.90 Q

2

(o=35) +0.99 Q

4

(r=67)

V

1

, X

D

0.00 - - -

V

2

, X

D

2.96 Q

5

(o=96) -0.99 Q

1

(r=12)

V

1

, X

E

0.00 - - -

V

2

, X

E

0.00 - - -

V

1

, X

F

0.00 - - -

V

2

, X

F

3.02 Q

5

(o=90) -0.99 Q

1

(r=12)

TABLE II

INTERPRETATION OF THE 6 TEST POINTS

D. Two examples of obtained interpretations

Two interpretations using Table II are presented here. The

ﬁrst interpretation is for the test point A using the Parzen

Window. The interpretation contains 3 elements: (1) the point

belongs to the class +1 with a probability (the CRM score)

of 0.99 (the value of F (X

A

)) because:

* (2): V

1

which is very important indicates that it belongs

strongly to the class +1

* (3): V

2

which is moderately important indicates that it

belongs strongly to the class +1

The second interpretation is for the test point D

3

using the

MLP. The interpretation contains 2 elements: (1) the point

belongs to the class −1 with a probability (the CRM score)

of 1.00 (the value of F (X

D

)) because:

* (2): V

2

which is very important indicates that it belongs

strongly to the class −1

The inspection of obtained interpretations, Table II, on all

points of the ﬁgure 3 shows that interpretations are consistent

whatever the tested model; thus is an important advantage

of the proposed method. The interpretation method is also

usable for other applications: the importance (I) and the

inﬂuence (I

v

) (of an input variable) being known, the class

of an example (a customer in our application of this method)

could be changed or reinforced.

V. TRANSPOSITION TO A REAL APPLICATION

A. Introduction to the “Why” and “How” notions

The aim of the transposition detailed in this section is a

proof of concept, intended for a Orange

TM

Business Unit,

of the interpretation method presented in this paper. The

purpose is to show that the interpretation method can be

used in the context of CRM.

The way to improve customer’s relationship is described

in the following example. A campaign is designed to reduce

customers’ churn. The score (probability that a customer, X

n

,

churns) interpretation has to explain (i) “Why” the trained

model indicates that the customer has this score and (ii)

“How” it is possible to decrease this score.

The “Why” and “How” information are not useful for

all customers. Marketers need this information only for

customers on which the campaign will be applied. These

customers are selected using their churn probability (high

scores). These customers are named “the target”.

Using the “Why” and “How” information, marketers will

write a more personalized script to retain customers. The

commercial script can be personalized for each customer

relationship. In the discussion between the teleoperator and

the customer is rarely possible to inﬂuence more than one

aspect of this customer (one input variable of the classiﬁca-

tion model which produces scores). Therefore an only one

variable will be kept in the Why and How interpretations as

described in the next section.

3

For the point D which belongs to the class −1, and reciprocally for the

point A of the class +1, a low rank of I

v

indicates a positive inﬂuence on

the class −1 and negative one on the class +1, see section III-B

B. Implementation

The Why notion uses the deﬁnition of I presented in

section III-A. This deﬁnition is used, here, only for the most

important variable. This variable describes a “proﬁle” on the

customer X

n

and we deﬁne a Why notion by:

W hy(X

n

|F, p) = argmax

V

j

[I(V

j

|F, X

n

, p)] (4)

The computation time of W hy(X

n

) is in O(KJ). This

computation can be simpliﬁed only if the V

dj

, the number

d of different values of the variable V

j

are considered.

In this case the computation time of W hy(X

n

) is in

O

³³

P

j

j=1

V

dj

´

J

´

. Computation time can exceed a day

(since more than one million of customers are concerned)

and become useless in the CRM-Analytics loop (see Figure

2). To reduce this computation time, variables which have

more than 100 different values are discretized using centiles.

Therefore a variable has now a maximum of T modalities

(T ≤ 100, ∀j). The why notion uses then for S(.) the

computation:

S(V

j

|F, X

n

, p) =

T

X

t=1

||F

j

(X

n

)) − F

j

(X

n

; V

tj

) ||

2

P (V

tj

) (5)

where P (V

tj

) is the probability of V

tj

.

The “How” interpretation looks for values of variables that

positively change the score of a customer (“pull down” value

for churn or vice versa “pull up” value for “appetency”).

This interpretation is tied to I

v

(see equation 3). Here for

the Orange Business Unit application, the “How” is limited

to the more positive variable, such as (F

j

(., .) ∈ [0:1]):

How(X

n

|F, p) = argmin

V

j

»

argmin

t

[F

j

(X

n

, V

tj

)]

–

(6)

Here the problem is to prevent churn and to ﬁnd the

“worst” variable. Furthermore, variables that cannot be

changed, such as sex, birthday or address, are not tested.

C. Experiments on Orange scores

Orange scores are calculated with the SAS

TM

, Kxen

TM

or

Khiops

TM

software (depending on the Business Unit and the

country). Results presented here have been obtained using

the Kxen software using a model close to a ridge regression.

However the structure of the model is not used as detailed

above in this paper.

For conﬁdentiality reasons results of the “why” and “how”

approaches on recent Orange scores are not presented. Only

the “Why” information is illustrated on an older model

of churn. This model is computed on a table of 100000

customers. The target is composed of 10 % of customers.

Input variables are deﬁned as follow: indicators of tele-

phone use; ﬂags on the possession of service or product;

indicators on customer (sex, senior (yes/no), ...); indicators

of customer environment; indicators of customer purchasing

behaviour; ...

Why % of the Target Usage Product 1 Product 2 Service 1 Customer Customer Customer ...

Indication Environment Behavior

Usage 58% 0.19 0.00 1.04 0.68 0.99 0.99 0.07 ...

Product 1 17% 2.10 6.77 1.20 1.03 1.23 0.95 3.05 ...

Product 2 15% 1.85 0.00 0.49 1.15 0.79 1.01 1.06 ...

Product 3 6% 1.97 0.08 1.16 3.74 0.66 0.99 1.40 ...

... ... ... ... ... ... ... ... ... ...

TABLE III

“WHY” RESULTS

Table III shows on the ﬁrst column the name of the most

important variable using the deﬁnition equation of 4. The

second column indicates the percentage of customers for

which this variable is the most important. From the third

to the last, columns gives ratios. For example the cell at the

intersection of the “Usage” column and the “Product 1” line

gives the ratio between the mean value of the input variable

“Usage” and the mean variable of customer for which the

“Product 1” input variable is the most important (in the

“Why” sense). This cell indicates customers who have a

mean greater than the mean population.

Table III shows a main proﬁle, which is pointed by the

“Usage” variable, that contains 58 % of the “target popu-

lation”. The analysis of the ﬁrst line of this table indicates

(1) for the ﬁrst column: customers with weak usage of some

services (5 times smaller than the mean population); (2) for

the second column: customers with no services or product of

type “Product 1”; and so on. Therefore a possible marketing

campaign can be build to push service usage or to suggest

adequate services for their consumption. Others lines and cell

of the table III can be analysed using the same process.

15 models have been tested (for this churn problem) with

different numbers of input variables. All tests demonstrate

that the approach is useful. The “Why” approach allows

to detect proﬁles in high scores and to provide relevant

interpretation. The “How” approach seeks the best value that

will allow to reinforce (or change) a score.

D. Discussions

The Orange case shows the usefulness of the approach

to detect high scores proﬁles. The proﬁles interpretation is

easy since it contains only the most important variable which

characterizes the proﬁle itself.

However proﬁle built using only the most important vari-

able is not always the best choice. If all high scores have

the same most important variable the second most sensitive

variable has to be considered and so on. When the model

has a lot of input variables the proﬁle could be difﬁcult to

analyse. This is another obstacle for marketing use of the

interpretation method.

VI. CONCLUSION

A method to interpret results of a predictive model has

been presented. Experimental results on a toy problem using

two different models and experimental results using another

model (from a commercial software) were performed. Re-

sults show a very nice behavior of the method. At the

moment this method is being industrialized in Orange CRM

applications.

Even if the method was elaborated for black box models

there are still ways to improve the approaches to speed up

computing of sensitivity. The sensitivity analysis of speciﬁc

model (i.e. logistic regression) could be accelerated by

ﬁnding an analytic sensitivity function for the model. For

example the method is exact for naive bayes model which

is used in the Khiops software

4

. The proposed method will

be added to the Khiops software next year. Future work

concerns the extension of the method to obtain an instance

selection method.

ACKNOWLEDGMENTS

Authors would like to thank Claude Riwan and the Score Team

of Orange France for their contribution to the experimentation of

the method presented in this paper.

REFERENCES

[1] Y. Arcadius, J. Akossou, and R. Palm, “Consequences of variable se-

lection on the interpretation of the results in multiple linear regression,”

in Biotechnol. Agron. Soc. Environ., vol. 9, 2005, pp. 11–18.

[2] J. Nakache and J. Confais, Statistique explicative appliqu´ee. TECH-

NIP, 2003.

[3] J. J. Brennan and L. M. Seiford, “Linear programming and l1 regres-

sion: A geometric interpretation,” Computational Statistics & Data

Analysis, 1987.

[4] S. Thrun, “Extracting rules from artifcial neural networks with dis-

tributed representations,” in Advances in Neural Information Process-

ing Systems, M. Press, Ed., vol. 7. Cambridge, MA: G. Tesauro, D.

Touretzky, T. Leen, 1995.

[5] J. M. Benitez, J. L. Castro, and I. Requena, “Are artiﬁcial neural

networks black boxes,” IEEE Transactions on Neural Networks, vol. 8,

no. 5, pp. 1156–1164, 1997, septembre.

[6] K. Fr¨amling, “Explaining results of neural networks by contextual

importance and utility,” in AISB, 1996.

[7] R. F´eraud and F. Cl´erot, “A methodology to explain neural network

classiﬁcation,” Neural Networks, vol. 15, no. 2, pp. 237–246, 2002.

[8] V. Lemaire and C. Cl´erot, “An input variable importance deﬁnition

based on empirical data probability and its use in variable selection,”

in International Joint Conference on Neural Networks IJCNN, 2004.

[9] V. Lemaire and R. F´eraud, “Driven forward features selection: a

comparative study on neural networks,” in International Conference

on Neural Information Processing, Hong-Kong, October 2006.

[10] M. Boulll´e, “Khiops: a statistical discretization method of continuous

attributes,” Machine Learning (ML), vol. 55, no. 1, pp. 53–69, 2004.

[11] M. Boull´e, “A bayes optimal approach for partitioning the values of

categorical attributes,” Journal of Machine Learning Research, 2005.

[12] J. A. Anderson, An introduction to neural network. MIT Press, 1995.

[13] E. Parzen, “On estimation of a probability density function and mode,”

Ann. Math. Stat., pp. 1065–1076, 1962.

4

http://www.francetelecom.com/en/group/rd/offer/

software/applications/providers/khiops.html