Content uploaded by Dominique Gay

Author content

All content in this area was uploaded by Dominique Gay on Apr 24, 2020

Content may be subject to copyright.

Predictive K-means with local models

Vincent Lemaire1, Oumaima Alaoui Ismaili1,

Antoine Cornu´ejols2, Dominique Gay3

1Orange Labs, Lannion, France

2AgroParisTech, Universit´e Paris-Saclay, Paris, France

3LIM-EA2525, Universit´e de La R´eunion

Abstract. Supervised classiﬁcation can be eﬀective for prediction but

sometimes weak on interpretability or explainability (XAI). Clustering,

on the other hand, tends to isolate categories or proﬁles that can be

meaningful but there is no guarantee that they are useful for labels pre-

diction. Predictive clustering seeks to obtain the best of the two worlds.

Starting from labeled data, it looks for clusters that are as pure as pos-

sible with regards to the class labels. One technique consists in tweaking

a clustering algorithm so that data points sharing the same label tend to

aggregate together. With distance-based algorithms, such as k-means, a

solution is to modify the distance used by the algorithm so that it incor-

porates information about the labels of the data points. In this paper,

we propose another method which relies on a change of representation

guided by class densities and then carries out clustering in this new repre-

sentation space. We present two new algorithms using this technique and

show on a variety of data sets that they are competitive for prediction

performance with pure supervised classiﬁers while oﬀering interpretabil-

ity of the clusters discovered.

1 Introduction

While the power of predictive classiﬁers can sometimes be awesome on given

learning tasks, their actual usability might be severely limited by the lack of

interpretability of the hypothesis learned. The opacity of many powerful super-

vised learning algorithms has indeed become a major issue in recent years. This

is why, in addition to good predictive performance as standard goal, many learn-

ing methods have been devised to provide readable decision rules [3], degrees of

beliefs, or other easy to interpret visualizations. This paper presents a predic-

tive technique which promotes interpretability, explainability as well, in its core

design.

The idea is to combine the predictive power brought by supervised learn-

ing with the interpretability that can come from the descriptions of categories,

proﬁles, and discovered using unsupervised clustering. The resulting family of

techniques is variously called supervised clustering or predictive clustering. In

the literature, there are two categories of predictive clustering. The ﬁrst family

of algorithms aims at optimizing the trade-oﬀ between description and predic-

tion, i.e., aiming at detecting sub-groups in each target class. By contrast, the

algorithms in the second category favor the prediction performance over the dis-

covery of all underlying clusters, still using clusters as the basis of the decision

function. The hope is that the predictive performance of predictive clustering

methods can approximate the performances of supervised classiﬁers while their

descriptive capability remains close to the one of pure clustering algorithms.

Several predictive clustering algorithms have been presented over the years,

for instance [1,4,10,11, 23]. However, the majority of these algorithms require (i)

a considerable execution time, and (ii) that numerous user parameters be set.

In addition, some algorithms are very sensitive to the presence of noisy data and

consequently their outputs are not easily interpretable (see [5] for a survey). This

paper presents a new predictive clustering algorithm. The underlying idea is to

use any existing distance-based clustering algorithms, e.g. k-means, but on a

redescription space where the target class is integrated. The resulting algorithm

has several desirable properties: there are few parameters to set, its computa-

tional complexity is almost linear in m, the number of instances, it is robust to

noise, its predictive performance is comparable to the one obtained with classical

supervised classiﬁcation techniques and it tends to produce groups of data that

are easy to interpret for the experts.

The remainder of this paper is organized as follows: Section II introduces the

basis of the new algorithm, the computation of the clusters, the initialization step

and the classiﬁcation that is realized within each cluster. The main computation

steps of the resulting predictive clustering algorithms are described in Algorithm

1. We then report experiments that deal with the predictive performance in

Sections 3. We focus on the supervised classiﬁcation performance to assess if

predictive clustering could reach the performances of algorithms dedicated to

supervised classiﬁcation. Our algorithm is compared using a variety of data sets

with powerful supervised classiﬁcation algorithms in order to assess its value as

a predictive technique. And an analysis of the results is carried out. Conclusion

and perspectives are discussed in Section 4.

2 Turning the K-means algorithm predictive

The k-means algorithm is one of the simplest yet most commonly used clus-

tering algorithms. It seeks to partition minstances (X1,...Xm) into Kgroups

(B1, . . . , BK) so that instances which are close are assigned to the same cluster

while clusters are as dissimilar as possible. The objective function can be deﬁned

as:

G= Argmin

Bi

K

X

i=1

X

Xj∈Bi

kXj−µik2(1)

where µiare the centers of clusters Biand we consider the Euclidean distance.

Predictive clustering adds the constrain of maximizing clusters purity (i.e.

instances in a cluster should share the same label). In addition, the goal is to

provide results that are easy to interpret by the end users. The objective function

of Equation (1) needs to be modiﬁed accordingly.

One approach is to modify the distance used in conventional clustering al-

gorithm in order to incorporate information about the class of the instances.

This modiﬁed distance should make points diﬀerently labelled appear as more

distant than in the original input space. Rather than modifying the distance,

one can instead alter the input space. This is the approach taken in this paper,

where the input space is partitioned according to class probabilities prior to the

clustering step, thus favoring clusters of high purity. Besides the introduction of

a technique for computing a new feature space, we propose as well an adapted

initialization method for the modiﬁed k-means algorithm. We also show the ad-

vantage of using a speciﬁc classiﬁcation method within each discovered cluster

in order to improve the classiﬁcation performance. The main steps of the result-

ing algorithm are described in Algorithm 1. In the remaining of this section II

we show how each step of the usual K-means is modiﬁed to yield a predictive

clustering algorithm.

Algorithm 1 Predictive K-means algorithm

Input:

-D: a data set which contains minstances. Each one (Xi)i∈ {1,...,m}is described

by ddescriptive features and a label Ci∈ {1,...,J}.

-K: number of clusters .

Start:

1) Supervised preprocessing of data to represent each Xias b

Xiin a new feature

space Φ(X).

2) Supervised initialization of centers.

repeat

3) Assignment: generate a new partition by assigning each instance b

Xito the

nearest cluster.

4) Representation: calculate the centers of the new partition.

until the convergence of the algorithm

5) Assignment classes to the obtained clusters:

-method 1: majority vote.

-method 2: local models.

6) Prediction the class of the new instances in the deployment phase:

→the closest cluster class (if method 1 is used).

→local models (if method 2 is used).

End

A modiﬁed input space for predictive clustering - The principle of the

proposed approach is to partition the input space according to the class prob-

abilities P(Cj|X). More precisely, let the input space Xbe of dimension d,

with numerical descriptors as well as categorical ones. An example Xi∈ X

(Xi= [X(1)

i, . . . , X(d)

i]>) will be described in the new feature space Φ(X) by

d×Jcomponents, with Jbeing the number of classes. Each component X(n)

iof

Xi∈ X will give Jcomponents X(n,j)

i, for j∈ {1, . . . , J }, of the new description

b

Xiin Φ(X), where b

X(n,j)

i= log P(X(n)=X(n)

i|Cj), i.e., the log-likelihood val-

ues. Therefore, an example Xis redescribed according to the (log)-probabilities

of observing the values of original input variables given each of the Jpossible

classes (see Figure 1). Below, we describe a method for computing these values.

But ﬁrst, we analyze one property of this redescription in Φ(X) and the distance

this can provide.

XX(1) . . . X(d)Y

X1. . .

. . . . . .

Xm. . .

Φ

=⇒

Φ(X)X(1,1) . . . X(1,J ). . . X(d,1) . . . X(d,J )Y

X1. . .

. . . . . .

Xm. . .

Fig. 1. Φredescription scheme from dvariables to d×Jvariables, with log-likelihood

values: log P(X(n)|Cj)

Property of the modiﬁed distance - Let us denote distp

Bthe new distance

deﬁned over Φ(X). For the two recoded instances ˆ

X1and ˆ

X2∈IRd×J, the

formula of distp

Bis (in the following we omit b

X=b

Xiin the probability terms

for notation simpliﬁcation):

distp

B(ˆ

X1,ˆ

X2) =

J

X

j=1

klog(P(ˆ

X1|Cj)) −log(P(ˆ

X2|Cj)) kp(2)

where k.kpis a Minkowski distance. Let us denote now, ∆p(ˆ

X1,ˆ

X2) the

distance between the (log)-posterior probabilities of two instances ˆ

X1and ˆ

X2.

The formula of this distance as follow:

∆p(ˆ

X1,ˆ

X2) =

J

X

j=1

klog(P(Cj|ˆ

X1)) −log(P(Cj|ˆ

X2)) kp(3)

where ∀i∈ {1,...,m},P(Cj|ˆ

Xi) = P(Cj)Qd

n=1 P(X(n)

i|Cj)

P(ˆ

Xi)(using the hypothesis of

features independence conditionally to the target class). From the distance given

in equation 3, we ﬁnd the following inequality:

∆p(ˆ

X1,ˆ

X2)≤distp

B(ˆ

X1,ˆ

X2) + Jklog(P(ˆ

X2)) −log(P(ˆ

X1)) k(4)

Proof.

∆p=

J

X

j=1

klog(P(Cj|ˆ

X1)) −log(P(Cj|ˆ

X2)) kp

=

J

X

j=1

klog( P(ˆ

X1|Cj)P(Cj)

P(ˆ

X1))−log( P(ˆ

X2|Cj)P(Cj)

P(ˆ

X2))kp

=

J

X

j=1

klog(P(ˆ

X1|Cj)) −log(P(ˆ

X1)) −log(P(ˆ

X2|Cj)) + log(P(ˆ

X2)) kp

≤

J

X

j=1

[A+B]

with ∆p=∆p(ˆ

X1,ˆ

X2) and A=klog(P(ˆ

X1|Cj)) −log(P(ˆ

X2|Cj)) kp

and B=klog(P(ˆ

X2)) −log(P(ˆ

X1)) kp

then ∆p≤distp

B(ˆ

X1,ˆ

X2) + Jklog(P(ˆ

X2)) −log(P(ˆ

X1)) kp.

This above inequality expresses that two instances that are close in terms of

distance distp

Bwill also be close in terms of their probabilities of belonging to

the same class. Note that the distance presented above can be integrated into

any distance-based clustering algorithms.

Building the log-likelihood redescription Φ(X)-Many methods can es-

timate the new descriptors X(n,j)

i= log P(X(n)=X(n)

i|Cj) from a set of ex-

amples. In our work, we use a supervised discretization method for numerical

attributes and a supervised grouping values for categorical attributes to obtain

respectively intervals and group values in which P(X(n)=X(n)

i|Cj) could be

measured. The used supervised discretization method is described in [8] and the

grouping method in [7]. The two methods have been compared with extensive

experiments to corresponding state of the art algorithms. These methods com-

putes univariate partitions of the input space using supervised information. It

determines the partition of the input space to optimize the prediction of the

labels of the examples given the intervals in which they fall using the computed

partition. The method ﬁnds the best partition (number of intervals and thresh-

olds) using a Bayes estimate. An additional bonus of the method is that outliers

are automatically eliminated and missing values can be imputed.

Initialisation of centers - Because clustering is a NP-hard problem, heuristics

are needed to solve it, and the search procedure is often iterative, starting from

an initialized set of prototypes. One foremost example of many such distance-

based methods is the k-means algorithm. It is known that the initialization step

can have a signiﬁcant impact both on the number of iterations and, more impor-

tantly, on the results which correspond to local minima of the optimization crite-

rion (such as Equation 1 in [20]). However, by contrast to the classical clustering

methods, predictive clustering can use supervised information for the choice of

the initial prototypes. In this study, we chose to use the K++R method. Described

in [17], it follows an “exploit and explore” strategy where the class labels are ﬁrst

exploited before the input distribution is used for exploration in order to get the

apparent best initial centers. The main idea of this method is to dedicate one

center per class (comparable to a “Rocchio” [19] solution). Each center is deﬁned

as the average vector of instances which have the same class label. If the prede-

ﬁned number of clusters (K) exceeds the number of classes (J), the initialization

continues using the K-means++ algorithm [2] for the K−Jremaining centers in

such a way to add diversity. This method can only be used when K≥J, but this

is ﬁne since in the context of supervised clustering4we do not look for clusters

where K < J . The complexity of this scheme is O(m+ (K−J)m)<O(mK),

where mis the number of examples. When K=J, this method is deterministic.

Instance assignment and centers update - Considering the Euclidean dis-

tance and the original K-means procedure for updating centers, at each iteration,

each instance is assigned to the nearest cluster (j) using the `2metric (p= 2) in

4In the context of supervised clustering, it does not make sense to cluster instances

in Kclusters where K < J

the redescription space Φ(X). The Kcenters are then updated according to the

K-Means procedure. This choice of distance (Euclidean) in the adopted k-means

strategy could have an inﬂuence on the (predictive) relevance of the clusters but

has not been studied in this paper.

Label prediction in predictive clustering - Unlike classical clustering which

aims only at providing a description of the available data, predictive clustering

can also be used in order to make prediction about new incoming examples that

are unlabeled.

The commonest method used for prediction in predictive K-means is the

majority vote. A new example is ﬁrst assigned to the cluster of its nearest proto-

type, and the predicted label is the one shared by the majority of the examples

of this cluster. This method is not optimal. Let us call PMthe frequency of the

majority class in a given cluster. The true probability µof this class obeys the

Hoeﬀding inequality: P|PM−µ| ≥ ε≤2 exp(−2mkε2) with mkthe number

of instances assigned to the cluster k. If there are only 2 classes, the error rate

is 1 −µif PMand µboth are >0.5. But the error rate can even exceed 0.5 if

PM>0.5 while actually µ < 0.5. The analysis is more complex in case of more

than two classes. It is not the object of this paper to investigate this further.

But it is apparent that the majority rule can often be improved upon, as is the

case in classical supervised learning.

Another evidence of the limits of the majority rule is provided by the exam-

ination of the ROC curve [12]. Using the majority vote to assign classes for the

discovered clusters generates a ROC curve where instances are ranked depending

on the clusters. Consequently, the ROC curve presents a sequence of steps. The

area under the ROC curve is therefore suboptimal compared to a ROC curve

that is obtained from a more reﬁned ranking of the examples, e.g., when class

probabilities are dependent upon each example, rather than groups of examples.

One way to overcome these limits is to use local prediction models in each

cluster, hoping to get better prediction rules than the majority one. However, it

is necessary that these local models: 1) can be trained with few instances, 2) do

not overﬁt, 3) ideally, would not imply any user parameters to avoid the need for

local cross-validation, 4) have a linear algorithmic complexity O(m) in a learning

phase, where mis the number of examples, 5) are not used in the case where the

information is insuﬃcient and the majority rule is the best model we can hope

for, 6) keep (or even improve) the initial interpretation qualities of the global

model. Regarding item (1), a large study has been conducted in [22] in order to

test the prediction performances in function of the number of training instance

of the most commonly classiﬁers. One prominent ﬁnding was that the Naive

Bayes (NB) classiﬁer often reaches good prediction performances using only few

examples (Bouchard & Triggs’s study [6] conﬁrms this result). This fact remains

valid even when features receive weights (e.g., Averaging Naive Bayes (ANB)

and Selective Naive Bayes (SNB) [16]). We defer discussion of the other items

to the Section 3 on experimental results.

In our experiments, we used the following procedure to label each incom-

ing data point X: i) Xis redescribed in the space Φ(X) using the method

described in Section II, ii) Xis assigned to the cluster kcorresponding to the

nearest center iii) the local model, l, in the corresponding cluster is used to

predict the class of X(and the probability memberships) if a local model ex-

its: P(j|X) = argmax1≤j≤J(PSN Bl(Cj|X)) otherwise the majority vote is used

(Note: PSN Bl(Cj|X)) is described in the next section).

3 Comparison with supervised Algorithm

3.1 The chosen set of classiﬁers

To test the ability of our algorithm to exhibit high predictive performance while

at the same time being able to uncover interesting clusters in the diﬀerent data

sets, we have compared it with three powerful classiﬁers (in the spirit of, or

close to our algorithm) from the state of the art: Logistic Model Tree (LMT) [15],

Naives Bayes Tree (NBT) [14] and Selective Naive Bayes (SNB) [9]. This section

brieﬂy described these classiﬁers.

•Logistic Model Tree (LMT)[15] combines logistic regression and decision

trees. It seeks to improve the performance of decision trees. Instead of associating

each leaf of the tree to a single label and a single probability vector (piecewise

constant model), a logistic regression model is trained on the instances assigned

to each leaf to estimate an appropriate vector of probabilities for each test in-

stance (piecewise linear regression model). The logit-Boost algorithm is used

to ﬁt a logistic regression model at each node, and then it is partitioned using

information gain as a function of impurity.

•Naives Bayes Tree (NBT)[14] is a hybrid algorithm, which deploys a

naive Bayes classiﬁer on each leaf of the built decision tree. NBT is a classiﬁer

which has often exhibited good performance compared to the standard decision

trees and naive Bayes classiﬁer.

•Selective Naive Bayes (SNB)is a variant of NB. One way to average a

large number of selective naive Bayes classiﬁers obtained with diﬀerent subsets

of features is to use one model only, but with features weighting [9]. The Bayes

formula under the hypothesis of features independence conditionally to classes

becomes: P(j|X) = P(j)QfP(Xf|j)Wf

PK

j=1[P(j)QfP(Xf|j)Wf], where Wfrepresents the weight of

the feature f,Xfis component fof X,jis the class labels. The predicted class j

is the one that maximizes the conditional probability P(j|X). The probabilities

P(Xi|j) can be estimated by interval using a discretization for continuous fea-

tures. For categorical features, this estimation can be done if the feature has few

diﬀerent modalities. Otherwise, grouping into modalities is used. The resulting

algorithm proves to be quite eﬃcient on many real data sets [13].

•Predictive K-Means (PKMMV,PKMSNB): (i) PKMVM corresponds to the Pre-

dictive K-Means described in Algorithm 1 where prediction is done according to

the Majority Vote; (ii) PKMSNB corresponds to the Predictive K-Means described

in Algorithm 1 where prediction is done according to a local classiﬁcation model.

•Unsupervised K-Means (KMMV )is the usual unsupervised K-Means with

prediction done using the Majority Vote in each cluster. This classiﬁer is given

for comparison as a baseline method. The pre-processing is not supervised and

the initialization used is k-means++ [2] (in this case since the initialization is

not deterministic we run k-means 25 times and we keep the best initialization

according to the Mean Squared Error). Among the existing unsupervised pre-

processing approaches [21], depending on the nature of the features, continuous

or categorical, we used:

–for Numerical attribute: Rank Normalization (RN). The purpose of rank

normalization is to rank continuous feature values and then scale the feature

into [0,1]. The diﬀerent steps of this approach are: (i) rank feature values

ufrom lowest to highest values and then divide the resulting vector into H

intervals, where His the number of intervals, (ii) assign for each interval a

label r∈ {1, ..., H }in increasing order, (iii) if Xiu belongs to the interval r,

then X0

iu =r

H. In our experiments, we use H= 100.

–for Categorical attribute: we chose to use a Basic Grouping Approach (BGB).

It aims at transforming feature values into a vector of Boolean values. The

diﬀerent steps of this approach are: (i) group feature values into ggroups

with as equal frequencies as possible, where gis a parameter given by the

user, (ii) assign for each group a label r∈ {1, ..., g}, (iii) use a full disjunctive

coding. In our experiments, we use g= 10.

Fig. 2. Diﬀerences between the three types of “classiﬁcation”

In the Figure 2 we suggest a two axis ﬁgure to situate the algorithms de-

scribed above: a vertical axis for their ability to describe (explain) the data

(from low to high) and horizontal axis for their ability to predict the labels (from

low to high). In this case the selected classiﬁers exemplify various trade-oﬀs be-

tween prediction performance and explanatory power: (i)KMMV more dedicated

to description would appear in the bottom right corner; (ii)LMT,NBT and SNB

dedicated to prediction would go on the top left corner; and (iii)PKMVM,PKMSNB

would lie in between. Ideally, our algorithm, PKMSNB should place itself on the top

right quadrant of this kind of ﬁgure with both good prediction and description

performance.

Note that in the reported experiments, K=J(i.e, number of clusters =

number of classes). This choice which biases the algorithm to ﬁnd one cluster

per class, is detrimental for predictive clustering, thus setting a lower bound on

the performance that can be expected of such an approach.

3.2 Experimental protocol

The comparison of the algorithms have been performed on 8 diﬀerent datasets of

the UCI repository [18]. These datasets were chosen for their diversity in terms of

classes, features (categorical and numerical) and instances number (see Table 1).

Datasets Instances #Vn#Vc# Classes Datasets Instances #Vn#Vc# Classes

Glass 214 10 0 6 Waveform 5000 40 0 3

Pima 768 8 0 2 Mushroom 8416 0 22 2

Vehicle 846 18 0 4 Pendigits 10992 16 0 10

Segmentation 2310 19 0 7 Adult 48842 7 8 2

Table 1. The used datasets, Vn: numerical features, Vc: categorical features.

Evaluation of the performance: In order to compare the performance of the

algorithms presented above, the same folds in the train/test have been used.

The results presented in Section 3.3 are those obtained in the test phase using

a 10 ×10 folds cross validation (stratiﬁed). The predictive performance of the

algorithms are evaluated using the AUC (area under the ROC’s curve). It is

computed as follows: AUC =PC

iP(Ci)AUC(Ci), where AUC(i) denotes the AUC’s

value in the class iagainst all the others classes and P(Ci) denotes the prior on

the class i(the elements frequency in the class i). AUC(i) is calculated using the

probability vector P(Ci|X)∀i.

average results (in the test phase) using ACC

Data KMMV PKMMV PKMSNB LMT NBT SNB

Glass 70.34 ±8.00 89.32 ±6.09 95.38 ±4.66 97.48 ±2.68 94.63 ±4.39 97.75 ±3.33

Pima 65.11 ±4.17 66.90 ±4.87 73.72 ±4.37 76.85 ±4.70 75.38 ±4.71 75.41 ±4.75

Vehicle 37.60 ±4.10 47.35 ±5.62 72.21 ±4.13 82.52 ±3.64 70.46 ±5.17 64.26 ±4.39

Segment 67.50 ±2.35 80.94 ±1.93 96.18 ±1.26 96.30 ±1.15 95.17 ±1.29 94.44 ±1.48

Waveform 50.05 ±1.05 49.72 ±3.39 84.04 ±1.63 86.94 ±1.69 79.87 ±2.32 83.14 ±1.49

Mushroom 89.26 ±0.97 98.57 ±3.60 99.94 ±0.09 98.06 ±4.13 95.69 ±6.73 99.38 ±0.27

PenDigits 73.65 ±2.09 76.82 ±1.33 97.35 ±1.36 98.50 ±0.35 95.29 ±0.76 89.92 ±1.33

Adult 76.07 ±0.14 77.96 ±0.41 86.81 ±0.39 83.22 ±1.80 79.41 ±7.34 86.63 ±0.40

Average 66.19 73.44 88.20 89.98 85.73 86.36

average results (in the test phase) using 100 x AUC

Data KMMV PKMMV PKMSNB LMT NBT SNB

Glass 85.72 ±5.69 96.93 ±2.84 98.27 ±2.50 97.94 ±0.19 98.67 ±2.05 99.77 ±0.54

Pima 65.36 ±5.21 65.81 ±6.37 78.44 ±5.35 83.05 ±4.61 80.33 ±5.21 80.59 ±4.78

Vehicle 65.80 ±3.36 74.77 ±3.14 91.15 ±1.75 95.77 ±1.44 88.07 ±3.04 87.19 ±1.97

Segment 91.96 ±0.75 95.24 ±0.75 99.51 ±0.32 99.65 ±0.23 98.86 ±0.51 99.52 ±0.19

Waveform 75.58 ±0.58 69.21 ±3.17 96.16 ±0.58 97.10 ±0.53 93.47 ±1.41 95.81 ±0.57

Mushroom 88.63 ±1.03 98.47 ±0.38 99.99 ±0.00 99.89 ±0.69 99.08 ±2.29 99.97 ±0.02

Pendigits 95.34 ±0.45 95.84 ±0.29 99.66 ±0.11 99.81 ±0.10 99.22 ±1.78 99.19 ±1.14

Adult 73.33 ±0.65 59.42 ±3.70 92.37 ±0.34 77.32 ±10.93 84.25 ±5.66 92.32 ±0.34

Average 80.21 81.96 94.44 93.81 92.74 94.29

Table 2. Mean performance and standard deviation for the TEST set using a 10x10

folds cross-validation process

3.3 Results

Performance evaluation: Table 2 presents the predictive performance of

LMT,NBT,SNB, our algorithm PKMMV,PKMSNB and the baseline KMMV using the ACC

(accuracy) and the AUC criteria (presented as a %). These results show the very

good prediction performance of the PKMSNB algorithm. Its performance is indeed

comparable to those of LMT and SNB which are the strongest ones. In addition, the

use of local classiﬁers (algorithm PKMSNB ) provides a clear advantage over the use

of the majority vote in each cluster as done in PKMMV. Surprisingly, PKMSNB exhibits

slightly better results than SNB while both use naive Bayes classiﬁers locally and

PKMSNB is hampered by the fact that K=J, the number of classes. Better

performance are expected when K≥J. Finally, PKMSNB appears to be slightly

superior to SNB, particularly for the datasets which contain highly correlated

features, for instance the PenDigits database.

Discussion about local models, complexity and others factors: In Sec-

tion II in the paragraph about label prediction in predictive clustering, we pro-

posed a list of desirable properties for the local prediction models used in each

cluster. We come back to these items denoted from (i) to (vi) in discussing Tables

2 and 3:

i) The performance in prediction are good even for the dataset Glass which

contains only 214 instances (90% for training in the 10x10 cross validation

(therefore 193 instances)).

ii) The robustness (ratio between the performance in test and training) is given

in Table 3 for the Accuracy (ACC) and the AUC. This ratio indicates that

there is no signiﬁcant overﬁtting. Moreover, by contrast to methods described

in [14, 15] (about LMT and NBT) our algorithm does not require any cross

validation for setting parameters.

iii) The only user parameter is the number of cluster (in this paper K=J).

This point is crucial to help a non-expert to use the proposed method.

iv) The preprocessing complexity (step 1 of Algorithm 1) is O(d m log m), the

k-means has the usual complexity O(d m J t) and the complexity for the

creation of the local models is O(d m∗log m∗) + O(K(d m∗log dm∗)) where d

is the number of variables, mthe number of instances in the training dataset,

m∗is the average number of instances belonging to a cluster. Therefore a fast

training time is possible as indicated in Table 3 with time given in seconds

(for a PC with Windows 7 enterprise and a CPU : Intel Core I7 6820-HQ

2.70 GHz).

v) Only clusters where the information is suﬃcient to beat the ma jority vote

contain local model. Table 3 gives the percentage of pure clusters obtained

at the end of the convergence the K-Means and the percentage of clusters

with a local model (if not pure) when performing the 10x10 cross validation

(so over 100 results).

vi) Finally, the interpretation of the PKMSNB model is based on a two-level anal-

ysis. The ﬁrst analysis consists in analyzing the proﬁle of each cluster using

histograms. A visualisation of the average proﬁle of the overall population

(each bar representing the percentage of instances having a value of the

corresponding interval) and the average proﬁle of a given cluster allows to

understand why a given instance belongs to a cluster. Then locally to a clus-

ter the variable importance of the local classiﬁer (the weights, Wf, in the

SNB classiﬁer) gives a local interpretation.

Datasets Robustness Training Local models Datasets Robustness Training Lo cal models

(#intances) ACC AUC Time (s) (1) (2) (3) (#intances) ACC AUC Time (s) (1) (2) (3)

Glass 0.98 0.99 0.07 40.83 23.17 36.0 Waveform 0.97 0.99 0.73 0.00 98.00 2.00

Pima 0.95 0.94 0.05 00.00 88.50 11.5 Mushroom 1.00 1.00 0.53 50.00 50.00 0

Vehicle 0.95 0.97 0.14 07.50 92.25 0.25 Pendigits 0.99 1.00 1.81 0.00 100.00 0

Segment. 0.98 1.00 0.85 28.28 66.28 5.44 Adult 1.00 1.00 3.57 0.00 100.00 0

Table 3. Elements for discussion about local models. (1) Percentage of pure clusters;

(2) Percentage of non-pure clusters with a local model; (3) Percentage of non-pure

clusters without a local model.

The results of our experiments and the elements (i) to (vi) show that the

algorithm PKMSNB is interesting with regards to several aspects. (1) Its predictive

performance are comparable to those of the best competing supervised classi-

ﬁcation methods, (2) it doesn’t require cross validation, (3) it deals with the

missing values, (4) it operates a features selection both in the clustering step

and during the building of the local models. Finally, (5) it groups the categorical

features into modalities, thus allowing one to avoid using a complete disjunctive

coding which involves the creation of large vectors. Otherwise this disjunctive

coding could complicate the interpretation of the obtained model.

The reader may ﬁnd a supplementary material here: https://bit.ly/2T4VhQw

or here: https://bit.ly/3a7xmFF. It gives a detailed example about the inter-

pretation of the results and some comparisons to others predictive clustering

algorithms as COBRA or MPCKmeans.

4 Conclusion and perspectives

We have shown how to modify a distance-based clustering technique, such as

k-means, into a predictive clustering algorithm. Moreover the learned represen-

tation could be used by other clustering algorithms. The resulting algorithm

PKMSNB exhibits strong predictive performances most of the time as the state of

the art but with the beneﬁt of not having any parameters to adjust and therefore

no cross validation to compute. The suggested algorithm is also a good support

for interpretation of the data. Better performances can still be expected when

the number of clusters is higher than the number of classes. One goal of a work

in progress it to ﬁnd a method that would automatically discover the optimal

number of clusters. In addition, we are developing a tool to help visualize the re-

sults allowing the navigation between clusters in order to view easily the average

proﬁles and the importance of the variables locally for each cluster.

References

1. Al-Harbi, S.H., Rayward-Smith, V.J.: Adapting k-means for supervised clustering.

Applied Intelligence 24(3), 219–226 (2006)

2. Arthur, D., Vassilvitskii, S.: K-means++: The advantages of careful seeding. In:

Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algo-

rithms. pp. 1027–1035 (2007)

3. Been Kim, Kush R. Varshney, A.W.: Workshop on human interpretability in ma-

chine learning (whi 2018). In: Proceedings of the 2018 ICML Workshop (2018)

4. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning

in semi-supervised clustering. In: Proceedings of the Twenty-ﬁrst International

Conference on Machine Learning (ICML) (2004)

5. Blockeel, H., Dzeroski, S., Struyf, J., Zenko, B.: Predictive Clustering. Springer-

Verlag New York (2019)

6. Bouchard, G., Triggs, B.: The tradeoﬀ between generative and discriminative clas-

siﬁers. In: IASC International Symposium on Computational Statistics (COMP-

STAT). pp. 721–728 (2004)

7. Boull´e, M.: A Bayes optimal approach for partitioning the values of categorical

attributes. Journal of Machine Learning Research 6, 1431–1452 (2005)

8. Boull´e, M.: MODL: a Bayes optimal discretization method for continuous at-

tributes. Machine Learning 65(1), 131–165 (2006)

9. Boull´e, M.: Compression-based averaging of selective naive Bayes classiﬁers. Jour-

nal of Machine Learning Research 8, 1659–1685 (2007)

10. Cevikalp, H., Larlus, D., Jurie, F.: A supervised clustering algorithm for the initial-

ization of rbf neural network classiﬁers. In: Signal Processing and Communication

Applications Conference (June 2007), http://lear.inrialpes.fr/pubs/2007/CLJ07

11. Eick, C.F., Zeidat, N., Zhao, Z.: Supervised clustering - algorithms and beneﬁts. In:

International Conference on Tools with Artiﬁcial Intelligence. pp. 774–776 (2004)

12. Flach, P.: Machine learning: the art and science of algorithms that make sense of

data. Cambridge University Press (2012)

13. Hand, D.J., Yu, K.: Idiot’s bayes-not so stupid after all? International Statistical

Review 69(3), 385–398 (2001)

14. Kohavi, R.: Scaling up the accuracy of naive-bayes classiﬁers: a decision-tree hy-

brid. In: International Conference on Data Mining. pp. 202–207. AAAI Press (1996)

15. Landwehr, N., Hall, M., Frank, E.: Logistic model trees. Mach. Learn. 59(1-2)

(2005)

16. Langley, P., Sage, S.: Induction of selective bayesian classiﬁers. In: Proceedings of

the Tenth International Conference on Uncertainty in Artiﬁcial Intelligence. pp.

399–406. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1994)

17. Lemaire, V., Alaoui Ismaili, O., Cornu´ejols, A.: An initialization scheme for super-

vized k-means. In: International Joint Conference on Neural Networks (2015)

18. Lichman, M.: UCI machine learning repository (2013)

19. Manning, C.D., Raghavan, P., Sch¨utze, H.: Introduction to Information Retrieval.

Cambridge University Press, New York (2008)

20. Meil˘a, M., Heckerman, D.: An experimental comparison of several clustering and

initialization methods. In: Conference on Uncertainty in Artiﬁcial Intelligence. pp.

386–395. Morgan Kaufmann Publishers Inc. (1998)

21. Milligan, G.W., Cooper, M.C.: A study of standardization of variables in cluster

analysis. Journal of Classiﬁcation 5(2), 181–204 (1988)

22. Salperwyck, C., Lemaire, V.: Learning with few examples: An empirical study on

leading classiﬁers. In: International Joint Conference on Neural Networks (2011)

23. Van Craenendonck, T., Dumancic, S., Van Wolputte, E., Blockeel, H.: COBRAS:

fast, iterative, active clustering with pairwise constraints. In: Proceedings of Intel-

ligent Data Analysis (2018)