Conference PaperPDF Available

Predictive K-means with local models

Authors:

Abstract and Figures

Supervised classification can be effective for prediction but sometimes weak on interpretability or explainability (XAI). Clustering, on the other hand, tends to isolate categories or profiles that can be meaningful but there is no guarantee that they are useful for labels prediction. Predictive clustering seeks to obtain the best of the two worlds. Starting from labeled data, it looks for clusters that are as pure as possible with regards to the class labels. One technique consists in tweaking a clustering algorithm so that data points sharing the same label tend to aggregate together. With distance-based algorithms, such as k-means, a solution is to modify the distance used by the algorithm so that it incorporates information about the labels of the data points. In this paper, we propose another method which relies on a change of representation guided by class densities and then carries out clustering in this new representation space. We present two new algorithms using this technique and show on a variety of data sets that they are competitive for prediction performance with pure supervised classifiers while offering interpretability of the clusters discovered.
Content may be subject to copyright.
Predictive K-means with local models
Vincent Lemaire1, Oumaima Alaoui Ismaili1,
Antoine Cornu´ejols2, Dominique Gay3
1Orange Labs, Lannion, France
2AgroParisTech, Universit´e Paris-Saclay, Paris, France
3LIM-EA2525, Universit´e de La R´eunion
Abstract. Supervised classification can be effective for prediction but
sometimes weak on interpretability or explainability (XAI). Clustering,
on the other hand, tends to isolate categories or profiles that can be
meaningful but there is no guarantee that they are useful for labels pre-
diction. Predictive clustering seeks to obtain the best of the two worlds.
Starting from labeled data, it looks for clusters that are as pure as pos-
sible with regards to the class labels. One technique consists in tweaking
a clustering algorithm so that data points sharing the same label tend to
aggregate together. With distance-based algorithms, such as k-means, a
solution is to modify the distance used by the algorithm so that it incor-
porates information about the labels of the data points. In this paper,
we propose another method which relies on a change of representation
guided by class densities and then carries out clustering in this new repre-
sentation space. We present two new algorithms using this technique and
show on a variety of data sets that they are competitive for prediction
performance with pure supervised classifiers while offering interpretabil-
ity of the clusters discovered.
1 Introduction
While the power of predictive classifiers can sometimes be awesome on given
learning tasks, their actual usability might be severely limited by the lack of
interpretability of the hypothesis learned. The opacity of many powerful super-
vised learning algorithms has indeed become a major issue in recent years. This
is why, in addition to good predictive performance as standard goal, many learn-
ing methods have been devised to provide readable decision rules [3], degrees of
beliefs, or other easy to interpret visualizations. This paper presents a predic-
tive technique which promotes interpretability, explainability as well, in its core
design.
The idea is to combine the predictive power brought by supervised learn-
ing with the interpretability that can come from the descriptions of categories,
profiles, and discovered using unsupervised clustering. The resulting family of
techniques is variously called supervised clustering or predictive clustering. In
the literature, there are two categories of predictive clustering. The first family
of algorithms aims at optimizing the trade-off between description and predic-
tion, i.e., aiming at detecting sub-groups in each target class. By contrast, the
algorithms in the second category favor the prediction performance over the dis-
covery of all underlying clusters, still using clusters as the basis of the decision
function. The hope is that the predictive performance of predictive clustering
methods can approximate the performances of supervised classifiers while their
descriptive capability remains close to the one of pure clustering algorithms.
Several predictive clustering algorithms have been presented over the years,
for instance [1,4,10,11, 23]. However, the majority of these algorithms require (i)
a considerable execution time, and (ii) that numerous user parameters be set.
In addition, some algorithms are very sensitive to the presence of noisy data and
consequently their outputs are not easily interpretable (see [5] for a survey). This
paper presents a new predictive clustering algorithm. The underlying idea is to
use any existing distance-based clustering algorithms, e.g. k-means, but on a
redescription space where the target class is integrated. The resulting algorithm
has several desirable properties: there are few parameters to set, its computa-
tional complexity is almost linear in m, the number of instances, it is robust to
noise, its predictive performance is comparable to the one obtained with classical
supervised classification techniques and it tends to produce groups of data that
are easy to interpret for the experts.
The remainder of this paper is organized as follows: Section II introduces the
basis of the new algorithm, the computation of the clusters, the initialization step
and the classification that is realized within each cluster. The main computation
steps of the resulting predictive clustering algorithms are described in Algorithm
1. We then report experiments that deal with the predictive performance in
Sections 3. We focus on the supervised classification performance to assess if
predictive clustering could reach the performances of algorithms dedicated to
supervised classification. Our algorithm is compared using a variety of data sets
with powerful supervised classification algorithms in order to assess its value as
a predictive technique. And an analysis of the results is carried out. Conclusion
and perspectives are discussed in Section 4.
2 Turning the K-means algorithm predictive
The k-means algorithm is one of the simplest yet most commonly used clus-
tering algorithms. It seeks to partition minstances (X1,...Xm) into Kgroups
(B1, . . . , BK) so that instances which are close are assigned to the same cluster
while clusters are as dissimilar as possible. The objective function can be defined
as:
G= Argmin
Bi
K
X
i=1
X
XjBi
kXjµik2(1)
where µiare the centers of clusters Biand we consider the Euclidean distance.
Predictive clustering adds the constrain of maximizing clusters purity (i.e.
instances in a cluster should share the same label). In addition, the goal is to
provide results that are easy to interpret by the end users. The objective function
of Equation (1) needs to be modified accordingly.
One approach is to modify the distance used in conventional clustering al-
gorithm in order to incorporate information about the class of the instances.
This modified distance should make points differently labelled appear as more
distant than in the original input space. Rather than modifying the distance,
one can instead alter the input space. This is the approach taken in this paper,
where the input space is partitioned according to class probabilities prior to the
clustering step, thus favoring clusters of high purity. Besides the introduction of
a technique for computing a new feature space, we propose as well an adapted
initialization method for the modified k-means algorithm. We also show the ad-
vantage of using a specific classification method within each discovered cluster
in order to improve the classification performance. The main steps of the result-
ing algorithm are described in Algorithm 1. In the remaining of this section II
we show how each step of the usual K-means is modified to yield a predictive
clustering algorithm.
Algorithm 1 Predictive K-means algorithm
Input:
-D: a data set which contains minstances. Each one (Xi)i∈ {1,...,m}is described
by ddescriptive features and a label Ci∈ {1,...,J}.
-K: number of clusters .
Start:
1) Supervised preprocessing of data to represent each Xias b
Xiin a new feature
space Φ(X).
2) Supervised initialization of centers.
repeat
3) Assignment: generate a new partition by assigning each instance b
Xito the
nearest cluster.
4) Representation: calculate the centers of the new partition.
until the convergence of the algorithm
5) Assignment classes to the obtained clusters:
-method 1: majority vote.
-method 2: local models.
6) Prediction the class of the new instances in the deployment phase:
the closest cluster class (if method 1 is used).
local models (if method 2 is used).
End
A modified input space for predictive clustering - The principle of the
proposed approach is to partition the input space according to the class prob-
abilities P(Cj|X). More precisely, let the input space Xbe of dimension d,
with numerical descriptors as well as categorical ones. An example Xi∈ X
(Xi= [X(1)
i, . . . , X(d)
i]>) will be described in the new feature space Φ(X) by
d×Jcomponents, with Jbeing the number of classes. Each component X(n)
iof
Xi∈ X will give Jcomponents X(n,j)
i, for j∈ {1, . . . , J }, of the new description
b
Xiin Φ(X), where b
X(n,j)
i= log P(X(n)=X(n)
i|Cj), i.e., the log-likelihood val-
ues. Therefore, an example Xis redescribed according to the (log)-probabilities
of observing the values of original input variables given each of the Jpossible
classes (see Figure 1). Below, we describe a method for computing these values.
But first, we analyze one property of this redescription in Φ(X) and the distance
this can provide.
XX(1) . . . X(d)Y
X1. . .
. . . . . .
Xm. . .
Φ
=
Φ(X)X(1,1) . . . X(1,J ). . . X(d,1) . . . X(d,J )Y
X1. . .
. . . . . .
Xm. . .
Fig. 1. Φredescription scheme from dvariables to d×Jvariables, with log-likelihood
values: log P(X(n)|Cj)
Property of the modified distance - Let us denote distp
Bthe new distance
defined over Φ(X). For the two recoded instances ˆ
X1and ˆ
X2IRd×J, the
formula of distp
Bis (in the following we omit b
X=b
Xiin the probability terms
for notation simplification):
distp
B(ˆ
X1,ˆ
X2) =
J
X
j=1
klog(P(ˆ
X1|Cj)) log(P(ˆ
X2|Cj)) kp(2)
where k.kpis a Minkowski distance. Let us denote now, p(ˆ
X1,ˆ
X2) the
distance between the (log)-posterior probabilities of two instances ˆ
X1and ˆ
X2.
The formula of this distance as follow:
p(ˆ
X1,ˆ
X2) =
J
X
j=1
klog(P(Cj|ˆ
X1)) log(P(Cj|ˆ
X2)) kp(3)
where i∈ {1,...,m},P(Cj|ˆ
Xi) = P(Cj)Qd
n=1 P(X(n)
i|Cj)
P(ˆ
Xi)(using the hypothesis of
features independence conditionally to the target class). From the distance given
in equation 3, we find the following inequality:
p(ˆ
X1,ˆ
X2)distp
B(ˆ
X1,ˆ
X2) + Jklog(P(ˆ
X2)) log(P(ˆ
X1)) k(4)
Proof.
p=
J
X
j=1
klog(P(Cj|ˆ
X1)) log(P(Cj|ˆ
X2)) kp
=
J
X
j=1
klog( P(ˆ
X1|Cj)P(Cj)
P(ˆ
X1))log( P(ˆ
X2|Cj)P(Cj)
P(ˆ
X2))kp
=
J
X
j=1
klog(P(ˆ
X1|Cj)) log(P(ˆ
X1)) log(P(ˆ
X2|Cj)) + log(P(ˆ
X2)) kp
J
X
j=1
[A+B]
with p=p(ˆ
X1,ˆ
X2) and A=klog(P(ˆ
X1|Cj)) log(P(ˆ
X2|Cj)) kp
and B=klog(P(ˆ
X2)) log(P(ˆ
X1)) kp
then pdistp
B(ˆ
X1,ˆ
X2) + Jklog(P(ˆ
X2)) log(P(ˆ
X1)) kp.
This above inequality expresses that two instances that are close in terms of
distance distp
Bwill also be close in terms of their probabilities of belonging to
the same class. Note that the distance presented above can be integrated into
any distance-based clustering algorithms.
Building the log-likelihood redescription Φ(X)-Many methods can es-
timate the new descriptors X(n,j)
i= log P(X(n)=X(n)
i|Cj) from a set of ex-
amples. In our work, we use a supervised discretization method for numerical
attributes and a supervised grouping values for categorical attributes to obtain
respectively intervals and group values in which P(X(n)=X(n)
i|Cj) could be
measured. The used supervised discretization method is described in [8] and the
grouping method in [7]. The two methods have been compared with extensive
experiments to corresponding state of the art algorithms. These methods com-
putes univariate partitions of the input space using supervised information. It
determines the partition of the input space to optimize the prediction of the
labels of the examples given the intervals in which they fall using the computed
partition. The method finds the best partition (number of intervals and thresh-
olds) using a Bayes estimate. An additional bonus of the method is that outliers
are automatically eliminated and missing values can be imputed.
Initialisation of centers - Because clustering is a NP-hard problem, heuristics
are needed to solve it, and the search procedure is often iterative, starting from
an initialized set of prototypes. One foremost example of many such distance-
based methods is the k-means algorithm. It is known that the initialization step
can have a significant impact both on the number of iterations and, more impor-
tantly, on the results which correspond to local minima of the optimization crite-
rion (such as Equation 1 in [20]). However, by contrast to the classical clustering
methods, predictive clustering can use supervised information for the choice of
the initial prototypes. In this study, we chose to use the K++R method. Described
in [17], it follows an “exploit and explore” strategy where the class labels are first
exploited before the input distribution is used for exploration in order to get the
apparent best initial centers. The main idea of this method is to dedicate one
center per class (comparable to a “Rocchio” [19] solution). Each center is defined
as the average vector of instances which have the same class label. If the prede-
fined number of clusters (K) exceeds the number of classes (J), the initialization
continues using the K-means++ algorithm [2] for the KJremaining centers in
such a way to add diversity. This method can only be used when KJ, but this
is fine since in the context of supervised clustering4we do not look for clusters
where K < J . The complexity of this scheme is O(m+ (KJ)m)<O(mK),
where mis the number of examples. When K=J, this method is deterministic.
Instance assignment and centers update - Considering the Euclidean dis-
tance and the original K-means procedure for updating centers, at each iteration,
each instance is assigned to the nearest cluster (j) using the `2metric (p= 2) in
4In the context of supervised clustering, it does not make sense to cluster instances
in Kclusters where K < J
the redescription space Φ(X). The Kcenters are then updated according to the
K-Means procedure. This choice of distance (Euclidean) in the adopted k-means
strategy could have an influence on the (predictive) relevance of the clusters but
has not been studied in this paper.
Label prediction in predictive clustering - Unlike classical clustering which
aims only at providing a description of the available data, predictive clustering
can also be used in order to make prediction about new incoming examples that
are unlabeled.
The commonest method used for prediction in predictive K-means is the
majority vote. A new example is first assigned to the cluster of its nearest proto-
type, and the predicted label is the one shared by the majority of the examples
of this cluster. This method is not optimal. Let us call PMthe frequency of the
majority class in a given cluster. The true probability µof this class obeys the
Hoeffding inequality: P|PMµ| ≥ ε2 exp(2mkε2) with mkthe number
of instances assigned to the cluster k. If there are only 2 classes, the error rate
is 1 µif PMand µboth are >0.5. But the error rate can even exceed 0.5 if
PM>0.5 while actually µ < 0.5. The analysis is more complex in case of more
than two classes. It is not the object of this paper to investigate this further.
But it is apparent that the majority rule can often be improved upon, as is the
case in classical supervised learning.
Another evidence of the limits of the majority rule is provided by the exam-
ination of the ROC curve [12]. Using the majority vote to assign classes for the
discovered clusters generates a ROC curve where instances are ranked depending
on the clusters. Consequently, the ROC curve presents a sequence of steps. The
area under the ROC curve is therefore suboptimal compared to a ROC curve
that is obtained from a more refined ranking of the examples, e.g., when class
probabilities are dependent upon each example, rather than groups of examples.
One way to overcome these limits is to use local prediction models in each
cluster, hoping to get better prediction rules than the majority one. However, it
is necessary that these local models: 1) can be trained with few instances, 2) do
not overfit, 3) ideally, would not imply any user parameters to avoid the need for
local cross-validation, 4) have a linear algorithmic complexity O(m) in a learning
phase, where mis the number of examples, 5) are not used in the case where the
information is insufficient and the majority rule is the best model we can hope
for, 6) keep (or even improve) the initial interpretation qualities of the global
model. Regarding item (1), a large study has been conducted in [22] in order to
test the prediction performances in function of the number of training instance
of the most commonly classifiers. One prominent finding was that the Naive
Bayes (NB) classifier often reaches good prediction performances using only few
examples (Bouchard & Triggs’s study [6] confirms this result). This fact remains
valid even when features receive weights (e.g., Averaging Naive Bayes (ANB)
and Selective Naive Bayes (SNB) [16]). We defer discussion of the other items
to the Section 3 on experimental results.
In our experiments, we used the following procedure to label each incom-
ing data point X: i) Xis redescribed in the space Φ(X) using the method
described in Section II, ii) Xis assigned to the cluster kcorresponding to the
nearest center iii) the local model, l, in the corresponding cluster is used to
predict the class of X(and the probability memberships) if a local model ex-
its: P(j|X) = argmax1jJ(PSN Bl(Cj|X)) otherwise the majority vote is used
(Note: PSN Bl(Cj|X)) is described in the next section).
3 Comparison with supervised Algorithm
3.1 The chosen set of classifiers
To test the ability of our algorithm to exhibit high predictive performance while
at the same time being able to uncover interesting clusters in the different data
sets, we have compared it with three powerful classifiers (in the spirit of, or
close to our algorithm) from the state of the art: Logistic Model Tree (LMT) [15],
Naives Bayes Tree (NBT) [14] and Selective Naive Bayes (SNB) [9]. This section
briefly described these classifiers.
Logistic Model Tree (LMT)[15] combines logistic regression and decision
trees. It seeks to improve the performance of decision trees. Instead of associating
each leaf of the tree to a single label and a single probability vector (piecewise
constant model), a logistic regression model is trained on the instances assigned
to each leaf to estimate an appropriate vector of probabilities for each test in-
stance (piecewise linear regression model). The logit-Boost algorithm is used
to fit a logistic regression model at each node, and then it is partitioned using
information gain as a function of impurity.
Naives Bayes Tree (NBT)[14] is a hybrid algorithm, which deploys a
naive Bayes classifier on each leaf of the built decision tree. NBT is a classifier
which has often exhibited good performance compared to the standard decision
trees and naive Bayes classifier.
Selective Naive Bayes (SNB)is a variant of NB. One way to average a
large number of selective naive Bayes classifiers obtained with different subsets
of features is to use one model only, but with features weighting [9]. The Bayes
formula under the hypothesis of features independence conditionally to classes
becomes: P(j|X) = P(j)QfP(Xf|j)Wf
PK
j=1[P(j)QfP(Xf|j)Wf], where Wfrepresents the weight of
the feature f,Xfis component fof X,jis the class labels. The predicted class j
is the one that maximizes the conditional probability P(j|X). The probabilities
P(Xi|j) can be estimated by interval using a discretization for continuous fea-
tures. For categorical features, this estimation can be done if the feature has few
different modalities. Otherwise, grouping into modalities is used. The resulting
algorithm proves to be quite efficient on many real data sets [13].
Predictive K-Means (PKMMV,PKMSNB): (i) PKMVM corresponds to the Pre-
dictive K-Means described in Algorithm 1 where prediction is done according to
the Majority Vote; (ii) PKMSNB corresponds to the Predictive K-Means described
in Algorithm 1 where prediction is done according to a local classification model.
Unsupervised K-Means (KMMV )is the usual unsupervised K-Means with
prediction done using the Majority Vote in each cluster. This classifier is given
for comparison as a baseline method. The pre-processing is not supervised and
the initialization used is k-means++ [2] (in this case since the initialization is
not deterministic we run k-means 25 times and we keep the best initialization
according to the Mean Squared Error). Among the existing unsupervised pre-
processing approaches [21], depending on the nature of the features, continuous
or categorical, we used:
for Numerical attribute: Rank Normalization (RN). The purpose of rank
normalization is to rank continuous feature values and then scale the feature
into [0,1]. The different steps of this approach are: (i) rank feature values
ufrom lowest to highest values and then divide the resulting vector into H
intervals, where His the number of intervals, (ii) assign for each interval a
label r∈ {1, ..., H }in increasing order, (iii) if Xiu belongs to the interval r,
then X0
iu =r
H. In our experiments, we use H= 100.
for Categorical attribute: we chose to use a Basic Grouping Approach (BGB).
It aims at transforming feature values into a vector of Boolean values. The
different steps of this approach are: (i) group feature values into ggroups
with as equal frequencies as possible, where gis a parameter given by the
user, (ii) assign for each group a label r∈ {1, ..., g}, (iii) use a full disjunctive
coding. In our experiments, we use g= 10.
Fig. 2. Differences between the three types of “classification”
In the Figure 2 we suggest a two axis figure to situate the algorithms de-
scribed above: a vertical axis for their ability to describe (explain) the data
(from low to high) and horizontal axis for their ability to predict the labels (from
low to high). In this case the selected classifiers exemplify various trade-offs be-
tween prediction performance and explanatory power: (i)KMMV more dedicated
to description would appear in the bottom right corner; (ii)LMT,NBT and SNB
dedicated to prediction would go on the top left corner; and (iii)PKMVM,PKMSNB
would lie in between. Ideally, our algorithm, PKMSNB should place itself on the top
right quadrant of this kind of figure with both good prediction and description
performance.
Note that in the reported experiments, K=J(i.e, number of clusters =
number of classes). This choice which biases the algorithm to find one cluster
per class, is detrimental for predictive clustering, thus setting a lower bound on
the performance that can be expected of such an approach.
3.2 Experimental protocol
The comparison of the algorithms have been performed on 8 different datasets of
the UCI repository [18]. These datasets were chosen for their diversity in terms of
classes, features (categorical and numerical) and instances number (see Table 1).
Datasets Instances #Vn#Vc# Classes Datasets Instances #Vn#Vc# Classes
Glass 214 10 0 6 Waveform 5000 40 0 3
Pima 768 8 0 2 Mushroom 8416 0 22 2
Vehicle 846 18 0 4 Pendigits 10992 16 0 10
Segmentation 2310 19 0 7 Adult 48842 7 8 2
Table 1. The used datasets, Vn: numerical features, Vc: categorical features.
Evaluation of the performance: In order to compare the performance of the
algorithms presented above, the same folds in the train/test have been used.
The results presented in Section 3.3 are those obtained in the test phase using
a 10 ×10 folds cross validation (stratified). The predictive performance of the
algorithms are evaluated using the AUC (area under the ROC’s curve). It is
computed as follows: AUC =PC
iP(Ci)AUC(Ci), where AUC(i) denotes the AUC’s
value in the class iagainst all the others classes and P(Ci) denotes the prior on
the class i(the elements frequency in the class i). AUC(i) is calculated using the
probability vector P(Ci|X)i.
average results (in the test phase) using ACC
Data KMMV PKMMV PKMSNB LMT NBT SNB
Glass 70.34 ±8.00 89.32 ±6.09 95.38 ±4.66 97.48 ±2.68 94.63 ±4.39 97.75 ±3.33
Pima 65.11 ±4.17 66.90 ±4.87 73.72 ±4.37 76.85 ±4.70 75.38 ±4.71 75.41 ±4.75
Vehicle 37.60 ±4.10 47.35 ±5.62 72.21 ±4.13 82.52 ±3.64 70.46 ±5.17 64.26 ±4.39
Segment 67.50 ±2.35 80.94 ±1.93 96.18 ±1.26 96.30 ±1.15 95.17 ±1.29 94.44 ±1.48
Waveform 50.05 ±1.05 49.72 ±3.39 84.04 ±1.63 86.94 ±1.69 79.87 ±2.32 83.14 ±1.49
Mushroom 89.26 ±0.97 98.57 ±3.60 99.94 ±0.09 98.06 ±4.13 95.69 ±6.73 99.38 ±0.27
PenDigits 73.65 ±2.09 76.82 ±1.33 97.35 ±1.36 98.50 ±0.35 95.29 ±0.76 89.92 ±1.33
Adult 76.07 ±0.14 77.96 ±0.41 86.81 ±0.39 83.22 ±1.80 79.41 ±7.34 86.63 ±0.40
Average 66.19 73.44 88.20 89.98 85.73 86.36
average results (in the test phase) using 100 x AUC
Data KMMV PKMMV PKMSNB LMT NBT SNB
Glass 85.72 ±5.69 96.93 ±2.84 98.27 ±2.50 97.94 ±0.19 98.67 ±2.05 99.77 ±0.54
Pima 65.36 ±5.21 65.81 ±6.37 78.44 ±5.35 83.05 ±4.61 80.33 ±5.21 80.59 ±4.78
Vehicle 65.80 ±3.36 74.77 ±3.14 91.15 ±1.75 95.77 ±1.44 88.07 ±3.04 87.19 ±1.97
Segment 91.96 ±0.75 95.24 ±0.75 99.51 ±0.32 99.65 ±0.23 98.86 ±0.51 99.52 ±0.19
Waveform 75.58 ±0.58 69.21 ±3.17 96.16 ±0.58 97.10 ±0.53 93.47 ±1.41 95.81 ±0.57
Mushroom 88.63 ±1.03 98.47 ±0.38 99.99 ±0.00 99.89 ±0.69 99.08 ±2.29 99.97 ±0.02
Pendigits 95.34 ±0.45 95.84 ±0.29 99.66 ±0.11 99.81 ±0.10 99.22 ±1.78 99.19 ±1.14
Adult 73.33 ±0.65 59.42 ±3.70 92.37 ±0.34 77.32 ±10.93 84.25 ±5.66 92.32 ±0.34
Average 80.21 81.96 94.44 93.81 92.74 94.29
Table 2. Mean performance and standard deviation for the TEST set using a 10x10
folds cross-validation process
3.3 Results
Performance evaluation: Table 2 presents the predictive performance of
LMT,NBT,SNB, our algorithm PKMMV,PKMSNB and the baseline KMMV using the ACC
(accuracy) and the AUC criteria (presented as a %). These results show the very
good prediction performance of the PKMSNB algorithm. Its performance is indeed
comparable to those of LMT and SNB which are the strongest ones. In addition, the
use of local classifiers (algorithm PKMSNB ) provides a clear advantage over the use
of the majority vote in each cluster as done in PKMMV. Surprisingly, PKMSNB exhibits
slightly better results than SNB while both use naive Bayes classifiers locally and
PKMSNB is hampered by the fact that K=J, the number of classes. Better
performance are expected when KJ. Finally, PKMSNB appears to be slightly
superior to SNB, particularly for the datasets which contain highly correlated
features, for instance the PenDigits database.
Discussion about local models, complexity and others factors: In Sec-
tion II in the paragraph about label prediction in predictive clustering, we pro-
posed a list of desirable properties for the local prediction models used in each
cluster. We come back to these items denoted from (i) to (vi) in discussing Tables
2 and 3:
i) The performance in prediction are good even for the dataset Glass which
contains only 214 instances (90% for training in the 10x10 cross validation
(therefore 193 instances)).
ii) The robustness (ratio between the performance in test and training) is given
in Table 3 for the Accuracy (ACC) and the AUC. This ratio indicates that
there is no significant overfitting. Moreover, by contrast to methods described
in [14, 15] (about LMT and NBT) our algorithm does not require any cross
validation for setting parameters.
iii) The only user parameter is the number of cluster (in this paper K=J).
This point is crucial to help a non-expert to use the proposed method.
iv) The preprocessing complexity (step 1 of Algorithm 1) is O(d m log m), the
k-means has the usual complexity O(d m J t) and the complexity for the
creation of the local models is O(d mlog m) + O(K(d mlog dm)) where d
is the number of variables, mthe number of instances in the training dataset,
mis the average number of instances belonging to a cluster. Therefore a fast
training time is possible as indicated in Table 3 with time given in seconds
(for a PC with Windows 7 enterprise and a CPU : Intel Core I7 6820-HQ
2.70 GHz).
v) Only clusters where the information is sufficient to beat the ma jority vote
contain local model. Table 3 gives the percentage of pure clusters obtained
at the end of the convergence the K-Means and the percentage of clusters
with a local model (if not pure) when performing the 10x10 cross validation
(so over 100 results).
vi) Finally, the interpretation of the PKMSNB model is based on a two-level anal-
ysis. The first analysis consists in analyzing the profile of each cluster using
histograms. A visualisation of the average profile of the overall population
(each bar representing the percentage of instances having a value of the
corresponding interval) and the average profile of a given cluster allows to
understand why a given instance belongs to a cluster. Then locally to a clus-
ter the variable importance of the local classifier (the weights, Wf, in the
SNB classifier) gives a local interpretation.
Datasets Robustness Training Local models Datasets Robustness Training Lo cal models
(#intances) ACC AUC Time (s) (1) (2) (3) (#intances) ACC AUC Time (s) (1) (2) (3)
Glass 0.98 0.99 0.07 40.83 23.17 36.0 Waveform 0.97 0.99 0.73 0.00 98.00 2.00
Pima 0.95 0.94 0.05 00.00 88.50 11.5 Mushroom 1.00 1.00 0.53 50.00 50.00 0
Vehicle 0.95 0.97 0.14 07.50 92.25 0.25 Pendigits 0.99 1.00 1.81 0.00 100.00 0
Segment. 0.98 1.00 0.85 28.28 66.28 5.44 Adult 1.00 1.00 3.57 0.00 100.00 0
Table 3. Elements for discussion about local models. (1) Percentage of pure clusters;
(2) Percentage of non-pure clusters with a local model; (3) Percentage of non-pure
clusters without a local model.
The results of our experiments and the elements (i) to (vi) show that the
algorithm PKMSNB is interesting with regards to several aspects. (1) Its predictive
performance are comparable to those of the best competing supervised classi-
fication methods, (2) it doesn’t require cross validation, (3) it deals with the
missing values, (4) it operates a features selection both in the clustering step
and during the building of the local models. Finally, (5) it groups the categorical
features into modalities, thus allowing one to avoid using a complete disjunctive
coding which involves the creation of large vectors. Otherwise this disjunctive
coding could complicate the interpretation of the obtained model.
The reader may find a supplementary material here: https://bit.ly/2T4VhQw
or here: https://bit.ly/3a7xmFF. It gives a detailed example about the inter-
pretation of the results and some comparisons to others predictive clustering
algorithms as COBRA or MPCKmeans.
4 Conclusion and perspectives
We have shown how to modify a distance-based clustering technique, such as
k-means, into a predictive clustering algorithm. Moreover the learned represen-
tation could be used by other clustering algorithms. The resulting algorithm
PKMSNB exhibits strong predictive performances most of the time as the state of
the art but with the benefit of not having any parameters to adjust and therefore
no cross validation to compute. The suggested algorithm is also a good support
for interpretation of the data. Better performances can still be expected when
the number of clusters is higher than the number of classes. One goal of a work
in progress it to find a method that would automatically discover the optimal
number of clusters. In addition, we are developing a tool to help visualize the re-
sults allowing the navigation between clusters in order to view easily the average
profiles and the importance of the variables locally for each cluster.
References
1. Al-Harbi, S.H., Rayward-Smith, V.J.: Adapting k-means for supervised clustering.
Applied Intelligence 24(3), 219–226 (2006)
2. Arthur, D., Vassilvitskii, S.: K-means++: The advantages of careful seeding. In:
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algo-
rithms. pp. 1027–1035 (2007)
3. Been Kim, Kush R. Varshney, A.W.: Workshop on human interpretability in ma-
chine learning (whi 2018). In: Proceedings of the 2018 ICML Workshop (2018)
4. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning
in semi-supervised clustering. In: Proceedings of the Twenty-first International
Conference on Machine Learning (ICML) (2004)
5. Blockeel, H., Dzeroski, S., Struyf, J., Zenko, B.: Predictive Clustering. Springer-
Verlag New York (2019)
6. Bouchard, G., Triggs, B.: The tradeoff between generative and discriminative clas-
sifiers. In: IASC International Symposium on Computational Statistics (COMP-
STAT). pp. 721–728 (2004)
7. Boull´e, M.: A Bayes optimal approach for partitioning the values of categorical
attributes. Journal of Machine Learning Research 6, 1431–1452 (2005)
8. Boull´e, M.: MODL: a Bayes optimal discretization method for continuous at-
tributes. Machine Learning 65(1), 131–165 (2006)
9. Boull´e, M.: Compression-based averaging of selective naive Bayes classifiers. Jour-
nal of Machine Learning Research 8, 1659–1685 (2007)
10. Cevikalp, H., Larlus, D., Jurie, F.: A supervised clustering algorithm for the initial-
ization of rbf neural network classifiers. In: Signal Processing and Communication
Applications Conference (June 2007), http://lear.inrialpes.fr/pubs/2007/CLJ07
11. Eick, C.F., Zeidat, N., Zhao, Z.: Supervised clustering - algorithms and benefits. In:
International Conference on Tools with Artificial Intelligence. pp. 774–776 (2004)
12. Flach, P.: Machine learning: the art and science of algorithms that make sense of
data. Cambridge University Press (2012)
13. Hand, D.J., Yu, K.: Idiot’s bayes-not so stupid after all? International Statistical
Review 69(3), 385–398 (2001)
14. Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: a decision-tree hy-
brid. In: International Conference on Data Mining. pp. 202–207. AAAI Press (1996)
15. Landwehr, N., Hall, M., Frank, E.: Logistic model trees. Mach. Learn. 59(1-2)
(2005)
16. Langley, P., Sage, S.: Induction of selective bayesian classifiers. In: Proceedings of
the Tenth International Conference on Uncertainty in Artificial Intelligence. pp.
399–406. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1994)
17. Lemaire, V., Alaoui Ismaili, O., Cornu´ejols, A.: An initialization scheme for super-
vized k-means. In: International Joint Conference on Neural Networks (2015)
18. Lichman, M.: UCI machine learning repository (2013)
19. Manning, C.D., Raghavan, P., Sch¨utze, H.: Introduction to Information Retrieval.
Cambridge University Press, New York (2008)
20. Meil˘a, M., Heckerman, D.: An experimental comparison of several clustering and
initialization methods. In: Conference on Uncertainty in Artificial Intelligence. pp.
386–395. Morgan Kaufmann Publishers Inc. (1998)
21. Milligan, G.W., Cooper, M.C.: A study of standardization of variables in cluster
analysis. Journal of Classification 5(2), 181–204 (1988)
22. Salperwyck, C., Lemaire, V.: Learning with few examples: An empirical study on
leading classifiers. In: International Joint Conference on Neural Networks (2011)
23. Van Craenendonck, T., Dumancic, S., Van Wolputte, E., Blockeel, H.: COBRAS:
fast, iterative, active clustering with pairwise constraints. In: Proceedings of Intel-
ligent Data Analysis (2018)
... One is to allow the Economy--litE and Economy-approaches, which use the confidence level of a binary classifier, to solve multi-classes problems. A second one is to use a supervised clustering technique to compute groups of times series (see Lemaire et al. 2020) in the Economy-K and Economy-multi-K approaches. Finally, we are working on the adaptation of these methods to the on-line detection of anomalies in a data stream. ...
Article
Full-text available
An increasing number of applications require to recognize the class of an incoming time series as quickly as possible without unduly compromising the accuracy of the prediction. In this paper, we put forward a new optimization criterion which takes into account both the cost of misclassification and the cost of delaying the decision. Based on this optimization criterion, we derived a family of non-myopic algorithms which try to anticipate the expected future gain in information in balance with the cost of waiting. In one class of algorithms, unsupervised-based, the expectations use the clustering of time series, while in a second class, supervised-based, time series are grouped according to the confidence level of the classifier used to label them. Extensive experiments carried out on real datasets using a large range of delay cost functions show that the presented algorithms are able to solve the earliness vs. accuracy trade-off, with the supervised partition based approaches faring better than the unsupervised partition based ones. In addition, all these methods perform better in a wide variety of conditions than a state of the art method based on a myopic strategy which is recognized as being very competitive. Furthermore, our experiments show that the non-myopic feature of the proposed approaches explains in large part the obtained performances.
... One is to allow the supervised-based approaches, which use the confidence level of a binary classifier, to solve multi-classes problems. Another one is to use a supervised clustering technique to compute groups of times series (see [14]). Finally, we are working on the adaptation of these methods to the on-line detection of anomalies in a data stream. ...
Preprint
Full-text available
An increasing number of applications require to recognize the class of an incoming time series as quickly as possible without unduly compromising the accuracy of the prediction. In this paper, we put forward a new optimization criterion which takes into account both the cost of misclassification and the cost of delaying the decision. Based on this optimization criterion, we derived a family of non-myopic algorithms which try to anticipate the expected future gain in information in balance with the cost of waiting. In one class of algorithms, unsupervised-based, the expectations use the clustering of time series, while in a second class, supervised-based, time series are grouped according to the confidence level of the classifier used to label them. Extensive experiments carried out on real data sets using a large range of delay cost functions show that the presented algorithms are able to satisfactorily solving the earliness vs. accuracy trade-off, with the supervised-based approaches faring better than the unsupervised-based ones. In addition, all these methods perform better in a wide variety of conditions than a state of the art method based on a myopic strategy which is recognized as very competitive.
Conference Paper
Full-text available
Clustering is inherently ill-posed: there often exist multiple valid clusterings of a single dataset, and without any additional information a clustering system has no way of knowing which clustering it should produce. This motivates the use of constraints in clustering, as they allow users to communicate their interests to the clustering system. Active constraint-based clustering algorithms select the most useful constraints to query, aiming to produce a good clustering using as few constraints as possible. We propose COBRA, an active method that first over-clusters the data by running K-means with a $K$ that is intended to be too large, and subsequently merges the resulting small clusters into larger ones based on pairwise constraints. In its merging step, COBRA is able to keep the number of pairwise queries low by maximally exploiting constraint transitivity and entailment. We experimentally show that COBRA outperforms the state of the art in terms of clustering quality and runtime, without requiring the number of clusters in advance.
Conference Paper
Full-text available
Over the last years, researchers have focused their attention on a new approach, supervised clustering, that combines the main characteristics of both traditional clustering and supervised classification tasks. Motivated by the importance of the initialization in the traditional clustering context, this paper explores to what extent supervised initialization step could help traditional clustering to obtain better performances on supervised clustering tasks. This paper reports experiments which show that the simple proposed approach yields a good solution together with significant reduction of the computational cost.
Article
Full-text available
Folklore has it that a very simple supervised classification rule, based on the typically false assumption that the predictor variables are independent, can be highly effective, and often more effective than sophisticated rules. We examine the evidence for this, both empirical, as observed in real data applications, and theoretical, summarising explanations for why this simple rule might be effective. La tradition veunt qu'une règle très simple assumant l'independance des variables prédictives. une hypothèse fausse dans la plupart des cas, peut être très efficace, souvent même plus efficace qu'une méthode plus sophistiquée en ce qui concerne l'attribution de classes a un groupe d'objets. A ce sujet, nous examinons les preuves empiriques, et les preuves théoriques, e'est-a-dire les raisons pour lesquelles cette simple règle pourrait faciliter le processus de tri.
Conference Paper
Full-text available
In this paper, we propose a new supervised clustering algorithm, coined as the homogeneous clustering (HC), to find the number and initial locations of the hidden units in radial basis function (RBF) neural network classifiers. In contrast to the traditional clustering algorithms introduced for this goal, the proposed algorithm is a supervised procedure where the number and initial locations of the hidden units are determined based on split of the clusters having overlaps among the classes. The basic idea of the proposed approach is to create class specific homogenous clusters where the corresponding samples are closer to their mean than the means of rival clusters belonging to other class categories. We tested the proposed clustering algorithm along with the RBF network classifier on the Graz02 object database and the ORL face database. The experimental results show that the RBF network classifier performs better when it is initialized with the proposed HC algorithm than an unsupervised k-means algorithm. Moreover, our recognition results exceed the best published results on the Graz02 database and they are comparable to the best results on the ORL face database indicating that the proposed clustering algorithm initializes the hidden unit parameters successfully.
Conference Paper
Full-text available
Learning algorithms proved their ability to deal with large amount of data. Most of the statistical approaches use defined size learning sets and produce static models. However in specific situations: active or incremental learning, the learning task starts with only very few data. In that case, looking for algorithms able to produce models with only few examples becomes necessary. The literature's classifiers are generally evaluated with criterion such as: accuracy, ability to order data (ranking)... But this classifiers' taxonomy can dramatically change if the focus is on the ability to learn with just few examples. To our knowledge, just few studies were performed on this problem. The study presented in this paper aims to study a larger panel of both algorithms (9 different kinds) and data sets (17 UCI bases).
Book
As one of the most comprehensive machine learning texts around, this book does justice to the field's incredible richness, but without losing sight of the unifying principles. Peter Flach's clear, example-based approach begins by discussing how a spam filter works, which gives an immediate introduction to machine learning in action, with a minimum of technical fuss. Flach provides case studies of increasing complexity and variety with well-chosen examples and illustrations throughout. He covers a wide range of logical, geometric and statistical models and state-of-the-art topics such as matrix factorisation and ROC analysis. Particular attention is paid to the central role played by features. The use of established terminology is balanced with the introduction of new and useful concepts, and summaries of relevant background material are provided with pointers for revision if necessary. These features ensure Machine Learning will set a new standard as an introductory textbook.
Article
k-means is traditionally viewed as an algorithm for the unsupervised clustering of a heterogeneous population into a number of more homogeneous groups of objects. However, it is not necessarily guaranteed to group the same types (classes) of objects together. In such cases, some supervision is needed to partition objects which have the same label into one cluster. This paper demonstrates how the popular k-means clustering algorithm can be profitably modified to be used as a classifier algorithm. The output field itself cannot be used in the clustering but it is used in developing a suitable metric defined on other fields. The proposed algorithm combines Simulated Annealing with the modified k-means algorithm. We apply the proposed algorithm to real data sets, and compare the output of the resultant classifier to that of C4.5.
Conference Paper
Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraint-based methods that guide the clustering algorithm towards a better grouping of the data, and 2) distance-function learning methods that adapt the underlying similarity metric used by the clustering algorithm. This paper provides new methods for the two approaches as well as presents a new semi-supervised clustering algorithm that integrates both of these techniques in a uniform, principled framework. Experimental results demonstrate that the unified approach produces better clusters than both individual approaches as well as previously proposed semi-supervised clustering algorithms.
Conference Paper
In this paper, we review the induction of simple Bayesian classifiers, note some of their drawbacks, and describe a recursive algorithm that constructs a hierarchy of probabilistic concept descriptions. We posit that this approach should outperform the simpler scheme in domains that involve disjunctive concepts, since they violate the independence assumption on which the latter relies. To test this hypothesis, we report experimental studies with both natural and artificial domains. The results are mixed, but they are encouraging enough to recommend closer examination of recursive Bayesian classifiers in future work.