Content uploaded by Joydeep Ghosh

Author content

All content in this area was uploaded by Joydeep Ghosh on Apr 03, 2014

Content may be subject to copyright.

Localized Simultaneous Clustering and

Classiﬁcation

Abhimanu Kumar #1 , Yubin Park ∗2, Aayush Sharma ∗3, Joydeep Ghosh ∗4

#Department of Computer Science

1abhimanu@cs.utexas.edu

∗Department of Electrical and Computer Engineering

University Of Texas at Austin

2yubin@ece.utexas.edu

3asharma@ece.utexas.edu

4ghosh@ece.utexas.edu

Abstract— We propose a model for simultaneous clustering and

classiﬁcation to predict class labels in a localised way. The model

is based on the assumption that there are smaller, distinct, local

groups with in bigger datasets that behave as single coherent

unit and have similar characteristics. A model based on this

assumption will have better iterpretability as well as improved

results. For ex: in a consumer survey data, consumers with

similar tastes behave similarly and so the data has smaller

subgroups within itself with distinct behavior patterns. A model

that exploits this phenomenon by clustering “similar” consumers

together to predict class labels will have better interpretability

as well as performance. Our primary focus here is classiﬁcation

but we also put forward a scheme to extend it to a regression

setting. The proposed classiﬁer not only provides better results

but also good interpretability. The model also provides the set of

prominent features for each discovered group of datapoints that

affect class membership in that subgroup the most. The model

when evaluated over UCI datasets performs quite well and beats

existing best results in couple of cases. We also perform model

evaluation on CDC Diabetes and Infant Mortality health care

datasets and report the results.

I. INTRODUCTION

Most datasets have smaller groups within themselves that

tend to exhibit similar characteristics. A model that exploits

this property will have better results and more importantly

improved data insights. It will be capable of understanding

the data subgroups within the larger dataset and able to answer

questions such as “what makes one subgroup different from

other” or “what features are prominent in which subgroup”.

The answer to these questions are very important especially

in cases for ex: if the data was a consumer survey, then

discovering smaller cliques of consumers within the larger

consumer base makes important business sense. As now the

surveyor or company will not have to deal with each consumer

individually and will have different business strategies for

each of these different consumer cliques based on cliques’

properties. Moreover, instead of having a single strategy for

all their consumer they can serve consumers better if they take

into account these smaller consumer cliques.

Our aim in this paper is to exploit this localised informa-

tion to provide better predictors. We propose and evaluate a

classiﬁcation based model called LSC2(Localised Simultane-

ous Clustering & Classiﬁcation). Moreover, this approach is

generic and can be easily extended to a regression setting.

The LSC2model improves upon the results obtained by

“clustering ﬁrst and then classifying” approach. There is an

optimal number of clusters or subgroups Kfor which the

model gives the best result. This Kcan be understood as the

optimal number of subgroups in the data. The model ﬁnds this

optimal Kas well as the set of features that affect the class

membership of datapoints in the subgroup the most.

II. RE LATE D WOR K

Real-world data with complex distribution and structure

cannot be captured by a single model. To overcome this they

are divided into homogeneous groups that are described more

accurately through local models [1]. The mixture of experts [2]

is one such approach that utilizes this divide-and-conquer

strategy by simultaneously partitioning the input space and

learning models for each partition. The hierarchical mixture of

experts [3] further develops this by adopting a tree structure on

each model and shows improved performance. Simultaneous

Clustering & Classiﬁcation (SCC) [4], [5] approaches the

problem using clustering technique instead of gating networks.

The SCC algorithm provides better accuracies on UCI datasets

than several classiﬁcation algorithms, as multiple local models

can be ﬁt utilizing structural information of the data. Though

these models jointly optimize both partitioning and predictive

modeling, they lack intuitive interpretations on their results as

generative processes of data are not considered.

Clustering is another technique that can be used to model

such complex data. In a clustering algorithm, data points

with similar characteristics are grouped together revealing

the structure of the complex dataset as a whole. Grouping

similar customers or users is extremely important in marketing

research, since then a single optimal policy would affect all

the customers in a group rather than having a single universal

policy for all or multitudes of policies based on each individual

consumer need. Thus to obtain this set of optimal policies,

one approach is to maintain the similarity between users by

applying clustering and then building multiple local models

for each of these clusters [6]. Although this approach extracts

Name Feature×Dataset Size Feature Characteristics Class Ratio

Car Evaluation 6×1726 Categorical 3:7

Kr-Vs-Kp 36×3196 Categorical 1:1

Indians Diabetes 8×768 Integer/Real 1:2

Tic-Tac-Toe 9×958 Categorical 1:2

Contraceptive 9×1473 Categorical/Integer 3:4

Gamma Telescope 10×19020 Real 9:5

TABLE I: Properties for different datasets used

the overall structure of the data and gives easy interpretations,

the performance of the localized models doesn’t provide

much improvement as the joint optimization of clustering and

classiﬁcation is not done.

Another approach to solve the above problem can be found

in a dyadic data prediction. The Simultaneous co-clustering

and learning(SCOAL) [7] framework jointly optimizes co-

clustering and local models per cluster, and has been shown

to ﬁt various complex dyadic data. The Bayesian framework

of the SCOAL(SABAE) [8] can be used to solve cold-start

problems inherent to SCOAL, but the real data experiment for

these models has not been done yet.

The model presented in this paper can be viewed as a

variation of SABAE. It doesn’t have the dyadic component of

SABAE and predicts using a single table data. The model here

preserves clustering properties as well as increases prediction

performance.

III. DATA SET

We evaluate the LSC2model on 12 of UCI machine learning

datasets. For classiﬁcation we focus only on bi-class problems

out of convenience but the model is not restricted to bi-class

and can be used for multi-class problems as well. The datasets

have continuous as well as categorical features. The properties

of datasets used are shown in table I. The multi-class classi-

ﬁcation datasets are converted to bi-class before evaluating

the model over them. We also provide interpretations for the

results obtained for couple of datasets.

IV. OUR AP PROACH

The LSC2model is a generative framework that estimates

the class labels using the posterior probability of the individual

classes in localised clusters. The generative process for this

scheme is described below. X={xn}[n]N

1is the set of

datapoints and Y={yn}[n]N

1denotes their class labels.

•Sample cluster assignment for each datapoint xnfrom a

Discrete Multinomial Distribution: zn∼Disc(π). Each

datapoint xnis assigned to one of Kclusters i.e. zn∈

{1, . . . , K}

•Sample xnfrom one of Kpossible exponential family

distributions p(xn|θzn)where θznis selected according

to the cluster assignment zn.

•Sample class label ynfrom Logistic Sigmoid Function [9]

i.e. yn∼σ(βzn, xn, λ). Note here that we take Logistic

Regression with LASSO penalty (provided by λparame-

ter) to infer the class labels but the model is not limited to

that and any other valid distribution can replace Logistic

scheme here.

Fig. 1: Graphical Model for LSC2

The graphical model for the above generative process is

shown in ﬁgure 1. The overall joint probability of this gener-

ative scheme is:

p(Y, X, z|π, θ, β) =

Y

n

p(yn|xn, zn, βzn)p(xn|zn, θzn)p(zn|π)(1)

The probability for observed variables(X, Y ) after marginal-

izing out the latent variables(z) is:

p(Y, X |π, θ, β) =

Y

nX

z

p(yn|xn, zn, βzn)p(xn|zn, θzn)p(zn|π)(2)

The LSC2optimizes this marginal probability of the

observed variables via an Expectation-Maximization algo-

rithm [10]. Algorithm 1 describes the E and M steps in the

optimization process. The model uses Logistic Regression with

LASSO penalty λfor the class assignments. The βparam-

eter of the model is calculated using Iterative Reweighted

Least Squared(IRLS) [9] algorithm. The cluster assignment

in the EM iterations is hard-clustering. The LASSO penalty

λis learned using 5-fold cross-validation and the miss-

classiﬁcation errors for the best λvalue are reported.

After learning all the parameters of the model, the predic-

tion task can be performed based on Maximum a posteriori

(MAP) estimator of (ˆzMAP ,ˆ

YMAP ). As zand Yare discrete

variables, probabilities for all possible (z, Y )’s can be directly

computed using Eq (1). We choose the pair (z, Y )that gives

the maximum probability, and the corresponding Ybecomes

the prediction from the model. The exact algorithm for a bi-

class problem is described in Algorithm 1.

V. EVALUATI ON METRIC

The evaluation metric for LSC2is the standard Miss-

classiﬁcation Error. The cluster assignment done here is hard-

clustering and the target label is calculated using the weight

parameter βzof the corresponding cluster. The evaluation

metric is given by :

ME rr or =

N

X

n=1

1{yn6=ˆyn}

N(3)

Algorithm 1 The E and M steps for Classiﬁcation Model

The Likelihood of the model:

p(Y, X |π, θ, β) = Y

nX

z

p(yn|xn, zn)p(xn|zn)p(zn)

while until convergence do

E-Step

γn,z =p(zn|xn, yn.π, θ, β )

∝p(yn|βz, xn)p(xn|θz)p(zn|π)

⇒γn,z =p(zn|xn, yn.π, θ, β )

X

z

p(zn|xn, yn.π, θ, β )

M-Step

Γ = hard cluster(Γ)

µz=

N

X

n=1

γn,zxn

Nz

, Nz=

N

X

n=1

γn,z , πz=Nz

N

Covz=

N

X

n=1

γn,z(xn−µn)T(xn−µn)

Nz

βz=IRLS(X, Γz, Y , λ)

end while

Prediction:

if (maxzp(y= 1|σ(βT

zx))p(x|θz)p(z|π)>maxzp(y=

0|σ(βT

zx))p(x|θz)p(z|π))then

ˆy←1

else

ˆy←0

end if

where ˆ

Yis prediction obtained from algorithm 1.

The clusters predicted are probed for their interpretability

too. We examine whether the Logistic or LASSO coefﬁcients

obtained in the respective clusters provide us any insights into

the dataset. This interpretability is discussed in further details

in section VII-A.

VI. RE SU LTS

We present here the evaluation results obtained for the LSC2

model. The results are based on 5-fold cross-validation test

runs. The miss-classiﬁcation errors reported are for different

Kvalues for the LSC2model as well as for “clustering ﬁrst

and then classifying”, Non-Sim, (Non-Simultaneous) approach.

Table II shows the results for a group of UCI datasets. As

we can see LSC2beats Non-Sim approach almost always. The

miss-classiﬁcation error decreases until an optimal value of K

is reached and then starts increasing after that. This optimal

Kis the most probable number of localised clusters in the

Feature Cluster-1 Cluster-2

Number of times pregnant 1.66 19.79

Plasma glucose concentration 0.23 2.85

Diastolic blood pressure -0.54 -7.28

Triceps skin fold thickness -2.05 1.82

serum insulin 0-0.06

Body mass index 0.75 0

Diabetes pedigree function 0 0

Age -0.65 1.12

TABLE III: Normalized Beta values for the two most promi-

nent clusters in Pima Indian Diabetes Dataset. The cluster ratio

is 3:5

respective datasets. For ex: “Indians Diabetes” datasets has

two internal sub-clusters whereas “Car Evaluation” dataset

has three and “Tic-tac-toe” has four. The miss-classiﬁcation

error for Non-Sim model also in general decreases till some

optimal value but the decrease in error is not as much as LSC2

model since Non-Sim approach doesn’t take into account the

class-labels of the datapoints to be clustered. The class-label

information provides crucial information to the LSC2model

for better classiﬁcation.

A. Interpretation

From Table II it is clear that “Indian Diabetes” dataset

has two most probable internal clusters. Table III shows the

normalized βvalues for the two most prominent clusters

(which give the least miss-Error in table II) for Pima Indian

dataset. We can see that different features affect the two

clusters differently. While feature “serum insulin” has no effect

in cluster-1, feature “Body mass Index” has no effect in

cluster-2 as their beta values are 0. This fact can be used

to treat the disease in two clusters differently.

Figure 2a shows the normalized βvalues for the different

features of “Gamma Telescope”. This dataset deals with image

patterns created by Cherenkov photons due to radiation. The

aim is to determine whether the photon shower was created

by primary gamma rays (primary) or by cosmic rays in upper

atmosphere (background). It’s a bi-class problem with the two

classes being: 1)Primary which is class 0 and 2)Background

which is class 1. We observe from table II that there are three

most prominent clusters in this dataset. Figure 2a shows the

nomalized βvalues for these 3 clusters. It can be seen that

features “fConc” and “fConc1” have no effect on the class-

labels of the datapoint as their βs are zero. Feature “fSize”

contributes mostly to the class 1 for cluster C-3 whereas

it contributes mostly to class 0 for clusters C-1 and C-2.

Moreover “fWidth” is prominent in class 0 for clusters C-1

and C-3 whereas it is prominent in class 1 for clusters C-2.

Figure 2b shows the feature means of the two most discrim-

inating features among the 3 clusters: “fWidth” and “fSize”.

The bubbles size are proportional to the size of the clusters.

It is clear that clusters C-1 and C-2 are of similar sizes while

cluster C-3 is slightly bigger. Moreover, the feature means of

the 3 clusters are close together for the given features. The

above insights can be used to better study the image patterns

Dataset Model K= 1 K= 2 K= 3 K= 4 K= 5

Car Evaluation LSC29.82% 8.06% 7.11% 9.94% 10.12%

Non-Sim 9.82% 8.89% 9.24% 11.62% 13.48%

Kr-Vs-Kp LSC27.36% 6.77% 7.04% 6.95% 7.32%

Non-Sim 7.36% 6.96% 7.59% 8.48% 8.91%

Indians Diabetes LSC224.60% 22.01% 22.35% 22.86% 23.70%

Non-Sim 24.60% 24.21% 26.32% 26.05% 28.98%

Tic-Tac-Toe LSC229.68% 25.26% 22.42% 22.00% 22.83%

Non-Sim 29.68% 25.83% 24.52% 26.21% 27.94%

Contraceptive LSC233.19% 30.27% 29.15% 32.63% 33.95%

Non-Sim 33.19% 30.11% 30.86% 33.14% 34.20%

Gamma Telescope LSC221.22% 17.27% 15.26% 15.87% 17.24%

Non-Sim 21.22% 17.41% 16.94% 17.62% 21.26%

TABLE II: Miss-Classiﬁcation error for various UCI Datasets.

.

(a) Normalized beta values for the features (b) feature Means of two of the most prominent features: fSize and fWidth

Fig. 2: mean and nomalized βvalues of the featureset for Gamma Telescope Dataset.

created and the reasons behind their creation.

VII. DISCUSSION AND CONCLUSION

Exploiting localised cliques of data points inside bigger,

complex datasets can not only provide better results but

also new insights into the data itself. The proposed model,

LSC2, does this job effectively providing users with improved

insights into the data. The localised models based approach is

able to ﬁnd clusters in diverse set of data as seen in section VI-

A. This approach can also be extended beyond classiﬁcation

tasks.

A. Extending LSC2to a regression setting

The LSC2model is based on exploiting localised subclusters

in a bigger, complex dataset. This approach can be easily

extended to a regression setting. In the generative model as

discussed in section IV the ynbecomes a real-valued variable

instead of a class label and is sampled for accordingly from

a suitable distribution. The graphical model for the regression

setting will be similar to ﬁgure 1.

A promising area of exploration in this regard would be to

examine whether this approch can be extended to a transfer

learning paradigm for a classiﬁcation or regression task. For a

large dataset when the LSC2model discovers localised clusters

and classiﬁes accordingly these discovered clusters can be used

for transfer learning. Whether these cluster features can be

transfered to a similar task setting in another related domain

for better results is an interesting area to explore.

REFERENCES

[1] D. J. HAND and V. VINCIOTTI, “Local versus global models for

classiﬁcation problems: Fitting models where it matters,” The American

Statistician, vol. 57, no. 2, 2003.

[2] R. A. Jacobs and M. I. Jordan, “Adaptive mixtures of local experts,” in

Neural Computations, 1991.

[3] C. M. Bishop and M. Svensen, “Bayesian hierachical mixtures of

experts,” in Uncertainty in Artiﬁcial Intellience, 2003.

[4] W. Cai, S. Chen, and D. Zhang, “A simultaneous learning framework

for clustering and classiﬁcation,” Pattern Recognition, vol. 42, pp. 1248–

1259, 2009.

[5] ——, “A multiobjective simultaneous learning framework for clustering

and classiﬁcation,” IEEE Transactions on Neural Networks, vol. 21,

no. 2, February 2010.

[6] M. Wedel and J.-B. E. M. Steenkamp, “A clusterwise regression method

for simultaneous fuzzy market structuring and beneﬁt segmentation,”

Journal of Marketing Research, 1991.

[7] M. Deodhar and J. Ghosh, “Scoal: A framework for simultaneous

co-clustering and learning from complex data,” ACM Transactions on

Knowledge Discovery from Data, 2009.

[8] A. Sharma and J. Ghosh, “Side information aware bayesian afﬁnity

estimation,” IDEAL, The University of Texas at Austin, Tech. Rep.,

2011.

[9] C. M. Bishop, Pattern Recognition and Machine Learning. Cambridge,

MA: Springer, 2007, pp. 205–209.

[10] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood

from incomplete data via the EM algorithm,” J. Royal Statistical Society.

Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977.