Content uploaded by Joydeep Ghosh
Author content
All content in this area was uploaded by Joydeep Ghosh on Apr 03, 2014
Content may be subject to copyright.
Localized Simultaneous Clustering and
Classification
Abhimanu Kumar #1 , Yubin Park ∗2, Aayush Sharma ∗3, Joydeep Ghosh ∗4
#Department of Computer Science
1abhimanu@cs.utexas.edu
∗Department of Electrical and Computer Engineering
University Of Texas at Austin
2yubin@ece.utexas.edu
3asharma@ece.utexas.edu
4ghosh@ece.utexas.edu
Abstract— We propose a model for simultaneous clustering and
classification to predict class labels in a localised way. The model
is based on the assumption that there are smaller, distinct, local
groups with in bigger datasets that behave as single coherent
unit and have similar characteristics. A model based on this
assumption will have better iterpretability as well as improved
results. For ex: in a consumer survey data, consumers with
similar tastes behave similarly and so the data has smaller
subgroups within itself with distinct behavior patterns. A model
that exploits this phenomenon by clustering “similar” consumers
together to predict class labels will have better interpretability
as well as performance. Our primary focus here is classification
but we also put forward a scheme to extend it to a regression
setting. The proposed classifier not only provides better results
but also good interpretability. The model also provides the set of
prominent features for each discovered group of datapoints that
affect class membership in that subgroup the most. The model
when evaluated over UCI datasets performs quite well and beats
existing best results in couple of cases. We also perform model
evaluation on CDC Diabetes and Infant Mortality health care
datasets and report the results.
I. INTRODUCTION
Most datasets have smaller groups within themselves that
tend to exhibit similar characteristics. A model that exploits
this property will have better results and more importantly
improved data insights. It will be capable of understanding
the data subgroups within the larger dataset and able to answer
questions such as “what makes one subgroup different from
other” or “what features are prominent in which subgroup”.
The answer to these questions are very important especially
in cases for ex: if the data was a consumer survey, then
discovering smaller cliques of consumers within the larger
consumer base makes important business sense. As now the
surveyor or company will not have to deal with each consumer
individually and will have different business strategies for
each of these different consumer cliques based on cliques’
properties. Moreover, instead of having a single strategy for
all their consumer they can serve consumers better if they take
into account these smaller consumer cliques.
Our aim in this paper is to exploit this localised informa-
tion to provide better predictors. We propose and evaluate a
classification based model called LSC2(Localised Simultane-
ous Clustering & Classification). Moreover, this approach is
generic and can be easily extended to a regression setting.
The LSC2model improves upon the results obtained by
“clustering first and then classifying” approach. There is an
optimal number of clusters or subgroups Kfor which the
model gives the best result. This Kcan be understood as the
optimal number of subgroups in the data. The model finds this
optimal Kas well as the set of features that affect the class
membership of datapoints in the subgroup the most.
II. RE LATE D WOR K
Real-world data with complex distribution and structure
cannot be captured by a single model. To overcome this they
are divided into homogeneous groups that are described more
accurately through local models [1]. The mixture of experts [2]
is one such approach that utilizes this divide-and-conquer
strategy by simultaneously partitioning the input space and
learning models for each partition. The hierarchical mixture of
experts [3] further develops this by adopting a tree structure on
each model and shows improved performance. Simultaneous
Clustering & Classification (SCC) [4], [5] approaches the
problem using clustering technique instead of gating networks.
The SCC algorithm provides better accuracies on UCI datasets
than several classification algorithms, as multiple local models
can be fit utilizing structural information of the data. Though
these models jointly optimize both partitioning and predictive
modeling, they lack intuitive interpretations on their results as
generative processes of data are not considered.
Clustering is another technique that can be used to model
such complex data. In a clustering algorithm, data points
with similar characteristics are grouped together revealing
the structure of the complex dataset as a whole. Grouping
similar customers or users is extremely important in marketing
research, since then a single optimal policy would affect all
the customers in a group rather than having a single universal
policy for all or multitudes of policies based on each individual
consumer need. Thus to obtain this set of optimal policies,
one approach is to maintain the similarity between users by
applying clustering and then building multiple local models
for each of these clusters [6]. Although this approach extracts
Name Feature×Dataset Size Feature Characteristics Class Ratio
Car Evaluation 6×1726 Categorical 3:7
Kr-Vs-Kp 36×3196 Categorical 1:1
Indians Diabetes 8×768 Integer/Real 1:2
Tic-Tac-Toe 9×958 Categorical 1:2
Contraceptive 9×1473 Categorical/Integer 3:4
Gamma Telescope 10×19020 Real 9:5
TABLE I: Properties for different datasets used
the overall structure of the data and gives easy interpretations,
the performance of the localized models doesn’t provide
much improvement as the joint optimization of clustering and
classification is not done.
Another approach to solve the above problem can be found
in a dyadic data prediction. The Simultaneous co-clustering
and learning(SCOAL) [7] framework jointly optimizes co-
clustering and local models per cluster, and has been shown
to fit various complex dyadic data. The Bayesian framework
of the SCOAL(SABAE) [8] can be used to solve cold-start
problems inherent to SCOAL, but the real data experiment for
these models has not been done yet.
The model presented in this paper can be viewed as a
variation of SABAE. It doesn’t have the dyadic component of
SABAE and predicts using a single table data. The model here
preserves clustering properties as well as increases prediction
performance.
III. DATA SET
We evaluate the LSC2model on 12 of UCI machine learning
datasets. For classification we focus only on bi-class problems
out of convenience but the model is not restricted to bi-class
and can be used for multi-class problems as well. The datasets
have continuous as well as categorical features. The properties
of datasets used are shown in table I. The multi-class classi-
fication datasets are converted to bi-class before evaluating
the model over them. We also provide interpretations for the
results obtained for couple of datasets.
IV. OUR AP PROACH
The LSC2model is a generative framework that estimates
the class labels using the posterior probability of the individual
classes in localised clusters. The generative process for this
scheme is described below. X={xn}[n]N
1is the set of
datapoints and Y={yn}[n]N
1denotes their class labels.
•Sample cluster assignment for each datapoint xnfrom a
Discrete Multinomial Distribution: zn∼Disc(π). Each
datapoint xnis assigned to one of Kclusters i.e. zn∈
{1, . . . , K}
•Sample xnfrom one of Kpossible exponential family
distributions p(xn|θzn)where θznis selected according
to the cluster assignment zn.
•Sample class label ynfrom Logistic Sigmoid Function [9]
i.e. yn∼σ(βzn, xn, λ). Note here that we take Logistic
Regression with LASSO penalty (provided by λparame-
ter) to infer the class labels but the model is not limited to
that and any other valid distribution can replace Logistic
scheme here.
Fig. 1: Graphical Model for LSC2
The graphical model for the above generative process is
shown in figure 1. The overall joint probability of this gener-
ative scheme is:
p(Y, X, z|π, θ, β) =
Y
n
p(yn|xn, zn, βzn)p(xn|zn, θzn)p(zn|π)(1)
The probability for observed variables(X, Y ) after marginal-
izing out the latent variables(z) is:
p(Y, X |π, θ, β) =
Y
nX
z
p(yn|xn, zn, βzn)p(xn|zn, θzn)p(zn|π)(2)
The LSC2optimizes this marginal probability of the
observed variables via an Expectation-Maximization algo-
rithm [10]. Algorithm 1 describes the E and M steps in the
optimization process. The model uses Logistic Regression with
LASSO penalty λfor the class assignments. The βparam-
eter of the model is calculated using Iterative Reweighted
Least Squared(IRLS) [9] algorithm. The cluster assignment
in the EM iterations is hard-clustering. The LASSO penalty
λis learned using 5-fold cross-validation and the miss-
classification errors for the best λvalue are reported.
After learning all the parameters of the model, the predic-
tion task can be performed based on Maximum a posteriori
(MAP) estimator of (ˆzMAP ,ˆ
YMAP ). As zand Yare discrete
variables, probabilities for all possible (z, Y )’s can be directly
computed using Eq (1). We choose the pair (z, Y )that gives
the maximum probability, and the corresponding Ybecomes
the prediction from the model. The exact algorithm for a bi-
class problem is described in Algorithm 1.
V. EVALUATI ON METRIC
The evaluation metric for LSC2is the standard Miss-
classification Error. The cluster assignment done here is hard-
clustering and the target label is calculated using the weight
parameter βzof the corresponding cluster. The evaluation
metric is given by :
ME rr or =
N
X
n=1
1{yn6=ˆyn}
N(3)
Algorithm 1 The E and M steps for Classification Model
The Likelihood of the model:
p(Y, X |π, θ, β) = Y
nX
z
p(yn|xn, zn)p(xn|zn)p(zn)
while until convergence do
E-Step
γn,z =p(zn|xn, yn.π, θ, β )
∝p(yn|βz, xn)p(xn|θz)p(zn|π)
⇒γn,z =p(zn|xn, yn.π, θ, β )
X
z
p(zn|xn, yn.π, θ, β )
M-Step
Γ = hard cluster(Γ)
µz=
N
X
n=1
γn,zxn
Nz
, Nz=
N
X
n=1
γn,z , πz=Nz
N
Covz=
N
X
n=1
γn,z(xn−µn)T(xn−µn)
Nz
βz=IRLS(X, Γz, Y , λ)
end while
Prediction:
if (maxzp(y= 1|σ(βT
zx))p(x|θz)p(z|π)>maxzp(y=
0|σ(βT
zx))p(x|θz)p(z|π))then
ˆy←1
else
ˆy←0
end if
where ˆ
Yis prediction obtained from algorithm 1.
The clusters predicted are probed for their interpretability
too. We examine whether the Logistic or LASSO coefficients
obtained in the respective clusters provide us any insights into
the dataset. This interpretability is discussed in further details
in section VII-A.
VI. RE SU LTS
We present here the evaluation results obtained for the LSC2
model. The results are based on 5-fold cross-validation test
runs. The miss-classification errors reported are for different
Kvalues for the LSC2model as well as for “clustering first
and then classifying”, Non-Sim, (Non-Simultaneous) approach.
Table II shows the results for a group of UCI datasets. As
we can see LSC2beats Non-Sim approach almost always. The
miss-classification error decreases until an optimal value of K
is reached and then starts increasing after that. This optimal
Kis the most probable number of localised clusters in the
Feature Cluster-1 Cluster-2
Number of times pregnant 1.66 19.79
Plasma glucose concentration 0.23 2.85
Diastolic blood pressure -0.54 -7.28
Triceps skin fold thickness -2.05 1.82
serum insulin 0-0.06
Body mass index 0.75 0
Diabetes pedigree function 0 0
Age -0.65 1.12
TABLE III: Normalized Beta values for the two most promi-
nent clusters in Pima Indian Diabetes Dataset. The cluster ratio
is 3:5
respective datasets. For ex: “Indians Diabetes” datasets has
two internal sub-clusters whereas “Car Evaluation” dataset
has three and “Tic-tac-toe” has four. The miss-classification
error for Non-Sim model also in general decreases till some
optimal value but the decrease in error is not as much as LSC2
model since Non-Sim approach doesn’t take into account the
class-labels of the datapoints to be clustered. The class-label
information provides crucial information to the LSC2model
for better classification.
A. Interpretation
From Table II it is clear that “Indian Diabetes” dataset
has two most probable internal clusters. Table III shows the
normalized βvalues for the two most prominent clusters
(which give the least miss-Error in table II) for Pima Indian
dataset. We can see that different features affect the two
clusters differently. While feature “serum insulin” has no effect
in cluster-1, feature “Body mass Index” has no effect in
cluster-2 as their beta values are 0. This fact can be used
to treat the disease in two clusters differently.
Figure 2a shows the normalized βvalues for the different
features of “Gamma Telescope”. This dataset deals with image
patterns created by Cherenkov photons due to radiation. The
aim is to determine whether the photon shower was created
by primary gamma rays (primary) or by cosmic rays in upper
atmosphere (background). It’s a bi-class problem with the two
classes being: 1)Primary which is class 0 and 2)Background
which is class 1. We observe from table II that there are three
most prominent clusters in this dataset. Figure 2a shows the
nomalized βvalues for these 3 clusters. It can be seen that
features “fConc” and “fConc1” have no effect on the class-
labels of the datapoint as their βs are zero. Feature “fSize”
contributes mostly to the class 1 for cluster C-3 whereas
it contributes mostly to class 0 for clusters C-1 and C-2.
Moreover “fWidth” is prominent in class 0 for clusters C-1
and C-3 whereas it is prominent in class 1 for clusters C-2.
Figure 2b shows the feature means of the two most discrim-
inating features among the 3 clusters: “fWidth” and “fSize”.
The bubbles size are proportional to the size of the clusters.
It is clear that clusters C-1 and C-2 are of similar sizes while
cluster C-3 is slightly bigger. Moreover, the feature means of
the 3 clusters are close together for the given features. The
above insights can be used to better study the image patterns
Dataset Model K= 1 K= 2 K= 3 K= 4 K= 5
Car Evaluation LSC29.82% 8.06% 7.11% 9.94% 10.12%
Non-Sim 9.82% 8.89% 9.24% 11.62% 13.48%
Kr-Vs-Kp LSC27.36% 6.77% 7.04% 6.95% 7.32%
Non-Sim 7.36% 6.96% 7.59% 8.48% 8.91%
Indians Diabetes LSC224.60% 22.01% 22.35% 22.86% 23.70%
Non-Sim 24.60% 24.21% 26.32% 26.05% 28.98%
Tic-Tac-Toe LSC229.68% 25.26% 22.42% 22.00% 22.83%
Non-Sim 29.68% 25.83% 24.52% 26.21% 27.94%
Contraceptive LSC233.19% 30.27% 29.15% 32.63% 33.95%
Non-Sim 33.19% 30.11% 30.86% 33.14% 34.20%
Gamma Telescope LSC221.22% 17.27% 15.26% 15.87% 17.24%
Non-Sim 21.22% 17.41% 16.94% 17.62% 21.26%
TABLE II: Miss-Classification error for various UCI Datasets.
.
(a) Normalized beta values for the features (b) feature Means of two of the most prominent features: fSize and fWidth
Fig. 2: mean and nomalized βvalues of the featureset for Gamma Telescope Dataset.
created and the reasons behind their creation.
VII. DISCUSSION AND CONCLUSION
Exploiting localised cliques of data points inside bigger,
complex datasets can not only provide better results but
also new insights into the data itself. The proposed model,
LSC2, does this job effectively providing users with improved
insights into the data. The localised models based approach is
able to find clusters in diverse set of data as seen in section VI-
A. This approach can also be extended beyond classification
tasks.
A. Extending LSC2to a regression setting
The LSC2model is based on exploiting localised subclusters
in a bigger, complex dataset. This approach can be easily
extended to a regression setting. In the generative model as
discussed in section IV the ynbecomes a real-valued variable
instead of a class label and is sampled for accordingly from
a suitable distribution. The graphical model for the regression
setting will be similar to figure 1.
A promising area of exploration in this regard would be to
examine whether this approch can be extended to a transfer
learning paradigm for a classification or regression task. For a
large dataset when the LSC2model discovers localised clusters
and classifies accordingly these discovered clusters can be used
for transfer learning. Whether these cluster features can be
transfered to a similar task setting in another related domain
for better results is an interesting area to explore.
REFERENCES
[1] D. J. HAND and V. VINCIOTTI, “Local versus global models for
classification problems: Fitting models where it matters,” The American
Statistician, vol. 57, no. 2, 2003.
[2] R. A. Jacobs and M. I. Jordan, “Adaptive mixtures of local experts,” in
Neural Computations, 1991.
[3] C. M. Bishop and M. Svensen, “Bayesian hierachical mixtures of
experts,” in Uncertainty in Artificial Intellience, 2003.
[4] W. Cai, S. Chen, and D. Zhang, “A simultaneous learning framework
for clustering and classification,” Pattern Recognition, vol. 42, pp. 1248–
1259, 2009.
[5] ——, “A multiobjective simultaneous learning framework for clustering
and classification,” IEEE Transactions on Neural Networks, vol. 21,
no. 2, February 2010.
[6] M. Wedel and J.-B. E. M. Steenkamp, “A clusterwise regression method
for simultaneous fuzzy market structuring and benefit segmentation,”
Journal of Marketing Research, 1991.
[7] M. Deodhar and J. Ghosh, “Scoal: A framework for simultaneous
co-clustering and learning from complex data,” ACM Transactions on
Knowledge Discovery from Data, 2009.
[8] A. Sharma and J. Ghosh, “Side information aware bayesian affinity
estimation,” IDEAL, The University of Texas at Austin, Tech. Rep.,
2011.
[9] C. M. Bishop, Pattern Recognition and Machine Learning. Cambridge,
MA: Springer, 2007, pp. 205–209.
[10] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood
from incomplete data via the EM algorithm,” J. Royal Statistical Society.
Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977.