ArticlePDF Available

Abstract and Figures

We propose a model for simultaneous clustering and classification to predict class labels in a localised way. The model is based on the assumption that there are smaller, distinct, local groups with in bigger datasets that behave as single coherent unit and have similar characteristics. A model based on this assumption will have better iterpretability as well as improved results. For ex: in a consumer survey data, consumers with similar tastes behave similarly and so the data has smaller subgroups within itself with distinct behavior patterns. A model that exploits this phenomenon by clustering "similar" consumers together to predict class labels will have better interpretability as well as performance. Our primary focus here is classification but we also put forward a scheme to extend it to a regression setting. The proposed classifier not only provides better results but also good interpretability. The model also provides the set of prominent features for each discovered group of datapoints that affect class membership in that subgroup the most. The model when evaluated over UCI datasets performs quite well and beats existing best results in couple of cases. We also perform model evaluation on CDC Diabetes and Infant Mortality health care datasets and report the results.
Content may be subject to copyright.
Localized Simultaneous Clustering and
Classification
Abhimanu Kumar #1 , Yubin Park 2, Aayush Sharma 3, Joydeep Ghosh 4
#Department of Computer Science
1abhimanu@cs.utexas.edu
Department of Electrical and Computer Engineering
University Of Texas at Austin
2yubin@ece.utexas.edu
3asharma@ece.utexas.edu
4ghosh@ece.utexas.edu
Abstract We propose a model for simultaneous clustering and
classification to predict class labels in a localised way. The model
is based on the assumption that there are smaller, distinct, local
groups with in bigger datasets that behave as single coherent
unit and have similar characteristics. A model based on this
assumption will have better iterpretability as well as improved
results. For ex: in a consumer survey data, consumers with
similar tastes behave similarly and so the data has smaller
subgroups within itself with distinct behavior patterns. A model
that exploits this phenomenon by clustering “similar” consumers
together to predict class labels will have better interpretability
as well as performance. Our primary focus here is classification
but we also put forward a scheme to extend it to a regression
setting. The proposed classifier not only provides better results
but also good interpretability. The model also provides the set of
prominent features for each discovered group of datapoints that
affect class membership in that subgroup the most. The model
when evaluated over UCI datasets performs quite well and beats
existing best results in couple of cases. We also perform model
evaluation on CDC Diabetes and Infant Mortality health care
datasets and report the results.
I. INTRODUCTION
Most datasets have smaller groups within themselves that
tend to exhibit similar characteristics. A model that exploits
this property will have better results and more importantly
improved data insights. It will be capable of understanding
the data subgroups within the larger dataset and able to answer
questions such as “what makes one subgroup different from
other” or “what features are prominent in which subgroup”.
The answer to these questions are very important especially
in cases for ex: if the data was a consumer survey, then
discovering smaller cliques of consumers within the larger
consumer base makes important business sense. As now the
surveyor or company will not have to deal with each consumer
individually and will have different business strategies for
each of these different consumer cliques based on cliques’
properties. Moreover, instead of having a single strategy for
all their consumer they can serve consumers better if they take
into account these smaller consumer cliques.
Our aim in this paper is to exploit this localised informa-
tion to provide better predictors. We propose and evaluate a
classification based model called LSC2(Localised Simultane-
ous Clustering & Classification). Moreover, this approach is
generic and can be easily extended to a regression setting.
The LSC2model improves upon the results obtained by
“clustering first and then classifying” approach. There is an
optimal number of clusters or subgroups Kfor which the
model gives the best result. This Kcan be understood as the
optimal number of subgroups in the data. The model finds this
optimal Kas well as the set of features that affect the class
membership of datapoints in the subgroup the most.
II. RE LATE D WOR K
Real-world data with complex distribution and structure
cannot be captured by a single model. To overcome this they
are divided into homogeneous groups that are described more
accurately through local models [1]. The mixture of experts [2]
is one such approach that utilizes this divide-and-conquer
strategy by simultaneously partitioning the input space and
learning models for each partition. The hierarchical mixture of
experts [3] further develops this by adopting a tree structure on
each model and shows improved performance. Simultaneous
Clustering & Classification (SCC) [4], [5] approaches the
problem using clustering technique instead of gating networks.
The SCC algorithm provides better accuracies on UCI datasets
than several classification algorithms, as multiple local models
can be fit utilizing structural information of the data. Though
these models jointly optimize both partitioning and predictive
modeling, they lack intuitive interpretations on their results as
generative processes of data are not considered.
Clustering is another technique that can be used to model
such complex data. In a clustering algorithm, data points
with similar characteristics are grouped together revealing
the structure of the complex dataset as a whole. Grouping
similar customers or users is extremely important in marketing
research, since then a single optimal policy would affect all
the customers in a group rather than having a single universal
policy for all or multitudes of policies based on each individual
consumer need. Thus to obtain this set of optimal policies,
one approach is to maintain the similarity between users by
applying clustering and then building multiple local models
for each of these clusters [6]. Although this approach extracts
Name Feature×Dataset Size Feature Characteristics Class Ratio
Car Evaluation 6×1726 Categorical 3:7
Kr-Vs-Kp 36×3196 Categorical 1:1
Indians Diabetes 8×768 Integer/Real 1:2
Tic-Tac-Toe 9×958 Categorical 1:2
Contraceptive 9×1473 Categorical/Integer 3:4
Gamma Telescope 10×19020 Real 9:5
TABLE I: Properties for different datasets used
the overall structure of the data and gives easy interpretations,
the performance of the localized models doesn’t provide
much improvement as the joint optimization of clustering and
classification is not done.
Another approach to solve the above problem can be found
in a dyadic data prediction. The Simultaneous co-clustering
and learning(SCOAL) [7] framework jointly optimizes co-
clustering and local models per cluster, and has been shown
to fit various complex dyadic data. The Bayesian framework
of the SCOAL(SABAE) [8] can be used to solve cold-start
problems inherent to SCOAL, but the real data experiment for
these models has not been done yet.
The model presented in this paper can be viewed as a
variation of SABAE. It doesn’t have the dyadic component of
SABAE and predicts using a single table data. The model here
preserves clustering properties as well as increases prediction
performance.
III. DATA SET
We evaluate the LSC2model on 12 of UCI machine learning
datasets. For classification we focus only on bi-class problems
out of convenience but the model is not restricted to bi-class
and can be used for multi-class problems as well. The datasets
have continuous as well as categorical features. The properties
of datasets used are shown in table I. The multi-class classi-
fication datasets are converted to bi-class before evaluating
the model over them. We also provide interpretations for the
results obtained for couple of datasets.
IV. OUR AP PROACH
The LSC2model is a generative framework that estimates
the class labels using the posterior probability of the individual
classes in localised clusters. The generative process for this
scheme is described below. X={xn}[n]N
1is the set of
datapoints and Y={yn}[n]N
1denotes their class labels.
Sample cluster assignment for each datapoint xnfrom a
Discrete Multinomial Distribution: znDisc(π). Each
datapoint xnis assigned to one of Kclusters i.e. zn
{1, . . . , K}
Sample xnfrom one of Kpossible exponential family
distributions p(xn|θzn)where θznis selected according
to the cluster assignment zn.
Sample class label ynfrom Logistic Sigmoid Function [9]
i.e. ynσ(βzn, xn, λ). Note here that we take Logistic
Regression with LASSO penalty (provided by λparame-
ter) to infer the class labels but the model is not limited to
that and any other valid distribution can replace Logistic
scheme here.
Fig. 1: Graphical Model for LSC2
The graphical model for the above generative process is
shown in figure 1. The overall joint probability of this gener-
ative scheme is:
p(Y, X, z|π, θ, β) =
Y
n
p(yn|xn, zn, βzn)p(xn|zn, θzn)p(zn|π)(1)
The probability for observed variables(X, Y ) after marginal-
izing out the latent variables(z) is:
p(Y, X |π, θ, β) =
Y
nX
z
p(yn|xn, zn, βzn)p(xn|zn, θzn)p(zn|π)(2)
The LSC2optimizes this marginal probability of the
observed variables via an Expectation-Maximization algo-
rithm [10]. Algorithm 1 describes the E and M steps in the
optimization process. The model uses Logistic Regression with
LASSO penalty λfor the class assignments. The βparam-
eter of the model is calculated using Iterative Reweighted
Least Squared(IRLS) [9] algorithm. The cluster assignment
in the EM iterations is hard-clustering. The LASSO penalty
λis learned using 5-fold cross-validation and the miss-
classification errors for the best λvalue are reported.
After learning all the parameters of the model, the predic-
tion task can be performed based on Maximum a posteriori
(MAP) estimator of zMAP ,ˆ
YMAP ). As zand Yare discrete
variables, probabilities for all possible (z, Y )’s can be directly
computed using Eq (1). We choose the pair (z, Y )that gives
the maximum probability, and the corresponding Ybecomes
the prediction from the model. The exact algorithm for a bi-
class problem is described in Algorithm 1.
V. EVALUATI ON METRIC
The evaluation metric for LSC2is the standard Miss-
classification Error. The cluster assignment done here is hard-
clustering and the target label is calculated using the weight
parameter βzof the corresponding cluster. The evaluation
metric is given by :
ME rr or =
N
X
n=1
1{yn6yn}
N(3)
Algorithm 1 The E and M steps for Classification Model
The Likelihood of the model:
p(Y, X |π, θ, β) = Y
nX
z
p(yn|xn, zn)p(xn|zn)p(zn)
while until convergence do
E-Step
γn,z =p(zn|xn, yn.π, θ, β )
p(yn|βz, xn)p(xn|θz)p(zn|π)
γn,z =p(zn|xn, yn.π, θ, β )
X
z
p(zn|xn, yn.π, θ, β )
M-Step
Γ = hard cluster(Γ)
µz=
N
X
n=1
γn,zxn
Nz
, Nz=
N
X
n=1
γn,z , πz=Nz
N
Covz=
N
X
n=1
γn,z(xnµn)T(xnµn)
Nz
βz=IRLS(X, Γz, Y , λ)
end while
Prediction:
if (maxzp(y= 1|σ(βT
zx))p(x|θz)p(z|π)>maxzp(y=
0|σ(βT
zx))p(x|θz)p(z|π))then
ˆy1
else
ˆy0
end if
where ˆ
Yis prediction obtained from algorithm 1.
The clusters predicted are probed for their interpretability
too. We examine whether the Logistic or LASSO coefficients
obtained in the respective clusters provide us any insights into
the dataset. This interpretability is discussed in further details
in section VII-A.
VI. RE SU LTS
We present here the evaluation results obtained for the LSC2
model. The results are based on 5-fold cross-validation test
runs. The miss-classification errors reported are for different
Kvalues for the LSC2model as well as for “clustering first
and then classifying”, Non-Sim, (Non-Simultaneous) approach.
Table II shows the results for a group of UCI datasets. As
we can see LSC2beats Non-Sim approach almost always. The
miss-classification error decreases until an optimal value of K
is reached and then starts increasing after that. This optimal
Kis the most probable number of localised clusters in the
Feature Cluster-1 Cluster-2
Number of times pregnant 1.66 19.79
Plasma glucose concentration 0.23 2.85
Diastolic blood pressure -0.54 -7.28
Triceps skin fold thickness -2.05 1.82
serum insulin 0-0.06
Body mass index 0.75 0
Diabetes pedigree function 0 0
Age -0.65 1.12
TABLE III: Normalized Beta values for the two most promi-
nent clusters in Pima Indian Diabetes Dataset. The cluster ratio
is 3:5
respective datasets. For ex: “Indians Diabetes” datasets has
two internal sub-clusters whereas “Car Evaluation” dataset
has three and “Tic-tac-toe” has four. The miss-classification
error for Non-Sim model also in general decreases till some
optimal value but the decrease in error is not as much as LSC2
model since Non-Sim approach doesn’t take into account the
class-labels of the datapoints to be clustered. The class-label
information provides crucial information to the LSC2model
for better classification.
A. Interpretation
From Table II it is clear that “Indian Diabetes” dataset
has two most probable internal clusters. Table III shows the
normalized βvalues for the two most prominent clusters
(which give the least miss-Error in table II) for Pima Indian
dataset. We can see that different features affect the two
clusters differently. While feature “serum insulin” has no effect
in cluster-1, feature “Body mass Index” has no effect in
cluster-2 as their beta values are 0. This fact can be used
to treat the disease in two clusters differently.
Figure 2a shows the normalized βvalues for the different
features of “Gamma Telescope”. This dataset deals with image
patterns created by Cherenkov photons due to radiation. The
aim is to determine whether the photon shower was created
by primary gamma rays (primary) or by cosmic rays in upper
atmosphere (background). It’s a bi-class problem with the two
classes being: 1)Primary which is class 0 and 2)Background
which is class 1. We observe from table II that there are three
most prominent clusters in this dataset. Figure 2a shows the
nomalized βvalues for these 3 clusters. It can be seen that
features “fConc” and “fConc1” have no effect on the class-
labels of the datapoint as their βs are zero. Feature “fSize”
contributes mostly to the class 1 for cluster C-3 whereas
it contributes mostly to class 0 for clusters C-1 and C-2.
Moreover “fWidth” is prominent in class 0 for clusters C-1
and C-3 whereas it is prominent in class 1 for clusters C-2.
Figure 2b shows the feature means of the two most discrim-
inating features among the 3 clusters: “fWidth” and “fSize”.
The bubbles size are proportional to the size of the clusters.
It is clear that clusters C-1 and C-2 are of similar sizes while
cluster C-3 is slightly bigger. Moreover, the feature means of
the 3 clusters are close together for the given features. The
above insights can be used to better study the image patterns
Dataset Model K= 1 K= 2 K= 3 K= 4 K= 5
Car Evaluation LSC29.82% 8.06% 7.11% 9.94% 10.12%
Non-Sim 9.82% 8.89% 9.24% 11.62% 13.48%
Kr-Vs-Kp LSC27.36% 6.77% 7.04% 6.95% 7.32%
Non-Sim 7.36% 6.96% 7.59% 8.48% 8.91%
Indians Diabetes LSC224.60% 22.01% 22.35% 22.86% 23.70%
Non-Sim 24.60% 24.21% 26.32% 26.05% 28.98%
Tic-Tac-Toe LSC229.68% 25.26% 22.42% 22.00% 22.83%
Non-Sim 29.68% 25.83% 24.52% 26.21% 27.94%
Contraceptive LSC233.19% 30.27% 29.15% 32.63% 33.95%
Non-Sim 33.19% 30.11% 30.86% 33.14% 34.20%
Gamma Telescope LSC221.22% 17.27% 15.26% 15.87% 17.24%
Non-Sim 21.22% 17.41% 16.94% 17.62% 21.26%
TABLE II: Miss-Classification error for various UCI Datasets.
.
(a) Normalized beta values for the features (b) feature Means of two of the most prominent features: fSize and fWidth
Fig. 2: mean and nomalized βvalues of the featureset for Gamma Telescope Dataset.
created and the reasons behind their creation.
VII. DISCUSSION AND CONCLUSION
Exploiting localised cliques of data points inside bigger,
complex datasets can not only provide better results but
also new insights into the data itself. The proposed model,
LSC2, does this job effectively providing users with improved
insights into the data. The localised models based approach is
able to find clusters in diverse set of data as seen in section VI-
A. This approach can also be extended beyond classification
tasks.
A. Extending LSC2to a regression setting
The LSC2model is based on exploiting localised subclusters
in a bigger, complex dataset. This approach can be easily
extended to a regression setting. In the generative model as
discussed in section IV the ynbecomes a real-valued variable
instead of a class label and is sampled for accordingly from
a suitable distribution. The graphical model for the regression
setting will be similar to figure 1.
A promising area of exploration in this regard would be to
examine whether this approch can be extended to a transfer
learning paradigm for a classification or regression task. For a
large dataset when the LSC2model discovers localised clusters
and classifies accordingly these discovered clusters can be used
for transfer learning. Whether these cluster features can be
transfered to a similar task setting in another related domain
for better results is an interesting area to explore.
REFERENCES
[1] D. J. HAND and V. VINCIOTTI, “Local versus global models for
classification problems: Fitting models where it matters,” The American
Statistician, vol. 57, no. 2, 2003.
[2] R. A. Jacobs and M. I. Jordan, “Adaptive mixtures of local experts,” in
Neural Computations, 1991.
[3] C. M. Bishop and M. Svensen, “Bayesian hierachical mixtures of
experts,” in Uncertainty in Artificial Intellience, 2003.
[4] W. Cai, S. Chen, and D. Zhang, A simultaneous learning framework
for clustering and classification,” Pattern Recognition, vol. 42, pp. 1248–
1259, 2009.
[5] ——, “A multiobjective simultaneous learning framework for clustering
and classification,” IEEE Transactions on Neural Networks, vol. 21,
no. 2, February 2010.
[6] M. Wedel and J.-B. E. M. Steenkamp, A clusterwise regression method
for simultaneous fuzzy market structuring and benefit segmentation,
Journal of Marketing Research, 1991.
[7] M. Deodhar and J. Ghosh, “Scoal: A framework for simultaneous
co-clustering and learning from complex data,” ACM Transactions on
Knowledge Discovery from Data, 2009.
[8] A. Sharma and J. Ghosh, “Side information aware bayesian affinity
estimation,” IDEAL, The University of Texas at Austin, Tech. Rep.,
2011.
[9] C. M. Bishop, Pattern Recognition and Machine Learning. Cambridge,
MA: Springer, 2007, pp. 205–209.
[10] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood
from incomplete data via the EM algorithm,” J. Royal Statistical Society.
Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977.
... And then, an improved multi-objective simultaneous learning framework for designing a classifier (IMSDC) [38] adopts a new initialization method in which the value of clusters membership degree is calculated on the basis of randomly initialized cluster centers, rather than initially chosen at random. Next then, a model for simultaneous clustering and classification in a localized way (LSCC) [39] is presented for focusing on small, distinct, local groups in big datasets. Inspired from them, our algorithm strengthens the SCC framework and adopts OBL strategy and AMI evaluation for alleviating the imbalance of accuracy and efficiency. ...
Article
Semi-supervised learning is significant data analysis method in the age of Big Data. The Bayesian-based classifier is a classical classification method in semi-supervised learning. Wherein, the classifier with clustering and classification technology have experienced the transformation from sequential structure to simultaneous structure. There are two main difficulties in simultaneous structure: the limited accuracy and diversity caused by rigid optimization algorithm; and the imbalance status between clustering and classification processes caused by insufficiently structure. To overcome these difficulties, a novel multi-objective differential evolution and firework algorithm for automatic simultaneous clustering and classification algorithm (MASCC-DE/FWA) is proposed. The main contributions of MASCC-DE/FWA contain: (1) Combination searching strategy for dynamic searching (2) Rapid and low-complexity silhouette coefficient as redesigned clustering objective function (3) Automatic clustering, opposition-based learning and adjusted mutual information for strengthening SCC-MOEA framework. The experimental result demonstrates that MASCC-DE/FWA performs better than other 8 state-of-art classification algorithms on synthetic dataset, 19 UCI datasets and image segmentation tasks.
Article
Full-text available
A generalized algorithm for fuzzy clusterwise regression (GFCR) is proposed that incorporates both benefit segmentation and market structuring within the framework of preference analysis. The method simultaneously estimates the models relating preferences to product dimensions within each of a number of clusters, and the degree of membership of brands and of subjects in those clusters. The performance of GFCR is assessed in a Monté Carlo study. An application to data on preferences for brands of margarine and butter is reported, the cross-validity of GFCR is assessed, and it is compared empirically with Kamakura's method. Managerial and research implications are discussed.
Article
Full-text available
The Hierarchical Mixture of Experts (HME) is a well-known tree-based model for regression and classification, based on soft probabilistic splits. In its original formulation it was trained by maximum likelihood, and is therefore prone to over-fitting. Furthermore the maximum likelihood framework offers no natural metric for optimizing the complexity and structure of the tree. Previous attempts to provide a Bayesian treatment of the HME model have relied either on ad-hoc local Gaussian approximations or have dealt with related models representing the joint distribution of both input and output variables. In this paper we describe a fully Bayesian treatment of the HME model based on variational inference. By combining local and global variational methods we obtain a rigourous lower bound on the marginal probability of the data under the model. This bound is optimized during the training phase, and its resulting value can be used for model order selection. We present results using this approach for a data set describing robot arm kinematics.
Article
Full-text available
We present a new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases. The new procedure can be viewed either as a modular version of a multilayer supervised network, or as an associative version of competitive learning. It therefore provides a new link between these two apparently different approaches. We demonstrate that the learning procedure divides up a vowel discrimination task into appropriate subtasks, each of which can be solved by a very simple expert network.
Article
Full-text available
Traditional pattern recognition involves two tasks: clustering learning and classification learning. Clustering result can enhance the generalization ability of classification learning, while the class information can improve the accuracy of clustering learning. Hence, both learning methods can complement each other. To fuse the advantages of both learning methods together, many existing algorithms have been developed in a sequential fusing way by first optimizing the clustering criterion and then the classification criterion associated with the obtained clustering results. However, such kind of algorithms naturally fails to achieve the simultaneous optimality for two criteria, and thus has to sacrifice either the clustering performance or the classification performance. To overcome that problem, in this paper, we present a multiobjective simultaneous learning framework (MSCC) for both clustering and classification learning. MSCC utilizes multiple objective functions to formulate the clustering and classification problems, respectively, and more importantly, it employs the Bayesian theory to make these functions all only dependent on a set of the same parameters, i.e., clustering centers which play a role of the bridge connecting the clustering and classification learning. By simultaneously optimizing the clustering centers embedded in these functions, not only the effective clustering performance but also the promising classification performance can be simultaneously attained. Furthermore, from the multiple Pareto-optimality solutions obtained in MSCC, we can get an interesting observation that there is complementarity to great extent between clustering and classification learning processes. Empirical results on both synthetic and real data sets demonstrate the effectiveness and potential of MSCC.
Article
It is generally argued that predictive or decision making steps in statistics are separate from the model building or inferential steps. In many problems, however, predictive accuracy matters more in some parts of the data space than in others, and it is appropriate to aim for greater model effectiveness in those regions. If the relevant parts of the space depend on the use to which the model is to be put, then the best model will depend also on this intended use. We illustrate using examples from supervised classification.
Article
Traditional pattern recognition generally involves two tasks: unsupervised clustering and supervised classification. When class information is available, fusing the advantages of both clustering learning and classification learning into a single framework is an important problem worthy of study. To date, most algorithms generally treat clustering learning and classification learning in a sequential or two-step manner, i.e., first execute clustering learning to explore structures in data, and then perform classification learning on top of the obtained structural information. However, such sequential algorithms cannot always guarantee the simultaneous optimality for both clustering and classification learning. In fact, the clustering learning in these algorithms just aids the subsequent classification learning and does not benefit from the latter. To overcome this problem, a simultaneous learning framework for clustering and classification (SCC) is presented in this paper. SCC aims to achieve three goals: (1) acquiring the robust classification and clustering simultaneously; (2) designing an effective and transparent classification mechanism; (3) revealing the underlying relationship between clusters and classes. To this end, with the Bayesian theory and the cluster posterior probabilities of classes, we define a single objective function to which the clustering process is directly embedded. By optimizing this objective function, the effective and robust clustering and classification results are achieved simultaneously. Experimental results on both synthetic and real-life datasets show that SCC achieves promising classification and clustering results at one time.
Article
S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.