Conference PaperPDF Available

Heart Disease Diagnosis via Nonparametric Mixture Models

Authors:

Abstract and Figures

Effective and efficient ways to heart disease diagnosis can be improved via clustering individuals with heterogeneous characteristics to similar risk groups. This paper focuses on nonparametric density based cluster analysis on the risks of heart disease via nonparametric mixtures. Cluster density distributions for the nonparametric mixture model are done through Gaussian kernel density estimators using graph theory techniques. The cluster quality for the clusters from the models were analysed and diagnosed via a density based silhouette information criteria. Although the number of components is not assumed the same with clusters, results shows that individuals under heart disease risks can be grouped into two categories using two component model. It was also concluded that the individuals in different cluster have varying risk levels for heart disease.
Content may be subject to copyright.
Journal of Advances in Mathematics and Computer Science
27(5): 1-17, 2018; Article no.JAMCS.40440
ISSN: 2456-9968
(Past name: British Journal of Mathematics &Computer Science, Past ISSN: 2231-0851)
Heart Disease Diagnosis via Nonparametric Mixture
Models
Chipo Mufudza1and Hamza Erol2
1Department of Applied Mathematics, National University of Science and Technology,
Corner Cecil Avenue and Gwanda Road, Bulawayo, Zimbabwe.
2Department of Computer Engineering, Mersin University, C¸ iftlikkoy Campus, TR-33343,
Mersin, Turkey.
Authors’ contributions
This work was carried out in collaboration between both authors. Author CM designed the study,
performed the statistical analysis, wrote the protocol and wrote the first draft of the manuscript.
Author HE managed literature searches and the analyses of the study. Both authors read and
approved the final manuscript.
Article Information
DOI: 10.9734/JAMCS/2018/40440
Editor(s):
(1) Morteza Seddighin, Professor, Indiana University East Richmond, USA.
Reviewers:
(1) Ramesh M. Mirajkar, Dr. Babasaheb Ambedkar College, India.
(2) Michael Chen, California State University, USA.
Complete Peer review History: http://www.sciencedomain.org/review-history/25018
Received: 28th February 2018
Accepted: 17th May 2018
Original Research Article Published: 6th June 2018
Abstract
Aims/Objectives: Effective and efficient heart disease prediction via nonparametric mixture
regression models.
Data Source: Data used in this paper is from the UCI database of the Cleveland Clinic
Foundation for heart disease. The original data source contains 76 raw attributes with 303
observations each. For the purpose of this paper only 14 attributes were used as explained in
section 4.
Methodology: Cluster analysis was applied via mixture models in the form of Nonparametric
Density-based models. The clusters were identified using a graph theory based technique. Voronoi
diagrams were used and and their distributions were estimated nonparametrically through a
mixture model with Gaussian kernels. The optimal number of clusters and components of the
*Corresponding author: E-mail: chipo.mufudza@nust.ac.zw
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
identified clusters were determined, analysed and diagnosed using a density based silhouette
information criteria. All the data analysis and model diagnosis were performed in R using the
PdfCluster package.
Results: Different number of components resulted in different number of clusters when
nonparametric mixture are used on heart disease. However, the optimal number of clusters
under heart disease risks were found to be represented by two clusters with two components
using density based silhouette information criteria. These were both well separated and classified
as indicated by lack of spurious clusters and high positive density based silhouette values (See
Figs. 2 and 4). Their properties are given in Table 2. The result is irregardless of the flexible
conditions which are assumptions free on: shape of the distribution, number of components and
number of clusters.
Conclusion: When nonparametric mixture models are used, individuals under risks of heart
diseases can be diagnosed either under high or low risk depending on the dominant characteristics
on a given individual. Those under high risk have attributes that makes them progress to heart
diseases faster compared to those under low risk. Therefore by classifying individuals into
these categories, medical personnel can quickly diagonise heart disease and efficiently identify
characteristics associated with each category.
Keywords: Density based silhouette information; heart disease; Kernel Density Estimator;
Nonparametric Mixture Models.
2010 Mathematics Subject Classification: 53C25; 83C05; 57N16.
1 Introduction
Data clustering aims to partition data points given in a certain space. In general there exist
no universal rule to define clusters, thus many methods which include both parametric and non-
parametric have been proposed. In particular density based clustering methods also referred to as
mixture models have been of interest due to their ability to capture diverse heterogeneous properties
of the data. Parametric mixture models although very informative and easy to interpret, include
a lot of restrictions and assumptions on the shape and distribution of clusters which can be very
misleading hence wrong interpretations. Overcoming these restrictions can involve use of algorithms
that are assumption free or minimise assumptions, a direction which makes use of nonparametric
density estimations. It is the aim of this paper to concentrate on these nonparametric mixture
models due to their flexibility. Nonparametric methods derive their strength from the sample
data given thus reducing the assumption deficit posed by the parametric methods as they limit
number of assumptions. The nonparametric densities used in nonparametric mixture models can be
represented by histograms or density estimation methods which include a range of varieties although
we will concentrate on kernel density estimators (kde). Cluster formed from nonparametric mixtures
are derived from own cluster density functions that is unknown but estimated rather than assuming
a shape and distribution of a cluster. It is therefore this view for improved flexibility that we are
going to focus on nonparametric clustering methods through mixture models. Given the general
parametric mixture model shown by equation (1.1):
f(x) =
n
i
πjfj(x) (1.1)
It is evident that clusters are associated with the components fjand hence the shape and properties
are assumed to follow suit, whilst no assumptions are made with nonparametric clusters which
usually associated with regions of high density. Although the two approaches may sometimes lead
to the same results, they are totally very different and it is our ultimate purpose in this paper
2
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
to dwell on the nonparametric without imposing any assumptions on shape, density and cluster-
component factor. This is a more robust way of identifying and inferring subpopulations as it
can use modality inference on the number of modalities as alluded by [1] over the parametric
and hierarchical approach which estimates the number of cluster output. Nonparametric density
based clustering can also be done via smooth polynomials which model the different cluster density
estimations as shown by [2]. They implemented the use of Legendre polynomials, Gamma and Beta
mixtures to approximate nonparametric mixture models. In this paper we focus on nonparametric
mixture models on multivariate mixed count data for heart disease using Gaussian kernel density
estimators. Although most nonparametric mixture models have been built under the assumption of
identically independent components for identifiability, latent variable models with Gaussian kernels
mixtures distributions can be built from observed data of mixed type which can be ordinary, binary,
continuous and with component independence etc as explained in [3, 4], respectively.
2 Materials and Methods
2.1 Nonparametric density based clustering
Density based clustering can be done both nonparametrically and parametrically as explained by
[5, 6]. In particular it was observed that heart disease can be predicted via two risk groups namely
high and low risk groups via Poisson mixture regression model as explained in [7]. Nonparametric
mixtures models however, identify clusters as regions of high density separated by regions of low
densities which can be done by identifying local maximum values of the estimated density or modes
of the data. Whilst nonparametric methods associate the clusters to the regions around the modes
of the probability distribution of data, clusters in the model-based parametric approach correspond
to the components of a mixture of distributions. The difference merges clearly since the number of
the modes in a mixture of distributions does not necessarily match the number of components as
explained by [8].
There has been great interest on nonparametric density based clustering methods due to their
flexibility in cluster determination and inference and for any given data type. Different approaches
have also been applied making use of both nonparametric density estimations and sometimes
distances. Nonparametric density based clustering estimation techniques also allow data to model
relationships among variables, thus making them robust to functional form specification and hence
the ability to detect structure which sometimes remains undetected by traditional parametric
estimation techniques [9]. The whole idea of parametric mixture model (1.1) is to assume that
each component is identified by fja parametric density function, thus shape of each cluster is
approximated by the same distribution. The clustering problem is now an approximation of the
mixing parameters πjand parameters associated with the density function fj. This is done under
some conditions which makes the model (1.1) identifiable. The most commonly used densities
are Gaussian although a lot of variations and shapes have also been considered including using
skewed t distributions to capture for a more flexible way for the shapes of clusters. Nonparametric
density based clustering can come in as a relief, on the need to free individual clusters from a given
density shape hence explore density assumption free clusters. In many cases the parametric mixture
modeling comes with serious disparity between a component and a cluster caused by compliance to
geometric heuristics as alluded by [10]. If the cluster shapes do not match the shapes of the density
fj, the parametric mixture approach may face difficulties, a motivation for considering completely
assumption free cluster shapes densities via nonparametric formulation. The second challenge may
also involve variability in the case of the mixing proportions π
jsinstead of having them as constants.
Nonparametric clustering involves different approaches although the general idea uses kernel density
estimators (KDE) which are a representation of mixture functions. Although KDE can use either
3
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
fixed or adaptive smoothing/bandwidth, the later is usually used to cater for the adaptive and
changing mixing proportions of the different clusters. Therefore a nonparametric representation
which replaces model (1.1) can be achieved via clusters with kernel density functions. Researchers
which include [10] used these kernel density functions as a mixture densities for clusters to identify
a point of local maximum of density called a hilltop using the Modal EM (MEM) to identify
clusters. Different clusters were linked via ridgelines linking two hilltops developed, creating a more
accurate way of nonparametric clustering through geometric heuristics. Practical procedures on
nonparametric cluster quality diagnosis have been developed through the use of mean integrated
errors as well as nonparametric density based silhouettes information by [8]. In [11], they used
nonparametric estimates for finite mixtures from data on repeated measurements with the aim of
advancing statistical inference on such models.
The density based nonparametric clustering methods normally use clusters driven by density estimates
which determines associated connected regions. Unlike the parametric methods nonparametric
density based clustering methods do not choose particular density function but normally use kernel
density estimate with kernels one one’s choice. The Gaussian kernel has so far proved to be
the commonly used [12, 13, 14, 15]. General researchers use the multidimensional kernel density
estimator with adaptive bandwidth suggested by [16] and represented as equation (2.1):
ˆ
f(x) =
n
i=1
1
nhi,j .....hi,d Kxjxj
i
hi,j (2.1)
where his the bandwidth and Kid the kernel function. Thus the general use of nonparametric
density based clustering represent a mixture model.
Several authors have represented nonparametric mixture models in different ways with the general
assumption that each cluster is generated by its own unknown density function [15]. Nonparametric
mixture models can therefore also mean that no assumptions are made about the form of the density
fj of model (1.1), even though the weights πjmaybe scalar parameters as alluded by [17]. It
should however, be noted that the weights are not only restricted to scalar weights since they can
be variables. [18], defines nonparametric mixture modeling in a different sense such that the family
F from which the component densities come is fully specified up to a parameter Θ, but the mixing
distribution from which they are drawn is assumed to be completely unspecified and unknown, rather
than having finite support of known cardinality. Thus nonparametric mixture models can therefore
be used to describe the case in which no assumptions are made about the distribution form of the
mixture model, even though the parameter mixing parameter can be Euclidean. Semi parametric
models normally refers to cases where the distribution form is partly specified by a finite-valued
parameter, such as the case in which fj(x) = f(xµj) for a symmetric but otherwise completely
unspecified density f(), as proposed by [19] an idea used in [17, 20] for the multivariate cases. The
assumption that component distributions come from a family of densities that may be indexed by a
finite-dimensional parameter vector is normally ignored when dealing with nonparametric mixtures.
It is still however, necessary to restrict the family of multivariate density functions from which the
component densities are drawn in order to avoid the problem of model non-identifiability [20].
A number of nonparametric mixture models have considered that the observed variables are jointly
conditionally independent given the latent class and use kernel methods to identify the finite
mixtures of nonparametric distributions [21, 22]. Suppose θis a vector of parameters, including the
mixing proportions λ1, ......., λkand the univariate densities fjk where, jindexes the component and
kindexes the coordinate, so 1 < k < r and 1 < j < m. Thus, under the assumption of conditional
independence, the nonparametric mixture density evaluated at xj= (xi1, ..., xir)tis given as:
fθ(xi) =
m
j=1
λjr
k=1
fjk (xik)(2.2)
4
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
If we let zbe the latent random variable with probability mass function (PMF) : χ[0; 1], where
χ={z1, z2, ........., zk}for an integer K. x= (x1;x2;........xr) be a vector of repeated measurements
on an observable outcome variable xwhose marginal probability density function (PDF) takes the
form equation (2.2) and fjk denotes the PDF of xiconditional on z=zk. It should be noted
that xi’s sometimes are not only conditionally independent but identically distributed as well which
can be represented in terms of blocks as eluded in the work of [20], where each block represents
coordinates that follows the same univariate distribution.
Graph theory techniques such as voronoi diagrams can also be used to identify clusters by observing
closely concentrated connections then estimation of density function along an interval of connections
has been explored in [12]. This method suggests that any existence of a valley between point intervals
is a disconnection using Voronoi diagram partition and Delaunay graph as edge connection between
points. The same idea was developed by [13, 14] although [14] further made an improvement on the
measure on which the valleys exhibited by use of minimum mass probability necessary to fill the
valley. In this paper we will focus heart disease diagnosis using graph theory and KDE to estimate
the nonparametric mixture models.
2.2 Nonparametric mixtures using graph theory
When graph theory methods are used in mixture models, clusters are connected components
estimated using graph based algorithms. The number of clusters can then be chosen based on
nearest neighbourhood graph Xi:ˆ
Khkwhere Khis the kernel density estimator and kis the
number of clusters. The number of clusters is normally done on the basis of the number of connected
components of a level set f > c in the given graph a scenario explained by [23]. [24] proposed that
a combined kde with single linkage graph can be used to estimate the number of clusters. A lot
more authors who included [25] have also used density clustering via connected regions using data
driven bandwidth selection measures and stability of the identified clusters. In this sense mixtures
are formed by kde’s where an edge between two points indicate that they belong to the same ball ρ,
in the same cluster. The number of connected components will then correspond to the number of
clusters found in the sample. The same idea was used by [26] when they investigated the stability of
the density based clustering by altering different tuning parameters to the mixture models including
the kernels.
Graph theory can also be used to identify cluster trees and pruning the trees to do away with spurious
clusters in cases where they exist. [27] explored a plug-in approach to cluster tree estimation:
estimate the cluster tree of the feature density by the cluster tree of a density estimate. In [12, 13]
they described nonparametric mixture models differently in an interesting way via graph theory by
use of the KDE. They identified clusters using connected points by observing the density function
along a given interval in a multidimensional space. Clusters are then identified by an existence of
some disconnections between Voronoi diagram indicated through a valley using a mode function
which is associated with both components and probability. Thus points of high connections indicates
a mode of the kernel density and hence a single cluster. Connected regions are then identified
using Delaunay diagrams or paired in a high dimensional space. They demonstrated this via the
PdfCluster package in R [13, 14], an algorithm which makes use of the density based silhouette
information to differentiate clusters and inferences on them as described by [8]. Thus nonparametric
mixture models have that flexibility to identify clusters effectively through so many methods. This
is the reason why we are going to concentrate on some of these methods on real dataset. In this
context we will look at heart disease data using nonparametric mixture models, hence help in early
and efficient hear disease diagnosis. This can also result alleviating and reduce complex cardiological
problems.
5
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
2.2.1 Nonparametric mixtures: Graph theory procedure
We will explore some of the procedures used by [12] that if the estimated function is unimodal then
we have a connected cluster otherwise it is disconnected. Thus, if we have χ= (x1, ......., xn), xi
dto be clustered in a d dimensional space with unknown bounded and differentiable density
function, f. Then for each constant c, R(c) = {x:x∈ ℜd, f(x)c}. In this method a mode
function is described such that the function fis replaced by its nonparametric estimate ˆ
f. Therefore
in multidimensional space where there is no obvious way to determine the connected region the
interest is to focus on the sample set of such that we have
S(c) = {x:x∈ ℜd,ˆ
f(x)c}(2.3)
Thus we have a multidimensional problem of identifying connected regions. Graph theory is then
used to identify connected components as detection of connected components of a given graph G
whose elements are vertices of S(c) where the edge is key. The task is then done via Voronoi diagram
using either Delaunay triangulation for d3 and pairing for d > 3 and according to some measure
of distance. The Voronoi diagram is a partition of dgiven χinto k regions Υ(x1), ......, Υ(xn)
where Υ(xi) is the set of all points of dcloser to xithan to any other point in the set according
to some measure of distance. The connected components are identified as union of connected pairs
that share at least one vertex and in this way cluster cores are determined. The clusters cores are
formed by data lying in the regions around the detected modes. Clusters can also be detected by
a minimum spanning tree as explained by [28] which is a subgraph of the Delaunay triangulation
[14].
Fig. 1. Voronoi tessellation and superimposed Delaunay triangulation
Fig. 1 shows an example of how the connected points share one facet of Delaunay triangulation
as they belong to adjacent Voronoi regions. When dimension is high then pairwise connections are
implemented as in our case where d= 14. The basic idea is to examine the behaviour of ˆ
f(x),
the kde when we move along a segment [x1, x2], since it depends on whether the sample values x1
and x2belong to the same connected set of S(c) or not. Thus we can view set S(c) as a union of
the two intervals of the group. If x1and x2belong to the same interval, then the corresponding
portion of density along the segment (x1, x2) has no local minimum. On the contrary, if x1and
x2belong to different subsets of S(c), then at some point along [x1, x2] the density exhibits a
local minimum, which we shall refer to as presence of a valley [13]. The density estimates which
determines the connected regions are not linked to any particular density function but are purely
estimated. The commonly used density estimate is the multidimensional kernel density estimate
given before in equation (2.1). The choice of the kernel normally does not have an effect on the
estimate unlike the smoothing parameter. The bandwidth can be fixed and adaptive as suggested
according to [27]. In this paper although, both the fixed and adaptive bandwidth were developed and
analysed for heart disease, only adaptive bandwidth will be represented. Nonparametric mixture
6
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
models via graph theory can be described by use of the KDE where clusters are identified by an
existence of some disconnections between voronoi diagrams as indicated through a valley. Thus
points of high connections indicates a mode of the kernel density and hence a single cluster. They
demonstrated this via the PdfCluster [13, 14] algorithm which makes use of the density based
silhouette information to differentiate clusters and inferences on them as described in [8].
2.2.2 Density Based Silhouette Information (DBS) criteria
In general silhouette analysis is a form of internal cluster validation which can be used to study
the separation distance between and within the resulting clusters. It validates the clustering
performance based on the pairwise difference of between and within cluster distances. This index
can also be used to determine the optimal cluster number through maximizing the value of this
index [29]. The silhouette plot displays a measure of range [1,1] which show how close each point
in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters
like number of clusters visually. Silhouette index (as these values are referred to as) near +1 indicate
that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample
is on or very close to the decision boundary between two neighboring clusters and negative values
indicate that those samples might have been assigned to the wrong cluster. The thickness of the
silhouette plot can also give an idea of the cluster size. Silhoutte information can also be used to
describe cluster compactness. This is a description of how objects are closely related in a single
cluster although this can be done via variance analysis. Therefore, silhouette information can be
used for both within and between cluster diagnosis.
The between and within distances are not always easy to calculate and apply with nonparametric
methods. An idea of silhouette information usage in nonparametric density-based clustering procedure
was developed by [8]. This was made possible via incorporating the idea of both within cluster
distances as well as cluster posterior probabilities same idea employed by [13, 14]. The dbs values
are calculated from the fact that if xiχis drawn from a probability density function f, one can
evaluate the posterior probability that it belongs to group υm, m =1:Mas:
τm(xi) = πmfm(xi)
πmfm(xi)(2.4)
where πmis the prior probability of υmand fmis the density of group υmat xi. Then the dbs is
defined as follows
dbsi=
log τm0(xi)
τm1(xi)
maxj=1:nlog τm0(xj)
τm1(xj)
(2.5)
where m0is such that xihas been classified in υm0and m1is the group into which τmis maximum
m̸=m0. Thus a dbs is vector reporting the density-based silhouette information of the clustered
data clusters [8, 13, 14]. It can be used as a diagnostic tool for cluster quality evaluation as
proposed where a high positive dbs value indicates well classified observations and clusters that
are well separated whilst a negative dbs indicates observations might have been assigned to wrong
clusters. The partitioning of clusters may indicates existence of hidden clusters (spurious) [8].
2.2.3 Bandwidth selection
The selection of an appropriate value of hthe bandwidth parameter which affects the density
estimations dramatically as explained in nonparametric methods is vital for best cluster results and
estimations. The choice of kernel since it is not as influential we are going to use Gaussian kernels
for all the estimations. Bandwidth selection methods have been studied intensively especially in
7
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
no mixture structures by authors who include [16, 30, 31]. Bandwidth selection in nonparametric
mixtures can be a very challenging procedure although some standard ideas can still be incorporated,
there seems to be challenges on under or over smoothing as explained by [32] in the case of
conditional independence. However, when Gaussian kernels are used under the assumption of
multivariate normality, [12] suggested that a bandwidth of the following format equation (2.6) can
be used to the adaptive Gaussian kernel in equation (2.1) namely
hj=σj4
(d+ 2)n1
(d+4) , j = 1, ..., d, (2.6)
where the standard deviation σjof the jth variable is replaced by an estimate. The bandwidth, hj
are then multiplied by a shrinkage factor of 3/4 in order to relieve the over smoothing determined
by computing the bandwidths under the assumption of multivariate normality. A square root law
can be used to allow variation for each dimension and on different scales. This brings flexibility in
such a way that different components are allowed to have different properties. Moreover, since the
bandwidth is iterative, the bandwidth estimations can be done prior to knowledge of the mixture
structure.
3 Results and Discussion
3.1 Data
The data used in this paper is found in the UCI database for the Cleveland Clinic Foundation
for heart disease as given by [33]. Original data is mixed data with 76 raw attributes and 303
observations for each attribute of which only 14 attributes are used in this paper. These include
number of diagnosis (num), Age, sex, chest pain, cholesterol, fasting blood sugar level (fbs),
rest blood pressure (trestbp), maximum heart rate (thalcd), resting electrocariographic (restecg),
exercise induced angina (exang), depression by exercise (oldpeak), slope of peak exercise (slope),
vessels (ca) and defects (thal). The data had some missing values which we replaced using the mean
response method. In the analysis continuous variables were taken to be log normal variables and
sometimes changed to log-normal whilst discrete variables with more than two levels were considered
ordinal. Nonparametric cluster analysis is done without prior information assumed or known. The
cluster density estimations, clusters and inferences on the clusters are solely based on the data and
hence the use of nonparametric methods. Kernel estimation methods with a Gaussian kernel were
considered in this paper to estimate the density distributions. Thus the ultimate distribution of the
data is a nonparametric mixture model which compromises of Gaussian kernel density estimates.
3.2 Packages used
Clustering of the data was nonparametrically done using graph theory methods as explained in
section 2.2.1. The dissimilarities distances among the points were calculated using the Gower
coefficient by [28] using the cluster package in R. The cluster package is also used to classify
multidimensional scaling of the data under the cmdscale function as described by [34] since the
data used was mixed data. Nonparametric density approximations are then implemented to the
mixed heart disease data using Gaussian kernel density estimators. This was implemented through
the PdfCluster package by Azzalini and Menardi [13]. The PdfCluster automatically selects the
procedure to be used for detecting connected components of the density level sets, depending on the
data dimensionality. This is enabled by making an internal call to function kepdf both to estimate
the density underlying the data and to build the connection network when the pairwise connection
criterion is selected. In this paper a kernel density estimation with Gaussian kernel is chosen and
built, with the vector of smoothing parameters set to the one asymptotically optimal under the
8
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
assumption of multivariate normality. We used both a fixed shrinked smoothing parameter and
adaptive bandwidth since the data is highly multidimensional. Cluster diagnosis in this method
uses the dbs criteria described before in section (2.2.2). The number of groups here is identified as
the number of modes of the estimated density used. Analysis was done under different number of
components and diagnosis of the clusters was done using dbs criteria from the PdfCluster package.
Therefore, use of the dbs method is applied to evaluate the cluster quality of Cleveland heart disease
data under nonparametric density based clustering using different principal components for the same
data.
3.3 Analysis covered
Analysis on the Cleveland heart disease data is done using PdfCluster to determine the number
of groups possible for heart disease prediction in a nonparametric way. Each of the cluster is
represented by a Gaussian kernel density funtion and hence the whole sample data as a mixture.
Both the fixed smoothing parameter (here not shown) and adaptive smoothing parameter via the
adaptive bandwidth as explained in section (2.2.3). Different number of components were used to
determine the number of clusters nonparametrically via connected regions as explained in section
(2.2.1). Results of the analysis is shown in the next sections (3.4 - 3.6).
3.4 Cluster diagnosis
In order to determine cluster optimal number of clusters and their quality, dbs criteria is used
for cluster diagnosis as explained in section 2.2.2. Different models with different number of
components result in a variety of clusters produced and hence the need to determine the quality
and optimality of the clusters. Fig. 2 shows dbs plot and values for models with different principal
components resulting in varying median dbs values for each given cluster. The cluster information
for a model with principal components i.e. p= 2,3,4,and 5 are given by Fig. 2(a), (b), (c) and (d),
respectively. The 2 component model has 2 clusters both with positive dbs median values, and no
partitions within each cluster. Both clusters have high positive median dbs values which indicates
that observations have been well classified. They are also well separated due to absence of within
dbs plot partition an evidence of no hidden clusters (spurious) within each of the clusters. The
quality of clusters seems to decrease as the number of principal component increases to 3 as shown
by Fig. 2(b) which shows the existence of 3 spurious clusters out of the 5 clusters. Furthermore,
the 2 other clusters have negative dbs median values a pointer that most of the observations were
wrongly classified. Whilst the 4 component model in Fig. 2(c) has 5 clusters, the 5 component
in Fig. 2(d) has 4 clusters most of them which has hidden clusters and with wrongly classified
observation as indicated by dbs plot partitions and with negative dbs median values.
The dbs plot and dbs median values for models with 8 and 10 components are given by Fig. 3(a)
and (b), respectively. The 8 component model has 5 clusters whilst there are only 2 clusters with a
ten component model. Although the 10 component model gives the same number of clusters as in 2
principal component one, the dbs median values for the clusters are both negative a high indication
that observations in these clusters were wrongly classified and ill partitioned. Therefore given the
mean dbs values for each cluster we conclude that data fits well in model with 2 components and 2
clusters. The width of the dbs graphs for the 2 component 2 cluster model are also reasonably bigger
than any other an indication that most observations are clustered in these clusters. This is because
the clusters are well separated, have no hidden clusters and with well classified observations. Thus,
number of principal component corresponds to the number of clusters although it should be noted
that different results can be found in other situations. It can be deduced that individuals exposed
to heart disease risks can be classified into 2 categories depending on which risks are they exposed
to at a given time. This can be high risk and low risk individuals as observed by [7] using Poisson
mixture models.
9
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
a)
dbs plot
cluster median dbs = 0.11
cluster median dbs = 0.21
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
b)
dbs plot
cluster median dbs = 0.14
cluster median dbs = 0.21
cluster median dbs = 0.26
cluster median dbs = −0.07
cluster median dbs = −0.15
−0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
c)
dbs plot
cluster median dbs = 0.4
cluster median dbs = 0.25
cluster median dbs = 0.04
cluster median dbs = −0.07
cluster median dbs = −0.12
−0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
d)
dbs plot
cluster median dbs = 0.15
cluster median dbs = 0.18
cluster median dbs = −0.07
cluster median dbs = −0.12
−0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 2. Dbs diagrams for different pprincipal components: a) p= 2, b) p= 3, c) p= 4
and d) p= 5
3.4.1 Cluster scatter plots
The cluster scatter plots represent how observations are classified, separated and partitioned within
and between clusters. It can also help to view cluster partitioning besides within a component
and cluster besides using the dbs diagnosis plot. In order to view these cluster partitions and
classifications, we are going to use Fig. 4, which represents the cluster scatter plots for models
with different components. The cluster scatter plot for a 2 and 3 component models are shown by
Fig. 4(a) and (b), respectively. They show that a 2 component model has 2 clusters which are well
separated in the scatter graph as groupings can easily be identified. Another 2 component model,
not shown in this work using fixed bandwidth had 3 clusters which were poorly separated. However,
a 3 component model on the other hand has 5 clusters which are very difficult to identify on the
scatter graph. The same condition is repeated as we increase the number of components to 4 as
shown by Fig. 4(c). It shows that it is very difficult to discretely separate observations among the
10
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
different clusters. This highly indicates clusters that are ill separated and observations are most
likely to be wrongly classified. A 10 component model although it has the same number of clusters
as the 2 component model, Fig. 4(d) shows that their elements can rarely be distinguished from
each other. Thus again using cluster scatter plots we deduce that 2 component 2 cluster model
proves to be the best model for heart disease diagnosis.
a)
b)
Fig. 3. Dbs diagrams for different pprincipal components: a) p= 8, b) p= 10.
3.5 Cluster summary for different models
In this section we represent a summary of all the models considered in the analysis with different
number of components and clusters as given in Table 1. It shows the different cluster properties of
the nonparametric mixture models analysed using PdfCluster package in R. The clusters alternate
from 2-5 with increase in the number of components. However, this seems to result in an increase in
the number of spurious clusters as well as wrong observation classifications as indicated by increase in
clusters with negative dbs median values thus reducing the cluster quality. Deducing from the cluster
properties in the table, best clusters are produced by model with 2 components as it has no spurious
clusters as well as all positive dbs median values as we have already seen. Although the 7 and 10
componential models produced the same number of clusters as 2, both clusters from a 7 component
model are spurious and with very low positive dbs median values, compared to a 2 component
model. This indicates poor cluster separation and relatively high observation misclassification. In
the same manner the 10 component model have all negative dbs values for the 2 clusters a high
indication that clusters were wrongly classified. We therefore conclude that individuals under risks
of heart disease be classified by 2 component nonparametric density model with 2 clusters.
3.6 Best model properties (2 Component 2 Cluster Model)
The best model for the heart disease is a 2 component 2 cluster model as it proved to be best model
for the heart disease data. In this section we will explore some of the properties of the choosen
model to have a better under understanding of the clusters. This may involve getting the properties
of each of the clusters produced by the PdfCluster algorithm. Table 2 shows a summary of some
of the statistical measures for the model.
The table summarises the dbs information for each cluster as well as for the whole data. It shows
that cluster 2 has high median dbs and mean values of 0.21 and 0.29, respectively compared to
cluster 1 which has a value even lower than the whole sample data. Cluster 2 also has a high max
positive dbs of 1 indicating well compacted cluster quality clustered data.
11
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
a)
2
2
2
1
2
1
2
1
2
2
1
2
2
1
1
1
1
1
1
1
2
2
2
2
2
11
1
1
2
1
1
1
1
1
1
2
2
1
1
2
1
1
1
2
2
1
2
2
2
1
1
2
2
1
2
1
2
2
2
1
2
2
1
1
2
2
2
2
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
1
2
2
2
22
2
2
1
1
2
1
2
2
1
2
22
2
2
1
1
1
2
1
1
22
2
2
1
1
2
1
22
2
2
1
1
2
2
2
1
1
1
2
1
2
2
1
2
1
2
1
2
1
2
1
1
2
1
2
1
2
1
1
2
2
2
2
2
1
2
2
1
1
2
1
2
1
1
1
2
2
1
1
2
1
2
2
1
1
2
1
2
2
2
2
2
2
1
1
1
2
2
1
2
2
2
2
2
2
2
1
2
2
1
1
1
1
22
2
1
1
1
1
2
21
2
1
2
1
2
2
2
1
1
1
1
1
1
22
2
2
2
2
1
2
2
2
1
1
1
1
1
1
1
1
1
2
1
2
1
1
1
2
1
1
1
2
2
1
1
2
1
1
2
1
1
1
1
1
2
2
1
1
1
2
2
1
2
1
1
1
2
1
1
2
2
1
2
1
2
1
1
2
1
1
2
1
1
1
1
2
1
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
−0.2 −0.1 0.0 0.1 0.2
V1
V2
b)
var 1
−0.2 −0.1 0.0 0.1 0.2
4
3
3
1
2
1
2
2
3
3
1
5
3
1
1
1
1
1
21
1
2
4
3
3
2
2
2
1
3
2
3
1
1
1
1
3
31
1
3
1
2
1
5
3
1
3
2
4
2
1
4
1
1
3
1
1
21
2
2
3
2
1
3
14
3
1
2
31
3
4
2
3
5
4
3
3
24
3
1
4
4
2
5
2
2
3
1
2
2
1
3
3
1
4
4
2
5
5
1
1
1
3
1
1
3
3
4
3
2
1
4
2
3
3
3
3
1
1
4
2
3
1
1
2
11
1
1
2
5
3
31
41
3
1
1
3
1
3
1
1
2
1
2
2
3
3
31
3
3
1
1
3
2
21
1
1
2
3
2
1
3
2
2
31
1
3
1
4
3
3
2
3
5
2
1
3
3
3
1
3
3
2
5
3
2
2
2
4
5
2
1
1
1
3
3
3
1
2
2
1
4
1
1
1
2
2
2
1
2
2
2
3
2
2
1
2
3
3
5
2
4
2
2
3
3
1
211
2
2
1
2
1
1
31
4
1
1
1
2
1
2
2
2
4
1
2
2
2
1
33
11
1
1
3
1
1
21
3
2
2
4
2
1
1
2
1
1
3
3
1
1
1
3
2
1
3
2
1
3
2
1
3
1
5
1
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
4
3
3
1
2
1
2
2
3
3
1
5
3
1
1
1
1
1
21
1
2
4
3
3
2
2
2
1
3
2
3
1
1
1
1
3
3
1
1
3
1
2
1
5
3
1
3
2
4
2
14
1
1
3
11
2
1
2
2
3
2
1
3
14
3
1
2
3
1
3
4
2
3
5
4
3
3
24
3
1
4
4
2
52
2
3
1
2
2
1
3
3
1
4
4
2
55
1
1
13
1
1
3
3
4
3
2
1
4
2
3
3
3
3
1
1
4
2
3
1
1
2
1
1
1
1
2
5
3
3
1
4
1
3
1
13
1
3
11
2
1
2
2
3
3
3
13
3
1
1
3
2
21
1
1
2
3
2
1
3
2
2
3
1
1
3
1
4
3
3
2
3
5
2
1
3
3
3
1
33
2
5
3
2
2
2
4
5
2
1
1
1
3
3
3
1
2
2
1
4
1
1
1
2
2
21
2
2
2
3
2
2
1
2
3
3
5
2
4
2
2
3
3
1
21
1
2
2
1
2
1
1
31
4
11
1
2
1
2
224
1
2
2
2
1
3
3
11
1
1
3
1
1
21
3
2
2
4
2
1
1
2
1
1
3
3
1
1
1
3
2
1
3
2
1
3
2
1
3
1
5
1
−0.2 −0.1 0.0 0.1 0.2
4
3
3
1
2
1
2
2
3
3
1
5
3
1
1
1
1
1
2
1
1
2
4
3
3
22
2
1
3
2
3
1
1
11
3
3
1
1
3
1
2
1
5
3
1
3
2
4
2
1
41
1
3
1
1
2
1
2
2
3
2
1
3
1
4
3
1
2
3
1
34
2
3
5
4
3
3
2
4
3
1
4
4
2
52
2
3
1
2
2
1
3
3
1
4
42
55
11
1
3
1
1
33
4
3
2
1
4
2
33
3
3
1
1
42
3
1
1
2
1
1
1
1
2
5
3
3
1
4
1
3
1
1
3
1
3
1
1
2
1
2
2
3
33
1
3
3
1
1
3
2
2
1
1
1
2
3
2
1
3
2
2
3
11
3
1
4
3
32
3
5
2
1
3
3
3
1
3
3
2
5
32
2
2
4
5
2
1
1
1
333
1
2
2
1
4
11
1
2
2
2
1
22
2
3
2
2
1
2
33
5
24
2
2
3
3
1
2
1
1
2
2
1
2
1
1
3
1
4
1
1
1
2
1
2
2
2
4
1
2
2
2
1
3
3
1
1
1
1
3
1
1
2
1
3
2
2
4
2
1
1
2
1
1
3
3
1
1
1
3
2
1
3
2
1
3
2
1
3
1
5
1
var 2
4
3
3
1
2
1
2
2
3
3
1
5
3
1
1
1
1
1
2
1
1
2
4
3
3
22
2
1
3
2
3
1
1
11
3
3
1
1
3
1
2
1
5
3
1
3
2
4
2
1
41
1
3
1
1
2
1
2
2
3
2
1
3
1
4
3
1
2
3
1
3
4
2
3
5
4
3
3
2
4
3
1
4
4
2
52
2
3
1
2
2
1
3
3
1
4
42
55
11
1
3
1
1
33
4
3
2
1
4
2
33
3
3
1
1
4
2
3
1
1
2
1
1
1
1
2
5
3
3
1
4
1
3
1
1
3
1
3
1
1
2
1
2
2
3
33
1
3
3
1
1
3
2
2
1
1
1
2
3
2
1
3
2
2
3
11
3
1
4
3
32
3
5
2
1
3
3
3
1
3
3
2
5
32
2
2
4
5
2
1
1
1
3
33
1
2
2
1
4
11
1
2
2
2
1
22
2
3
2
21
2
33
5
24
2
2
3
3
1
2
1
1
2
2
1
2
1
1
3
1
4
11
1
2
1
2
2
24
1
2
2
2
1
3
3
1
1
1
1
3
1
1
2
1
3
2
2
4
2
1
1
2
1
1
3
3
1
1
1
3
2
1
3
2
1
3
2
1
3
1
5
1
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
4
3
312
1
2
2
3
3
1
5
3
1
1
1
1
1
2
1
12
4
3
3
22
2
1
3
2
3
1
1
1
1
3
3
11
3
1
2
1
5
3
1
3
2
4
2
1
4
1
1
3
1
1
2
1
2
2
3
2
1
3
1
4
31
2
3
1
34
2
3
5
4
3
3
2
4
31
4
4
2
52
2
3
1
2
2
1
3
3
1
4
4
2
5
5
1
1
1
3
1
1
3
3
4
3
2
1
4
2
33
3
3
1
1
4
2
3
1
1
2
1
1
1
1
2
5
3
3
1
4
1
3
1
1
31
31
1
2
1
2
2
3
3
3
1
3
31
1
3
2
2
1
1
1
2
3
2
1
3
2
2
3
1
1
3
1
4
3
3
2
3
52
1
3
3
3
1
3
3
2
5
3
2
2
2
4
5
2
1
1
1
333
1
2
2
1
4
1
1
1
2
2
2
1
2
2
2
3
2
2
1
2
3
35
2
4
22
3
3
1
2
1
1
2
2
1
2
1
1
3
1
4
1
1
1
2
1
2
2
2
4
1
2
22
1
331
1
11
3
1
1
2
1
3
2
2
4
2
1
1
2
1
1
3
3
1
1
1
3
2
1
3
2
1
3
2
1
3
1
5
1
4
3
31
2
1
2
2
33
1
5
3
1
1
1
1
1
2
1
1
2
4
3
3
2
2
2
1
3
2
3
11
1
1
3
3
1
1
3
1
2
1
5
3
1
3
2
4
2
1
4
1
1
3
1
1
21
2
2
3
2
1
3
1
4
31
2
3
1
3
4
2
3
5
4
3
3
2
4
31
4
4
2
5
2
2
3
1
2
21
3
3
1
4
4
2
5
5
1
1
1
3
1
1
3
3
4
3
2
1
4
2
333
3
11
4
2
3
1
1
2
1
1
1
1
2
5
3
3
1
4
1
3
1
1
31
31
1
2
1
2
23
3
31
3
31
1
3
2
2
1
1
1
2
3
2
1
3
2
2
3
1
1
3
1
4
3
3
23
52
1
3
3
3
1
3
3
2
53
2
2
2
4
5
2
1
1
1
3
33
1
2
2
1
4
1
1
1
2
2
2
1
2
2
23
2
2
1
2
3
3
5
2
4
22
3
3
1
2
11
2
2
1
2
1
1
3
1
4
1
1
1
21
2
2
2
41
2
22
1
33
1
1
1
1
3
1
1
2
1
3
2
2
4
2
1
1
2
1
1
3
3
1
1
1
3
21
3
2
1
3
2
1
3
1
5
1
−0.3 −0.2 −0.1 0.0 0.1 0.2
−0.3 −0.2 −0.1 0.0 0.1 0.2
var 3
c)
var 1
−0.2 −0.1 0.0 0.1 0.2
3
24
1
21
2
1
14
1
2
4
1
1
1
1
1
11
3
2
3
3
4
1
1
1
1
4
1
1
1
1
1
1
2
41
1
2
1
1
1
2
3
1
2
2
3
1
1
3
3
1
4
1
3
33
1
2
4
1
1
4
33
4
1
1
1
1
3
3
2
4
23
4
2
23
4
1
3
3
2
2
2
3
2
1
1
2
1
4
2
1
3
3
3
2
2
1
1
1
3
5
1
2
2
3
2
1
1
3
1
2
1
1
2
1
1
3
2
2
1
1
1
31
3
3
1
2
1
31
31
3
1
1
2
1
2
1
3
1
1
2
2
4
2
21
3
1
11
2
1
21
1
1
2
3
1
1
2
1
2
25
1
4
1
3
3
2
3
3
2
1
1
1
3
3
1
1
4
2
2
2
3
2
1
3
21
1
1
1
4
42
1
1
1
1
3
1
1
3
1
2
1
3
2
2
1
1
1
1
1
1
2
2
2
2
3
2
1
2
4
3
11
1
1
1
1
1
1
1
21
3
1
1
1
2
1
1
1
2
3
1
1
2
1
1
21
11
1
1
4
3
1
11
3
2
1
3
1
1
1
1
1
1
1
2
1
3
1
3
1
1
4
1
1
2
1
1
1
1
21
3
24
1
21
2
1
1
4
1
2
4
1
1
1
1
1
11
3
2
3
3
4
11
1
1
4
1
1
1
1
11
2
4
1
1
2
1
1
1
2
3
1
2
2
3
1
13
3
1
4
13
3
3
1
2
4
1
1
4
33
4
1
1
1
1
3
3
2
4
23
4
2
23
4
1
3
3
2
22
3
2
1
12
1
4
2
1
3
3
3
22
1
1
13
5
1
22
3
2
1
13
1
2
1
1
2
1
1
3
2
2
1
1
1
3
1
3
3
12
1
3
1
3
1
3
1
12
1
2
13
1
1
2
2
4
2
2
13
1
11
2
1
21
1
1
2
3
1
1
2
1
2
2
5
1
4
1
3
3
2
3
3
2
1
1
1
3
3
1
14
2
2
2
3
2
1
3
2
1
1
1
1
4
42
1
1
1
1
3
1
1
3
1
2
13
2
2
1
1
1
1
1
1
2
2
2
2
3
2
1
2
4
3
11
1
1
1
1
1
1
1
21
3
11
1
2
1
1
123
1
1
2
1
1
2
1
11
1
1
4
3
1
11
3
2
1
3
1
1
1
1
1
1
1
2
13
1
3
1
1
4
1
1
2
1
1
1
1
21
−0.2 −0.1 0.0 0.1 0.2
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
3
2
4
1
21
2
1
1
4
1
2
4
11
1
1
1
1
1
3
2
3
3
4
111
1
4
1
1
1
1
11
241
1
2
1
1
1
2
3
1
2
2
3
1
1
3
3
1
4
1
3
3
3
1
2
4
1
1
4
33
4
1
1
1
1
3
3
2
4
2
3
4
2
2
3
4
1
3
3
2
2
2
3
2
1
12
1
42
1
3
3
3
22
1
1
13
5
1
22
3
2
1
13
1
2
1
12
1
1
3
2
2
1
11
3
1
3
3
1
2
1
3
1
3
1
3
1
1
2
1
2
1
3
1
1
2
2
4
2
2
13
1
1
1
2
1
21
1
1
2
3
1
1
2
1
2
2
5
1
4
1
3
3
2
3
3
2
1
1
1
3
3
1
1
4
2
2
2
3
2
1
3
2
1
1
1
1
4
4
2
1
1
1
1
3
1
1
3
1
2
1
3
2
2
1
1
1
1
1
1
2
2
2
2
3
2
1
2
4
3
1
1
1
1
1
1
1
1
1
21
3
11
1
2
1
11
2
3
1
1
2
1
1
2
1
11
1
1
4
3
1
11
3
2
1
3
1
1
1
1
1
1
1
2
1
3
1
3
1
1
4
1
1
2
1
1
1
1
2
1
−0.2 −0.1 0.0 0.1 0.2
3
2
4
1
2
1
2
1
1
4
1
2
4
1
1
1
1
11
1
3
2
3
3
4
11
1
1
41
1
1
111
2
4
1
1
2
1
1
1
2
3
1
2
2
3
1
1
33
1
4
1
3
3
3
1
2
4
1
1
43
3
4
1
1
1
1
33
2
4
2
3
42
2
3
4
1
3
3
2
22
3
2
1
1
2
1
4
2
1
3
33
22
11
1
3
51
22
3
2
1
1
3
1
21
1
2
1
1
32
2
1
1
1
3
1
3
3
1
2
1
3
1
3
1
3
1
1
2
1
2
1
3
1
1
2
2
4
22
1
3
1
11
2
1
2
1
1
1
2
3
1
1
2
1
2
2
51
4
1
3
3
23
3
2
1
1
1
3
3
1
1
4
2
2
23
2
1
3
2
1
1
1
1
442
1
11
1
3
11
3
1
2
1
3
22
1
1
1
1
1
1
22
2
23
2
1
2
43
1
1
1
1
1
11
1
1
2
1
3
1
1
1
2
11
1
2
3
1
1
2
1
1
2
1
1
1
1
1
43
1
1
1
3
2
1
3
1
1
1
1
1
1
1
2
1
3
1
3
1
1
4
1
1
2
1
1
1
1
2
1
var 2
3
2
4
1
2
1
2
1
1
4
1
2
4
1
1
1
1
1
1
1
3
2
3
3
4
11
1
1
4
1
1
1
1
11
2
4
1
1
2
1
1
1
2
3
1
2
2
3
1
1
33
1
4
1
3
3
3
1
2
4
1
1
43
3
4
1
1
1
1
3
3
2
4
2
3
4
2
2
3
4
1
3
3
2
22
3
2
1
1
2
1
4
2
1
3
33
22
11
1
3
51
22
3
2
1
1
3
1
21
1
2
1
1
3
2
2
1
1
1
3
1
3
3
1
2
1
3
1
3
1
3
1
1
2
1
2
1
3
1
1
2
2
4
22
1
3
1
11
2
1
2
1
1
1
2
3
1
1
2
1
2
2
51
4
1
3
3
23
3
2
1
1
1
3
3
1
1
4
2
2
23
2
1
3
2
1
1
1
1
4
42
1
11
1
3
11
3
1
2
1
3
22
1
1
1
11
1
22
2
23
2
1
2
43
1
1
1
1
1
1
1
1
1
2
1
3
11
1
2
1
1
1
23
1
1
2
1
1
2
1
11
1
1
43
1
1
1
3
2
1
3
1
1
1
1
1
1
1
2
1
3
1
3
1
1
4
1
1
2
1
1
1
1
2
1
3
2
4
1
2
1
2
1
1
4
1
2
4
11
1
1
11
1
3
2
3
3
4
111
1
41
1
1
1
11
2
4
1
1
2
1
1
1
2
3
1
2
2
3
1
1
3
3
1
4
1
3
3
3
1
2
4
1
1
4
3
3
4
1
1
1
1
3
3
2
4
2
3
4
2
2
34
1
3
3
2
2
2
3
2
1
1
2
1
4
2
1
3
3
3
22
1
1
1
3
51
22
3
2
1
1
3
1
2
1
1
2
1
1
3
22
1
1
1
3
1
3
3
1
2
1
3
1
3
1
3
1
1
2
1
2
1
3
1
1
2
2
4
22
1
31
1
1
2
1
2
1
1
1
2
3
1
1
2
1
2
2
51
4
1
3
3
23
3
2
1
1
1
33
1
1
4
2
2
23
2
1
3
2
1
1
1
1
4
4
2
1
11
1
3
1
1
3
1
2
1
3
22
1
1
1
1
1
1
22
2
23
2
1
2
43
1
1
1
1
1
1
1
1
1
2
1
3
11
1
2
111
2
3
1
1
2
1
1
2
1
11
1
1
43
1
1
1
3
2
1
3
1
1
1
1
11
1
2
1
3
1
3
1
1
4
1
1
2
1
1
1
1
2
1
3
2
412
1
2
1
1
4
12
4
1
1
1
1
1
1
1
32
3
3
4
11
1
1
4
1
1
1
111
24
11
2
1
1
1
2
3
1
2
2
3
1
1
3
3
1
41
3
3
3
1
2
4
1
1
4
3
3
41
1
1
1
33
2
42
3
4
2
2
3
41
3
3
2
22
3
2
1
1
2
1
4
2
1
3
3
3
22
1
1
1
3
5
1
2
2
3
21
1
3
1
21
1
2
1
1
3
2
2
1
1
1
3
1
3
3
1
2
1
3
1
3
1
3
1
1
21
21
3
1
1
2
2
4
2
2
1
3
11
1
2
1
2
1
11
2
3
1
1
2
1
2
2
5
1
4
1
3
3
2
3
3
21
1
1
3
3
1
1
4
2
2
2
3
2
1
3
2
1
1
1
1
442
1
1
1
1
3
1
1
3
1
2
1
3
2
2
1
1
1
1
1
1
222
2
3
21
2
4
3
1
1
1
1
1
1
1
1
1
2
1
3
1
1
1
2
1
1
1
2
3
1
1
21
1
211
1
11
4
3
1
1
1
3
21
3
1
1
1
1
1
1
1
2
1
3
1
3
1
1
4
1
1
2
1
1
1
12
1
3
2
41
21
2
1
14
1
2
4
1
1
1
1
1
1
1
3
2
3
3
4
1
1
1
1
4
1
1
11
1
1
2
4
1
1
2
1
1
1
2
3
1
2
2
3
1
1
3
3
1
41
3
33
1
2
4
1
1
4
3
3
41
1
1
1
3
3
24
2
3
4
2
2
3
41
3
3
2
2
2
3
2
1
1
21
4
2
1
3
3
3
2
2
1
1
1
3
5
1
2
2
3
21
1
3
1
2
11
2
11
3
2
2
1
1
1
3
1
3
3
1
2
1
3
1
3
1
3
1
1
21
21
3
1
1
2
24
2
21
3
11
1
2
1
2
1
1
1
2
3
11
2
1
2
2
5
1
4
1
33
2
33
21
1
1
3
3
1
1
4
2
22
3
2
1
3
2
1
1
1
1
4
42
1
1
1
1
3
1
1
3
1
2
1
3
2
2
11
1
1
1
1
2
2
2
2
3
21
2
4
3
1
1
1
1
1
1
11
1
2
1
3
1
1
1
21
1
1
2
31
1
21
1
21
1
1
1
1
4
3
1
1
1
3
21
3
1
1
1
1
1
1
1
2
1
3
1
3
11
4
1
1
2
1
1
1
1
2
1
var 3
−0.3 −0.2 −0.1 0.0 0.1 0.2
3
2
4
121
2
1
1
4
1
2
4
11
1
1
1
1
1
32
3
3
4
11
1
1
4
1
1
1
1
11
24
1
1
2
1
1
1
2
3
1
2
2
3
1
1
3
3
1
41
3
3
3
1
2
4
1
1
4
3
3
41
1
1
1
3
3
2
42
3
4
2
2
3
4
1
3
3
2
2
2
3
2
1
1
2
1
4
2
1
3
3
3
22
1
1
1
3
5
1
2
2
3
21
1
3
1
2
1
1
2
11
3
2
2
1
1
1
3
1
3
3
1
2
1
3
1
3
1
3
1
1
21
2
1
3
1
1
2
2
4
2
2
1
3
1
1
1
2
1
2
1
11
2
3
11
2
1
2
2
5
1
4
1
3
3
2
3
3
21
1
1
3
3
1
1
4
2
2
2
3
2
1
3
2
1
1
1
1
4
4
2
1
1
1
1
3
1
1
3
1
2
1
3
2
2
11
1
1
1
1
222
2
3
21
2
4
3
1
1
1
1
1
1
1
1
1
2
1
3
1
1
1
2
1
1
1
2
31
1
2
1
1
2
11
1
1
1
4
3
1
1
1
3
2
1
3
1
1
1
1
1
11
2
1
3
1
3
1
1
4
1
12
1
1
1
12
1
−0.3 −0.1 0.1 0.3
−0.2 −0.1 0.0 0.1 0.2
3
2
412
1
2
1
1
4
12
41
1
1
111
1
3
2
3
3
411
1
1
4
1
1
1
1
1
1
2
4
11
2
1
1
1
2
3
1
2
2
31
1
3
3
1
4
1
3
3
3
1
2
4
1
1
43
3
4
1
1
1
1
3
3
2
4
2
3
4
2
2
3
41
3
32
22
3
2
1
1
2
1
4
2
1
3
33
2
2
1
1
1
3
51
223
2
1
1
3
1
21
1
2
1
1
3
2
2
11
1
3
13
3
1
2
1
3
1
3
1
31
1
2
1
2
1
3
1
1
2
2
4
2
2
1
3
111
2
1
2
1
1
1
2
3
1
1
2
1
2
2
5
1
4
1
3
3
23
32
1
1
13
3
1
14
2
2
2
3
2
1
32
1
1
1
1
44
2
1
1
1
1
3
1
1
31
2
1
3
2
21
111
1
1
22
2
2
3
2
1
2
43
1
1
1
1
1
11
1
1
2
1
3
1
1
12
1
1
1
2
3
11
2
1
1
21
1
1
1
1
4
3
11
1
3
2
1
31
11
1
1
1
1
21
31
31
1
4
11
2
1
1
1
1
21
3
241
21
2
1
1
4
1
241
1
11
1
11
3
2
3
3
41
1
1
1
4
1
1
11
1
1
2
4
1
1
2
1
1
1
2
3
1
2
2
311
3
31
4
1
3
3
3
1
2
4
11
4
33
4
1
1
1
1
3
3
2
4
2
3
4
2
2
3
41
3
3
2
2
23
2
1
1
2
1
4
2
1
3
3
3
2
21
1
1
3
51
2
23
2
1
1
3
1
2
11
2
1
1
3
2
2
1
1
1
3
1
3
3
1
2
1
3
1
3
1
31
1
2
1
2
1
3
11
2
2
4
2
2
1
3
111
2
1
2
1
1
1
2
3
11
2
1
2
2
5
1
4
1
3
3
233
2
1
1
1
3
3
1
1
4
2
2
2
3
2
1
3
2
1
1
1
1
4
4
2
1
1
1
1
3
1
1
31
2
1
3
2
211
111
1
2
2
2
2
3
2
1
243
1
1
1
1
1
1
1
1
1
2
1
3
1
1
1
21
1
1
2
3
1
1
2
1
1
21
1
1
1
1
4
31
1
1
3
2
1
31
1
1
1
1
1
1
21
31
31
1
4
11
2
1
1
1
1
21
−0.3 −0.1 0.0 0.1 0.2
3
241
21
2
1
1
4
1
241
1
11
1
11
3
2
3
3
4
11
1
1
4
1
1
1
1
1
1
2
4
11
2
1
1
1
2
3
1
2
2
3
113
3
1
4
1
3
3
3
1
2
4
11
433
4
1
1
1
1
3
3
2
4
2
3
4
2
2
3
413
3
2
223
2
1
12
14
2
1
3
33
2
21
1
1
3
51
223
2
1
1
3
1
21
1
2
1
1
3
2
2
11
13
13
3
12
1
3
1
3
1
3
1
1
2
1
2
13
11
2
2
4
2
2
1
3
1
11
2
1
2
1
1
1
2
3
11
2
1
2
2
5
1
4
1
3
3
23
3
2
1
1
13
3
1
14
2
2
2
3
2
13
2
1
1
1
1
4
4
2
1
1
1
1
3
1
13
1
2
1
3
2
2
1
1
111
1
22
2
2
3
2
1
243
1
1
1
1
1
1
1
1
1
2
1
3
1
1
12
1
1
1
23
1
1
2
1
1
2
1
1
1
1
1
4
3
1
1
1
3
2
13
1
11
1
1
1
1
21
3
1
3
1
1
4
11
2
1
1
1
1
21
var 4
d)
−0.2 0.2
−0.2 0.2
−0.1 0.2
−0.2 0.1
−0.15 0.15
−0.3 0.1
−0.2 0.1
−0.3 0.0
−0.2 0.1
−0.2 0.1
−0.1 0.1
−0.2 0.1
−0.2 0.0
−0.15 0.05
−0.3 0.2
−0.15 0.05
−0.3 0.1
−0.2 0.2
−0.2 0.1
−0.15 0.10
Fig. 4. Cluster scatter plots for model with: a) 2 components, b) 3 components,
c) 4 components and d)10 components
3.7 Discussion
In parametric mixture models, components and clusters are assumed to be same with the same
parametric distribution. Although the best model nonparametrically for heart disease has 2 clusters
and 2 components, same results as produced by [7] parametrically, the nonparametric model
analysis in section 3.4 shows that the number of components is not always the same with the
number of clusters. Thus, the number of components does not always correspond to the number
12
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
Table 1. Cluster summary under adaptive bandwidth
Components Total Clusters Spurious Clusters Negative dbs Clusters
22 0 0
3 5 3 2
4 5 3 2
5 4 3 2
6 5 3 4
7 2 2 0
8 5 2 3
9 5 2 5
10 2 2 2
Table 2. 2-2 model DBS cluster summary
Cluster Min. 1st Qu. Median Mean 3rd Qu. Max.
1 0.01311 0.06579 0.11020 0.11450 0.16130 0.28030
2 -0.03721 0.14310 0.21280 0.28590 0.43170 1.00000
Data -0.03721 0.09534 0.15800 0.20220 0.23520 1.00000
of clusters as normally assumed in parametric mixture models but different number of components
can correspond to different number of clusters being produced. As expected with change in the
smoothing parameter results in a change in the model outcomes, there were drastic changes in the
model output with change from fixed bandwidth to adaptive iterative bandwidth. The adaptive
bandwidth usage resulted in rigorous classifications where clusters are well differentiated. Whilst
fixed band width seems to have little effects on the optimal clusters from different components,
when adaptive bandwidth is used, increase in the number of components result in increase in
number of clusters. An interesting point to note is how nonparametric mixture models differentiate
the heterogeneity among observations by grouping them into clusters and their ability to infer more
on the clusters compactness, within and between correlations unlike parametric mixture which make
use of property assumptions. In this paper the density based silhouette information was used to
determine cluster properties by using the median dbs values, dbs graph orientation as well as cluster
scatter plot for the different components. Whilst high positive mean dbs values of up to one shows
compactness the negative values indicate wrong classification. The different clusters have different
level of risk depending on which characteristics are more evident than the other. It was noted that
individuals in cluster 2 were at more risk compared to those in cluster 1 for the chosen model.
Thus nonparametric mixture models explains the heterogeneous properties on the observations and
groups. Further research on this work can also involve incorporating analysis of the stability of
the clusters when different tuning parameters in the estimations are used, i.e How stable are the
clusters under different smoothing parameter and kernels. An example is incorporating different
kernel estimators to the nonparametric clustering.
4 Conclusion
In this paper we presented how nonparametric mixture models can be used to diagnose heart disease
efficiently via the use of graph theory techniques and estimation of the cluster densities using
Gaussian kernel density estimators. This was achieved by use of voronoi diagrams and pairwise
connection to identify high connections point (modes) and low connection points (valley). This
enabled us to capture the heterogeneity in the heart disease data which is a mixed count data
13
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
as connection points(clusters) can then be represented by a mixture density via kernel density
functions. Analysis of these combined processes were done via cluster and PdfCluster packages
in R. In this case the cluster and component behaviour are solely determined and estimated from
the data, thus bringing the flexibility and robustness to cater for heterogeneous qualities of any
given population. The use of density based silhouette information enabled us to determine the
compactness, within and between cluster correlations and hence quality of the required number of
clusters. The analysis and results therefore enabled us to clearly classify individuals under heart
disease risks into 2 clusters using 2 components. These clusters has different risk levels, either a
cluster for those who are most likely to have the disease (high risk) and those unlikely to get the
disease (low risk). A closer look on within and between cluster correlations through dbs criteria and
paired connections enabled us to have a clear picture of relations on observations within a cluster.
We noted that cluster 2 had a more compact structure than cluster 1 due to its highly positive mean
dbs value. Individuals in cluster 2 where those considered to be at high risk, thus we can deduce that
those individual who are at high risk are more likely to show tightly related symptoms than those
who are unlike to develop it as shown by the cluster compactness. By identifying individuals under
heart disease risk into either a low or high risk group, this can help effectively and efficient diagnose
heart disease and hence reduce cardiological challenges that can be brought by wrong diagnosis.
Treatment can then be administered accordingly. Thus we can conclude that when nonparametric
mixture models are used to diagnose heart disease, a 2 cluster model with 2 components can be
used as it can diagnose individuals with different heart disease risk factors properly.
5 Recommendations
The model can be very useful to physicians and medical personnel because it helps to diagonise
heart diseases not only in a quick and effectively faster way but also efficiently. Since the model can
be used to classify a patient under high or low risk category, by administering a machine learning
programme that captures the 14 attributes used in this work from any given patient, physicians can
easily limit the time needed for diagnosis. This is possible because the clustering is performed at
individual level capturing individual characteristics hence individuals can be classified. It is of great
interest to note that since each cluster has common individual characteristics, it will be therefore
be faster for a physician to identify the attributes associated with those in high risk or low risk
categories directly therefore minimising the time to diagonise heart diseases. Consequently, effective
monitoring strategies can be invented for each cluster depending on the common characteristics
identified.
In this paper, clusters were identified nonparametrically using graph theory producing same results
with [7] were Poisson mixture regression models were used. We recommend to data scientists
therefore that a different model either with more attributes or that uses different clustering methods
like splines be implemented to check the possible number of clusters that can be produced.
Disclaimer
This manuscript was presented in the conference ICANAS, 2nd International Conference on Advances
in Natural and Applied Sciences, Antalya, Turkey, At Antalya, Turkey, April 2017. Available link
is:
https://www.researchgate.net/publication/316550105 Heart Disease Diagnosis via Nonparametric Mixture Modelsdate
Acknowledgement
Gratitude goes to the Turkish Government Scholarship for the financial provision to the corresponding
author.
14
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
Competing Interests
Authors have declared that no competing interests exist.
References
[1] Cheng YS, Ray S. Multivariate modality inference using Gaussian Kernel. Open Journal of
Statistics. 2014;4:419-434.
Available:http://dx.doi.org/10.4236/ojs.2014.45041
[2] Rebafka T, Roueff F. Nonparametric estimation of the mixing density using polynomials.
Mathematical Methods of Statistics, Allerton Press. Springer (link). 2015;24(3).
[3] McParland D, Gormley IC. Model based clustering for mixed data: clustMD. Adv Data Anal
Classif. 2016;10:155.
DOI: 10.1007/s11634-016-0238-x
[4] Zhu X, Hunter DR. Clustering via finite nonparametric ICA mixture models. Annals of Applied
Statistics. 2015;1-27.
[5] McLachlan G, Peel J. Finite mixture models. Wiley Series; 2000.
[6] Fraley C, Raftery AE. Model-based clustering, discriminant analysis and density estimation.
Journal of the American Statistical Association. ABI/INFORM Global. 2002;97:611.
[7] Mufudza C, Erol H. Poisson mixture regression models for heart disease prediction.
Computational and Mathematical Methods in Medicine. 2016;10.
Article ID: 4083089
Available:http://dx.doi.org/10.1155/2016/4083089
[8] Menardi G. Density-based Silhouette diagnostics for clustering methods. Stat Comput.
2011;21:295-308.
DOI: 10.1007/s11222-010-9169-0
[9] Raccine J, Li Q. Nonparametric estimation of regression functions with both categorical and
continuous data. Journal of Econometrics. 2004;119:99-130.
[10] Li J, Ray S, Lindsay BG. A nonparametric statistical approach to clustering via mode
identification. Journal of Machine Learning Research. 2007;8:1687-1723.
[11] Bonhomme S, Jochmans K, Robin JM. Nonparametric estimation of finite mixtures from
repeated measurements. Journal of the Royal Statistical Society: Series B (Statistical
Methodology). 2016;78:211-229.
DOI: 10.1111/rssb.12110
[12] Azzalini A, Torelli N. Clustering via nonparametric density estimation. Stat Comput.
2007;17:71-80.
DOI: 10.1007/s11222-006-9010-y
[13] Azzalini A, Menardi G. Clustering via nonparametric density estimation: The R package
PdfCluster. Journal of Statistical Software. 2014;57:1-26.
15
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
[14] Menardi G, Azzalini A. An advancement in clustering via nonparametric density estimation.
Stat Comput. 2014;24:753-767.
DOI: 10.1007/s11222-013-9400-x.
[15] Mallapragada PK, Jin R, Jain A. Nonparametric mixture models for clustering. Chapter:
Structural, Syntactic and Statistical Pattern Recognition: Lecture Notes in Computer Science.
2010;6218:334-343.
[16] Scott DW, Sain S. Multi-dimensional density estimation. Hand Book of Statistics. 2005;24:229-
261.
[17] Chauveau D, Hoang VTL. Nonparametric mixture models with conditionally independent
multivariate component densities. 2015;hal-01094837v2.
[18] Lindsay BG, Lesperance ML. A review of semiparametric mixture models. Journal of Statistical
Planning and Inference. 1995;47:29-39.
[19] Hall P, Zhou X. Nonparametric estimation of component distributions in a multivariate
mixture. The Annals of Statistics. 2003;31:201-224.
[20] Benaglia T, Chauveau D, Hunter R. An EM-like algorithm for semi- and nonparametric
estimation in multivariate mixtures. Journal of Computational and Graphical Statistics.
2009a;18:505-526.
[21] Sgouritsa E, et al. Identifying finite mixtures of nonparametric product distributions and
causal inference of confounders. Proceedings of the 29th Annual Conference on Uncertainty in
Artificial Intelligence (UAI). 2013;556-565.
[22] Chauveau D, Hoang VTL. Nonparametric mixture models with conditionally independent
multivariate component densities. Computational Statistics & Data Analysis. 2016;103:1-16.
Available:http://dx.doi.org/10.1016/j.csda.2016.04.013
[23] Cuevas A, et al. Estimating the number of clusters. The Canadian Journal of Statistics.
2000;28:1-17.
[24] Biau G, Cadre B, Pelletier B. A graph based estimator of the number of clusters. ESAIM:
Probability and Statistics. 2007;11:272-280.
DOI: 10.1051/ps:2007019
[25] Rinaldo A, Wasserman L. Generalised density clustering. The Annals of Statistics. 2010;38:678-
2722.
DOI: 10.1214/10-AOS797
[26] Rinaldo A, Singh A, Nugent L. Stability of density based clustering. Journal of Machine
Learning Research. 2012;13:905-948.
[27] Stuetzle I, Nugent R. A generalised single linkage method for estimating the cluster tree of a
density. J. Comput. Graph Statistics. 2010;19:397-418.
[28] Gower JC, Ross JSG. Minimum spanning trees and single linkage cluster analysis, J.R.Stat.
Soc., Series C (Applied Statistics). 1969;18:54-64.
[29] Liu Y, et al.. Understanding of internal clustering validation measures. IEEE International
Conference on Data Mining. 2010;911-916.
16
Mufudza and Erol; JAMCS, 27(5): 1-17, 2018; Article no.JAMCS.40440
[30] ardle W, et al. Nonparametric and semiparametric Models. Springer; 2004.
[31] Benaglia T, Chauveau D, Hunter R. Bandwidth selection in an EM-like algorithm for
nonparametric multivariate mixtures, World Scientific Publishing Co. Nonparametric Statistics
and Mixture Models: A Festschrift in Honor of Thomas P. Hettmansperger, World Scientific
Publishing Co., 2011;15-27. (hal-00353297).
[32] Benaglia T, Chauveau D, Hunter R.. An EM-like algorithm for semi- and nonparametric
estimation in multivariate mixtures. Journal of Computational and Graphical Statistics.
2009b;18:505-526.
[33] Detrano R, Long beach and Cleveland clinic foundation. V.A. Medical Center.
Available:http://archive.ics.uci.edu/ml/datasets/Heart+Disease
[34] Maechler M, et al. Cluster: Cluster analysis basics and extensions. R package version 2.0.5;
2016.
——————————————————————————————————————————————–
c
2018 Mufudza and Erol; This is an Open Access article distributed under the terms of the Creative
Commons Attribution License http://creativecommons.org/licenses/by/4.0, which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
Peer-review history:
The peer review history for this paper can be accessed here (Please copy paste the total link in your browser
address bar)
http://www.sciencedomain.org/review-history/25018
17
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Early heart disease control can be achieved by high disease prediction and diagnosis efficiency. This paper focuses on the use of model based clustering techniques to predict and diagnose heart disease via Poisson mixture regression models. Analysis and application of Poisson mixture regression models is here addressed under two different classes: standard and concomitant variable mixture regression models. Results show that a two-component concomitant variable Poisson mixture regression model predicts heart disease better than both the standard Poisson mixture regression model and the ordinary general linear Poisson regression model due to its low Bayesian Information Criteria value. Furthermore, a Zero Inflated Poisson Mixture Regression model turned out to be the best model for heart prediction over all models as it both clusters individuals into high or low risk category and predicts rate to heart disease componentwise given clusters available. It is deduced that heart disease prediction can be effectively done by identifying the major risks componentwise using Poisson mixture regression model.
Article
Full-text available
The aim of this paper is to provide simple nonparametric methods to estimate finitemixture models from data with repeated measurements. Three measurements suffice for the mixture to be fully identified and so our approach can be used even with very short panel data. We provide distribution theory for estimators of the mixing proportions and the mixture distributions, and various functionals thereof. We also discuss inference on the number of components. These estimators are found to perform well in a series of Monte Carlo exercises. We apply our techniques to document heterogeneity in log annual earnings using PSID data spanning the period 1969-1998.
Article
Full-text available
The number of modes (also known as modality) of a kernel density estimator (KDE) draws lots of interests and is important in practice. In this paper, we develop an inference framework on the modality of a KDE under multivariate setting using Gaussian kernel. We applied the modal clus- tering method proposed by [1] for mode hunting. A test statistic and its asymptotic distribution are derived to assess the significance of each mode. The inference procedure is applied on both simulated and real data sets.
Article
Full-text available
We propose an algorithm for nonparametric estimation for finite mixtures of multivariate random vectors that strongly resembles a true EM algorithm. The vectors are assumed to have independent coordinates conditional upon knowing from which mixture component they come, but otherwise their density functions are completely unspecified. Sometimes, the density functions may be partially specified by Euclidean parameters, a case we call semiparametric. Our algorithm is much more flexible and easily applicable than existing algorithms in the literature; it can be extended to any number of mixture components and any number of vector coordinates of the multivariate observations. Thus it may be applied even in situations where the model is not identifiable, so care is called for when using it in situations for which identifiability is difficult to establish conclusively. Our algorithm yields much smaller mean integrated squared errors than an alternative algorithm in a simulation study. In another example using a real dataset, it provides new insights that extend previous analyses. Finally, we present two different variations of our algorithm, one stochastic and one deterministic, and find anecdotal evidence that there is not a great deal of difference between the performance of these two variants. The computer code and data used in this article are available online.
Article
Full-text available
The R package pdfCluster performs cluster analysis based on a nonparametric estimate of the density of the observed variables. After summarizing the main aspects of the methodology, we describe the features and the usage of the package, and finally illustrate its working with the aid of two datasets.
Article
We propose an extension of non-parametric multivariate finite mixture models by dropping the standard conditional independence assumption and incorporating the independent component analysis (ICA) structure instead. We formulate an objective function in terms of penalized smoothed Kullback Leibler distance and introduce the nonlinear smoothed majorization-minimization independent component analysis (NSMM-ICA) algorithm for optimizing this function and estimating the model parameters. We have implemented a practical version of this algorithm, which utilizes the FastICA algorithm, in the R package icamix. We illustrate this new methodology using several applications in unsupervised learning and image processing.
Article
Recent works in the literature have proposed models and algorithms for nonparametric estimation of finite multivariate mixtures. In these works, the model assumes independent coordinates, conditional on the subpopulation from which each observation is drawn, so that the dependence structure comes only from the mixture. Here, we relax this assumption, allowing in the multivariate observations independent multivariate blocks of coordinates conditional upon knowing which mixture component from which they come. Otherwise their density functions are completely multivariate and nonparametric. We propose an EM-like algorithm for this model, and derive some strategies for selecting the bandwidth matrix involved in the nonparametric estimation step of it. The performance of this algorithm is through several numerical simulations. We also experiment this new model and algorithm on an actual dataset from the model based, unsupervised clustering perspective, to illustrate its potential.
Article
Minimum spanning trees (MST) and single linkage cluster analysis (SLCA) are explained and it is shown that all the information required for the SLCA of a set of points is contained in their MST. Known algorithms for finding the MST are discussed. They are efficient even when there are very many points; this makes a SLCA practicable when other methods of cluster analysis are not. The relevant computing procedures are published in the Algorithm section of the same issue of Applied Statistics. The use of the MST in the interpretation of vector diagrams arising in multivariate analysis is illustrated by an example.
Article
Cluster analysis is the automated search for groups of related observations in a dataset. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures, and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled. We review a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, minefield detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology and discuss recent developments in model-based clustering for non-Gaussian data, high-dimensional datasets, large datasets, and Bayesian estimation.
Article
Abstract The goal of clustering is to detect the presence of distinct groups in a data set and assign group labels to the observations. Nonparametric clustering is based on the premise that the observations may,be regarded as a sample from some underlying density in feature space and that groups correspond to modes of this density. The goal then is to find the modes and assign each observation to the domain of attraction of a mode. The modal structure of a density is summarized by its cluster tree; modes of the density correspond to leaves of the cluster tree. Estimating the cluster tree is the primary goal of nonparametric cluster analysis. We adopt a plug-in approach to cluster tree estimation: estimate the cluster tree of the feature density by the cluster tree of a density estimate. For some density estimates the cluster tree can be computed exactly, for others we have to be content with an approximation. We present a graph-based method that can approximate the cluster tree of any density estimate. Density estimates tend to have spurious modes caused by sampling variability, leading to spurious branches in the graph cluster tree. We propose excess mass as a measure for the size of a branch, reflecting the height of the corresponding peak of the density above the surrounding valley floor as well as its spatial extent. Excess mass can be used as a guide for pruning the graph cluster tree. We point out mathematical and algorithmic connections to single linkage clustering and illustrate our approach on several examples. Keywords: Cluster analysis, level set, single linkage clustering, excess mass, nearest