ArticlePDF Available

An R Code for Implementing Non-hierarchical Algorithm for Clustering of Probability Density Functions

Authors:

Abstract and Figures

This paper aims to present a code for implementation of non-hierarchical algorithm to cluster probability density functions in one dimension for the first time in R environment. The structure of code consists of 2 primary steps: executing the main clustering algorithm and evaluating the clustering quality. The code is validated on one simulated data set and two applications. The numerical results obtained are highly compatible with that on MATLAB software regarding computational time. Notably, the code mainly serves for educational purpose and desires to extend the availability of algorithm in several environments so as having multiple choices for whom interested in clustering. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Content may be subject to copyright.
| | |
2 3 3
1,2,
1
2
3
| | |
++
| | |
x1, x2, ..., xNN N
n
ˆ
f(x) = 1
Nh1h2...hn
N
X
i=1
n
Y
j=1
Kjxxj
hj,
hj=4
N(n+2) 1
n+4 ×σj
j Kj(·)
j
σj=sN
P
i=1
(xij ¯xj)2
N1
j
F F =
{f1(x), f2(x), ..., fn(x)}, n > 2k
k
C1,· · · , Ck
k
P
i=1
#Ci=n, #Ci
i
CiCj=, i 6=j, i = 1, k, j = 1, k.
L1
n
f1(x), f2(x), . . . , fn(x),(n>2)
Rmfmax (x) =
max {f(x)1, f2(x), . . . , fn(x)}
n= 2
w(f1, f2)kf1f2k1
2=Z
Rm
fmax (x)x1
n > 2
w(f1, f2, . . . , fn)≡ kf1, f2, . . . , fnk1
=Z
Rm
fmax (x)dx 1.
g, (g1, g2, ..., gn),(f1, f2, ..., fm)
w[g∪ {(f1, f2, ..., fm)}]
w[{g1, g2, ..., gn}∪{f1, f2, ..., fm}].
SF =
k
P
i=1 P
fCi
kffvik2
nmini6=jkfvifvjk2,
kfvifvjkL1
CiCj
| | |
fvj=
P
fiCj
fi
nj
nj
Cj
k
n f1, f2, . . . , fn(n > 2)
k C1, C2, . . . , Ck
k
P
i=1
#Ci=n C(t)
i
Ci, i =1, k t
n k
1
C(1)
1, C(1)
2, . . . , C(1)
k.
fj,
wfjC(1)
i=
fjC(1)
i
1, i = 1, k, j = 1, n
fjC(1)
1
fjC(1)
h
wfjC(1)
h=
mini=1,...,k nwfjC(1)
io fj
C(1)
h
C(1)
s
wfjC(1)
s= mini=1,...,k nwfjC(1)
io
fjC(1)
s
2C(2)
s
C(1)
hfj
C(2)
h
C(2)
1, C(2)
2, . . . , C(2)
kfj
m
k m
C(m)
1, C(m)
2, . . . , C(m)
k
wfjC(m)
s= mini=1,...,k nwfjC(m)
io.
(f, k, x)
| | |
f
f
k
k
x f
m
d
m
createU (k, n )
k n
B
B
(1n , .5, k +.5)
(f, x)
f
| | |
f, x
L1
C1={f1, f4}, C2={f2, f5, f7}, C3={f3, f6}.
| | |
k
k=k=k=
| | |
k= 2
C={f , f , f },
C=(f , f , f , f4, f6, f7,
f8, f9, f10, f , f , f , f ).
(C1
(C2)
C2
(C2)
1920 ×1080 k= 2
| | |
k
k=k= 3 k= 4
| | |
| | |
n
| | |
| | |
n
n
n n
n
| | |
2
2
Article
Clustering for probability density functions (CDF) can be categorized as non-fuzzy and fuzzy approaches. Regarding the second approach, the iterative refinement technique has been used for searching the optimal partition. This method could be easily trapped at a local optimum. In order to find the global optimum, a meta-heuristic optimization (MO) algorithm must be incorporated into the fuzzy CDF problem. However, no research utilizing MO to solve the fuzzy CDF problem has been proposed so far due to the lack of a reasonable encoding for converting a fuzzy clustering solution to a chromosome. To address this shortcoming, a new definition called Gaussian prototype is defined first. This type of prototype is capable of accurately representing the cluster without being overly complex. As a result, prototypes’ information can be easily integrated into the chromosome via a novel prototype-based encoding method. Second, a new objective function is introduced to evaluate a fuzzy CDF solution. Finally, Differential Evolution (DE) is used to determine the optimal solution for fuzzy clustering. The proposed method, namely DE-AFCF, is the first to propose a globally automatic fuzzy CDF algorithm, which not only can automatically determine the number of clusters k but also can search for the optimal fuzzy partition matrix by taking into account both clustering compactness and separation. The DE-AFCF is also applied in some image clustering problems, such as processed image detection, and traffic image recognition.
Article
Full-text available
Clustering for probability density functions (CDF) has recently emerged as a new interest technique in statistical pattern recognition because of its potential in various practical issues. For solving the CDF problems, evolutionary techniques which are successfully applied in clustering for discrete elements (CDE) have not been studied much in CDF. Therefore, this paper presents for the first time an application of the differential evolution (DE) algorithm for clustering of probability density functions (pdfs) in which the clustering problem is transformed into an optimization problem. In this optimization problem, the objective function is to minimize the internal validity measure-SF index and the design variable is the name of the cluster in which pdfs are assigned to. To solve this optimization problem, a differential evolution-based clustering for probability density function (DE-CDF) is proposed. The efficiency and feasibility of the proposed approach are demonstrated through four numerical examples including analytical and real-life problems with gradually increasing the complexity of the problem. The obtained results mostly outperform several those of compared algorithms in the literature. OAPA
Article
Full-text available
In this article we suggest a definition for the notion of L1-distance that combines probability density functions and prior probabilities. We also obtain the upper and lower bounds for this distance as well as its relation to other measures. Besides, the relationship between the proposed distance and quantities involved in classification problem by Bayesian method will be established. In practice, calculations are performed by Matlab procedures. As an illustration for applications of the obtained results, the article gives here an estimation for the ability to repay bank debt of some companies in Can Tho City, Vietnam.
Article
Full-text available
dendextend is an R package for creating and comparing visually appealing tree diagrams. dendextend provides utility functions for manipulating dendrogram objects (their color, shape, and content) as well as several advanced methods for comparing trees to one another (both statistically and visually). As such, dendextend offers a flexible framework for enhancing R's rich ecosystem of packages for performing hierarchical clustering of items. The dendextend R package (including detailed introductory vignettes) is available under the GPL-2 Open Source license and is freely available to download from CRAN at: (http://cran.r-project.org/package=dendextend) CONTACT: Tal.Galili@math.tau.ac.il. © The Author(s) 2015. Published by Oxford University Press.
Article
Full-text available
This article presents some theoretical results on the maximum of several functions, and its use to define the joint distance of k probability densities, which, in turn, serves to derive new algorithms for clustering densities. Numerical examples are presented to illustrate the theory.
Article
The fastcluster package is a C++ library for hierarchical, agglomerative clustering. It provides a fast implementation of the most e_cient, current algorithms when the input is a dissimilarity index. Moreover, it features memory-saving routines for hierarchical clustering of vector data. It improves both asymptotic time complexity (in most cases)and practical performance (in all cases) compared to the existing implementations in standard software: several R packages, MATLAB, Mathematica, Python with SciPy. The fastcluster package presently has interfaces to R and Python. Part of the functionality is designed as a drop-in replacement for the methods hclust and flashClust in R and scipy.cluster.hierarchy.linkage in Python, so that existing programs can be e_ortlessly adapted for improved performance.
Article
In this article we discuss our experience designing and implementing a statistical computing language. In developing this new language, we sought to combine what we felt were useful features from two existing computer languages. We feel that the new language provides advantages in the areas of portability, computational efficiency, memory management, and scoping.