ArticlePDF Available

A Survey of Data Clustering Methods

Authors:

Abstract and Figures

Data clustering is one of the most essential, common and interesting task to classification of patterns in different areas such as data mining, pattern recognition, artificial intelligence and etc. The objective of data clustering is to classification of similar entities. There are so many different techniques of data clustering available for different nature of applications. Data clustering techniques are categorizing into two types-Partitioning Procedures and Hierarchical Procedures. Hierarchical clustering creates hierarchy of clusters, look like tree. Results of hierarchical Clusters are shown in dendrogram shape. Partitioning method-clustering makes various partitions of objects and evaluates them by some standard. In this paper, we introduce a critical review on few papers and found some strengths and weaknesses of different clustering techniques. The purpose of this overview is to compare and evaluate each clustering techniques and find their pros and cons. This comparison concludes the better approach for future research in data clustering.
Content may be subject to copyright.
International Journal of Advanced Science and Technology
Vol.113 (2018), pp.133-142
http://dx.doi.org/10.14257/ijast.2018.113.14
ISSN: 2005-4238 IJAST
Copyright 2018 SERSC Australia
A Survey of Data Clustering Methods
1
Saima Bano and M. N. A. Khan
Shaheed Zulfikar Ali Bhutto Institute of Science and Technology,
Islamabad, Pakistan
sayohunzai@gmail.com, mnak2010@gmail.com
Abstract
Data clustering is one of the most essential, common and interesting task to
classification of patterns in different areas such as data mining, pattern recognition,
artificial intelligence and etc. The objective of data clustering is to classification of
similar entities. There are so many different techniques of data clustering available for
different nature of applications. Data clustering techniques are categorizing into two
types Partitioning Procedures and Hierarchical Procedures. Hierarchical clustering
creates hierarchy of clusters, look like tree. Results of hierarchical Clusters are shown in
dendrogram shape. Partitioning method-clustering makes various partitions of objects
and evaluates them by some standard. In this paper, we introduce a critical review on few
papers and found some strengths and weaknesses of different clustering techniques. The
purpose of this overview is to compare and evaluate each clustering techniques and find
their pros and cons. This comparison concludes the better approach for future research in
data clustering.
Keywords: Data Clustering, Fuzzy Clustering, Fuzzy C-Means Clustering, K-Means,
K-Mediods
1. Introduction
Data clustering is a procedure in which we make cluster of entities ton based on
their similar features. A good clustering technique will create high quality clusters
with high intra-class similarity low inter-class similarity Quality of clustering
depends on the similarity measure used and its implementation. Quality of a
clustering process is measured by its ability to find out some or all of the unknown
patterns. To creating a similarity clusters distance measured is used, which are as
follow: Euclidean distance, Manhattan or taxicab distance, Mahalanobis distance,
Inner product space, Hamming distance, Cosine Similarity Index, Minkowski
distance. All distances are used to finds similarity in between different points but
mostly Euclidian distance is used to measure objects for similarity attributes. Data
Clustering is used in many different application areas; i.e., Data retrieval, Image
analysis, Machine learning, web search engines, Pattern recognition, computational,
economic, Libraries, insurances, city planning, and earthquakes studies.
There are so many data clustering algorithms to classified data into similar groups.
Normally clustering algorithms are categorized into two groups as unsupervised linear
clustering algorithm and unsupervised non-linear clustering algorithm. In unsupervised
linear clustering algorithms includes: Fuzzy c mean clustering algorithm, Quality
threshold clustering algorithm, K- mean clustering algorithm, hierarchical clustering
algorithm, and Gaussian (EM) clustering algorithm. Whereas MST created clustering
algorithm are density based and kernel k- mean are clustering algorithm include in
unsupervised nonlinear.
Received (December 27, 2017), Review Result (March 11, 2018), Accepted (March 13, 2018)
International Journal of Advanced Science and Technology
Vol.113 (2018)
134 Copyright 2018 SERSC Australia
Some well-know clustering algorithms are K-means clustering which create a clusters
n number of objects into k clusters. Moreover, observed the nearest means to making
clusters. Assigning objects to cluster by using distance. k-means strength is fast, cover
local optimum and very flexible clusters during process its changes. K-means limitation is
when data size greater the clusters results poor because k-means always cover local
optimum clusters. While k-mediods clustering algorithms are like k-means algorithm
which n objects point into k clusters, it minimizes the dissimilarities so it is more robust
than k-means. Mediod is center point in k-mediod. It takes reference points instead of
mean values of objects.
Common clustering problems are Interpreting results, Outlier handling, Number of
clusters, Dynamic data and evaluating results, to reduce such type of issues different
clustering Procedures are used. Clustering Procedures are been classified into the
following categories:
Hierarchical Procedures
o Agglomerative hierarchical clustering
o Divisive hierarchical clustering
Partitioning Procedures
Density-based Procedures
Model-based Clustering Procedures
o Decision Trees
o Neural Networks
Grid-based Procedures
Fuzzy Clustering
1.1. Hierarchical Clustering Method
This method create a group of nested clusters structured as a visualized and
hierarchical tree as a dendrogram a diagram like tree that records the structures of joins
or separations. Hierarchical procedures can be any agglomerative or divisive.
Agglomerative algorithm starts with each element as a single cluster and joins them in
sequence larger clusters; divisive algorithms start with the entire group and proceed to
split it into successively minor clusters.
1.2. Partitioning Clustering Method
The partitioning Procedures commonly result in a group of M clusters, each item
belonging to unique cluster. Each cluster may be denoted by a centroid or a cluster
representative; this is some sort of summary description of all the entities enclosed in a
cluster. The exact form of this report will depend on the nature of the entity that is being
clustered.
1.3. Density-based Clustering Method
The density based clustering algorithm discovers the clusters in arbitrary shape. By
region of low density objects are separate into dense region. points are connecting on the
base of certain distance threshold. Points are connected till satisfy density criteria. It
handles noise and one scan. Several algorithm related to density based is: DBSCAN,
OPTICS, DENCLUE, CLIQUE, BIRCH, CURE.
1.4. Model-based Clustering Methods
Model-based methods are used to optimize the data and some mathematical model that
is based on the assumption. Model-based clustering, model generated a data that assumes
International Journal of Advanced Science and Technology
Vol.113 (2018)
Copyright 2018 SERSC Australia 135
and recover the original model from the data. Then clusters are defined from the
recovered data.
1.5. Grid-based Clustering Method
This method concerned with value space instead of data points to making grid
clustering. In such way grid clustering method first creates a grid structure then calculates
cell density and after that identifies cluster centers. The main advantage of grid-based
clustering is reducing a computational complexity, particularly for clustering very large
data sets. Several interesting methods of grid-based clustering method are: Wave Cluster,
STING, CLIQUE.
1.6. Fuzzy Clustering
Fuzzy clustering is referred to soft computing and the data points belong to more than
one cluster. A famous fuzzy clustering algorithm is fuzzy c- means clusters it are mostly
used in image processing tools.
2. Related Work
Karaboga et al., [1] discuses Artificial Bee Colony (ABC) algorithm an
optimizing procedure that simulates the quick seeking behaviors of a honeybee
swarm for data classification and clustering. The artificial colony consists of three
types of bees: employed bees, onlookers and scouts. Employed bee: associate with a
specific food source and offers the neighborhoods of the basis in its memory.
Onlookers: it’s get information of food source from the employed bees in the hive
and select one of the food source together the nectars. Scouts: it is responsible for
finding new food, the nectar source. The total numbers of employed bees are equal
to the number of food sources around the hive. In this paper, clustering problem is
stated i.e., N number of objects are provided and the aim of the proposed technique
is to allocate each object into k-clusters followed by minimizing the sum of
distances between the objects. ABC algorithm is compared with other famous
heuristic algorithm such as GA and PSO based on their performance. Thirteen types
of classification issues (such as glass, thyroid, and wine) from Glass Identification
Data Set, Thyroid Disease Data Set, and Wine Data Set available in the UCI
Machine Learning repository are used to evaluate the performance of the ABC
algorithm. First 75% of data used for training data, and the remaining 25% is used
for testing data. The performance is evaluated and tested using XOR, Decoder-
Encoder and 3-bit parity Procedures. The authors tested the proposed algorithm to
form clusters for patterns by assigning it to the class whose center is closest to the
cluster center. For this purpose, they have used the Euclidean distances measure.
The ABC algorithm shaped average clusters for all the problems with error
percentages of 13.13% as compared to 15.99% error percentage for PSO. ABC
algorithm offers much better quality clusters and shows good performance. The
overall ranking of the ABC algorithm was first among the other optimization
techniques.
Chen et al., [2] disuses Spectral parallel Procedures which is widely used for
computer vision and information retrieval. Author compare two types of approaches:
sparsifying the similarity matrix and the Nystro¨m approximation to discovering a
parallel spectral cluster in distributed environment. Spectral refer to the use of
eigenvalues, eigenvectors, singular values and singular vectors. It is more effective
in finding clusters than some traditional algorithms, such as k-means. The algorithm
constructs a parallel matrix and reflects the relationship between the data points,
then uses similarity matrix information to groups into k-clusters. To constructing the
sparse similarity matrix using nearest neighbor the authors use compute distances of
International Journal of Advanced Science and Technology
Vol.113 (2018)
136 Copyright 2018 SERSC Australia
all data points, symmetrically modify the sparse matrix and finally computes the
similarities. These three steps are implemented by using MapReduce, a Google
parallel computing framework. To reduce the memory use the sparsification
approach keeps the most useful sparse matrix. Whereas, Nystrom approximation
approach used to stores only several columns of the similarity matrix. For
experimentations, three data sets were used: Corel (images), RCV1 (documents) and
Picasa-Web (a Google online platform photo sharing product. By k-means clusters
were generated an observed that spectral clustering finds better similarities in
images. By evaluation parallel spectral clustering approach speed up to 256
machines, and efficiently handle larger problem.
According to Senthilnath et al., [3] Firefly Algorithm (FA) is best for
optimization problem in clustering that is a latest nature inspired optimization
algorithm which simulates the flash pattern and characteristics of fireflies. FA is
used for difficult optimization problem in clustering. The algorithm works in
following three steps. The agents are randomly distributed in search space. The
objects were separate into classes which goal is to find clusters center. FA is
unsupervised technique so the dataset is distributed into training dataset and test
dataset. Thirteen types of data sets (The Balance data set, Cancer-Int data set and
etc.,) from UCI machine learning repository were used to shown the results of the
optimization techniques. The performance is measured by using Classification Error
Percentage (CEP) with all the 13-benchmark data sets. CEP helps to evaluate which
method has generated the optimal cluster centers. The performance of FA is
compared with other two well-known optimization algorithms i.e. Artificial Bee
Colony (ABC), Particle Swarm Optimization (PSO). The accuracy and robustness of
FA could be efficiently used for clustering and FA has good global performance
than other optimization techniques. The performance of clustering generally
depends on the size and value of training data set. FA efficiently generate clusters
center.
Kim et al., [4] presented Density-based clustering algorithms such as DBSCAN
and OPTICS are widely used in clustering. Density-based clustering Procedures are
used to discover clusters of arbitrary shape and dense regions of data points. On the
base of given density parameters, discover clusters which are dense in region. For a
large dataset author proposed a density-based clustering algorithm, which discovers
densities of cluster and well suited for framework .by using MapReduce framework
it is difficult to parallelize clustering. When dealing with large amount of data, it is
hard to parallelize clustering algorithm by using MapReduce framework. A Density-
based clustering algorithm, DBCURE find clusters with varying densities and is
suitable for parallelizing the algorithm with MapReduce. To parallelized DBCURE
using MapReduce, the authors have developed DBCURE-MR which finds several
clusters together by expanding every core point in parallel. Whereas, traditional
density-based algorithms find each cluster individually. Three datasets CLOVER,
WINDOW and BUTTERFLY have been used by the authors to evaluate clustering.
The overall findings of the study are that DBCURE and DBCURE-MR finds clusters
centers efficiently and scales up well with the MapReduce framework.
Kaymak et al., [5] used Fuzzy clustering algorithm to divides the dataset into
groups so that the clusters describe a structure within the data. Similar to fuzzy
logic, in fuzzy clustering every point has certain degree of belonging to different
clusters, rather than belonging to just one cluster. Fuzzy clustering is widely used in
various fields like finance and marketing. However, there are certain issues in fuzzy
algorithm that need to be taken care of such as volume and shape of the clusters,
distribution of the data patterns, initialization of the clustering algorithm and
selecting the number of clusters in the data.
International Journal of Advanced Science and Technology
Vol.113 (2018)
Copyright 2018 SERSC Australia 137
Marghescu et al., [6] explore advanced supporting tools for predicting currency
crises, which is based on an experimental study of the currency crisis in 23 rising
markets around the world spanned over half century. The authors built fuzzy C-
mean (FCM) model to partition data points into specific overlapping groups and
then classify data clusters into early-warning clusters (EWCs) and tranquil clusters
(TCs). FCM model is used for predicting the overall economic crisis by testing and
evaluating a large number of samples.
Niknam et al., [7] discuses K-Mean clustering which is simple and efficient
technique to create k-clusters, and covers local optimal solutions. K-mean cluster
technique is highly depend on the initial position and finds local optimal solution.
Niknam and Amiri (2010) present a new hybrid evolutionary approach based on
FAPSO (fuzzy adaptive particle swarm optimization), ACO (ant colony
optimization) and k-means algorithms called FAPSO-ACOK which finds global
optimal and correctly centered clusters.
Kocheturov et al., [8] analyze stock markets of the USA and Sweden by studying
the dynamics of a cluster structure in financial markets followed by finding its
correlation to crisis and non-crisis periods. The network examination has become a
great tool for learning financial markets in the last 15 years. The authors build a
network structures from a correlation matrix of the stocks markets, which has a
predefined number of connected components. Structure is a forest of stars with
weighted edges where every node represents a stock and weights are equal to the
measured similarity between the stocks.
Nanda et al., [9] presented a data mining methodology for classifying Indian
stocks market into unlike clusters. The clustering methodology classifies stocks on
certain investment criteria. Bombay Stock Exchange (BSE) was together from
Capitaline Databases Plus and data for 106 stocks were collected for the fiscal year
20072008. K-means, self-organizing maps (SOM) and Fuzzy C-means clustering
approaches were used to cluster stock market data. The authors collected a mixed
data from different sectors like BSE BANKEX, BSE Auto, BSE Pharma, BSE IT,
BSE Midcaps and BSE. By comparing the portfolio performances with the BSE
Sensex benchmark index, the results shows that K-means method turns out to be
better.
Sastry et al., [10] utilized clustering techniques for detecting difference in
product sales and also to identify and compare sales over a particular time.
Clustering is well suited to group items that seem to fall naturally together, when
there is no specified class for any new item. Authors used annual sales data of steel
products to analyze Sales Volume & Value with respect to dependent attributes like
products, customers and quantities sold. The demand for steel products is cyclical
and depends on many factors like customer profile, price, discounts and tax issues.
Authors have analyzed sales data with clustering algorithms like K-Means & EM
(expectationmaximization) that revealed many interesting patterns useful for
improving sales revenue and achieving higher sales volumes. K-Means & EM
(partition Procedures) algorithms are better suited to evaluate sales data in
comparison with density based Procedures.
Fallahpour et al., [11] in their study talk about applied clustering approach to
classify 79 selected stocks of Iran's stock market into a number of clusters. The data
collected from currency crisis in Iran’s economy that negatively influenced Iran
Stock Exchange dramatically during the period 22/09/2012 to 22/03/2013. Applied
three well know clustering Procedures namely K-medoids, K-means and X-means
were used. The techniques were evaluated by the application of Intraclass inertia,
which show the density of each clustering method. By comparison, of Intra-class
inertia was take that K-Means algorithm has a enhanced quality than K-medoids and
X-means techniques. By the use of some defined indexes namely Silhouette and
International Journal of Advanced Science and Technology
Vol.113 (2018)
138 Copyright 2018 SERSC Australia
Davis-Bouldin, efficient number of clusters were extracted. Most desirable clusters
from five stock market result shows that k-means can create an efficient portfolios.
Stetco et al., [12] performed classification of registered Companies in London
Stock Exchange to indentify the group of similarly performing companies based on
their historical stock price record. Fuzzy clustering analysis were carried out using a
correlation-based metric to obtain a more insightful classification of the companies
into groups with fuzzy boundaries, giving realistic and detailed view of their
relationships. Analyze cluster and discovered groups in terms of the volatility of
their returns using both standard deviation and exponentially weighted moving
average. This approach has the potential to be of practical importance to
classification as it can detect fuzzy clusters of correlated stocks that have lower
inter-cluster correlation, analyze their volatility and sample potentially less risky
combination of assets.
Miguéis et al., [13] analyze lifestyle segmentation of customer using data mining
technique. A decent relationship between customers and companies is a vital
element of attractiveness and organization of loyalty relationships with customers is
a focal tactical purpose. Therefore, companies are improving service levels and
wishing to be at the top edge in order to certify a good business relationship with
customers. The authors [13] extracted information from a large transaction database
and propose a market segmentation technique for retailing based on customer’s
lifestyle. The authors used a variable clustering method to infer customers’ lifestyle.
The authors [13] identify typical shopping baskets based on products which are
more repeatedly bought together. Lifestyle segment is assigned to customers based
on their purchasing history. The propose model is implemented in European
retailing company.
Shim et al., [14] develop CRM approach using association rules and sequential
patterns for a small size online shopping mall. In 2002, dot-com bubble burst was
established and contained several small-sized online shopping centers. Many of
them become known in the market because they have good relations with customers
and decent characteristics of online marketplace including significantly reduced
menu cost for products/services and search costs and easily access services/products
in the world. But some of the online shopping malls have not constantly succeeded
and need to close them. Several of them have poor customer relationship
management strategies and need to close them. The authors [14] analyze customer
transactions data of the online shopping mall and propose sequential patterns and
association rules. Firstly, authors [14] define the VIP customers based on recency,
frequency and monetary (RFM) values. Formerly, they design a model which
categories customers into non-VIP and VIP. The authors [14] used various data
mining techniques such as logistic regression, bagging decision tree, artificial neural
network and decision tree with each of these as a base classifier. Lastly, they
identify patterns and rules for VIPs customer from the transactional data using
association rules and sequential patterns and then design CRM strategies for the
online shopping mall. Reference [16-25] outlined various software engineering and
machine learning techniques in different domains.
3. Critical Evaluation
Comparative evaluation of different clustering techniques as discussed above is
shown in Table 1.
International Journal of Advanced Science and Technology
Vol.113 (2018)
Copyright 2018 SERSC Australia 139
Table 1. Critical Evaluation of Clustering Methods
Ref
#
Proposed
technique
Strengths
Limitations
Possible
Improvement
[1]
Artificial Bee
Colony (ABC)
algorithm
ABC algorithm, fast, robust,
enhances accuracy and used
for optimizing multivariable
functions.
Local search performance
depends on neighborhood
search and greedy selection
and the global search
performance of
algorithm depends on
random search process.
To establish true
strengths of
ABC algorithm,
it would be
better to
compare it with
some other
optimization
techniques such
as Genetic
Algorithms and
Ant Colony
Optimization.
[2]
Firefly
Algorithm
(FA)
The accuracy and robustness
of FA is efficient for
clustering center and
performance is better than
other optimization
techniques.
FA depends on population
size and This will affect the
computation time of this
algorithm.
To get better
results FA would
be compared
with some other
optimization
techniques like
ant colony
optimization,
genetic
algorithms etc.
[3]
Parallel
clustering
algorithm
called
DBCURE-MR.
The performance of
DBCURE-MR speedup with
an order of magnitude. And
it is more effective for
experimenting with real-life
datasets.
Time complexity is
maximum.
Because lack of
availability of
large memory,
the buffer size is
only kept to
500MB in every
machine for
simulation. To
get a better
result, the
memory size on
each machine be
enlarged.
[4]
Fuzzy c-means
(E-FCM)
extension and
Kessel (E-GK)
algorithms
Fuzzy C-Mean
algorithm is to reduce the
sensitivity of the resulting
clustering
Algorithm miss-
classification of patterns due
to unsupervised learning.
Real world
applications of
extended
clustering
algorithms need
to be explores to
truly analyze
properties of the
algorithms.
[5]
Fuzzy C- mean
(FCM) model
FCM model is used for
predicting the overall
economic crisis by testing
and evaluating a large
number of samples.
The high rate of false alarms
in the test sets is a major
weakness of this model.
Currently, the
model is tested
for only three
Asian countries
Indonesia,
Korea and
Taiwan. To get a
better estimate
of the results,
data about
currency crisis in
different
countries from
other continents
International Journal of Advanced Science and Technology
Vol.113 (2018)
140 Copyright 2018 SERSC Australia
could be
included.
[6]
FAPSO (fuzzy
adaptive
particle swarm
optimization),
ACO (ant
colony
optimization)
and k-means
algorithms
called FAPSO-
ACOK
To find a better cluster
partition and solve nonlinear
partitioned clustering
problem.
The algorithm still finds
hard optimization problem.
To get better
understanding
about its
potential use, the
proposed
algorithms could
be compared
with other
evolutionary
algorithm such
as artificial bee
colony and
firefly algorithm
etc.
[7]
Dynamic
cluster
structures
Dynamic cluster structures
proved to be more stable
during the crisis periods like
the world financial crisis, the
Subprime mortgage crisis,
the Dot-com crisis and the
banking crisis in Sweden
The modularity-based
approach reveals that the
highest values of modularity
and relatively small number
of communities help detect
the abrupt changes in the
markets such as the Dot-com
crisis is related to both types
of the markets. The world
financial crisis is not
observable from these trends
so well as there are only
local quality threshold (Qt)
maxima and minima of the
community number. Thus
the main difference between
the modularity and our
measures is that the former
gives more contrasting
picture of the crises.
Traditionally, dynamic
clustering algorithms are
calculated for large datasets.
It is difficult to cluster the
tiny dataset due of the loss
of the statistical
characteristics and
probability features.
Author studied
the Swedish
financial market
with 266
companies. To
overcome the
world financial
crisis PMP
should be
applied more
companies and
should test the
market
performance
with other
countries.
[8]
Clustering-
based stock
selection
method
The proposed cluster-based
approach considerably
reduces the time required for
creating a portfolio. Validity
indexes were used in each
case to find the optimal
number of clusters.
Clustering algorithms were
performing on limited
dataset. The validity indexes
were no reliable in some
cases.
To get more
optimal number
of clusters in
stock market
data, it would be
better to analysis
it with some
other clustering
techniques.
[9]
K-Means and
EM
(expectation
maximization)
Reduces the risk of
redundancy errors and
allows global integration of
product sales.
Systems are restrictive and
not flexible in
implementation and usage.
ERPs are difficult to adjust
to the specific workflow and
the main causes of their
failure. Systems can be
difficult to use.
To get better
analysis of sale
data and
resultant revenue
collection needs
to be tested on
other software
such as SAP.
[10]
K-medoids, K-
means and X-
means
Markowitz model creates
optimize portfolios on
applied best techniques and
k-means meets the portfolios
By applied method, k-means
is not guaranteed to be
global optimum.
Markowitz model is based
Other well-
known
clustering
Procedures like
International Journal of Advanced Science and Technology
Vol.113 (2018)
Copyright 2018 SERSC Australia 141
risk minimization by
portfolio diversification.
on diversification; when
portfolios are not well
diversified then k-means is
not able to return efficient
stock portfolios.
fuzzy clustering,
expectation
maximization
etc. can be used
to study how to
minimize
currency crisis in
the stock
exchange.
[11]
Principal
Component
Analysis
(PCA),
compression
and fuzzy
clustering
The authors computed 43
eigenvectors with 90% of the
variability in stock data. By
the using PCA, it removes
noise and to improves data
analysis performance.
1. If the time scale is kept
small (such as hourly, daily)
then not only does it
becomes computationally
more costly to analyze but
also global trends might be
difficult to discover. In
contrast, larger time
intervals may erase local
differences that may be
useful for classification.
Stock price is not a good
indicator of performance, as
it does not reflect the size
and revenue of a company;
neither is it a good
comparator when
considering multiple stocks.
2.In Fuzzy clustering, data
points have more than single
clusters so correlation cannot
be used directly as a metric
in cluster analysis as it does
not satisfy the non-negativity
condition of metric
functions.
To identify
better stock
prices for long
period, the
model-based
techniques can
be used.
4. Conclusion and Future Work
A review of different clustering techniques proposed in the literature shows that
each technique has its own advantages and disadvantages. I would like to propose a
new hybrid technique like evolutionary techniques, optimization methods such as
Genetic Algorithms and Ant Colony Optimization to compare with other techniques
to get better result of data clustering. Real world applications of extended clustering
algorithms need to be explored to analyze properties of the algorithms. The
available existing techniques regarding data clustering will overcome the issues
regarding performance, scalability and dimensionality. My focus will be to improve
data clustering technique to incorporate performance, scalability issues. The key
conclusion of this paper is to assessments of data clustering techniques, which are
used in data mining, pattern. Reason behind the review is to implement a fresh
hybrid technique in an operational method keeping in view of different data
clustering strengths and limitations based on their practicality and productivity of
current techniques are also analyzed critically.
References
[1] D. Karaboga and C. Ozturk, “A novel clustering approach: Artificial Bee Colony (ABC) algorithm”,
Applied Soft Computing, vol. 11, no. 1, (2011), pp. 652-657.
[2] W. Y. Chen, Y. Song, H. Bai, C. J. Lin and E. Y. Chang, “Parallel spectral clustering in distributed
systems”, Pattern Analysis and Machine Intelligence, IEEE Transactions, vol. 33 no. 3, (2011), pp. 568-
586.
International Journal of Advanced Science and Technology
Vol.113 (2018)
142 Copyright 2018 SERSC Australia
[3] J. Senthilnath, S. N. Omkar and V. Mani, “Clustering using firefly algorithm: performance study”,
Swarm and Evolutionary Computation, vol. 1, no. 3, (2011), pp. 164-171.
[4] Y. Kim, K. Shim, M. S. Kim and S. Lee, “DBCURE-MR: an efficient density-based clustering algorithm
for large data using MapReduce”, Information Systems, vol. 42, (2014), pp. 15-35.
[5] U. Kaymak and M. Setnes, “Extended fuzzy clustering algorithms”, ERIM Report Series Reference No.
ERS-2001-51-LIS, (2000).
[6] D. Marghescu, P. Sarlin and S. Liu, “Early-warning analysis for currency crises in emerging markets: A
revisit with fuzzy clustering”, Intelligent Systems in Accounting, Finance and Management, vol. 17, no.
3-4, (2010), pp. 143-165.
[7] T. Niknam and B. Amiri, “An efficient hybrid approach based on PSO, ACO and k-means for cluster
analysis”, Applied Soft Computing, vol. 10, no. 1, (2010), pp. 183-197.
[8] A. Kocheturov, M. Batsyn and P. M. Pardalos, “Dynamics of cluster structures in a financial market
network, Physica A: Statistical Mechanics and its Applications, vol. 413, (2014), pp. 523-533.
[9] S. R. Nanda, B. Mahanty and M. K. Tiwari, “Clustering Indian stock market data for portfolio
management, Expert Systems with Applications, vol. 37, no. 12, (2010), pp. 8793-8798.
[10] S. H. Sastry, P. Babu and M. S. Prasada, “Analysis & Prediction of Sales Data in SA P-ERP System
using Clustering Algorithms, arXiv preprint arXiv:1312.2678, (2013).
[11] S. Fallahpour, M. H. Zadeh and E. N. Lakvan, “Use of Clustering Approach for Portfolio Management”,
International SAMANM Journal of Finance and Accounting, vol. 2, no. 1, (2014).
[12] A. Stetco, X. Zeng and J. Keane, “Fuzzy cluster analysis of financial time series and their volatility
assessment”, In Systems, Man, and Cybernetics (SMC), IEEE International Conference, (2013), pp. 91-
96.
[13] V. L. Miguéis, A. S. Camanho and J. F. e Cunha, “Customer data mining for lifestyle segmentation,
Expert Systems with Applications, vol. 39, no. 10, (2012), pp. 9359-9366.
[14] B. Shim, K. Choi and Y. Suh, “CRM strategies for a small-sized online shopping mall based on
association rules and sequential patterns”, Expert Systems with Applications, vol. 39, no. 9, (2012), pp.
7736-7742.
[15] C. Pete, C. Julian, K. Randy, K. Thomas, R. Thomas, S. Colin and W. Rüdiger, “CRISP-DM”, NCR,
SPSS, DaimlerChrysler. USA, (2000), pp. 1-76.
[16] M. N. A. Khan and S. Ullah, A log aggregation forensic analysis framework for cloud computing
environments”, Computer Fraud & Security, vol. 2017, no. 7, (2017) July, pp. 11-16.
[17] M. N. A. Khan and I. Wakeman, Machine learning for post-event timeline reconstruction”, In First
Conference on Advances in Computer Security and Forensics Liverpool, UK, (2006), pp. 112-121.
[18] S. Rahman and M. N. A. Khan, Review of live forensic analysis techniques”, International Journal of
Hybrid Information Technology, vol. 8, no. 2, (2015), pp. 379-88.
[19] M. N. A. Khan, C. R. Chatwin and R. C. Young, Extracting Evidence from Filesystem Activity using
Bayesian Networks”, International journal of Forensic computer science, vol. 1, (2007), pp. 50-63.
[20] M. N. A. Khan, Performance analysis of Bayesian networks and neural networks in classification of file
system activities”, Computers & Security, vol. 31, no. 4, (2012), pp. 391-401.
[21] M. N. A. Khan, C. R. Chatwin and R. C. Young, A framework for post-event timeline reconstruction
using neural networks”, Digital Investigation, vol. 4, no. 3-4, (2007), pp. 146-157.
[22] M. S. Bashir and M. N. A. Khan, Triage in live digital forensic analysis”, International journal of
Forensic Computer Science, vol. 1, (2013), pp. 35-44.
[23] M. Rafique and M. N. A. Khan, Exploring static and live digital forensics: Methods, practices and
tools”, International Journal of Scientific & Engineering Research, vol. 4, no. 10, (2013), pp. 1048-1056.
[24] R. Shehzad and M. N. A. Khan, Integrating knowledge management with business intelligence
processes for enhanced organizational learning”, International Journal of Software Engineering and Its
Applications, vol. 7, no. 2, (2013), pp. 83-91.
[25] M. Khalid, S. ul Haq and M. N. A. Khan, An assessment of extreme programming based requirement
engineering process”, International Journal of Modern Education and Computer Science, vol. 5, no. 2,
(2013), pp. 41.
... Clustering describes a procedure that tries to align similar data points close together so that they can form a distinct group. A good clustering is present when there is a high intra-class similarity and low inter-class similarity between data points (Bano & Khan, 2018). In other words, good clustering produces distinct data point clouds that are separated from another data point cloud. ...
... Different distance metrics are used in the cluster algorithms to calculate the distance between data points. The Euclidean distance is the most used distance measure, but the Manhattan, Mahalanobis, Hamming distance, and more can be used for that purpose too (Bano & Khan, 2018). Clustering techniques are often used as exploratory data tools. ...
... In agglomerative clustering, each element starts as a separate cluster that is sequentially joined into larger clusters. Divisive algorithms start the opposite way by initially grouping all data points into one cluster and then sequentially proceeding into smaller clusters (Bano & Khan, 2018). ...
Thesis
Full-text available
Forest fires, prolonged drought periods, and bark beetle infestation in- creasingly stress tree health. Drones and technological advancements are reducing the workload of arborists and foresters by speeding up forested areas’ analysis and inventory management to gain information about these stressed areas. One aspect of this analysis examines tree species distribution in a forest area. Hence, this work proposes an in- novative unsupervised tree species classification ensemble to support tree species assessment. The ensemble uses an individual tree crown delineation model based on a mask region-based convolutional neural network to obtain single tree crown images from an orthomosaic. These images are fed into an experiment that uses different preprocessing steps, convolutional neural networks for feature extraction, dimensionality reduction techniques, and clustering techniques in an automated manner. The optimal method combination with contrast limited adaptive his- togram equalization (CLAHE), DenseNet, principal component analysis, and k-means++ achieves a weighted F1-score of 0.79 on the FORTRESS dataset, which is only a little worse compared to supervised approaches. The qualitative results reveal that the method combination with CLAHE, DenseNet, uniform manifold approximation and projection, and agglom- erative clustering generally performs well. Also, the results show that the ensemble is transferable and less stringent in classifying tree species in previously unknown forest areas. However, the proposed ensemble needs improvements in classifying classes with small sample sizes and different deciduous tree species. This work highlights the promise and importance of developing and using an unsupervised approach to classify tree species.
... These methods aim to mitigate the limitations inherent in cluster analysis. Depending on specific requirements, users select the most suitable clustering method, recognizing that no single method is flawless, as each has its own set of advantages and limitations [8]. ...
Article
Full-text available
Data clustering primarily serves as a solution for tackling unsupervised learning challenges and represents a fundamental tool applied across various domains-including data mining, pattern recognition, and artificial intelligence. The main objective of data clustering is to group similar objects and allocate them to different categories. Different clustering techniques are developed and implemented to categorize data objects. Partitioning and hierarchical techniques [1] are the two classified categories of data clustering. Additionally, alternative methods such as grid-based, density-based, and fuzzy C-mean clustering approaches are also available. This paper's objective is to provide a comprehensive overview of data clustering. It covers the historical context of this technique, furnishes a precise definition, and thoroughly explores various types of clustering methods while critically assessing their respective strengths and weaknesses. Furthermore, the paper delves into the practical applications of data clustering and highlights recently developed algorithms, facilitating a meaningful comparison among different clustering approaches. In summary, this paper offers a concise yet thorough review of the entire spectrum of whole data clustering method.
... These methods aim to mitigate the limitations inherent in cluster analysis. Depending on specific requirements, users select the most suitable clustering method, recognizing that no single method is flawless, as each has its own set of advantages and limitations [8]. ...
Article
Data clustering primarily serves as a solution for tackling unsupervised learning challenges and represents a fundamental tool applied across various domains-including data mining, pattern recognition, and artificial intelligence. The main objective of data clustering is to group similar objects and allocate them to different categories. Different clustering techniques are developed and implemented to categorize data objects. Partitioning and hierarchical techniques [1] are the two classified categories of data clustering. Additionally, alternative methods such as grid-based, density-based, and fuzzy C-mean clustering approaches are also available. This paper's objective is to provide a comprehensive overview of data clustering. It covers the historical context of this technique, furnishes a precise definition, and thoroughly explores various types of clustering methods while critically assessing their respective strengths and weaknesses. Furthermore, the paper delves into the practical applications of data clustering and highlights recently developed algorithms, facilitating a meaningful comparison among different clustering approaches. In summary, this paper offers a concise yet thorough review of the entire spectrum of whole data clustering method.
... For example, Huang et al. designed a k-prototypes algorithm by combining k-means and k-modes algorithms to directly cluster categorical data without conversion. In reference [12], the authors assess different clustering techniques. Data clustering techniques can be of two types: partitioning procedures and hierarchical procedures. ...
Article
Full-text available
In the field of biomedicine, enormous data are generated in a structured and unstructured form every day. Soft computing techniques play a major role in the interpretation and classification of the data to make appropriate decisions for making policies. The field of medical science and biomedicine needs efficient soft computing-based methods which can process all kind of data such as structured data, categorical data, and unstructured data to generate meaningful outcome for decision-making. The soft-computing methods allow clustering of similar data, classification of data, predictions from big-data analysis, and decision-making on the basis of analysis of data. A novel method is proposed in the paper using soft-computing methods where clustering mechanisms and classification mechanisms are used to process the biomedicine data for productive outcomes. Fuzzy logic and C-means clustering are devised as a collaborative approach to analyze the biomedicine data by reducing the time and space complexity of the clustering solutions. This research work is considering categorical data, numeric data, and structured data for the interpretation of data to make further decisions. Timely decisions are very important especially in the field of biomedicine because human health and human lives are involved in this field and delays in decision-making may cause threats to human lives. The COVID-19 situation was a recent example where timely diagnosis and interpretations played significant roles in saving the lives of people. Therefore, this research work has attempted to use soft computing techniques for the successful clustering of similar medical data and for quicker interpretation of data to support the decision-making processes related to medical fields.
... Dongkuan Xu and Yingjie Tain [17] discussed the time complexity, merits and demerits of various clustering algorithms and also few evaluation metrics used for checking the methods. In contrast, Saima Bano and M.N.A. Khan [18] presents a detailed survey of different partitioning and hierarchical clustering algorithms with their pros and cons. It details the Artificial Bee Colony and Firefly algorithms to optimize the clustering performance, scalability and dimensionality. ...
Article
Incremental Learning, a machine learning methodology, trains the continuously arriving input data and extends the model's knowledge. When it comes to unlabeled data streams, incremental learning task becomes more challenging. Our newly proposed incremental learning methodology, Data Augmented Incremental Learning (DAIL), learns the ever-increasing real-time streams with reduced memory resources and time. Initially, the unlabeled batches of data streams are clustered using the proposed clustering algorithm, Clustering based on Autoencoder and Gaussian Model (CLAG). Later, DAIL creates an updated incremental model for the labelled clusters using data augmentation. DAIL avoids the retraining of old samples and retains only the most recently updated incremental model holding all old class information. The use of data augmentation in DAIL combines the similar clusters generated with different data batches. A series of experiments verified the significant performance of CLAG and DAIL, producing scalable and efficient incremental model.
... Thus, clustering is one of the most important areas of study regarding Artificial Intelligence and Machine Learning applications. As stated by [26], data clustering is one of the most used techniques when it is set to group a given input data into groups of elements that are similar. Several researchers have used cluster techniques for urban mobility in the last years [27]. ...
Article
Full-text available
Urban mobility is a massive issue in the current century, being widely promoted the need of adopting sustainable solutions regarding transportation within large urban centres. The evolution of technologies has democratised smart cities to better plan and manage their mobility solutions, without compromising the social, economic, and environmental impacts. Pursuing the carbon neutrality and the climate agreement goals, soft mobility is one of the most popular emerging methods to provide greener alternatives regarding mobility. Among these transportation modes are the bicycle, which has been widely used in several public systems across the world, one of them being in Lisbon. This article provides a decision support system for bike-sharing docking stations for three council parishes of the city, namely, Parque das Nações, Marvila, and Beato. Taking advantage of clustering methods and GSM data from a telecommunication operator, this study pretends to highlight a novel approach to identify soft mobility hotspots, in specific bike-sharing docking stations, for suited mobility management systems in Lisbon’s city centre.
... In measuring the reliability of learning outcomes, data clustering is used. Data Clustering according to Saima Bano and M. N. A. Khan is a procedure in which a group of entities is created based on similar features [9]. Clustering is one of the widely used knowledge discovery techniques to reveal structures in a data set that can be extremely useful to the analyst. ...
Article
Full-text available
E-Learning has begun to be widely used in universities in Indonesia. However, there are various obstacles that are commonly encountered in its implementation such as infrastructure and learning design. Usually, the problem of providing infrastructure is adjusted to the ability of a university to procure needs according to the conditions in the field. This is very different from the design of E-Learning which demands a change in pedagogical paradigm in the interaction between lecturers and students. Learning outcomes of a single course at the university under study are often not optimal/reliable and tend to indicate passive student participation in face-to-face activities in class. Therefore, the design of E-Learning is very important to be made in such a way that the level of student participation becomes higher and results in reliable learning outcomes. Reliability is measured using the K-means Cluster technique by monitoring the extent to which students interact in using E-Learning facilities.
Article
In recent years, there has been a major interest in recognising electromyography (EMG) patterns. This work proposes a new method based on a biclustering algorithm which can group strides showing homogeneous EMG activation intervals. The surface EMG signals of biceps femoris, rectus femoris, semitendinosus, lateral gastrocnemius, and medial gastrocnemius muscles of 17 healthy children aged between 4 and 11 years old were obtained using a Trigno EMG wireless system. The data set was tested for different values of parameter α (the threshold describing when the multiple node deletion step is used) and δ (the threshold that limits the value of the mean square residue). The highest number of coincidences of muscle activation was observed in 6 to 7-year-old subjects. This was not affected by their anthropometrics or gender. The obtained biclusters reflect actual differences between the subjects' gait parameters, namely stride length, stride time, and walking speed. These results can be used to develop strategies for finding homogeneous groups of patients.
Chapter
European Soccer also known as soccer is considered to be the most popular team sport in the world. It is believed to be played by more than a whopping 150 million men as well as women of all the different age groups in probably more than 200 countries. Various techniques that can predict the outcome for professional soccer matches have used the count of goals which was scored by each team and use it as a base measure that evaluates performance of the team and also helps to estimate future results. In the sector of Machine Learning, various clustering algorithms are commonly used for creating several clusters. In this paper, an efficient clustering algorithm that is Fuzzy C-Mean clustering is proposed which would make different clusters that result to clustering of the home team corner with respect to away team corner on Soccer dataset. This clustering will help coaches to take right decision for improving their performance.
Chapter
Energy consumption has been an important content of wireless sensor network research in recent years, and node energy consumption can be effectively reduced by optimizing routing algorithms. Aiming at the problem of random clustering and uneven clustering in the LEACH algorithm, which leads to unbalanced energy, a routing algorithm for uniform clustering is proposed. The network is evenly clustered, and reasonable cluster heads are selected by competition in the clustering stage of the network. The data transmission path is optimized. The polling control mechanism is introduced into the intra-cluster communication during the data communication stage, which is carried out by combining single-hop and multi-hop. The simulation results show that the algorithm can effectively reduce network energy consumption, extend network lifetime, and improve throughput.
Article
Full-text available
Due to frequent use of Internet and with technological advancements, cyber and malware attacks over the digital devices have increased manifold. Activities performed electronically can be investigated by means of digital forensic analysis methodologies. Live digital forensic tools are used for digital evidence collection and investigations of malicious activities that occurred on a standalone system or networks. Since compromised system remains active while using these tools, some serious issues relating to malicious functionalities and policy violations could lead to serious damages like data theft or data loss. In this paper, we present a critical review of the triage in live forensic. This paper discusses several techniques being used for performing live forensic analysis and critically evaluate their efficacy in terms of their applicability and reliability. A brief anecdote about the pros and cons of these techniques are also discussed. We present the findings of our study in the critical section.
Article
Full-text available
Clustering is an important data mining technique where we will be interested in maximizing intracluster distance and also minimizing intercluster distance. We have utilized clustering techniques for detecting deviation in product sales and also to identify and compare sales over a particular period of time. Clustering is suited to group items that seem to fall naturally together, when there is no specified class for any new item. We have utilizedannual sales data of a steel major to analyze Sales Volume & Value with respect to dependent attributes like products, customers and quantities sold. The demand for steel products is cyclical and depends on many factors like customer profile, price,Discounts and tax issues. In this paper, we have analyzed sales data with clustering algorithms like K-Means&EMwhichrevealed many interesting patternsuseful for improving sales revenue and achieving higher sales volume. Our study confirms that partition methods like K-Means & EM algorithms are better suited to analyze our sales data in comparison to Density based methods like DBSCAN & OPTICS or Hierarchical methods like COBWEB.
Article
Cloud computing has emerged in the past decade to provide wide-ranging computational facilities at cheaper cost. Cloud computing facilitates the sharing of processing units, storage devices and applications. The National Institute of Standards and Technology describes cloud computing as a model for facilitating on-demand ubiquitous network access to a combined pool of computing resources – eg, networks, storage, servers, services and applications. These resources can easily be provisioned and released without the service provider's interaction. Cloud computing is now in widespread use. But it does pose a number of security challenges. Muhammad Naeem Ahmed Khan and ShahWali Ullah of the Shaheed Zulfikar Ali Bhutto Institute of Science and Technology propose an analysis model based on aggregating available logs from clients and servers that could help identify and detail suspicious activity on cloud-based systems.
Article
Knowledge Management (KM), either in a tacit or explicit form, plays an important role in decision making. Business Intelligence (BI) has acquired significant importance in the business world due to the recent technological advancements and availability of state-of-the-art software tools. Presently, most of the organizations across the globe are adopting BI solutions to promote business, retain customers, earn maximum profits, reduce recurring expenditures, etc. KM, on the other hand, has not gained significant acknowledgement from an implementation perspective. Moreover, the effect of BI on KM and vice versa is rarely observed. In this paper, we propose a model where KM and BI can benefit from each other. Since both of the said fields have a wide range of features, the proposed model is based on specific features. For that purpose, we have identified a number of Critical Success Factors (CSFs) related to both BI and KM technologies reported in contemporary research. Afterwards, we have drawn a comparison pertinent to the effectiveness of these CSFs with the similar research studies. The comparison shows that our model that combines both BI and KM technologies is more useful as compared to their individual utilization.
Conference Paper
Every company listed on the London Stock Exchange is classified into an industry sector based on its primary activity, however, it may be both more interesting and valuable to group similarly performing companies based on their historical stock price record over a long period of time. Using fuzzy clustering analysis with a correlation-based metric, we obtain a more insightful categorization of the companies into groups with fuzzy boundaries, giving arguably a more realistic and detailed view of their relationships. Once cluster analysis is performed, we analyze the behaviour of discovered groups in terms of the volatility of their returns using both standard deviation and exponentially weighted moving average. This approach has the potential to be of practical relevance in the context of diversified portfolio construction as it can detect fuzzy clusters of correlated stocks that have lower inter-cluster correlation, analyze their volatility and sample potentially less risky combination of assets.
Article
Clustering is a useful data mining technique which groups data points such that the points within a single group have similar characteristics, while the points in different groups are dissimilar. Density-based clustering algorithms such as DBSCAN and OPTICS are one kind of widely used clustering algorithms. As there is an increasing trend of applications to deal with vast amounts of data, clustering such big data is a challenging problem. Recently, parallelizing clustering algorithms on a large cluster of commodity machines using the MapReduce framework have received a lot of attention. In this paper, we first propose the new density-based clustering algorithm, called DBCURE, which is robust to find clusters with varying densities and suitable for parallelizing the algorithm with MapReduce. We next develop DBCURE-MR, which is a parallelized DBCURE using MapReduce. While traditional density-based algorithms find each cluster one by one, our DBCURE-MR finds several clusters together in parallel. We prove that both DBCURE and DBCURE-MR find the clusters correctly based on the definition of density-based clusters. Our experimental results with various data sets confirm that our DBCURE-MR finds clusters efficiently without being sensitive to the clusters with varying densities and scales up well with the MapReduce framework.
Article
Post-event timeline reconstruction plays a critical role in forensic investigation and serves as a means of identifying evidence of the digital crime. We present an artificial neural networks based approach for post-event timeline reconstruction using the file system activities. A variety of digital forensic tools have been developed during the past two decades to assist computer forensic investigators undertaking digital timeline analysis, but most of the tools cannot handle large volumes of data efficiently. This paper looks at the effectiveness of employing neural network methodology for computer forensic analysis by preparing a timeline of relevant events occurring on a computing machine by tracing the previous file system activities. Our approach consists of monitoring the file system manipulations, capturing file system snapshots at discrete intervals of time to characterise the use of different software applications, and then using this captured data to train a neural network to recognise execution patterns of the application programs. The trained version of the network may then be used to generate a post-event timeline of a seized hard disk to verify the execution of different applications at different time intervals to assist in the identification of available evidence.