ArticlePDF Available

A survey on parallel clustering algorithms for Big Data

Authors:

Abstract and Figures

Data clustering is one of the most studied data mining tasks. It aims, through various methods, to discover previously unknown groups within the data sets. In the past years, considerable progress has been made in this field leading to the development of innovative and promising clustering algorithms. These traditional clustering algorithms present some serious issues in connection with the speed-up, the throughput, and the scalability. Thus, they can no longer be directly used in the context of Big Data, where data are mainly characterized by their volume, velocity, and variety. In order to overcome their limitations, the research today is heading to the parallel computing concept by giving rise to the so-called parallel clustering algorithms. This paper presents an overview of the latest parallel clustering algorithms categorized according to the computing platforms used to handle the Big Data, namely, the horizontal and vertical scaling platforms. The former category includes peer-to-peer networks, MapReduce, and Spark platforms, while the latter category includes Multi-core processors, Graphics Processing Unit, and Field Programmable Gate Arrays platforms. In addition, it includes a comparison of the performance of the reviewed algorithms based on some common criteria of clustering validation in the Big Data context. Therefore, it provides the reader with an overall vision of the current parallel clustering techniques.
Content may be subject to copyright.
Vol.:(0123456789)
Artificial Intelligence Review
https://doi.org/10.1007/s10462-020-09918-2
1 3
A survey onparallel clustering algorithms forBig Data
ZinebDar1 · YasmineLamari1· SaidChahSlaoui1
© Springer Nature B.V. 2020
Abstract
Data clustering is one of the most studied data mining tasks. It aims, through various meth-
ods, to discover previously unknown groups within the data sets. In the past years, con-
siderable progress has been made in this field leading to the development of innovative
and promising clustering algorithms. These traditional clustering algorithms present some
serious issues in connection with the speed-up, the throughput, and the scalability. Thus,
they can no longer be directly used in the context of Big Data, where data are mainly char-
acterized by their volume, velocity, and variety. In order to overcome their limitations, the
research today is heading to the parallel computing concept by giving rise to the so-called
parallel clustering algorithms. This paper presents an overview of the latest parallel clus-
tering algorithms categorized according to the computing platforms used to handle the Big
Data, namely, the horizontal and vertical scaling platforms. The former category includes
peer-to-peer networks, MapReduce, and Spark platforms, while the latter category includes
Multi-core processors, Graphics Processing Unit, and Field Programmable Gate Arrays
platforms. In addition, it includes a comparison of the performance of the reviewed algo-
rithms based on some common criteria of clustering validation in the Big Data context.
Therefore, it provides the reader with an overall vision of the current parallel clustering
techniques.
Keywords Algorithms· Big Data· Clustering· Data mining· DBSCAN· FPGA· GPU·
k-means· MapReduce· MPI· Multi-cores CPU· Spark
1 Introduction
With the advent of the Big Data phenomenon, the data analysis techniques are currently
being modernized in order to address the emerging challenges. Data clustering is no excep-
tion to this trend. This long-established data mining technique is used to partition a set
* Zineb Dafir
zineb.dafir@um5s.net.ma
Yasmine Lamari
yasmine.lamari@um5s.net.ma
Said Chah Slaoui
said.slaoui@menara.ma
1 Faculty ofScience ofRabat, Mohammed V University, Rabat, Morocco
Z.Dafir et al.
1 3
of data instances into homogeneous subsets, such that each subset is formed by similar
instances, and at the same time dissimilar to instances belonging to other subsets (Han
etal. 2012). The primary objective is to discover previously unknown groups, which is a
sought-after result in several problems in everyday life. This can be achieved through dif-
ferent categories of clustering methods such as hierarchical methods, partitioning methods,
density-based methods, grid-based methods, or other clustering techniques (Fahad etal.
2014).
1.1 Challenges
Most traditional clustering algorithms are specialized and operate under specific conditions
to solve a particular type of problem. Besides, they are outdated and impractical in the
context of the Big Data due to their computational costs and their inability to handle het-
erogeneous data. They also suffer from their dependence on prior knowledge, data order,
and complex input parameters. In order to overcome these limitations, the research today is
heading to the parallel computing concept by giving rise to the so-called parallel clustering
algorithms. As the designation implies, this kind of algorithms can partition data sets into
several chunks, and then for each chunk, execute activities simultaneously on one or on
many processing devices. The intermediate clustering results are usually aggregated at the
end to produce the final result. The concept of parallelism aims to improve the speed-up,
the throughput, and the scalability of the clustering process so that it becomes effective to
meet the challenges of Big Data.
These challenges arise primarily from the characteristics that define the Big Data,
namely the volume, the velocity, and the variety of data, which stand for the 3V’s defini-
tion proposed by the Gartner company in Laney (2001). The first dimension refers to the
quantity of generated and stored data. Recently, the volume of data ranges from petabytes
toward zettabytes in a continued increase. Then, the second dimension represents the fre-
quency at which the data is coming and updated because the data is constantly in motion.
The third dimension to be also considered is the variety of data, since data is collected
from multiple sources and in different forms. These dimensions describe the Big Data phe-
nomenon, and they serve also as criteria for evaluating algorithms that attempt to solve the
Big Data issues such as clustering algorithms.
1.2 Scope ofthearticle
This paper presents an overview of the latest parallel clustering algorithms categorized
according to the computing platforms used to handle the Big Data. Indeed, at some point,
we need to know what has been accomplished and what remains to be done regarding the
clustering of Big Data. Accordingly, this paper aims to provide the reader with an overall
vision of the parallel methods of clustering Big Data that have been developed recently as
well as the current trend of the research in such relevant field. In this respect, we opted for
a recent classification of the different platforms for the Big Data analytic which is proposed
in Singh and Reddy (2014). According to this classification, two categories of platforms
of Big Data can be distinguished: the horizontal scaling platforms and the vertical scal-
ing platforms. The first category gathers systems that distribute the workload across many
servers or commodity machines. And so it includes peer-to-peer networks, MapReduce,
and Spark platforms. While the second category brings together systems that work on a
single server and allows adding additional resources, such as processors, memory, and fast
A survey onparallel clustering algorithms forBig Data
1 3
hardware. This category includes High Performance Computing Clusters (HPC), Multi-
core processors, Graphics Processing Unit (GPU), and Field Programmable Gate Arrays
(FPGA) platforms. Figure1 illustrates the classification of different Big Data platforms.
1.3 Contributions
This section explores the fundamental contributions of this paper as follows:
1. Presents a technical overview of different Big Data platforms.
2. Surveys the new parallel clustering algorithms and their categorization according to the
platforms adopted to ensure their parallelization.
3. Introduces a comparison of the algorithms studied in terms of clustering validation
criteria and Big Data characteristics.
1.4 Organization
The remainder of this survey is organized as follows: Sect.2 reviews the most popular plat-
forms of Big Data. Section3 exposes the most recent parallel clustering methods classified
based on the used platform. Section4 compares the studied parallel clustering algorithms.
Fig. 1 Classification of Big data platforms
Z.Dafir et al.
1 3
Finally, Sect.5 summarizes the main observations of this survey and the corresponding
recommendations.
2 Overview ofthedierent platforms ofBig Data
This section presents a brief overview of the most popular platforms of Big Data. First, the
horizontal scaling platforms are presented, including MapReduce, Spark, and Peer-to-peer
networks. Then, the vertical scaling platforms are presented, which comprise Graphics
Processing Unit, Multi-core CPU, and Field Programmable Gate Arrays. These platforms
have been the basis for several designs and clustering algorithms of which some relevant
works are examined in this paper.
2.1 MapReduce
MapReduce is a popular parallel programming model, which was first introduced by
Google in Dean and Ghemawat (2004). It was designed to read, process, and write a huge
amount of data. This programming model consists of two main functions: the Map and the
Reduce functions. The Map function takes as input a logical record (also called a chunk of
data) and produces a set of intermediate key/value pairs. Once the Map phase is completed,
the Reduce phase starts processing the intermediate key/value pairs produced in the previ-
ous phase. Indeed, the Reduce function takes as input the set of intermediate key/value
pairs that share the same key, and then, merges together all values assigned to the input key
in order to produce the set of values associated with the corresponding key. There are two
other optional functions used to refine the execution of the programs: the Partitioner and
the Combiner functions. The Partitioner function splits the intermediate keys according to
the number of Reduce tasks or the number of output files which is specified by the devel-
oper. The Combiner function aims to summarize intermediate results produced by each
Map task in order to avoid potential repetitions, thus optimizing the transfer of data to the
Reduce task over the network. All these functions are programmable by the developer. Fig-
ure2 describes the operational flowchart of the MapReduce programming model (Dean
and Ghemawat 2004).
Fig. 2 Flowchart of MapReduce model
A survey onparallel clustering algorithms forBig Data
1 3
2.2 Spark
Apache Spark is a big data processing framework designed for data-intensive applica-
tions and executed on commodity clusters (Zaharia et al. 2010). Unlike MapReduce,
Spark framework supports iterative jobs and allows running queries on big datasets by
loading only the useful dataset into memory. In this way, the execution time is reduced
considerably.
Spark introduces three fundamental aspects: resilient distributed datasets (RDDs), par-
allel operations, and shared variables. RDD is a collection of objects shared by a set of
machines that can be recovered in case of loss; it can be also stored in memory to reuse it
in multiple parallel MapReduce jobs. The second aspect is the parallel operations which
can be performed on RDDs. It includes the Reduce, Collect and Foreach operations. The
last aspect consists of broadcast variables and accumulators.
Spark is flexible, easy to use and does not need any abstraction to program. It pro-
cesses the data in real-time using the Spark Streaming module and caches partial results
in memory using distributed workers. Moreover, Spark is efficient and exceeds the Hadoop
MapReduce framework by 10× in interactive machine learning workloads while preserving
the fault tolerance and scalability of MapReduce. However, Spark shows some limitations,
such as the fact that it requires large resources, and it is also expensive in terms of memory.
Figure3 depicts the operational flowchart of the Spark model (Zaharia etal. 2010).
2.3 Peer‑to‑peer networks
Peer-to-peer (P2P) networking represents a distributed architecture that divides tasks
among peers. The definition of P2P networking includes any type of network architec-
ture which does not need a server to control the transfer of information between contrib-
utors, and which makes a part of their resources accessible by other contributors in the
same network (Milojicic etal. 2002). Hence, P2P systems allow valuable externalities,
lower cost of ownership and sharing, and finally anonymity. The most used scheme in
this platform is the Message Passing Interface (MPI). The fundamental idea behind the
standard MPI is to provide the necessary abstractions in order to ensure the communica-
tion between peers. It is also characterized by its ability to keep processes alive during
Fig. 3 Flowchart of Apache Spark model
Z.Dafir et al.
1 3
the running of the system, which means that there is no need to read the data several
times from the disk. This quality makes the MPI suitable for iterative jobs.
The Architecture of P2P network has the ability to add nodes dynamically and thus
to scale up efficiently to a larger size that may be needed to process applications with a
huge amount of data, as shown in Fig.4. This kind of platforms is known to be resistant
to failures, which means that when a node experiences a local failure, it will have no
impact on the remaining nodes. Another advantage lies in the fact that the capacity of
a P2P system increases with the addition of new nodes. However, adding new nodes to
the system can slow down the transfer of data to the connected users. P2P systems suffer
also from security issues and require high bandwidth usage.
Fig. 4 A typical architecture of the peer-to-peer network
A survey onparallel clustering algorithms forBig Data
1 3
2.4 Graphics processing unit
A Graphics Processing Unit (GPU) is a single-chip processor which was initially dedicated
for processing 2D and 3D calculations. As an input, it receives geometry primitives from
the CPU in a 3D form. Then it transforms them from individual vertices into pixels, which
are shaded and mapped onto the screen. In order to produce the final image, the processed
pixels are combined to form an output destined to be visualized in a display device. These
stages form the so-called graphics pipeline as shown in Fig.5 (Owens et al. 2008). The
GPUs follow the single program multiple data (SPMD) as a programming model and the
single instruction, multiple data (SIMD) as parallel architecture.
Today, modern GPUs are gaining increasing attention due to their massively parallel
processing architecture that accelerates the performance of applications requiring high
floating point calculations. Indeed, GPUs are not beneficial only for graphics applications;
they are also used to perform non-specialized calculations giving rise to the general pur-
pose graphics processing unit (GPGPU) computing. In order to take full advantage of the
parallelism offered by GPUs, new parallel programming languages, such as CUDA (Nick-
olls etal. 2008) and OpenCL (Stone etal. 2010), have emerged. These languages simplify
and improve the performance of the linear algebra on the GPUs (Owens etal. 2008).
GPUs have demonstrated energy and cost efficiency for arithmetical intense and stream-
ing-memory problems. However, their major drawback lies in the limited memory capac-
ity, which often necessitates complex memory management.
2.5 Multi‑core CPU
A Multi-core platform is a processor that integrates multiple cores in a single chip. Gener-
ally, there are three commonly recognized architectures of multi-core CPU (Akhter and
Roberts 2006). The first architecture shares the on-chip cache between execution units,
while the second architecture provides a dedicated cache for each execution core. The third
architecture adopts a hybrid approach that subdivided the cache into two types of layers,
Fig. 5 A typical structure of graphics pipeline
Z.Dafir et al.
1 3
layers dedicated to a particular execution core and others shared by all execution cores.
Figure6 describes a typical architecture of multi-core platform.
In order to take advantage of the parallelism offered by the multi-core platforms, one has
to consider the allocation of the work among all available processors (Gepner and Kowalik
2006). The multi-threading model is a common way that allows the parallel execution on
a multi-core platform. This is achieved by dividing the work into separate execution units
which can run on the different processors at the same time.
The multi-core platforms offer a high performance and low heat generation. They are
useful for applications with high capacity of parallelism that can profit from all the avail-
able cores, otherwise, the multi-core platforms won’t be a practical choice. This entails
a substantial effort to parallelize an application as far as possible, which is not always
feasible.
2.6 Field programmable gate arrays
A Field Programmable Gate Arrays (FPGA) is an integrated circuit device. It is composed
of an array of programmable logic blocks, in addition to a hierarchy of reconfigurable inter-
connects and I/O blocks (Brown etal. 1992). Different architectures of logic blocks can
be conceived in order to form a complex circuit. Such circuits contain several other sub-
circuits and have more than one output. These blocks are interconnected via reconfigurable
interconnects, which consists of wire segments and programmable switches. Like the logic
blocks, the structure of programmable switches can be conceived in different ways. The
configurations of all components of the FPGA are described using a hardware description
language (HDL).
The main advantages of FPGAs lie in the reduced costs of prototypes they offer, in
addition to their expandability and flexibility. In fact, the flexibility is regarded as both an
advantage and a drawback since it makes the FPGAs larger, slower, and more power con-
suming (Farooq etal. 2012).
The FPGAs are omnipresent in various applications and can figure out any computa-
tional problem, especially the applications that require exploiting the parallelism available
on FPGAs.
Fig. 6 A typical architecture of multi-core platform
A survey onparallel clustering algorithms forBig Data
1 3
3 Parallel clustering algorithms
This section exposes the most recent and relevant parallel algorithms for clustering Big
Data. The aim is to explore a variety of types of clustering, such as partitioning-based
clustering, density-based clustering, bio-inspired methods and many other techniques of
clustering. First, we describe the parallel clustering algorithms based on horizontal scaling
platforms. Then, we describe the parallel clustering algorithms based on vertical scaling
platforms. Figure7 presents the main works reviewed in this paper.
3.1 Horizontal scaling platforms‑based clustering algorithms
As mentioned above, the horizontal scaling platforms considered in this survey are MapRe-
duce, Spark, and Peer-to-Peer networks. This section covers a selection of clustering algo-
rithms that are implemented using this kind of platforms.
3.1.1 Clustering algorithms using MapReduce
The work proposed in Cui etal. (2014) is a processing model in MapReduce which elimi-
nates the iteration dependence of the k-means algorithm through a sampling technique. The
Fig. 7 Parallel clustering algorithms for Big Data
Z.Dafir et al.
1 3
main idea of this optimized k-means is to estimate the iterations using a sampling tech-
nique in order to obtain only some subsets from big datasets. Then, by processing these
subsets, the sets of centers are constructed and used to cluster the original datasets. The
proposed algorithm consists of three MapReduce jobs. The first job is responsible for sam-
pling the original dataset. The second job performs the samples clustering step in mappers,
and then the merging step in one reducer in order to produce k final centers from the inter-
mediates centers. For this purpose, the authors introduced two novel methods for merging:
weight-based merge clustering (WMC) and distribution-based merge clustering (DMC).
At last, the third job generates the Voronoi diagram using k points from the previous job,
partition the original dataset, and then, obtain the final clustering result. The experiments
on synthetic and real datasets show that the proposed method performs better compared
with other parallel versions of the k-means algorithm.
In the same context, Li etal. (2014) suggested Multiplex k-means, a parallel execu-
tion of multiple k-means processes using MapReduce. In the proposed method, several pro-
cesses are launched serially, then only the best result is considered. The execution of these
processes is costly in terms of time and resources, unlike the proposed Mux k-means algo-
rithm, which simultaneously runs multiple k-means using different centroid groups, and
keeps the best solution at the end. The proposed algorithm involves four steps. It first runs
the k-means processes using Map and Reduce operations. The Map operation calculates
the distance between the current point and all existing centroids, while the Reduce opera-
tion is responsible for updating the centroids. After each iteration, the quality of clustering
result is evaluated based on the Total Within-Cluster Variation value (TWCV). The TWCV
metric is the sum of the distances between each point and the centroids of each group.
More the value of TWCV is lower more the quality of clustering is higher. Then, in the
second step, the groups with a lower value of TWCV are pruned. The third step consists
of swapping similar centroids. Finally, the last step consists of generating new centroids
using two methods: the Random Search with a definite Scope (RSDS) and the Average of
Dissimilar Group Pairs (ADGP). This process is repeated until the centers become steady.
The algorithm was implemented using the Hadoop MapReduce framework and tested with
real-life datasets. The experiments show that the Mux-k-means can achieve better results
than when using other serial versions of the k-means algorithm.
MR-DBSCAN (He etal. 2014) is a MapReduce-based implementation of the well-
known DBSCAN algorithm. This work introduces a new cost-based data partition-
ing method in order to take into account the density of points. The proposed method is
a 3-stage end-to-end solution. The first stage, which is called data partitioning, produces
small partitions from the whole dataset according to the spatial proximity. The second
stage, which is called local clustering, groups partitions independently. Finally, the third
stage, which is called global merging, aggregates the produced partitions in the previous
stage in order to form the final result. This latter is achieved in two steps: build merge map-
ping and relabel data. The build merge mapping step first determines all pairs of intersect-
ing partitions, then computes the global clusters and builds a mapping from local to global
clusters. And the relabel data step adjusts the intermediate results of local clustering by
replacing local cluster ID’s with global ones and determines the type of all points. The
MR-DBSCAN was evaluated using two large real-life datasets. A set of experiments were
carried out in order to study the performance changes when varying some sensitive param-
eters related to the DBSCAN algorithm, such as MinPts,
𝜀
and partitioning methods. The
results of experiments confirm the efficiency and scalability of the proposed method.
MR-ABC (Banharnsakun 2017) is a MapReduce-based implementation of the arti-
ficial bee colony (ABC) for large-scale data clustering. The proposed method optimizes
A survey onparallel clustering algorithms forBig Data
1 3
the assignment of large data to clusters through the minimization of the sum of squared
Euclidean distance. The main operations involved in this method are to update centroids
of clusters and to evaluate the fitness. First, the proposed method generates the initial solu-
tions. Then, it updates the new centroid values for the employed bees. After that, the fitness
is calculated and evaluated based on the sum of the squared Euclidean distance. Since this
task is time-consuming, the author uses the MapReduce model in order to calculate the
fitness value. Then, each onlooker bee selects the centroid values which produce a higher
fitness from employed bees and updates them. This process reiterates until the number of
iterations reaches a threshold value. In order to evaluate the MR-ABC algorithm, experi-
ments were conducted on synthetic and real-life datasets. The results obtained show that
the proposed method outperforms the PKMeans (Zhao etal. 2009) and the parallel K-PSO
(Wang etal. 2012) algorithms in terms of quality. Besides, the running time and speed-up
results of MR-ABC with 10 Hadoop cluster nodes demonstrated that the MR-ABC can pro-
cess large amounts of data in reasonable time.
In Yang and Li (2013), the authors studied the Ant Colony optimization algorithm
in order to propose a new approach for big data semantic clustering (MBSC) based on
MapReduce. The parallel implementation of Ant Colony Clustering method based on
semantic content involves a single MapReduce job. The Map task splits the data records
into data chunks. Then, it defines the key-value pairs as the length of traversal path without
dropping records and the set of traversed nodes. In each step, the Map function reads the
pheromone value and calculates the swarm similarity, which is transformed to probability
value, in order to decide either to drop or to select a record. In this way, the data records
with the same similarity are gathered in the same cluster. The reduce task collect solutions
from all data chunks given by ants and then update the pheromone value for a next itera-
tion. The process is repeated until the final result is reached. The comparison shows that
the MBSC algorithm is more efficient than k-means in terms of time when considering dif-
ferent MSBC parameters.
In Jin etal. (2013), the authors proposed a parallel method of spectral clustering using
MapReduce. The spectral clustering is based on the evaluation of the sparse matrix eigen-
value. The calculation of the similarity matrix and other parameters used in this algorithm
are expensive, hence the need to use MapReduce in order to reduce the computation time.
The proposed parallelization of the spectral clustering consists of three steps. The first step
calculates the similarity matrices, which are simplified by the use of the Map and Reduce
operations. The second step calculates the k minimum feature vectors using the Lanczos
algorithm (Lanczos 1950). Finally, the third step parallelizes the k-means algorithm. The
parallelization of k-means is divided into two fundamental steps. The Map function calcu-
lates the nearest centroid for each point. Then, the combiner function partially merges sam-
ples with the same centroid and sends the result to the Reduce function in the form of pairs,
where the key is the centroid and the value is the list of points having the same centroid.
The Reduce function collects the points with the same centroid and updates the centroids
values by calculating the average of each set of points assigned to the centroids. The pro-
cess reiterates until stable centroids are obtained. The experiments were performed using
the classic dataset using several properties (correctness validation, a test of speed-up ratio,
analysis of scalability). The results show that the parallel spectral algorithm is efficient for
processing large datasets.
In Sun etal. (2014), the authors developed a parallel method combining the information
bottleneck (IB) theory with centroid-based clustering. Their main contributions include
the use of the IB theory-based hierarchy clustering to determine the centroid of each Map
computational node, in addition to the use of an objective method in order to determine the
Z.Dafir et al.
1 3
number of clusters. The parallel IB theory is based on MapReduce. The MapReduce job
is designed with multiple Map tasks and a single Reduce task. In this method, the data is
divided into partitions, and then each partition is treated independently and in parallel by a
Map computational node. In each Map computational node, the IB theory-based clustering
method is applied to each partition in order to obtain the sub-centroid. All sub-centroids
are gathered in the Reduce task in order to create a new dataset. Then, the IB theory-based
clustering method is once again applied to the new dataset in order to generate the ini-
tial centroid of the global dataset. Once the initial center is calculated, the parallel cen-
troid clustering based on an iterative MapReduce model, called Twister (Ekanayake etal.
2010), is applied. It is designed with multiple Map tasks and a single Reduce task, which
returns its output to the Map tasks iteratively. The parallel centroid clustering processes
as follows. First, the initial sample dataset is partitioned and the initial centroids obtained
previously are mapped to each computational node. In each Map computational node, the
sub-centroids are recalculated with centroid-based clustering method. All sub-centroids are
gathered in the Reduce computational node and the global centroid is updated. Then, the
new centroids are sent to the main computational node to be used in the next iteration. The
process stops when a certain difference, which is measured with Kull-back divergence, is
less than the fixed threshold value. In order to visualize the clustering results, the multidi-
mensional scaling (MDS) has been applied, as a dimension reduction method, on the DNA
data used in the experiments. The results show that the developed method is better than a
typical parallel k-means implementation.
The Best of both Worlds (BoW) approach, which is proposed in Cordeiro etal. (2011),
combines two parallel clustering methods: the Parallel Clustering (ParC) and the Sample-
and-Ignore (SnI), in order to reduce the I/O and network costs. The ParC method is exe-
cuted in five phases using three different strategies for partitioning data. In the first phase,
the data is distributed to the mappers from the distributed file system. Each mapper com-
putes the key value of the received data element. In the next phase, each reducer processes
the elements with the same key. And then, it normalizes them and runs the plugged-in
clustering algorithm in order to obtain the
𝛽
-clusters from each reducer. In the following
phase, the algorithm finds a pair composed of the reducer description as key and the cluster
description as value. The last phase is executed serially by putting together the
𝛽
-clusters
pairs which overlap in the space. The SnI method consists of executing the ParC algorithm
after applying a pre-processing step. This step starts by sampling the datasets and find ini-
tial clusters in order to cluster only the unclassified elements using the ParC algorithm. The
BoW method chooses the least expensive clustering strategy in terms of time and other
parameters using a cost-based optimization. The proposed approach has been developed
using Hadoop MapReduce framework, and tested with up to 1024 cores in parallel, on real-
life and synthetic datasets. The results show that the BoW approach is scalable, and gives
approximately the same results compared to the serial clustering algorithms.
3.1.2 Clustering algorithms using Spark
A recent algorithm, called Parallel Kernel Kmeans, has been introduced in Tsapanos etal.
(2016). The proposed method, which expands a previous work called the Trimmed Kernel
k-Means (Tsapanos etal. 2015), handles clustering of large datasets using Spark frame-
work. The proposed method begins by sub-sampling the data to perform the clustering
of large datasets. Then, it calculates the Kernel matrix iteratively using MapReduce and
writes it to the disk for future use. Thereafter, it reads the Kernel matrix from the disk in
A survey onparallel clustering algorithms forBig Data
1 3
order to compute the Kernel matrix trimming. The calculation of this matrix consists of
two main steps. The first step finds the winning cardinality of each node and trimmed rows
accordingly using the Map function, while the second step sums up in the same vector the
votes of the cardinality of each cluster using the Reduce function. The Map and Reduce
functions are used again to remove the winning votes. This process reiterates until finding
the cardinality of all nodes. The next step applies the Kernel k-means to adjacency lists
given by the calculated Kernel matrix trimming. The process of Kernel k-means proceeds
as follows: first, it associates each sample to a cluster randomly and then calculates the par-
tial sum of the entries retrieved from the adjacency list by a mapping function. The reduce
function calculates total sums from every cluster. Thereafter, the map function performs
the distance computations in order to define the new cluster assignment for each node.
The last distributed operation is the nearest neighbour assignment. The performance of the
proposed method was evaluated using the Youtube Faces dataset, and compared with the
approximate Kernel k-means and the Trimmed Kernel k-means algorithms. The clustering
results show that the Parallel Kernel k-means algorithm is more efficient than the Trimmed
Kernel k-means algorithm, and yields results close to those given by the approximate Ker-
nel k-means algorithm.
The work suggested in Mallios etal. (2016) is a framework for clustering and classi-
fication large amount of data using Spark.The k-means and ID3 algorithms were studied
and implemented using the proposed framework. The proposed framework is a parallel
execution of multiple round-trips performed by the master node and several workers. Each
round-trip consists of three main phases: Local Pre-processing, Global Distributed Pro-
cessing, Local Post-processing. In the first phase, the master node requests to the workers
to compute intermediate results from the stored data as part of the Map function. In the
second phase, the workers swap the results obtained, and then collect the global informa-
tion. In the last phase, each worker completes its own computations based on the interme-
diate results obtained from the previous phase. After having completed these phases, the
master node runs a new round-trip. The data, which is swapped between the master node
and the workers or between workers, operates in three modes: No-restriction mode, Par-
tially Restricted Data Exchange Mode and Strict Restricted Data Exchange Mode. Addi-
tionally, the data processed by the workers can be executed differently according to the
algorithm used in the proposed framework. The results show that the algorithms tested
in the proposed framework were efficient in terms of time and scalability. Moreover, the
k-means algorithm, which is implemented using the proposed framework, exceeds 31 %
the quality of results provided by the k-means algorithm using Spark.
There are several other works related to the implementation of k-means using Spark,
such as Wang etal. (2016), Zayani etal. (2016) and Sinha and Jana (2016).
In Wang etal. (2016), the authors conceived a parallel k-means algorithm using
Spark. The proposed design can select the appropriate variant of the k-means algorithm
and the distance function to use. The algorithm reads data from the HDFS and writes it
in RDDs blocks using two different strategies. In the first strategy, each data instance
is loaded directly into the RDD block in the form of a set of dense vectors, while in
the second strategy it is represented by a set of sparse vectors. The next step computes
the distance between each two sparse vectors in parallel. Finally, the last step performs
two different methods for updating the centroids according to the clustering type used
(crisp clustering or fuzzy clustering). For the crisp clustering, the assigning and the
distance calculation steps are performed as part of the Map function, whereas the cen-
troids updating step is performed as part of the Reduce function. This latter consists of
Z.Dafir et al.
1 3
collecting the instances with the same centroid, and then compute their average value.
For the fuzzy clustering, the centroids updating step is performed using a predefined
equation.
Another work related to the k-means algorithm is proposed by Zayani etal. (2016),
which is called the Parallel Overlapping k-means method (POKM). It was designed to per-
form non-disjoint partitioning of large-scale data. The proposed work is based on a geo-
metrical method, called Overlapping k-means (OKM), which consists of two steps: assign-
ing each instance to multiple clusters and updating centroids. These steps are repeated until
reaching the convergence criterion. Following the same manner, the Spark-based version
is based on two steps: the data assignment and prototypes calculation steps. Ones the input
data is partitioned, the first step would be to apply a Map function for each block, which
assigns the instances to multiple clusters, in order to parallelize the assigning procedure
locally. Then, another function operates on the global clusters, which are produced previ-
ously, by sorting the prototypes of clusters according to their proximity to the processed
instance. This process assigns the clusters from the nearest to the farthest as long as a
predefined criterion decreases. The Prototypes calculation step is performed by Map and
Reduce tasks. The Map task performs local computations for each cluster and sends as
output a set of key-value pairs, where the key is the index of the prototype cluster and value
is a couple of computed parameters. The Reduce task updates the prototype clusters and
returns the final ones.
In the same context, Sinha and Jana (2016) presented a method based on classical
k-means algorithm to process a big data streams using the Spark framework. The proposed
algorithm consists of four main processes. The first process consists of generating initial
clusters using a large value of k. The second process applies the k-means algorithm to the k
clusters. The next process consists of merging the centroids that are spaced from each other
by less than a certain threshold value. The last process returns the final clusters by merging
the previously obtained clusters. This algorithm is characterized by its free over-resolution.
The reviewed methods demonstrated their efficiency in terms of scalability and quality
of clustering large data.
The work introduced in Bharill etal. (2016) is a scalable random sampling and an itera-
tive optimization fuzzy c-means algorithm. The proposed method, which is called SRSIO-
FCM, divides randomly the dataset into many subsets. Then, it generates the centers of
clusters randomly in order to cluster the first subset. The cluster centers and the member-
ship information of the first subset are calculated, and then they are given as input to clus-
ter the second subset. In turn, the cluster centers and membership information of the sec-
ond subset are calculated. After the first two iterations that present a particular case, the
membership information of all the processed subsets are combined in order to compute the
new cluster centers, which will be given as input to cluster the next subset. The following
subsets are clustered in the same manner. The authors proposed a parallel implementation
of the LFCM algorithm (Havens etal. 2012) using Spark, which they called the SLFCM
algorithm (Scalable Literal Fuzzy c-Means), in order to handle the parallel computation of
membership information and centers of clusters. It performs Map and ReduceByKey oper-
ations, where the Map phase is responsible for the membership degree calculation of a data
point with respect to each center, and the ReduceByKey phase is responsible for updating
the values of the center of cluster based on the output of the Map phase. The performance
of the SRSIO-FCM algorithm was compared with the Spark based-implementations of two
well-known fuzzy clustering algorithms. The experiments on some big datasets demon-
strated that the proposed method achieves almost the same quality of clustering results in
less time.
A survey onparallel clustering algorithms forBig Data
1 3
The work proposed in Gouineau etal. (2016) is a distributed density clustering algo-
rithm to analyze very large datasets, called Patchwork. It was designed using the MapRe-
duce model to parallelize the calculations and implemented with Apache Spark. The Patch-
work algorithm is a part of density clustering algorithms that are known for their linear
computational complexity and near-linear horizontal scalability. It consists of two funda-
mental steps. The first step consists of dividing the multidimensional feature space into a
grid in order to determine dense regions and then finding the ID of the cell (hypercube in
D-dimensional feature space) for each point using the Map function. After that, the Reduce
function collects the tuples with the same cell ID and constructs collections of cells with
their density. The second step consists of creating clusters using the collections of cells and
sorting them by decreasing the density until all cells are processed. An optional filter is
used to help to find clusters with enough cells. The proposed algorithm was compared with
Spark MLLib k-means and Spark DBSCAN using four Synthetic datasets. It is efficient and
yields 40× better results than those produced by Spark MLLib k-means.
In Liu etal. (2017), the authors presented a parallel implementation of the density peaks
clustering algorithm (Rodriguez and Laio 2014) using the Spark’s API for graphs compu-
tation, called GraphX. The authors noticed that the original method needs to calculate the
distances between all pairs of data points which results in high computational cost. In order
to overcome this issue, they have proposed a parallel version of the density peaks cluster-
ing algorithm using the Spark RDD model. The process starts by initializing the thresholds
values of the local density and the distance from points of higher density and then gener-
ating a graph. The graph construction requires first importing separately vertex and edge
data stored in the HDFS to vertex RDD and edge RDD, then computing the distance and
updating its value in each edge, and finally combining vertex RDD and edge RDD to form
a Graph. The next step consists of computing the truncated distance, the local density, and
then the distance from points of higher density. The last step performs the clustering of
data points by selecting cluster centers, isolated points, and then performs the classification
based on the thresholds fixed previously. The experimental results demonstrated that the
proposed implementation can be up to 10× faster than the density peaks clustering algo-
rithm when implemented using MapReduce.
Remaining within the context of the density-based clustering methods, several studies
have been conducted on the DBSCAN algorithm in order to propose parallel implementa-
tions based on Spark, such as Luo etal. (2016), Han etal. (2016) and Cordova and Moh
(2015).
The S_DBSCAN algorithm, introduced in Luo etal. (2016), starts by partitioning the
raw data based on random samples. It produces partial clustering results by performing
local DBSCAN algorithm in parallel. At this stage, the Map task generates partial clusters,
while the ReduceByKey task saves each partial cluster as a new RDD to the HDFS, then
computes the centroid for each partial cluster. Finally, it merges the independent clustering
results obtained in the previous steps based on the centroid.
The proposed implementation in Han etal. (2016) starts by reading data from the HDFS
and then generates RDDs in order to transform them into data points. In the next step, mul-
tiple executors build independently partial clusters and sent them back to the driver. When
all the partial clusters are collected, the merging process starts identifying the clusters
that need to be merged based on a new concept introduced by the authors, called SEEDs.
SEEDs are points that do not belong to the current partition and they are placed by execu-
tors to serve as markers in order to identify master partial clusters and then merge them.
The proposed method is distinguished by the use of a Java-based kd-tree implementation in
order to optimize the complexity of the task of searching for the neighbours of points.
Z.Dafir et al.
1 3
The RDD-DBSCAN algorithm, introduced in Cordova and Moh (2015), consists of five
steps. This algorithm starts the same way as the Apache Spark-based implementations of
DBSCAN described above. It divides data and then performs local DBSCAN algorithm on
each partition. After that, it identifies clusters which span multiple partitions and generates
global cluster identifiers. Finally, all points are relabelled using the newly-found, globally-
unique identifiers.
The three methods were assessed based on their accuracy and speed up. They have
proven their efficiency in processing large data. The algorithm, proposed in Han et al.
(2016), can achieve a performance 16× faster than the MapReduce implementation of the
DBSCAN algorithm. The experiments of the RDD-DBSCAN algorithm demonstrated that
the communication costs do not significantly affect its performance.
3.1.3 Clustering algorithms using peer‑to‑peer networks
Recently, Azimi etal. (2017) introduced a new gossip-based distributed clustering algo-
rithm for peer-to-peer unstructured networks (GBDC-P2P). The GBDC-P2P is based on
K-medoids and k-means algorithms for extracting the representative data and discovering
the final clustering results, in addition to the CYCLON algorithm (Voulgaris etal. 2005) to
ensure the interactions between peers. It starts by selecting M representatives in each peer
among its internal data. In the next step, which concerns the gossip-based interactions,
each peer sends the representatives to the neighbour peers. This process of interactions
between peers is based on the CYCLON algorithm. Then, the summarization step sum-
marizes the external data of peers once their memory is full. This step is performed only
when necessary. The property of the data summarization is provided by the K-medoids
algorithm. The adaptation of the GBDC-P2P algorithm to dynamic network conditions is
ensured using an age variable for each external data of peers. At each round of gossip-
based operations, all peers increment the age variables of their external data. This mecha-
nism allows replacing the old external data of peers. Finally, an improved version of the
k-means algorithm, called Persistent k-means, is proposed and it is performed in each peer
in order to calculate the final clustering results. The proposed algorithm was compared to
other methods in the literature in order to demonstrate its efficiency. The clustering results
obtained by the GBDC-P2P algorithm are very close to those obtained by the centralized
k-means algorithm.
In Gehweiler and Meyerhenke (2010), a distributed heuristic, called DIDIC, is pro-
posed for clustering a virtual P2P supercomputer. It consists of three phases, namely,
the establishment of the initial situation, the elimination global knowledge with suit-
able diffusive processes, and the determination clustering. In the first phase, if no ini-
tial configuration is provided, then a random configuration is considered. In the second
phase, the diffusive clustering process is carried out. Two diffusion systems are used
per cluster to represent the same load colour. The primary diffusion system exploits the
property of diffusion and random walks in order to identify dense graph regions, while,
the secondary system sends a load of the system to nodes belonging to cluster in order
to accelerate the forming of large cluster-connected components. The last phase aims
to flood areas by new clusters when the diffusion process shows a really strong desire
for this by integrating the isolated nodes. The authors adopted the bulk synchronous
parallel (BSP) model (Valiant 1990), which provides an abstract view of technical struc-
ture and communication features of the hardware, in order to ensure the parallelism of
the proposed method. The clustering results of the DIDIC heuristic were compared to
A survey onparallel clustering algorithms forBig Data
1 3
results obtained with the MCL algorithm (Enright etal. 2002), which tend to be close
and sometimes slightly better in favour of the proposed method.
In Yıldırım and Özdoğan (2011), the authors proposed a MPI-based implementation
of a grid-based clustering algorithm based on the wavelet transform, which is known
as the WaveCluster algorithm (Sheikholeslami etal. 2000). The WaveCluster algorithm
transforms the original feature space by adopting the wavelet transform, thus forming
a new space in which the dense regions must be sought. As a result, it produces sets of
clusters at different resolutions and scales. In order to ensure the parallelism the Wave-
Cluster algorithm, the authors adopted the replicated approach (Skillicorn 1999), which
consists in splitting the data set over processors that execute nearly identical code seg-
ments of the algorithm. Therefore, processors need to exchange their local results and to
check the correctness. Furthermore, they adopted the master/slave model with the Sin-
gle Process, Multiple Data (SPMD) technique on a distributed-memory multiprocessor
system. The MPI was used to manage the communication requirements between master
and slave nodes. Experiments were conducted on a PC cluster of 8 compute nodes with
32 processors at total and having fast Ethernets as underlying communication hardware.
The clustering results demonstrated that the proposed method yields a high speed up
and scales linearly.
p-PIC (Yan etal. 2013) is a MPI-based implementation of the power iteration clus-
tering (PIC) algorithm (Lin and Cohen 2010). The original algorithm produces clus-
tering results by embedding data points in a low-dimensional subspace derived from
the similarity matrix. It consists of three main operations: the similarity matrix calcu-
lation and normalization, the iterative matrix-vector multiplication, and the clustering.
The authors of the p-PIC algorithm noticed that the original method depends on the
memory resources because it stores data and its associated similarity matrix in memory.
Therefore, they suggested enhancing the two first operations involved in this method
by exploring the parallelization strategies in order to reduce the computation and com-
munication costs. Indeed, each processor stores only two cases in memory at a time, in
addition to this, the similarity matrix is split according to the strategy of row-wise block
stripped matrix-vector multiplication and stored at the processors executed in parallel.
The experimental results on a local cluster and Amazon EC2 cloud show that the p-PIC
algorithm reaches almost a linear speed-up on all tested datasets.
A MPI-based implementation of the DBSCAN algorithm is presented in Savvas and
Tselios (2016). The authors designed an approach based on three phases, namely, the
splitting, the DBSCAN execution, and the final clusters forming. Furthermore, they
adopted the Single Instruction Multiple Data (SIMD) technique, which allows executing
the same task by each computational node of the system on different data. The mas-
ter node is responsible for splitting and assigning the data to the worker nodes, while
the worker nodes perform independently the DBSCAN algorithm. Consequently, each
worker node produces its own local clusters in the form of a collection of pairs, wherein
the centroids and the radius of the clusters are saved. Then, each worker node sends its
collection to its neighbor node in order to explore possible aggregations of clusters.
This is achieved by computing the intersections of circles then examining these inter-
sections. In the case where the circles do not intersect then there is no action since these
form different clusters. This process is repeated until all information become on the
final worker node, which will emit the clusters as a final result to the master node. The
experiments were conducted on 33 computational nodes and they demonstrated that the
proposed approach reduces the time complexity and yields identical results compared to
the original sequential DBSCAN.
Z.Dafir et al.
1 3
3.2 Vertical scaling platforms‑based clustering algorithms
In this section, we present the recent clustering methods based on vertical scaling plat-
forms, namely the GPU, Multi-core CPU, and FPGA platforms.
3.2.1 Clustering algorithms using multi‑core CPU
Authors of Hadian and Shahrivari (2014) suggested a parallel variant of the k-means algo-
rithm using multiple CPU cores of a single machine. The proposed algorithm divides
the datasets into chunks and then distributes them to the processing units. The process-
ing units perform the clustering on the chunks of data in parallel. The proposed algorithm
consists of two thread sets: Master thread and chunk-clustering threads. The master thread
is responsible for reading datasets and organizing them into chunks that have a predefined
size. After that, the master thread sends the chunks using the concept of the queue. The
next step consists of performing the clustering by multiple threads using k-means++ and
then storing the centroids obtained from each chunk in the global list of centroids. Once
the clustering of all chunks is completed, the global list of centroids is loaded by the mas-
ter thread in order to run the next step, which consists of clustering the centroids using
k-means++ to produce the final centroids. The proposed algorithm was evaluated using a
12-cores machine, and compared with the k-means, k-means++, and stream-based algo-
rithms. The results show that this algorithm can achieve a near-linear scalability, and yields
the same quality of results as k-means++. In addition, it runs much faster than stream-
based algorithms.
An adaptation of an uncertain data clustering algorithm (Erdem and Gündem 2014),
called fast density-based spatial clustering of applications with noise (FDBSCAN) (Kriegel
and Pfeifle 2005), to multi-core platform. The idea behind the original FDBSCAN algo-
rithm is to integrate the fuzzy distance functions directly into the traditional DBSCAN
algorithm in order to measure the similarity between fuzzy objects. The adaptation of this
clustering algorithm to the multi-core platform resulted in the so-called M-FDBSCAN,
which has demonstrated a significant acceleration of processing. The newly introduced
algorithm operates in three steps. First, it splits the 2-dimensional fuzzy data object dataset
into subsets according to the number of available cores in the multi-core system. Then, it
applies the FDBSCAN algorithm to each subset concurrently which leads to the determi-
nation of the final cluster regions partially. Finally, the last step in this algorithm consists
of merging the subsets pairs to get the final cluster regions. The M-FDBSCAN algorithm
is implemented using OpenMP in C to achieve its parallelism. Experiments conducted on
synthetic datasets demonstrated that the proposed algorithm scales linearly when increas-
ing the number of cores and it outperforms the FDBSCAN algorithm when processing
huge amounts of data.
Earlier, the work published in Kraus and Kestler (2010) suggests a parallel implemen-
tations of the k-means and K-modes algorithms using a multi-core platform with trans-
actional memory in order to process a large amount of data. The k-means and k-modes
algorithms were implemented as follows: The first step consists of distributing the initial
datasets to the cores. Then, each data point is assigned to the nearest centroid. The next
step consists of updating the centroids by k data threads. The operations assigning data and
updating centroids are performed in parallel using a software transactional memory (STM),
named deferred-update STM. Each data point is processed by running simultaneously the
A survey onparallel clustering algorithms forBig Data
1 3
threads. The data must be verified before sending it to the shared memory. If any change
has been made to the data, the transaction is sent. Otherwise, the transaction is rejected.
The parallel multi-core k-means algorithm (McKmeans) was compared to the single-core
k-means algorithm implemented in R, by simulating artificial data, using a dual quad-core
computer. Then, it was compared to the network-based ParaKMeans tested on the web with
the same datasets. These comparisons show the high performance of the proposed algo-
rithm. Besides, the speed-up achieved by the proposed algorithm is 10× better compared to
single-core implementations.
3.2.2 Clustering algorithms using graphics processing unit
Three parallel implementations of the k-means algorithm are suggested in Cuomo etal.
(2017) for the purposes of reducing the execution time using GPUs and CUDA. These
implementations follow a hybrid approach, in which the k-means algorithm is performed
partly in host and partly in device. Data transfer between host and device is efficiently car-
ried out using coalescing memory accesses and pitch for matrices memory allocation. The
proposed parallel approach calls the device at each iteration. This requires fixing some
parameters and checking the dimension of the matrix of data points. Some dimensions
need the data points to be divided in chunks and then process them one by one. Then, the
host calculates the new centroids using the information of the new indices produced by
the device, and finally it checks the convergence. The authors observed that the k-means
algorithm needs three data structures to store the set of data points, the centroids, and the
indices that indicate the membership of each data point. In order to address this observa-
tion they proposed a lighter data structure that can reduce the data transferring time. In
their first implementation, they proposed to use a matrix to store the data and to update the
results for the labelling stage on the CPU, while in the second implementation; they used
two different data structures to store them. The third implementation, which is considered
the most refined solution, adopts a parallel technique to compute the squared Euclidean
distance. This refined implementation achieves a speed-up reaching 88× faster than the
CPU version.
Authors of Huang etal. (2015) proposed PaStream, a parallel algorithm for clustering
data streams based on NVIDIA GPUs. This single-pass algorithm discovers clusters with
arbitrary shapes and detects outliers. The PaStream algorithm is implemented following
a framework for clustering fast evolving data streams, called CluStream (Aggarwal etal.
2003). The PaStream algorithm requires an initialization step, in which some micro-clus-
ters are generated using the k-means algorithm. Then, it involves two phases: an on-line
phase, in which data points are gathered into micro-clusters, and an off-line phase, in
which micro-clusters are aggregated into macro-clusters. In the on-line phase, for each data
point, three cases must be considered. In the first case, an existing micro-cluster absorbs
the data point. In the second case, the data point forms a new micro-cluster, while in the
third case, two closest micro-clusters merge to form one. This decision is made accord-
ing to the absorption rate. Furthermore, the authors have defined a merging factor, which
determines when the close micro-clusters should be merged in order to optimize memory
and time. In the off-line phase, the Euclidean distance between all centers is calculated,
then, the decision graph is plotted based on the local distance and the local density for
each micro-cluster. It should be noted that this algorithm operates on data groups and not
on data points at its arrival. This helps reduce the time required to transfer data from CPU
Z.Dafir et al.
1 3
to GPU. The experiments on real and synthetic datasets show that the PaStream algorithm
outperforms its version based on CPU in terms of time and quality of the produced result.
In Zhang etal. (2010) a parallel implementation of a nature-inspired algorithm for docu-
ment clustering is introduced. The original method is called the flocking-based document
clustering algorithm (Cui etal. 2006). In order to reduce the execution time and the com-
plexity of this algorithm, the authors exploited the computational power of Beowulf-like
clusters equipped with GPUs. In a typical flocking model, the behaviour of an individual is
based only on its neighbour flock mates within a certain range. This behaviour is described
by three rules: separation, alignment, and cohesion. In order to adapt this model to the
document clustering problem, the document is considered as an individual that participates
in the flocking formation. Then, the three rules are combined for similar neighbour docu-
ments, while only the separation rule is applied for non-similar neighbour documents. The
authors developed a special GPU cluster programming model to implement the proposed
method. It consists of a distributed object interface to unify CUDA memory management
and explicit message passing routines, a mechanism to spawn a flexible number of host
threads for parallelization that may exceed the number of GPUs in the system, in addition
to an interface for advanced users to control thread scheduling in clusters. The experiments
show that the GPU-based implementation scales up over one million documents processed
simultaneously in a sixteen-node moderate GPU cluster. Moreover, the proposed method
can reach up to 50× speed-up compared to its CPU-based implementation.
Async-EM (Altinigneli etal. 2013) is a parallel variant of the Expectation–Maximiza-
tion (EM) algorithm based on the GPU platform. Their main contributions concern the
synchronization of cores and the organization of memory access. Indeed, the authors
noticed that several updates of the global cluster representatives cause an inefficient use
memory bandwidth and synchronization overhead. Therefore, to avoid these issues, they
proposed to update the global cluster representatives only when a certain number of mem-
bership changes have occurred. This is achieved through the idea of the asynchronous
model updates combined with an efficient technique, called the model of consolidation.
The model of consolidation is responsible for merging the different sets of cluster repre-
sentatives when the local model updates are exchanged, and it exploits the special charac-
teristics of the memory hierarchy of modern GPUs. The experiments on real and synthetic
datasets show that the Async-EM. This method outperforms the incremental-EM and the
Batch-EM algorithms in terms of the convergence, the modelling error, and the execution
time performances. A comparison between the GPU and the CPU performances of the pro-
posed approach has also shown its effectiveness.
The work proposed in Melo et al. (2016) is part of the density-based clustering. It
focuses on the parallelization of the OPTICS algorithm using GPU and based on data
indexing strategy. The OPTICS algorithm creates an augmented ordering of the dataset
representing its density-based clustering structure (Ankerst etal. 1999). The proposed par-
allel implementation of the OPTICS algorithm consists of two phases: the graph construc-
tion and the OPTICS process. The proposed approach adopts a data representation in a
graph form following the METIS data structure, which is presented in Karypis and Kumar
(1998). And it uses three vectors to store the vertices, the nodes in the adjacency list, and
the distance. The process consists of four main steps, respectively, the vertices degree cal-
culation, the adjacency index calculation, the adjacency lists assembly and sorting. These
steps, in addition to the construction of data structure and the storage at the end of the adja-
cency lists of nodes, are each processed in parallel using GPU. The experiments show that
the proposed approach reduces significantly the complexity of the OPTICS algorithm and
makes it faster than the serial CPU-based version.
A survey onparallel clustering algorithms forBig Data
1 3
In the same context, authors of Deng etal. (2015) have proposed a new trajectory clus-
tering approach based on the POPTICS algorithm (Patwary etal. 2013). In order to adapt
the POPTICS algorithm to the trajectory data, the authors applied a spatio-temporal dis-
tance to measure the similarity between trajectories and an indexing approach based on
the STR-tree structure to their proposed algorithm. This latter, which is called Tra-POP-
TICS, is based on shared memory and involves three main steps. In the first step, each CPU
thread processes a local disjointed subset of trajectory data. It finds out the neighbours of
each trajectory. Then, it computes the core distance of each trajectory and then local Prim’s
minimum spanning tree (MST). The next step generates the global MST, and finally, the
last step extracts the clusters from the global MST. As part of this study, a parallel ver-
sion of the Tra-POPTOCS algorithm, called G-Tra-POPTICS has been designed using the
Hyper-Q feature of Kelper GPU and massive GPU threads. The experiments on real trajec-
tory dataset demonstrated that the Tra-POPTICS algorithm and its parallel version have
reached the quality of clustering result produced by a variant of OPTICS algorithm called
T-OPTICS (Nanni and Pedreschi 2006). In addition, the GPU-based version outperforms
the Tra-POPTICS algorithm regarding the computational time.
Another work related to the density-based clustering is presented in Andrade et al.
(2013). It is a GPU-based implementation of the DBSCAN algorithm, called G-DBSCAN,
which uses a simple graph-based data indexing technique. The proposed approach con-
sists of two main steps, namely, the graph construction and the identification of clusters
using the breadth-first search algorithm (Harish and Narayanan 2007). The First step aims
to construct a graph in order to represent the data. Indeed, each object in the dataset is
represented as a node in the graph, and when the similarity measure between two objects is
less than a threshold, which is given as an input parameter, an edge is added between them.
Therefore, this first step performs the calculation of vertices degree, the calculation of the
adjacency lists indices, and finally the assembly of adjacency lists. The second step aims to
identify the clusters by traversing the graph created in the previous step using the breadth-
first search algorithm. According to the experiments, the G-DBSCAN algorithm is 100×
faster than the serial implementation of the DBSCAN algorithm.
CUDA-MCL (Bustamam etal. 2012) is a parallel variant of the Markov clustering algo-
rithm (MCL) (Van Dongen 2008) applied to the protein-protein interaction networks. The
MCL algorithm discovers cluster structure in graphs by adopting the concept of random
walks. The proposed parallel implementation is based on GPU programming via CUDA.
The MCL algorithm is based on two algebraic operations on the Markov matrix, namely,
the expansion and the inflation. Therefore, the performance of this algorithm depends on
the size of the Markov matrix. The main contributions of the authors include the enhance-
ment of the performance of the original MCL by implementing parallel tasks for the expan-
sion and inflation operations, in addition to the storage optimization using sparse matrix
data structures. The proposed algorithm involves three parallel threads CUDA kernels.
A kernel to compute the parallel MCL expansion processes, a second kernel to compute
the parallel MCL inflation processes, and the last kernel to compute the parallel local and
global chaos. Comparing to the original MCL algorithm, which is based on CPU, the pro-
posed approach is faster on large datasets.
The Calculation-On-Demand CAST with GPU (COD-CAST-GPU) algorithm (Lin etal.
2014) is a parallel design of the Clustering Affinity Search Technique (CAST) (Ben-Dor
etal. 1999). As its name suggests, it is based on the GPU platform and the individual mem-
ory of graphics card. The CAST algorithm requires as input the similarity matrix and the
affinity threshold. Therefore, the size of such matrix is a critical point which causes the
storage problems. The idea behind the proposed algorithm is to calculate the similarity
Z.Dafir et al.
1 3
between data nodes only when it is needed in order to avoid the prior calculation of the
similarity matrix. In addition, in order to accelerate the CAST algorithm and gain in terms
of performance, the authors exploited the capabilities of the GPU. The proposed algorithm
starts by selecting a random node as a new cluster. Then, it uses two operations: ADD and
REMOVE to form this new cluster. After each movement, the affinity values are updated in
parallel, and this is where the GPU comes in. The technique adopted by the authors showed
remarkable improvements in terms of performance compared to the original algorithm.
3.2.3 Clustering algorithms using eld programmable gate arrays
Authors of Scicluna and Bouganis (2015) proposed to accelerate the well-known DBSCAN
algorithm using an implementation based on FPGA in order to achieve a high-performance
clustering. The proposed FPGA-based implementation is partially parallelized because the
authors considered that the step aiming to obtain the extended neighbourhood of points for
a cluster consumes the longest portion of the execution time. And since this step has no
data dependencies, it is suitable to be performed in parallel. The proposed method is a fully
configurable IP core, thus it contains many parameters to be adjusted, such as the size and
dimensions of the input data, internal precision, pipeline depths, and the level of parallel-
ism. Performance evaluation based on 2D point clustering shows that the execution time of
FPGA-based implementation is not impacted by the increase in the number of dimensions.
The proposed method can reach mean speed-ups of 31× in real-life datasets and 202× in
synthetic datasets when compared to the state-of-the-art methods.
The study introduced in Shi etal. (2014) is also related to the DBSCAN algorithm. It is
a FPGA-based implementation of this algorithm while adopting task-level and data-level
parallelism. Their main contributions include a design of a data reuse pipeline structure in
order to overcome the problem of the extra memory access caused by the data dependen-
cies in the parallel algorithm. In addition to a robust collision check mechanism in order to
prevent wrong clustering results in some special conditions. The experiments on synthetic
datasets show that proposed parallel architecture can reach up to 86× speed-up compared
to a software implementation on the general-purpose processor and up to 2.9× compared to
a software implementation on the graphic processor.
SAKMA (Jia etal. 2015) is a FPGA-based implementation of the k-means algorithm.
This architecture accelerates the whole k-means algorithm through certain approaches
such as the pipeline, the tile technique, the duplication parallelism, and the hardware adder
tree structures. The SAKMA architecture contains a processing system part in software
and a processing logic part in hardware. The FPGA-based implementation of the k-means
algorithm is carried out by the IP Core. The frequent off-chip memory access is a com-
mon problem that the authors resolved using the tiled technique. This technique aims to
divide the large volume of memory blocks into small tiles that can be buffered on-chip.
The experimental results on real biological and synthetic datasets show that the proposed
method can reach a speed-up at 20.5× with the affordable hardware cost when compared
with other state-of-the-art methods.
In the same context, authors of Winterstein et al. (2013) proposed to accelerate
the k-means algorithm using the FPGA and a binary tree data structure. Indeed, the
authors adopted the filtering algorithm (Kanungo etal. 2002) which intends to reduce
the computational load using a kd-tree as the main data structure. This algorithm con-
sists of constructing a tree from the set of points, then it traverses the tree iteratively
and updates the centers. The main challenge is how to pipeline and to parallelize the
A survey onparallel clustering algorithms forBig Data
1 3
kd-tree processing using multiple banks of distributed on-chip memory. Besides, the
on-chip dynamic memory allocation is used in order to efficiently utilize the memory
resources by allowing the allocation of the average amount of memory required dur-
ing runtime. The proposed method needs 5× fewer computational FPGA resources
compared to the conventional k-means algorithm implemented in parallel for the same
throughput constraint.
The work proposed in An etal. (2012) is part of the clustering-based prototype
learning algorithms. It is a k-means-based multi-prototype learning system strength-
ened by a FPGA-based implementation coprocessor for the nearest Euclidean distance
searching. This technique allows overcoming the high computational cost of the near-
est neighbour searching implementations for the k-means clustering algorithm and the
one nearest neighbour (1-NN) classification algorithm. The aim of the proposed sys-
tem is to construct prototypes using the k-means algorithm iteratively until its con-
vergence to stable centroids, then, the prototypes are used in order to recognize the
test samples by applying the 1-NN classifier to search for the nearest Euclidean dis-
tance among these prototypes. Finally, the k prototypes with the best training result
are selected to represent the final recognition. The experiments on a handwritten digits
dataset show that the proposed learning system can reach an accuracy rate of 97.91%
with 930 prototypes.
Another contribution to the k-means algorithm is proposed in Hussain etal. (2011)
which aims to design a FPGA-based implementation of the k-means algorithm for the
Micro-array datasets, which are known for their large sizes. The proposed approach
requires a careful analysis of samples of Micro-array data in order to fix the values
of some critical parameters, such as data size, dynamic range, precision, and memory
requirement. The k-means algorithm is implemented using three main blocks. The first
block is responsible for the calculation of distances, the second block is responsible
for the assigning data points, and the third block is responsible for recalculating the
centers. Respecting this order, each block executes its operations in parallel and as
soon as it finishes it activates the following block. This process reiterates until the
convergence. The experiments on a sample of a real-life dataset show that the proposed
method can reach a speed-up up to 51.7× compared to a software model and 206.8×
more energy efficient than the CPU implementation.
Authors of Sotiropoulou etal. (2014) proposed 2D-clustering implementation based
on multi-core FPGA for real-time image processing. The proposed clustering imple-
mentation operates on zero-suppressed data and it consists of a pipeline of three mod-
ules: the hit decoder module, the grid clustering module, and the centroid calculation
module. The first module is a preprocessing step which transforms the incoming data
to a recognized form. The second module identifies the clusters using an innovative
moving window technique to reduce the FPGA resources required for this process.
Finally, the third module is a post-processing step which performs the data reduction
process. In other words, the cluster data are replaced with a single set of centroid coor-
dinates. In order to parallelize this implementation, the authors proposed to instanti-
ate multiple grid clustering modules which work independently on data from separate
pixel modules. This is achieved through two logic modules, the parallel data distribu-
tor module, and the data merger module. The experiments concerned the 2D-clustering
single flow implementation and its parallel version up to 16 clustering engines. A com-
parison with a previous version showed that the proposed parallel implementation uses
64× fewer logic resources.
Z.Dafir et al.
1 3
4 Comparison oftheparallel clustering algorithms
The research carried out in the context of this survey has highlighted several noteworthy
observations. The first point concerns the fact that the majority of the parallel clustering
algorithms, which are proposed recently, focus on some well-known algorithms, such
as the k-means (Hartigan and Wong 1979), DBSCAN (Ester et al. 1996), and OPTICS
(Ankerst etal. 1999) algorithms. The nature of these algorithms makes them suitable for
parallelism in different designs and based on various platforms to handle Big Data.
4.1 Experimental setup
Table1 provides a general idea of the performance of the k-means algorithm when imple-
mented using different Big Data platforms. Each parallel k-means version is firstly classi-
fied according to the Big Data platform used to attain the parallelism of the original algo-
rithm. Then, we present the initial parameters, the type of data processed and the criteria
of clustering validation, namely, the quality of the obtained results, the running time con-
sumed, and the speed-up achieved. The quality of the results depends on the data type pro-
cessed and therefore the measures used differ in some cases. Finally, we present criteria
related to the ability of the proposed algorithm to handle Big Data through the volume, the
velocity, and the variety of the processed data.
4.2 Parallel k‑means algorithms
Although k-means is known to be popular and easy to implement, it suffers from some
weaknesses such as the determination of the suitable number of clusters k, in addition to
the scalability issues when treating sparse values. k-means is an iterative algorithm which
involves three main steps: initialization, clusters assignments, and centroids updates. There
are several designs that may be put forward to perform the clusters assignments and cen-
troids updates steps in parallel. From Table1, it can be observed that the MapReduce-
based implementation of the k-means algorithm can process up 4 billion data points in only
a few minutes.
4.3 Parallel DBSCAN algorithms
DBSCAN is as important as k-means. It aims to discover the clusters and the noise in a
spatial database requiring two parameters: Eps and MinPts. The Eps parameter denotes the
maximum radius of the neighbourhood from a point, while the MinPts parameter denotes
the minimum number of points required to form a dense region. Once these input param-
eters are set, DBSCAN performs clustering in a two-step approach. First, it starts by select-
ing an arbitrary point from the dataset that fulfils the core point condition as a seed. Then,
it groups all points that are density-reachable from the selected seed including itself, thus
obtaining a cluster.
DBSCAN offers several advantages including finding clusters of arbitrary shapes and
to handling automatically outliers. And unlike k-means, it does not need to set the num-
ber of clusters. However, the estimation of its input parameters is a complex and critical
task. This algorithm suffers also from the scalability issues like the rest of the traditional
A survey onparallel clustering algorithms forBig Data
1 3
Table 1 Comparison of the recent parallel clustering algorithms of k-means
Platforms Algorithms Clustering criteria Big Data criteria
Input param-
eters
Datasets type Quality Running time Speed-up Volume Velocity Variety
Horizontal scaling
Map-
Reduce
Cui etal.
(2014)
K: number of
clusters
Real, synthetic DBI: 0.0133 for
WMC
5.38 min for
WMC
Up to 4 billion
points
No Data
points
0.0128 for
DMC
5.29 min for DMC
(N=4b,
K=100, 16
reducers)
(N=4b, K=100,
16 reducers)
Spark Tsapanos
etal.
(2016)
𝜀
: threshold Real NMI: 0.8412 9 h 2.6× Up to 2.8 million
points
No Face
videos
(N=621k,
𝜀
=0.5)
(N=621k,
𝜀
=0.5)
(N=621k,
𝜀
=0.5)
Peer-to-
peer
Azimi etal.
(2017)
K: number of
clusters
Real, synthetic AC: 80.14% Up to 1 million
points
No Data
points
(N=35k,
K=7)
Vertical scaling
GPU Cuomo etal.
(2017)
K: number of
clusters
Real, synthetic 88× Up to 500,000
points
No Data
points
D: input size (N=500k,
K=128,
D=128)
Multi-core
CPU
Hadian and
Shahrivari
(2014)
K: number of
clusters
Real, synthetic SSE: 7.42E+12 12.9 s 8.2× Up to 11,620,300
points
No Data
points
Z.Dafir et al.
1 3
Table 1 (continued)
Platforms Algorithms Clustering criteria Big Data criteria
Input param-
eters
Datasets type Quality Running time Speed-up Volume Velocity Variety
chunk_size:
size of
chunks
(N=11m,
K=10)
(N=11m,
K=10)
(N=11m,
K=100, 12
cores, chunk_
size= 10000)
FPGA Jia etal.
(2015)
K: number of
clusters
Real, synthetic 238.86 s 20.5× Up to 20,000
points
No Data
points
(N=10k,
K=300)
(N=3000,
K=60)
AC, accuracy; DBI, Davies–Bouldin index; DMC, distribution-based merge clustering; N, the size of the tested dataset; NMI, normalized mutual information; SSE, sum of
squares error measure; WMC, weight-based merge clustering
A survey onparallel clustering algorithms forBig Data
1 3
clustering algorithms. The two steps involved in this algorithm can be easily performed in
parallel on chunks of the dataset. Table2 presents the performance of the DBSCAN algo-
rithm when implemented using different Big Data platforms. The experiments of the paral-
lel implementations of DBSCAN demonstrated that they reached an interesting speed-up
in comparison with the original version algorithm that attains 11× (the MapReduce-based
implementation of DBSCAN for 1.2 billion records).
4.4 Observations andopen issues
Also, it has to be noted that most of the parallel clustering algorithms proposed in the lit-
erature do not handle real-time data and focus on a single type of data, which limits their
adequacy to process Big Data. Indeed, most of the data nowadays are unceasingly produced
in real-time, which necessitate a continuous processing. Such processing should minimize
the storage and computation costs in order to analyse large-scale real-time data. Despite
the difficulties encountered with this kind of data, the parallel clustering methods, which
are proposed in Huang etal. (2015) and Sotiropoulou etal. (2014), handle data stream and
real-time data efficiently using the GPU and FPGA platforms respectively.
The variety of data is also an important dimension of the Big Data, which is unfortu-
nately difficult to handle in the context of clustering. Indeed, most of the parallel clustering
algorithms are designed for numerical data. A few others are specialized in other types of
data, such as text data, multimedia data. However, the Big Data also takes into account het-
erogeneous structured, semi-structured, and even unstructured data as it represents the larg-
est proportion. Indeed, clustering unstructured data is a challenging task due to the absence
of a recognizable representation. To the best of our knowledge, there is a lack of parallel
algorithms for clustering multi-view, heterogeneous, or multi-modal big data.
It should also be noted that most of the reviewed algorithms require fixing some initial
parameters, which involves a complex beforehand study to decide the appropriate values.
Such studies are often calling on information about the data distribution, in addition to
considerable efforts and time, which is not always feasible. However, there are very few
free-parameter algorithms, which are easy to use particularly in the context of Big Data,
where excessive human interactions should be prevented or minimized when processing
data. Therefore, there is a persistent need for parallel clustering algorithms that can meet
all observations mentioned above.
5 Conclusions
This paper proposed an in-depth review of the latest parallel clustering algorithms sorted
according to the Big Data platforms used. There are two fundamental categories of plat-
forms, which can handle large-scale data processing. In the first category, we addressed the
clustering algorithms based on MapReduce, Spark, and Peer-to-Peer networks. These plat-
forms form part of the horizontal scaling platforms. While in the second category, which is
known as the vertical scaling platforms, we focus on the clustering algorithms conceived
with Multi-core processors, GPU, and FPGA. All the reviewed algorithms were analysed
according to the strategies adopted to ensure the parallelism. This work also includes a
detailed comparison of the discussed clustering algorithms based on some common criteria
of validation clustering result in Big Data context.
Z.Dafir et al.
1 3
Table 2 Comparison of the recent parallel clustering algorithms of DBSCAN
Platforms Algorithms Clustering criteria Big Data criteria
Input parameters Datasets type Running time Speed-up Volume Velocity Variety
Horizontal scaling
Map-
Reduce
He etal. (2014)
𝜀
: distance Real 58 s About 11.29× Up to 1.2 billion recordsNo Spatial data
Minpts: minimum clus-
ter size
(64 tasks, N=1.2b,
𝜀
=0.001,
Minpts=500)
(64 tasks, N=1.2b,
𝜀
=0.001,
Minpts=500)
Spark Han etal. (2016)
𝜀
: distance Synthetic 1493 min About 137× Up to 1 million points No Data points
Minpts: minimum clus-
ter size
(N=1m, 512 cores,
𝜀
=25, Minpts=5)
(N=1m, 512 cores,
𝜀
=25, Minpts=5)
Peer-to-
peer
Savvas and Tselios
(2016)
𝜀
: distance Synthetic 67.55 s About 17.5× Up to 100,000 points No 2D data
points
Minpts: minimum clus-
ter size
(N=100k, 33 nodes) (N=100k, 33 nodes)
Vertical scaling
GPU Andrade etal.
(2013)
R: proximity radius Synthetic 82.9 s 111.6× Up to 700,000 points No 2D data
objects
MinPts: minimum
cluster size
(N=700k,
MinPts=4,
R=0.05)
(N=700k, MinPts=4,
R=0.05)
Multi-core
CPU
Erdem and Gündem
(2014)
𝜀
: minimum distance Synthetic 9.4 s Up to 50,000 points No 2D fuzzy
data
objects
𝜇
: minimum cluster size (N=50k,
𝜀
=120,
𝜇
=7, c=24)
c: number of cores
FPGA Scicluna and Bou-
ganis (2015)
𝜀
: distance Real 211.88 ms 33.77× Up to 25,000 points No Data points
A survey onparallel clustering algorithms forBig Data
1 3
Table 2 (continued)
Platforms Algorithms Clustering criteria Big Data criteria
Input parameters Datasets type Running time Speed-up Volume Velocity Variety
Minpts: minimum clus-
ter size
(N=19,504, parallel
elements=300,
Minpts=80,
𝜀
=25)
(N=19,504, paral-
lel elements=300,
Minpts=80,
𝜀
=25)
N, the size of the tested dataset
Z.Dafir et al.
1 3
After this thorough study, it is observed that most of the reviewed approaches concern
some well-known clustering algorithms, such as k-Means, DBSCAN, and OPTICS. This
choice lies in the fact that these algorithms are widely studied and suitable for parallel-
ism in different ways. Also, it was noted that some Big Data platforms are becoming less
commonly used in clustering, such as the Peer-to-Peer networks. This is due to the rapid
advancements in the field of parallel and distributed computing. These efforts gave rise
to new programming models and more powerful hardware that exceed the limitations of
the old platforms. It is important to point that most of the parallel clustering algorithms
proposed in the literature do not handle real-time data and focus on a single type of data,
which limits their capability to process Big Data. Consequently, the real-time and the het-
erogeneous data processing remains challenging issues in the context of clustering.
References
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceed-
ings of the 29th international conference on very large data bases, VLDB ’03, vol 29. VLDB Endow-
ment, Berlin, pp 81–92
Akhter S, Roberts J (2006) Multi-core programming: increasing performance through software multi-
threading, 1st edn. Books by engineers, for engineers. Intel Press, Hillsboro
Altinigneli MC, Plant C, Böhm C (2013) Massively parallel expectation maximization using graphics
processing units. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge
discovery and data mining, KDD ’13. ACM, Chicago, pp 838–846. https ://doi.org/10.1145/24875
75.24876 28
An F, Koide T, Mattausch HJ (2012) A k-means-based multi-prototype high-speed learning system with
FPGA-implemented coprocessor for 1-NN searching. IEICE Trans Inf Syst E95–D(9):2327–2338
Andrade G, Ramos G, Madeira D, Sachetto R, Ferreira R, Rocha L (2013) G-DBSCAN: a GPU accelerated
algorithm for density-based clustering. Procedia Comput Sci 18(Supplement C):369–378. https ://doi.
org/10.1016/j.procs .2013.05.200
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering
structure. In: Proceedings of the 1999 ACM SIGMOD international conference on management of
data, SIGMOD ’99. ACM, Philadelphia, pp 49–60. https ://doi.org/10.1145/30418 2.30418 7
Azimi R, Sajedi H, Ghayekhloo M (2017) A distributed data clustering algorithm in p2p networks. Appl
Soft Comput 51(Supplement C):147–167. https ://doi.org/10.1016/j.asoc.2016.11.045
Banharnsakun A (2017) A mapreduce-based artificial bee colony for large-scale data clustering. Pattern
Recognit Lett 93(Supplement C):78–84. https ://doi.org/10.1016/j.patre c.2016.07.027
Ben-Dor A, Shamir R, Yakhini Z (1999) Clustering gene expression patterns. J Comput Biol 6(3–4):281–
297. https ://doi.org/10.1089/10665 27993 18274
Bharill N, Tiwari A, Malviya A (2016) Fuzzy based scalable clustering algorithms for handling big data
using apache spark. IEEE Trans Big Data 2(4):339–352. https ://doi.org/10.1109/TBDAT A.2016.26222
88
Brown SD, Francis RJ, Rose J, Vranesic ZG (1992) Field-programmable gate arrays. Klu-
wer international series in engineering and computer science. Springer, Boston. https ://doi.
org/10.1007/978-1-4615-3572-0
Bustamam A, Burrage K, Hamilton NA (2012) Fast parallel markov clustering in bioinformatics using mas-
sively parallel computing on GPU with CUDA and ELLPACK-R sparse format. IEEE/ACM Trans
Comput Biol Bioinform 9(3):679–692. https ://doi.org/10.1109/TCBB.2011.68
Cordova I, Moh TS (2015) DBSCAN on resilient distributed datasets. In: International conference on high
performance computing simulation (HPCS). IEEE, Amsterdam, pp 531–540. https ://doi.org/10.1109/
HPCSi m.2015.72370 86
Cui X, Gao J, Potok TE (2006) A flocking based algorithm for document clustering analysis. J Syst Archit
52(8):505–515. https ://doi.org/10.1016/j.sysar c.2006.02.003
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data k-means clustering using MapReduce. J Super-
comput 70(3):1249–1259. https ://doi.org/10.1007/s1122 7-014-1225-7
Cuomo S, De Angelis V, Farina G, Marcellino L, Toraldo G (2017) A GPU-accelerated parallel k-means
algorithm. Comput Electr Eng. https ://doi.org/10.1016/j.compe lecen g.2017.12.002
A survey onparallel clustering algorithms forBig Data
1 3
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of
the 6th conference on symposium on opearting systems design and implementation, OSDI’04, vol 6.
USENIX Association, Berkeley
Deng Z, Hu Y, Zhu M, Huang X, Du B (2015) A scalable and fast optics for clustering trajectory big data.
Cluster Comput 18(2):549–562. https ://doi.org/10.1007/s1058 6-014-0413-9
Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for iterative
MapReduce. In: Proceedings of the 19th ACM international symposium on high performance distrib-
uted computing. ACM, pp 810–818
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein
families. Nucleic Acids Res 30(7):1575–1584
Erdem A, Gündem Tİ (2014) M-FDBSCAN: a multicore density-based uncertain data clustering algorithm.
Turk J Electr Eng Comput Sci 22:143–154. https ://doi.org/10.3906/elk-1202-83
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters a density-
based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the
second international conference on knowledge discovery and data mining, KDD’96. AAAI Press, Port-
land, pp 226–231
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of
clustering algorithms for Big Data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput
2(3):267–279. https ://doi.org/10.1109/TETC.2014.23305 19
Farooq U, Marrakchi Z, Mehrez H (2012) FPGA architectures: an overview. In: Tree-based het-
erogeneous FPGA architectures, chap. 2. Springer, New York, pp 7–48. https ://doi.
org/10.1007/978-1-4614-3594-5_2
FerreiraCordeiro RL, Traina Junior C, MachadoTraina AJ, López J, Kang U, Faloutsos C (2011) Cluster-
ing very large multi-dimensional datasets with mapreduce. In: Proceedings of the 17th ACM SIGKDD
international conference on knowledge discovery and data mining, KDD ’11. ACM, San Diego, pp
690–698. https ://doi.org/10.1145/20204 08.20205 16
Gehweiler J, Meyerhenke H (2010) A distributed diffusive heuristic for clustering a virtual p2p supercom-
puter. In: IEEE international symposium on parallel distributed processing, workshops and Phd forum
(IPDPSW). IEEE, Atlanta, pp 1–8. https ://doi.org/10.1109/IPDPS W.2010.54709 22
Gepner P, Kowalik MF (2006) Multi-core processors: New way to achieve high system performance. In:
International symposium on parallel computing in electrical engineering (PARELEC’06). Bialystok,
Poland, pp 9–13. https ://doi.org/10.1109/PAREL EC.2006.54
Gouineau F, Landry T, Triplet T (2016) Patchwork, a scalable density-grid clustering algorithm. In: Pro-
ceedings of the 31st annual ACM symposium on applied computing, SAC ’16. ACM, Pisa, pp 824–
831. https ://doi.org/10.1145/28516 13.28516 43
Hadian A, Shahrivari S (2014) High performance parallel k-means clustering for disk-resident datasets on
multi-core CPUs. J Supercomput 69(2):845–863. https ://doi.org/10.1007/s1122 7-014-1185-y
Han D, Agrawal A, Liao WK, Choudhary A (2016) A novel scalable DBSCAN algorithm with Spark. In:
IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, Chi-
cago, pp 1393–1402. https ://doi.org/10.1109/IPDPS W.2016.57
Han J, Kamber M, Pei J (2012) Cluster analysis: basic concepts and methods. In: Data mining, The Morgan
Kaufmann series in data management systems, 3rd edn, chap.10. Morgan Kaufmann, pp 443–495.
https ://doi.org/10.1016/B978-0-12-38147 9-1.00010 -1
Harish P, Narayanan PJ (2007) Accelerating large graph algorithms on the GPU using CUDA. In: High per-
formance computing—HiPC 2007. Lecture notes in computer science. Springer, Berlin, pp 197–208.
https ://doi.org/10.1007/978-3-540-77220 -0_21
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. Appl Stat 28(1):100.
https ://doi.org/10.2307/23468 30
Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M (2012) Fuzzy c-means algorithms for very large
data. IEEE Trans Fuzzy Syst 20(6):1130–1146. https ://doi.org/10.1109/TFUZZ .2012.22014 85
He Y, Tan H, Luo W, Feng S, Fan J (2014) MR-DBSCAN: a scalable MapReduce-based DBSCAN
algorithm for heavily skewed data. Front Comput Sci 8(1):83–99. https ://doi.org/10.1007/s1170
4-013-3158-3
Huang P, Li X, Yuan B (2015) A parallel gpu-based approach to clustering very fast data streams. In: Pro-
ceedings of the 24th ACM international on conference on information and knowledge management,
CIKM ’15. ACM, Melbourne, pp 23–32. https ://doi.org/10.1145/28064 16.28065 45
Hussain HM, Benkrid K, Seker H, Erdogan AT (2011) FPGA implementation of k-means algorithm for bio-
informatics application: an accelerated approach to clustering microarray data. In: NASA/ESA confer-
ence on adaptive hardware and systems (AHS). IEEE, San Diego, pp 248–255. https ://doi.org/10.1109/
AHS.2011.59639 44
Z.Dafir et al.
1 3
Jia F, Wang C, Li X, Zhou X (2015) SAKMA: specialized FPGA-based accelerator architecture for data-inten-
sive k-means algorithms. In: Algorithms and architectures for parallel processing. Springer, Cham, pp
106–119. https ://doi.org/10.1007/978-3-319-27122 -4_8
Jin R, Kou C, Liu R, Li Y (2013) Efficient parallel spectral clustering algorithm design for large data
sets under cloud computing environment. J Cloud Comput Adv Syst Appl 2(1):18. https ://doi.
org/10.1186/2192-113X-2-18
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means cluster-
ing algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892. https ://
doi.org/10.1109/TPAMI .2002.10176 16
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J
Sci Comput 20(1):359–392. https ://doi.org/10.1137/S1064 82759 52879 97
Kraus JM, Kestler HA (2010) A highly efficient multi-core algorithm for clustering extremely large datasets.
BMC Bioinform 11(1):169. https ://doi.org/10.1186/1471-2105-11-169
Kriegel HP, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of the eleventh ACM
SIGKDD international conference on knowledge discovery in data mining. ACM, Chicago, pp 672–677.
https ://doi.org/10.1145/10818 70.10819 55
Lanczos C (1950) An iteration method for the solution of the eigenvalue problem of linear differential and inte-
gral operators. United States Governm., Press Office Los Angeles
Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Technical Report, 949,
Gartner
Li C, Zhang Y, Jiao M, Yu G (2014) Mux-Kmeans: multiplex Kmeans for clustering large-scale data set. In:
Proceedings of the 5th ACM workshop on scientific cloud computing, ScienceCloud ’14. ACM, Vancou-
ver, pp 25–32. https ://doi.org/10.1145/26080 29.26080 33
Lin F, Cohen WW (2010) Power iteration clustering. In: Proceedings of the 27th international conference on
machine learning (ICML-10). Omnipress, Haifa, pp 655–662
Lin KW, Lin CH, Hsiao CY (2014) A parallel and scalable cast-based clustering algorithm on GPU. Soft Com-
put 18(3):539–547. https ://doi.org/10.1007/s0050 0-013-1074-y
Liu R, Li X, Du L, Zhi S, Wei M (2017) Parallel implementation of density peaks clustering algorithm based
on spark. Procedia Comput Sci 107(Supplement C):442–447. https ://doi.org/10.1016/j.procs .2017.03.138
Luo G, Luo X, Gooch TF, Tian L, Qin K (2016) A parallel DBSCAN algorithm based on spark. In: IEEE
international conferences on big data and cloud computing, social computing and networking, sustain-
able computing and communications. IEEE, Atlanta, pp 548–553. https ://doi.org/10.1109/BDClo ud-Socia
lCom-Susta inCom .2016.85
Mallios X, Vassalos V, Venetis T, Vlachou A (2016) A framework for clustering and classification of big data
using spark. In: Debruyne C, Panetto H, Meersman R, Dillon T, Kühn E, O’Sullivan D, Ardagna CA (eds)
On the move to meaningful internet systems: OTM 2016 conferences, vol 10033. Springer, Cham, pp
344–362. https ://doi.org/10.1007/978-3-319-48472 -3_20
Melo D, Toledo S, Mourao F, Sachetto R, Andrade G, Ferreira R, Parthasarathy S, Rocha L (2016) Hierarchical
density-based clustering based on GPU accelerated data indexing strategy. Procedia Comput Sci 80:951–
961. https ://doi.org/10.1016/j.procs .2016.05.389
Milojicic DS, Kalogeraki V, Lukose R, Nagaraja K, Pruyne J, Richard B, Rollins S, Xu Z (2002) Peer-to-peer
computing. Technical Report. HPL-2002-57, HP Labs
Nanni M, Pedreschi D (2006) Time-focused clustering of trajectories of moving objects. J Intell Inf Syst
27(3):267–289. https ://doi.org/10.1007/s1084 4-006-9953-7
Nickolls J, Buck I, Garland M (2008) Scalable parallel programming. In: IEEE hot chips 20 symposium (HCS).
IEEE, pp 40–53
Owens J, Houston M, Luebke D, Green S, Stone J, Phillips J (2008) GPU computing. Proc IEEE 96(5):879–
899. https ://doi.org/10.1109/JPROC .2008.91775 7
Patwary MA, Palsetia D, Agrawal A, Liao WK, Manne F, Choudhary A (2013) Scalable parallel optics data
clustering using graph algorithmic techniques. In: Proceedings of the international conference on high per-
formance computing, networking, storage and analysis, SC ’13. ACM, Denver, pp 49:1–49:12. https ://doi.
org/10.1145/25032 10.25032 55
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496.
https ://doi.org/10.1126/scien ce.12420 72
Savvas IK, Tselios D (2016) Parallelizing DBSCAN algorithm using MPI. In: IEEE 25th International con-
ference on enabling technologies: infrastructure for collaborative enterprises (WETICE). IEEE, Paris, pp
77–82. https ://doi.org/10.1109/WETIC E.2016.26
Scicluna N, Bouganis CS (2015) ARC 2014: a multidimensional FPGA-based parallel DBSCAN architecture.
ACM Trans Reconfig Technol Syst 9(1):2:1–2:15. https ://doi.org/10.1145/27247 22
A survey onparallel clustering algorithms forBig Data
1 3
Sheikholeslami G, Chatterjee S, Zhang A (2000) Wavecluster: a wavelet-based clustering approach for spa-
tial data in very large databases. VLDB J Int J Very Large Data Bases 8(3–4):289–304. https ://doi.
org/10.1007/s0077 80050 009
Shi S, Yue Q, Wang Q (2014) FPGA based accelerator for parallel DBSCAN algorithm. Comput Model New
Technol 18(2):135–142
Singh D, Reddy CK (2014) A survey on platforms for big data analytics. J Big Data 2(1):8. https ://doi.
org/10.1186/s4053 7-014-0008-6
Sinha A, Jana PK (2016) A novel k-means based clustering algorithm for big data. In: International conference
on advances in computing, communications and informatics (ICACCI). IEEE, pp 1875–1879. https ://doi.
org/10.1109/ICACC I.2016.77323 23
Skillicorn D (1999) Strategies for parallel data mining. IEEE Concurr 7(4):26–35. https ://doi.
org/10.1109/4434.80697 6
Sotiropoulou CL, Gkaitatzis S, Annovi A, Beretta M, Giannetti P, Kordas K, Luciano P, Nikolaidis S, Petridou
C, Volpi G (2014) A multi-core FPGA-based 2D-clustering implementation for real-time image process-
ing. IEEE Trans Nuclear Sci 61(6):3599–3606. https ://doi.org/10.1109/TNS.2014.23641 83
Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing
systems. Comput Sci Eng 12(3):66
Sun Z, Fox G, Gu W, Li Z (2014) A parallel clustering method combined information bottleneck theory and
centroid-based clustering. J Supercomput 69(1):452–467. https ://doi.org/10.1007/s1122 7-014-1174-1
Tsapanos N, Tefas A, Nikolaidis N, Pitas I (2015) A distributed framework for trimmed kernel k-means cluster-
ing. Pattern Recognit 48(8):2685–2698. https ://doi.org/10.1016/j.patco g.2015.02.020
Tsapanos N, Tefas A, Nikolaidis N, Pitas I (2016) Efficient mapreduce kernel k-means for big data clustering.
In: Proceedings of the 9th hellenic conference on artificial intelligence, SETN ’16. ACM, Thessaloniki, pp
28:1–28:5. https ://doi.org/10.1145/29032 20.29032 55
Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111. https ://doi.
org/10.1145/79173 .79181
Van Dongen S (2008) Graph clustering via a discrete uncoupling process. SIAM J Matrix Anal Appl 30(1):121–
141. https ://doi.org/10.1137/04060 8635
Voulgaris S, Gavidia D, van Steen M (2005) Cyclon: inexpensive membership management for unstructured
p2p overlays. J Netw Syst Manag 13(2):197–217. https ://doi.org/10.1007/s1092 2-005-4441-x
Wang J, Yuan D, Jiang M (2012) Parallel K-PSO based on MapReduce. In: IEEE 14th international conference
on communication technology, pp 1203–1208. IEEE, Chengdu. https ://doi.org/10.1109/ICCT.2012.65113
80
Wang B, Yin J, Hua Q, Wu Z, Cao J (2016) Parallelizing k-means-based clustering on spark. In: International
conference on advanced cloud and Big Data (CBD). IEEE, Chengdu, pp 31–36. https ://doi.org/10.1109/
CBD.2016.016
Winterstein F, Bayliss S, Constantinides GA (2013) FPGA-based k-means clustering using tree-based data
structures. In: The 23rd international conference on field programmable logic and applications. IEEE,
Porto, pp 1–6. https ://doi.org/10.1109/FPL.2013.66455 01
Yan W, Brahmakshatriya U, Xue Y, Gilder M, Wise B (2013) p-PIC: parallel power iteration clustering for big
data. J Parallel Distrib Comput 73(3):352–359. https ://doi.org/10.1016/j.jpdc.2012.06.009
Yang J, Li X (2013) MapReduce based method for big data semantic clustering. In: IEEE international confer-
ence on systems, man, and cybernetics. IEEE, pp 2814–2819. https ://doi.org/10.1109/SMC.2013.480
Yıldırım AA, Özdoğan C (2011) Parallel wavecluster: a linear scaling parallel clustering algorithm imple-
mentation with application to very large datasets. J Parallel Distrib Comput 71(7):955–962. https ://doi.
org/10.1016/j.jpdc.2011.03.007
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets.
In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing, HotCloud’10. USENIX
Association, Berkeley
Zayani A, BenN’Cir CE, Essoussi N (2016) Parallel clustering method for non-disjoint partitioning of large-
scale data based on spark framework. In: IEEE international conference on big data (Big Data). IEEE,
Washington, DC, pp 1064–1069. https ://doi.org/10.1109/BigDa ta.2016.78407 08
Zhang Y, Mueller F, Cui X, Potok T (2010) Large-scale multi-dimensional document clustering on GPU clus-
ters. In: IEEE international symposium on parallel distributed processing (IPDPS). IEEE, pp 1–10. https ://
doi.org/10.1109/IPDPS .2010.54704 29
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on MapReduce. In: Cloud computing. Lecture
notes in computer science. Springer, Berlin, pp 674–679. https ://doi.org/10.1007/978-3-642-10665 -1_71
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
... The feature vector is mapped into the feature space and the observation is assigned to one and only one class depending on the relative position to the frontiers ( Figure 2). In the case of unsupervised classifiers (also known as clustering) [20][21][22][23], given a set of observations (xi, yi), all feature vectors xi ∈ R m are known but the labels only are known at the end of the training process. The training process groups the observations into clusters (or classes) according to some criterion. ...
... This classifier reveals the number of different clusters (or groups) that can be recognized in the training dataset. In the case of unsupervised classifiers (also known as clustering) [20][21][22][23], given a set of observations (x i , y i ), all feature vectors x i ∈ R m are known but the labels only are known at the end of the training process. The training process groups the observations into clusters (or classes) according to some criterion. ...
Article
Full-text available
This paper presents an innovative technique, Advanced Predictor of Electrical Parameters, based on machine learning methods to predict the degradation of electronic components under the effects of radiation. The term degradation refers to the way in which electrical parameters of the electronic components vary with the irradiation dose. This method consists of two sequential steps defined as ‘recognition of degradation patterns in the database’ and ‘degradation prediction of new samples without any kind of irradiation’. The technique can be used under two different approaches called ‘pure data driven’ and ‘model based’. In this paper, the use of Advanced Predictor of Electrical Parameters is shown for bipolar transistors, but the methodology is sufficiently general to be applied to any other component.
... The initial dataset can be divided into smaller sections, each to be processed by an individual computing unit. This technique, known as data parallelism, proves particularly useful when handling datasets that exceed the memory capacity of a single machine [30]. ...
... For extremely large datasets that exceed the memory of a single machine, distributed computing frameworks such as Apache Hadoop or Apache Spark are beneficial. These frameworks facilitate the distribution of data and computation across several nodes in a cluster, accommodating larger datasets than would be possible on a single machine [30]. ...
Article
Full-text available
This paper introduces a novel formulation of the clustering problem, namely, the minimum sum-of-squares clustering of infinitely tall data (MSSC-ITD), and presents HPClust, an innovative set of hybrid parallel approaches for its effective solution. By utilizing modern high-performance computing techniques, HPClust enhances key clustering metrics: effectiveness, computational efficiency, and scalability. In contrast to vanilla data parallelism, which only accelerates processing time through the MapReduce framework, our approach unlocks superior performance by leveraging the multi-strategy competitive–cooperative parallelism and intricate properties of the objective function landscape. Unlike other available algorithms that struggle to scale, our algorithm is inherently parallel in nature, improving solution quality through increased scalability and parallelism and outperforming even advanced algorithms designed for small- and medium-sized datasets. Our evaluation of HPClust, featuring four parallel strategies, demonstrates its superiority over traditional and cutting-edge methods by offering better performance in the key metrics. These results also show that parallel processing not only enhances the clustering efficiency, but the accuracy as well. Additionally, we explore the balance between computational efficiency and clustering quality, providing insights into optimal parallel strategies based on dataset specifics and resource availability. This research advances our understanding of parallelism in clustering algorithms, demonstrating that a judicious hybridization of advanced parallel approaches yields optimal results for MSSC-ITD. Experiments on the synthetic data further confirm HPClust’s exceptional scalability and robustness to noise.
... Parallel clustering algorithms customized for big data platforms have been well reviewed in [56]. As examined in a thorough analysis [57], this examines scalable clustering algorithms based on approaches, providing insights into both streaming and non-streaming algorithms and setting them apart from conventional methods. ...
Article
Full-text available
For organizing and analyzing massive amounts of data and revealing hidden patterns and structures, clustering is a crucial approach. This paper examines unique strategies for rapid clustering, highlighting the problems and possibilities in this area. The paper includes a brief introduction to clustering, discussing various clustering algorithms, improvements in handling various data types, and appropriate evaluation metrics. It then highlights the unsupervised nature of clustering and emphasizes its importance in many different fields, including customer segmentation, market research, and anomaly detection. This review emphasizes ongoing efforts to address these issues through research and suggests exciting directions for future investigations. By examining the advancements, challenges, and future opportunities in clustering, this research aims to increase awareness of cutting-edge approaches and encourage additional innovations in this essential field of data analysis and pattern identification. It highlights the need for resilience to noise and outliers, domain knowledge integration, scalable and efficient algorithms, and interpretable clustering technologies. In addition to managing high-dimensional data, creating incremental and online clustering techniques, and investigating deep learning-based algorithms, the study suggests future research areas. Additionally featured are real-world applications from several sectors. Although clustering approaches have made a substantial contribution, more research is necessary to solve their limitations and fully realize their promise for data analysis.
... However, clustering-based intrusion detection methods face challenges when dealing with scalability issues in the analysis of large volumes of network traffic. To address this, several parallel clustering methods have been developed in the literature to handle large-scale data [9]. Many of these methods leverage the MapReduce framework [11] for data processing. ...
Article
Full-text available
Given the exponential growth of available data in large networks, the existence of rapid, transparent and explainable intrusion detection systems has become of high necessity to effectively discover attacks in such huge networks. To deal with this challenge, we propose a novel explainable intrusion detection system based on Spark, Particle Swarm Optimization (PSO) clustering and eXplainable Artificial Intelligence (XAI) techniques. Spark is used as a parallel processing model for the effective processing of large-scale data, PSO is integrated for improving the quality of the intrusion detection system by avoiding sensitive initialization and premature convergence of the clustering algorithm and finally, XAI techniques are used to enhance interpretability and explainability of intrusion recommendations by providing both micro and macro explanations of detected intrusions. Experiments are conducted on several large collections of real datasets to show the effectiveness of the proposed intrusion detection system in terms of explainability, scalability and accuracy. The proposed system has shown high transparency in assisting security experts and decision-makers to understand and interpret attack behavior.
... The record deduplication was efficiently conducted using DBSCAN clustering, which has the significant advantage of being able to form clusters of arbitrary shape while maintaining fast performance, characterized by a time complexity of O(|E| log(|E|)) [5]. It is worth noting that scalability can be achieved by using one of the parallel versions of DBSCAN, as discussed in [3]. ...
Chapter
Full-text available
Record linkage is the process of matching records from multiple data sources that refer to the same entities. When applied to a single data source, this process is known as deduplication. With the increasing size of data source, recently referred to as big data, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent decades, several blocking, indexing and filtering techniques have been developed. Their purpose is to reduce the number of record pairs to be compared by removing obvious non-matching pairs in the deduplication process, while maintaining high quality of matching. Currently developed algorithms and traditional techniques are not efficient, using methods that still lose significant proportion of true matches when removing comparison pairs. This paper proposes more efficient algorithms for removing non-matching pairs, with an explicitly proven mathematical lower bound on recently used state-of-the-art approximate string matching method-Fuzzy Jaccard Similarity. The algorithm is also much more efficient in classification using Density-based spatial clustering of applications with noise (DBSCAN) in log-linear time complexity O(|E| log(|E|)).
Article
Since its early formulations, co-clustering has gained popularity and interest both within and outside the machine learning community as a powerful learning paradigm for clustering high-dimensional data with good explainability properties. The simultaneous partitioning of all the modes of the input data tensors (rows and columns in a data matrix) is both a method for improving clustering on one mode while performing dimensionality reduction on the other mode(s), and a tool for providing an actionable interpretation of the clusters in the main mode as summaries of the features in each other mode(s). Hence, it is useful in many complex decision systems and data science applications. In this paper, we survey the the co-clustering literature by reviewing the main co-clustering methods, with a special focus on the work done in the last twenty-five years. We identify, describe and compare the main algorithmic categories, and provide a practical characterization with respect to similar unsupervised techniques. Additionally, we also try to explain why it is still a powerful tool despite the apparent recent decreasing interest shown by the machine learning community. To this purpose, we review the most recent trends in co-clustering research and outline the open problems and promising future research perspectives.
Chapter
The field of Quantum Reinforcement Learning (QRL) has gained much attention in recent years within the domain of artificial intelligence (AI) and machine learning (ML). With the widespread acceptance of AI agents across diverse industries, it is imperative to augment their potential, effectiveness, and flexibility. Quantum Reinforcement Learning (QRL) presents an innovative prospect that can overcome the constraints imposed by conventional AI algorithms and present revolutionary resolutions to intricate real-world predicaments. The significance of this research lies in its ability to address a primary obstacle in artificial intelligence, which is the development of efficient decision-making and learning mechanisms for AI agents. Current classical reinforcement learning algorithms have made significant advances, but they frequently encounter computational hurdles; particularly in situations where instantaneous decisions are crucial. QRL plays a significant role in revolutionising the design of systems that possess autonomous learning capabilities akin to human cognition. It holds immense promise with its potential for exponential acceleration that could transform how AI agents learn, adjust, and make decisions. Consequently, this research plays a critical role in examining the integration of quantum principles with reinforcement learning to unlock novel domains within artificial intelligence. It promotes the progression of quantum computing through the provision of pragmatic applications in the field of artificial intelligence. This allows for the creation of AI agents optimised by quantum computation, which can effectively perform a diverse range of tasks. It transcends the theoretical realm and fosters tangible solutions that can reshape industries and economies. However, it carries a great deal of risk if not monitored, managed, and controlled.
Article
Full-text available
Clustering algorithm is widely used in data mining. It attempt to classify elements into several clusters, and the elements in the same cluster are more similar to each other meanwhile the elements belonging to other clusters are not similar. The recently published density peaks clustering algorithm can overcome the disadvantage of the distance-based algorithm that can only find clusters of nearly-circular shapes, instead it can discover clusters of arbitrary shapes and it is insensitive to noise data. However it needs calculate distances between all pairs of data points and is not scalable to the big data, in order to reduce the computational cost of the algorithm we propose an efficient distributed density peaks clustering algorithm based on Spark's GraphX. This paper proves the effectiveness of the method based on two different data set. The experimental results show our system can improve the performance significantly (up to 10x) comparing to MapReduce implementation. We also evaluate our system expansibility and scalability.
Article
Full-text available
A huge amount of digital data containing useful information, called Big Data, is generated everyday. To mine such useful information, clustering is widely used data analysis technique. A large number of Big Data analytics frameworks have been developed to scale the clustering algorithms for big data analysis. One such framework called Apache Spark works really well for iterative algorithms by supporting in-memory computations, scalability etc. We focus on the design and implementation of partitional based clustering algorithms on Apache Spark, which are suited for clustering large datasets due to their low computational requirements. In this paper, we propose Scalable Random Sampling with Iterative Optimization Fuzzy c-Means algorithm (SRSIO-FCM) implemented on an Apache Spark Cluster to handle the challenges associated with big data clustering. Experimental studies on various big datasets have been conducted. The performance of SRSIO-FCM is judged in comparison with the proposed scalable version of the Literal Fuzzy c-Means (LFCM) and Random Sampling plus Extension Fuzzy c-Means (rseFCM) implemented on the Apache Spark cluster. The comparative results are reported in terms of time and space complexity, run time and measure of clustering quality, showing that SRSIO-FCM is able to run in much less time without compromising the clustering quality.
Article
Clustering approaches are widely used methodologies to analyse large data sets. The K-means algorithm is well-known as a procedure too computational-intensive for the large data analytic problem. In this work, we focus on a parallel technique to reduce the execution time when the K-means is used to cluster large dataset. We exploit computational powerful of its design when the Graphic Processor Units (GPUs), a massively parallel architecture, is adopted. We optimize the proposed implementation to handle (i) the space limitation issue of GPUs; (ii) the host-device data transfer time. Experimental results, on real and synthetic data, show how our parallelization approach give good results in terms of execution time and speed-up.
Conference Paper
Clustering large scale data has become an important challenge which motivates several recent works. While the emphasis has been on the organization of massive data into disjoint groups, this work considers the identification of non-disjoint groups rather than the disjoint ones. In this setting, it is possible for data object to belong simultaneously to several groups since many real-world applications of clustering require non-disjoint partitioning to fit data structures. For this purpose, we propose the Parallel Overlapping k-means method (POKM) which is able to perform parallel clustering processes leading to non-disjoint partitioning of data. The proposed method is implemented within Spark framework to ensure the distribution of works over the different computation nodes. Experiments which we have performed on simulated and real-world multi-labeled datasets shows both faster execution times and high quality of clustering compared to existing methods.
Article
Clustering is one of the important data mining issues, especially for large and distributed data analysis. Distributed computing environments such as Peer-to-Peer (P2P) networks involve separated/scattered data sources, distributed among the peers. According to unpredictable growth and dynamic nature of P2P networks, data of peers are constantly changing. Due to the high volume of computing and communications and privacy concerns, processing of these types of data should be applied in a distributed way and without central management. Today, most applications of P2P systems focus on unstructured P2P systems. In unstructured P2P networks, spreading gossip is a simple and efficient method of communication, which can adapt to dynamic conditions in these networks. Recently, some algorithms with different pros and cons have been proposed for data clustering in P2P networks. In this paper, by combining a novel method for extracting the representative data, a gossip-based protocol and a new centralized clustering method, a Gossip Based Distributed Clustering algorithm for P2P networks called GBDC-P2P is proposed. The GBDC-P2P algorithm is suitable for data clustering in unstructured P2P networks and it adapts to the dynamic conditions of these networks. In the GBDC-P2P algorithm, peers perform data clustering operation with a distributed approach only through communications with their neighbours. The GBDC-P2P does not need to rely on a central server and it performs asynchronously. Evaluation results demonstrate the superior performance of the GBDC-P2P algorithm. Also, a comparative analysis with other well-established methods illustrates the efficiency of the proposed method.
Conference Paper
Nowadays, massive data sets are generated in many modern applications ranging from economics to bioinformatics, and from social networks to scientific databases. Typically, such data need to be processed by machine learning algorithms, which entails high processing cost and usually requires the execution of iterative algorithms. Spark has been recently proposed as a framework that supports iterative algorithms over massive data efficiently. In this paper, we design a framework for clustering and classification of big data suitable for Spark. Our framework supports different restrictions on the data exchange model that are applicable in different settings. We integrate k-means and ID3 algorithms in our framework, leading to interesting variants of our algorithms that apply to the different restrictions on the data exchange model. We implemented our algorithms over the open-source computing framework Spark and evaluated our approach in a cluster of 37-nodes, thus demonstrating the scalability of our techniques. Our experimental results show that we outperform the algorithm provided by Spark for k-means up to 31 %, while the centralized k-means is at least one order of magnitude worse.