Conference PaperPDF Available

HPDBSCAN: highly parallel DBSCAN

Authors:

Abstract and Figures

Clustering algorithms in the field of data-mining are used to aggregate similar objects into common groups. One of the best-known of these algorithms is called DBSCAN. Its distinct design enables the search for an apriori unknown number of arbitrarily shaped clusters, and at the same time allows to filter out noise. Due to its sequential formulation, the parallelization of DBSCAN renders a challenge. In this paper we present a new parallel approach which we call HPDBSCAN. It employs three major techniques in order to break the sequentiality, empower workload-balancing as well as speed up neighborhood searches in distributed parallel processing environments i) a computation split heuristic for domain decomposition, ii) a data index preprocessing step and iii) a rule-based cluster merging scheme. As a proof-of-concept we implemented HPDBSCAN as an OpenMP/MPI hybrid application. Using real-world data sets, such as a point cloud from the old town of Bremen, Germany, we demonstrate that our implementation is able to achieve a significant speed-up and scale-up in common HPC setups. Moreover, we compare our approach with previous attempts to parallelize DBSCAN showing an order of magnitude improvement in terms of computation time and memory consumption.
Content may be subject to copyright.
HPDBSCAN – Highly Parallel DBSCAN
Markus Götz
m.goetz@fz-juelich.de Christian Bodenstein
c.bodenstein@fz-
juelich.de
Morris Riedel
m.riedel@fz-juelich.de
Jülich Supercomputing Center
Leo-Brandt-Straße
52428 Jülich, Germany
University of Iceland
Sæmundargötu 2
101, Reykjavik, Iceland
ABSTRACT
Clustering algorithms in the field of data-mining are used
to aggregate similar objects into common groups. One of
the best-known of these algorithms is called DBSCAN. Its
distinct design enables the search for an apriori unknown
number of arbitrarily shaped clusters, and at the same time
allows to filter out noise. Due to its sequential formula-
tion, the parallelization of DBSCAN renders a challenge. In
this paper we present a new parallel approach which we call
HPDBSCAN. It employs three major techniques in order
to break the sequentiality, empower workload-balancing as
well as speed up neighborhood searches in distributed paral-
lel processing environments i) a computation split heuristic
for domain decomposition, ii) a data index preprocessing
step and iii) a rule-based cluster merging scheme.
As a proof-of-concept we implemented HPDBSCAN as an
OpenMP/MPI hybrid application. Using real-world data
sets, such as a point cloud from the old town of Bremen,
Germany, we demonstrate that our implementation is able
to achieve a significant speed-up and scale-up in common
HPC setups. Moreover, we compare our approach with pre-
vious attempts to parallelize DBSCAN showing an order of
magnitude improvement in terms of computation time and
memory consumption.
Categories and Subject Descriptors
A.2.9 [General and reference]: Cross-computing tools
and techniques—Performance; F.5.8 [Theory of compu-
tation]: Design and analysis of algorithms—Parallel algo-
rithms; H.3.8 [Information systems]: Information sys-
tems applications—Data mining; I.2.1 [Computing method-
ologies]: Parallel computing methodologies—Parallel algo-
rithms
General Terms
Algorithms, Performance
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SC2015 2015 Austin, Texas USA
Copyright 2015 ACM X-XXXXX-XX-X/XX/XX ...$15.00.
Keywords
High performance computing, scalable clustering, parallel
DBSCAN, HPDBSCAN, OpenMP/MPI hybrid
1. INTRODUCTION
Cluster analysis is a data-mining technique that divides a
set of objects into disjoint subgroups, each containing sim-
ilar items. The resulting partition is called a clustering. A
clustering algorithm discovers these groups in the data by
maximizing a similarity measure within one group of items—
or cluster—and by minimizing it between individual clus-
ters. In contrast to supervised learning approaches, such
as classification or regression, clustering is an unsupervised
learning method. This means that, it tries to find the men-
tioned structures without any apriori knowledge about the
actual ground-truth. Typical fields of application for cluster
analysis include sequence analysis in bio-informatics, tissue
analysis in neuro-biology, or satellite image segmentation.
Clustering algorithms can be divided into four classes:
partitioning-based, hierarchy-based, density-based and grid-
based [17]. In this paper we will discuss aspects of the
two latter classes. Specifically, we are going to talk about
the density-based clustering algorithm DBSCAN density-
based spatial clustering of applications with noise [9]—and
how to efficiently parallelize it using computationally ef-
ficient techniques of grid-based clustering algorithms. Its
principal idea is to find dense areas, the cluster cores, and
to expand these recursively in order to form clusters. The
algorithm’s formulation has an inherent sequential control
flow dependency at the point of the recursive expansion,
making it challenging to parallelize.
In our approach we break the interdependency by adopt-
ing core ideas of grid-based clustering algorithms. We over-
lay the input data with a regular hypergrid, which we use
to perform the actual DBSCAN clustering. The overlayed
grid has two main advantages. Firstly, we can use the grid
as a spatial index structure to reduce the search space for
neighborhood queries; and secondly, we can separate the en-
tire clustering space along the cell borders and distribute it
among all compute nodes. Due to the fact that the cells
are constructed in a regular fashion, we can redistribute the
data points, by facilitating halo areas, so that there is no in-
termediate communication required during the parallel com-
putation step.
In spatially skewed datasets regular cell space splits would
lead to an imbalanced computational workload, since most of
the points would reside in dense cell subspaces assigned to a
small number of compute nodes. To mitigate this we propose
a cost heuristic that allows us to identify data-dependent
split points on the fly. Finally, the local computation results
are merged through a rule-based cluster merging scheme,
with linear complexity.
The remainder of this paper is organized as follows. The
next section surveys related work. Section 3 describes de-
tails of the original DBSCAN algorithm. Subsequently, Sec-
tion 4 discusses our parallelized version of DBSCAN by the
name of HPDBSCAN and shows its algorithmic equivalence.
Section 5 presents details of our hybrid OpenMP/MPI im-
plementation. Evaluations are shown in Section 6, where we
also present our benchmark method, datasets and the layout
of the test environment. We conclude the paper in Section 7
and give an overview of possible future work.
2. RELATED WORK
There are a number of previous research studies dealing
with the parallelization of DBSCAN. To the best of our
knowledge, the first attempt was made by Xiaowei Xu in
collaboration with Kriegel et al. [27]. In their approach,
single neighborhood search queries are parallelized facilitat-
ing a distributed version of the R-Tree—DBSCAN ’s initial
spatial index data structure. They adopt a master-slave
model, where the index is built on the master node and
the whole data set is split among the slaves according to
the bounding rectangles of the index. Subsequently, they
merge the local cluster results by reclustering the bordering
regions of the splits. Zhou et al. [28] and Arlia et al. [3]
present similar approaches, where they accelerate the neigh-
borhood queries by replicating the entire index on each of
the slave nodes, assuming the index fits entirely into the
main memory. Brecheisen et al. [5] have published a paral-
lel version of DBSCAN that approximates the cluster using
another clustering algorithm called OPTICS. Each of the
cluster candidates found in this manner is sent to a slave
node in order to filter out the actual from the guessed clus-
ter points. The local results are then merged by the master
into one coherent view. This approach, however, fails to
scale for big databases, since the pre-filtering has to be done
on the master, in main memory. Chen et al. [7] propose an-
other distributed DBSCAN algorithm, called P-DBSCAN
that is based on a priority R-Tree. Unfortunately, the paper
does not state how the data is distributed or how the clus-
ters are formed. An in-depth speed and scale-up evaluation
is also not performed. A paper by Fu et al. [13] demon-
strates the first Map-Reduce implementation of DBSCAN.
The core idea of this approach is the same as the first par-
allelization attempt of Xu, that is, to parallelize singular
neighborhood queries—this time in form of Map-Tasks. He
et al. [19] present another implementation of a parallel DB-
SCAN based on the Map-Reduce paradigm. They are the
first to introduce the notion of a cell-based preprocessing
step in order to perform a fully distributed clustering with-
out the need to replicate the entire dataset or to communi-
cate inbetween. Finally, Patwary et al. [25] have published
research work that shows a parallel DBSCAN that scales
up to hundreds of cores. Their main contribution is a quick
merging algorithm based on a disjoint-set data structure.
However, they either need to fit the entire dataset into main
memory or need a manual preprocessing step that splits the
data within a distributed computing environment.
3. THE DBSCAN ALGORITHM
Noise
Core
ε
Border
DDR
DC
DR
Figure 1: DBSCAN clustering with minP oints = 4
DBSCAN is a density-based clustering algorithm that was
published 1996 by Ester et al. [9]. Its principal idea is to
find dense areas and to expand these recursively in order to
find clusters. A dense region is thereby formed by a point
that has within a given search radius εat least minPoints
neighboring points. This dense area is also called the core
of a cluster. For each of the found neighbor points the den-
sity criteria is reapplied and the cluster is consequently ex-
panded. All points that do not form a cluster core and that
are not “absorbed” through expansion are regarded as noise.
A formal definition of the algorithm is as follows. Let X
be the entire dataset of points to be clustered and p, q X
two arbitrary points of this set. Then the following defi-
nitions describe DBSCAN with respect to its parameters ε
and minP oints. Figure 1 illustrates these notions.
Definition 1. Epsilon neighborhood (Nε)—The epsilon neigh-
borhood Nεof pdenotes all points qof the dataset X,
which have a distance dist(p, q) that is less or equal to ε,
or formally: Nε(p) = {q|dist(p, q )< ε}. In practice, the
euclidean distance is often used for dist making the epsilon-
neighborhood of pequal to the geometrically surrounding
hypersphere with radius ε.
Definition 2. Core pointpis considered a core point if
the epsilon-neighborhood of pcontains at least minP oints
number of points including itself: Core(p) = |Nε(p)| ≥
minP oints.
Definition 3. Directly density-reachable (DDR)—A point
qis directly density-reachable from a point p, if plies within
q’s epsilon-neighborhood and pis a core point, i.e., DDR(p, q) =
qNε(p)Core(p).
Definition 4. Density-reachable (DR)—A pair of points
p0=pand pn=qare called density reachable, if there
exists a chain of directly density-reachable points—{pi|0
ii < n DDR(pi, pi+1)}—linking them with one another.
Definition 5. Border point—Border points are special clus-
ter points that are usually located at the outer edges of a
cluster. They do not fulfill the core point criteria but are
still included in it due to direct density-reachability. For-
mally, this can be expressed as B order(p) = |Nε(p)|<
minP oints ∧ ∃q:DDR(q, p).
Definition 6. Density-connected (DC)—Two points pand
qare called density connected, if there is a third point r,
such that rcan density-reach pand q:DC (p, q) = r
X:DR(r, p)DR(r, q). Note that density-connectivity is
a weaker condition than density-reachability. Two border
points can be density-connected, even though they are not
density-reachable by definition due to not fulfilling the core
point criteria.
Definition 7. Cluster—A cluster is a subset of the whole
dataset, where each of the points is density-connected to all
the other and that contains at least one dense region, or in
other words a core point. This can be denoted as ∅ ⊂ CX
with p, q C:DC (p, q) and pC:C ore(p).
Definition 8. Noise—Noise are special points that do not
belong to any epsilon-neighborhood, such that N oise(p) =
¬∃q:DDR(q, p).
Listing 1 sketches pseudo code for a classic DBSCAN im-
plementation. Some of the type and function definitions are
left out, as their meaning can easily be inferred.
DBSCAN ’s main properties that distinguish it from more
traditional clustering algorithms, such as k-means [17] for
instance, are: i) it can detect arbitrarily shaped clusters
that can even protrude into, or surround one another; ii) the
cluster count does not have to be known apriori, and, iii) it
has a notion of noise inside the data.
Finding actual values for εand minPoints is dependent
on the clustering problem and its application domain. Ester
et al. [9] propose a simple algorithm for estimating ε. The
core idea is to determine the “thinnest” cluster area through
either visualization or a sorted 4-dist graph, and then choose
εto be equal to that width.
1de f DB SC A N ( X , ep s , m i nP o i nt s ):
2clusters = li s t ()
3fo r pin X :
4if v i si t e d ( p ):
5continue
6ma r kA s Vi s it e d (p )
7Np = q ue r y ( p , X , e ps )
8if l e ng t h ( N p ) < m i n Po i n t s :
9markAsNoise(p)
10 el s e :
11 C = C lu s te r ()
12 ad d ( c lu st e rs , C )
13 ex p a nd ( p , Np , X , C , ep s , m i n Po i n ts )
14 return clusters
15
16 de f ex pa n d ( p , N p , X , C , e ps , m i n P oi n t s ) :
17 ad d ( p , C )
18 fo r oin Np :
19 if n o t Vi s i te d ( o ) :
20 ma r kA s Vi s it e d (o )
21 No = q ue r y ( o , X , e ps )
22 if l e ng t h ( N o ) >= m i n Po i n ts :
23 Np = j oi n( Np , No )
24 if h a s No C l u st e r ( o ) :
25 ad d ( o , C )
Listing 1: Classic DBSCAN pseudocode
4. HPDBSCAN
In this section we present Highly Parallel DBSCAN, or in
short HPDBSCAN. Our approach to parallelize DBSCAN
consists of four major stages. In the first step the entire
dataset is loaded in equal-sized chunks by all processors in
parallel. Then, the data is preprocessed. This entails as-
signing each of the d-dimensional points in the dataset to
a virtual, unique spatial cell corresponding to their location
within the data space, with respect to the given distance
function. This allows us to sort the data points according
to their proximity, and to redistribute them to distinct com-
putation units of the parallel computing system. In order
to balance the computational load for each of the process-
ing units, we estimate the load using a simple cost heuristic
accommodating the grid overlay.
After this division phase, we perform the clustering of the
redistributed points in the second step locally on each of the
processing units, i.e., we assign a temporary cluster label to
each of the data points.
Subsequently, these have to be merged into one global
result view in step three. Whenever the temporary label
assigned by a processing unit disagrees with the ones in the
halo areas of the neighboring processors, we generate cluster
relabeling rules.
In the fourth step, the rules are broadcasted and applied
locally. Figure 2 shows a schematic overview of the pro-
cess using the fundamental modeling concepts (FMC) nota-
tion [21]. The next sections scrutinizes each of the substeps
theoretically.
HPDBSCAN
Cluster
labels
Data
...
...
Preprocessing
Overlay
hypergrid
Sort and
distribute
...
Estimate
splits
...
Local
DBSCAN
...
Merge
halos
Clustering
Cells
Initial
order
Spatially
ordered
data
...
Temporary
labels
Cluster relabeling
ε
minPoints
Figure 2: Schematic overview of HPDBSCAN
4.1 Grid-based data preprocessing and index
The original DBSCAN paper proposes the use of R-trees [4]
in order to reduce the neighborhood search complexity from
O(n2) to O(log(n)). The construction of the basic R-tree
cannot be performed in parallel as it requires the entire
dataset to be known. Therefore, other researchers [4, 27]
propose to either just replicate the entire dataset, and per-
form linear neighborhood scans in parallel for each data
item, or to use distributed versions of the R- or k-d-trees.
However, He et al. [19] point out that these approaches do
not scale in terms of memory consumption or communica-
tion cost with respect to large datasets and number of par-
allel processors.
Therefore, we have selected a far more scalable approach
for HPDBSCAN that is based on grid-based clustering algo-
rithms like, e.g., STING [17], and common spatial problem
in HPC, like for example HACC in particle physics [16]. Its
core idea is that the d-dimensional bounding box of the en-
tire dataset, with respect to dist, is overlayed by a regular,
non-overlapping hypergrid structure, which is then decom-
posed into subspaces by splitting the grid along the grid cell
boundaries. Each of the resulting subspaces is then exclu-
sively assigned to a parallel processor that is responsible for
computing the local clustering. In order to be able to do
so in a scalable fashion, all the data points within a par-
ticular subspace should be in the local memory of the re-
spective parallel processor, so that communication overhead
is avoided. However, in most cases the data points will be
distributed in arbitrary order within the dataset. Therefore,
the data has to be indexed first and then redistributed to
the parallel processor responsible for clustering the respec-
tive subspace.
In HPDBSCAN the indexing is performed by employing
a hashmap with offset and pointers into the data mem-
ory. For this, all parallel processors read an arbitrary, non-
overlapping, equally-sized chunk of the complete dataset
first. Then each data item of a chunk is uniquely associated
with the cell of the overlayed grid that it spatially occupies,
and vice versa—every grid cell contains all the data items
that its bounding box covers. This in turn enables us to or-
der all of the local data items with respect to their grid cell
so that they are consecutively placed in memory. Finally,
an indexing hashmap can be constructed with the grid cells
being the key, and the tuple of pointer into the memory and
number of items in this cell the value. An indexing approach
like this has an additional memory overhead of O(log(n))
similar to other approaches like R- or k-d-trees. Figure 3
shows the indexing approach examplified by the dataset, in-
troduced in Section 3, for the data chunk of a processing
unit called processor 1.
ε
ε
1 2 34
5 6 78
9 10 11 12
1
0 2
processor 1 processor 1
3
2 1
12
3 1
processor 2
Figure 3: Sorted data chunks locally indexed by each
processor using hashmaps pointing into the memory
Using that index the data redistribution can be performed
in a straightforward fashion. The local data points of a par-
allel processor that do not lie within its assigned subspace
are simply transferred to the respective parallel processor
“owning” them. Afterwards all parallel processors have to
rebuild their local data indices in order to encompass the
received data. An efficient way of doing this is to send the
section of the data index structure along with the data to
the recipients. Due to the fact that the received and the
local data are pre-sorted, the sent data index section and
its memory pointers can be used to quickly merge them us-
ing, e.g., the merge-step of mergesort. The downside of the
data redistribution approach is that it requires an additional
memory overhead of O(n
p), per parallel processor, with pbe-
ing the number of parallel processors, to be able to restore
the initial data arrangement after the clustering. However,
since the additional overhead has linear complexity, it is
maintainable even for large scale problems.
Using the described index structure, cell-neighborhood
queries execute in amortized computation time of O(1). The
cell-neighborhood Ncell thereby consists of all cells that are
directly bordering the searched cell, its diagonals, as well
as the cell itself with respect to all dimensions. For the
cell labeled 6in Figure 3 the cell-neighborhood is the set
{1,2,3,5,6,7,9,10,11}. A formal definition follows.
Definition 9. Cell neighborhood—The cell neighborhood
NCell (c) of a given cell cdenotes all cells dfrom the space
of all available grid cells Cthat have a Chebychev distance
distChebychev [6] of zero or one to c, i.e., NC ell(c) = {d|d
CdistChebychev (c, d)1}.
The actual epsilon neighborhood is then constructed from
all points within the direct cell-neighborhood, filtered using
the distance function dist.ˇ
Sidlauskas et al. [26] show that
a spatial grid index like this, is superior to R-trees and k-d-
trees on index creation and queries, in terms of computation
time, under the assumption that the cell granularity is op-
timal with respect to future neighborhood searches. Due
to the fact that DBSCAN ’s search radius is constant, the
cells can trivially be determined to be hypercubes with the
side length of ε. From a technical perspective it has the
additional advantage that each of the dparts of the entire
cell-neighborhood vector are consecutive in memory. This
in turn enables data pre-fetching and the reuse of cell neigh-
borhoods, thus reducing the number of cache misses.
In order to be able to answer all range queries within its
assigned subspace, a parallel processor needs an additional
one-cell-thick-layer of redundant data items, surrounding
the grid splits that allow them to compute the cell neighbor-
hood even at the edges of said splits. In parallel codes this
is commonly referred to as halos or ghost cells. An efficient
way of providing these halo cells is to transfer them along
with the actual data during the data redistribution phase.
This way the parallel processor will also index them along
with the other data. Halo cells do not change the actual
split boundaries in which a parallel processor operates and
can be removed after the local clustering.
4.2 Cost heuristic
In the previous section we introduced the notion of sub-
dividing the data space in order to efficiently parallelize
HPDBSCAN and its spatial indexing especially also in dis-
tributed computing environments. However, we have not in-
troduced a way to determine the boundaries of these splits.
One of the most na¨
ıve approaches is to subdivide the space in
equally-sized chunks in all dimensions, so that the resulting
number of chunks equals to the number of available cores.
While the latter part of the assumption is sensible as it min-
imizes the communication overhead, the former is not. Con-
sider a spatially skewed dataset like shown in figure 4. The
sketched dotted boundary, chunking the data space equally
for two parallel processors, results in a highly unbalanced
computational load, where one core needs to cluster almost
all the data points and the other idles most of the time. Due
to the fact that computing the dist function, while filtering
the cell neighborhood, is for many distance functions the
most processing intensive part of DBSCAN, this distribu-
tion pattern is particularly undesirable. It should also be
clear that this is not only an issue of the presented example,
but other spatially skewed datasets and larger processing
core counts as well.
Therefore, we employ a cost heuristic to determine a more
balanced data space subdivision. For this, we exploit the
computation complexity properties of the cell neighborhood
36
16
18 28
92
68
24
54
26
6
1 2
3
2
processor 1 processor 2
(Costprocessor 1)=368 (Costprocessor 2)=8
36
16
18 28
92
68
24
54
26
6
1 2
3
2
processor 1 processor 2
(Costprocessor 1)=190 (Costprocessor 2)=186
Figure 4: Impact of naive and heuristic-based hy-
pergrid decompositions on compute load balancing.
Halo cells are marked with a hatched pattern.
query. For each data item we have to perform ncomputa-
tions of the distance function dist, where nis the number
of data items in the cell neighborhood. Since we have to
do that for all mdata items within a cell, the total num-
ber of comparisons for one cell is nm. The sum of all
comparisons, i.e. the cost scores, for all cells gives us the
total “difficulty” of the whole clustering problem, at least in
terms of the dist function evaluations. Then, we can assign
to each parallel processor a consecutive chunk of cells, the
cost of which is about a p-th part of the total score with p
being the number of available parallel processing cores. The
formal definitions are as follows.
Definition 10. Cell cost—The cell cost C ostCell(c) of a
cell cis the product of the number of items in it multiplied
with the number of data points in the cell neighborhood –
CostC ell(c) = |c| ∗ |NCell (c)|.
Definition 11. Total cost—The total cost CostT otal is equal
to the sum – PcCells CostCell(c) – of all individual cells.
Since the data items are already pre-sorted due to the spa-
tial preprocessing step, the hypergrid subdivision can be per-
formed by iteratively accumulating cell cost scores until the
per-core threshold is reached or exceeded. Moreover, the the
cell itself can be subdivided to gain more fine-grained con-
trol. For this, the cost score of the cell is not added entirely
but in n-steps for each data item in the cell, where nis the
number of items in the cell neighborhood. Figure 4 shows an
example of a dataset, its overlayed hypergrid, the annotated
cell cost values and the resulting subdivision. These subdi-
vision can easily be computed in parallel by computing the
cell score locally, reducing them to a global histogram and fi-
nally determining the boundaries according to the explained
accumulative algorithm.
4.3 Local DBSCAN
Having redistributed the chunks among the compute nodes
in a balanced fashion, the local DBSCAN execution follows.
To break the need for sequential computation, implied by the
recursive cluster expansion, this stage is converted to a par-
allelizable version with a single loop iterating over all data
points. This enables the further, fine-grained paralleliza-
tion of the algorithms using shared-memory parallelization
approaches such as threads for example. The performance-
focused algorithm redesign is twofold at this stage. Besides
the parallelization of the iterations, the amount of compu-
tation per iteration is also minimized. Due to the cell-wise
sorting and indexing of data points within the local data
chunk, all points occupying one cell are stored consecutively
in memory. This ensures that each cell-neighborhood must
be computed at most once per thread, as each of them can
be cached until all queries from within the same cell are vis-
ited. Listing 2 presents the pseudocode of the converted,
iterative local DBSCAN.
1de f lo c al D B S CA N ( X , ep s , m i n Pt s ) :
2ru le s = R ul es ( )
3@parallel
4fo r pin X :
5Cp , N p = qu e ry ( p , X , e ps )
6if l e ng th ( N p ) >= m in P ts :
7markAsCore(p)
8ad d ( Cp , p )
9fo r qin Np :
10 Cq = g et C lu s te r ( q)
11 if i s Co r e ( q ):
12 ma r k A sS a m e ( ru l es , Cp , Cq )
13 ad d ( Cp , q )
14 el if n o tV i s it e d ( p ) :
15 markAsNoise(p)
16 return rules
Listing 2: Local DBSCAN pseudocode
For each of the points the epsilon neighborhood query is
performed independently, i.e., not as an recursive expansion.
When a query for a point preturns more than minP oints
data points, from which none is yet labeled, pis marked as
a core of a cluster. The newly created cluster is then labeled
using p’s data point index, which is globally unique for the
entire sorted dataset. If the epsilon neighborhood, number-
ing at least minP oints, contains a point qthat is already
assigned to a cluster, the point pis added to that cluster and
inherits the cluster label from q. In case of multiple cluster
labels present in the neighborhood, the core pinherits any
one of the cluster labels and notes information indicating
that each of the encountered subclusters actually are one,
as they are inherently density connected. That information
is vital to formulate merger rules for the subsequent merging
of local cluster fragments and unification of cluster labels in
the global scope (see section 4.4).
In all of the above cases, the remainder of non-core points
in the epsilon neighborhood, which may also include halo
area points, is added to the cluster of p. If phas less than
minP oints data points in its neighborhood, it is marked as
visited and labeled as noise. The below proof shows that
replacing the iterative cluster relabeling is equivalent to the
original recursive expansion.
Theorem 1. Given points pCpand qCq: (Core(p)
Core(q)) DDR(p, q) =⇒ ∃C:CpSCqCp, q C
Proof. If neither pnor qis core, or they are mutually
not DDR, the assumption is false and the implication triv-
ially true. If p,qor both are cores, and they are DDR then
by definition, they are also DR and therefore DC, with the
linking point rbeing either por q. Given the density connec-
tion DC between pand q, they belong to the same cluster C.
By extension, any point belonging to Cpor Cqalso belongs
to C.
The result of local DBSCAN is a list of subclusters along
with the points and cores they contain, a list of noise points,
and a set of rules describing which cluster labels are equiva-
lent. This information is necessary and sufficient for the next
step of merging the overlapping clusters with contradictory
labels within the nodes’ halos.
4.4 Rule-based cluster merging
The relabeling rules created by distinctive nodes are insuf-
ficient for merging cluster fragments from separate dataset
chunks. The label-mapping rules across different nodes are
created based on the labels of halo points. Upon the com-
pletion of the local DBSCAN, each halo zone is passed to
the node that owns the actual neighboring data chunk. Sub-
sequently, the comparison of local and halo point labels fol-
lows, resulting analogously in a set of relabeling rules for
neighboring chunks, which may create transitive cluster la-
bel chains. These rules are then serialized and broadcast to
all other nodes. Only then is the minimization of all local
and inter-chunk label-mapping rules possible, and all tran-
sitive labels can be removed. Thus each compute node is
equipped with a list of direct mappings from each existing
subcluster label to a single global cluster label.
Each compute node then proceeds to relabel the owned
clusters using the merger rules. At that stage each data
point, now having assigned a cluster label, is sent back to
the compute node that originally loaded it from the dataset.
Recreation of the order of all data points is enabled by the
initial ordering information created during the data redis-
tribution phase. The distributed HPDBSCAN execution is
complete and the result is a list of cluster ids or noise mark-
ers per data item.
5. IMPLEMENTATION
In this section we present our prototypical realization of
HPDBSCAN and specifics of distinct technical details. The
C++ source code can be obtained freely from our source
code repository [23]. It depends on the parallel programing
APIs Open Multiprocessing (OpenMP) [8] in version 4.0+
and Message Passing Interface (MPI) [15] in version 1.1+.
Additionally, the command-line version requires the I/O li-
brary Hierarchical Data Format 5 (HDF5) [18] in order to
pass the data and store computational results.
5.1 Data distribution and gathering
As explained in section 4.1, the data items of the datasets
are redistributed in the preprocessing step, in order to achieve
data locality. Implementing this behavior in shared-memory
architectures is trivially not required, due to the fact that
all processors can access the same memory. For distributed
environments, however, this step is needed and can be quite
challenging to realize—especially in a scalable fashion.
Since HPDBSCAN sorts the data points during the in-
dexing phase and lays them out consecutively in memory,
we are able to exploit collective communication operations
of the MPI. We first send the local histograms of data points
from each compute node to the one that owns the respec-
tive bounds during the local DBSCAN execution. This can
be implemented either by an MPI_Reduce or, alternatively,
by an MPI_Alltoall and a subsequent summation of the
array. After that, each of the compute nodes allocates lo-
cal memory, and the actual data points are exchanged using
an MPI_Alltoallv call. Using the received histograms, the
compute nodes are also able to memorize the initial ordering
of the data points, in a flat array, for example.
Vice versa, the gather step can be implemented analo-
gously. Instead of sorting the local data items by their as-
signed grid cell, they are now re-organized by their initial
global position in the dataset. After that, they can be ex-
changed again, using the MPI collectives, and stored. Note
that in this step the computed cluster labels are transferred
along with the data points in order to avoid multiple com-
munication.
5.2 Lock-free cluster labeling
To ensure that the cluster labels are unique within a chunk
as well as globally, each cluster label cis determined by the
lowest index of a core point inside a cluster—
c= minpC|Core(p)index(p). The index(p) function returns
the position of a data point p, within the globally sorted
dataset, redistributed to the compute nodes. Additionally to
ensuring global uniqueness, this mechanism also maximizes
the size of consistently labeled cluster fragments within the
same compute node, as each consecutive iteration over the
points increments the current point’s index. Whenever a
core is found in the epsilon neighborhood, the current point
inherits its cluster label, even if it is a core itself.
A data race may occur, when the current epsilon neighbor-
hoods of multiple parallel threads overlap. In that case each
thread may attempt to assign a label to a point within their
neighborhood intersection. The na¨
ıve approach of locking
the data structures storing the cluster label and core infor-
mation is not scalable.
The better alternative of using atomic operations, here
atomic min, requires encoding the values to operate on, with
a single native data type. For this, we use signed long inte-
ger type values, and compress all flags and labels described
by DBSCAN ’s original definition, i.e., “visited”, “core” and
“noise” flags, and a “cluster label”, to that data type. As the
iterations are performed for each data point exactly once,
the “visited” flag, is made redundant and abandoned. The
cluster label value is stored using the absolute value of the
lowest core point index it contains. The sign bit is used
to encode the “core” flag, such that each core of cluster c
is marked by value -c, and each non-core point—by value
c. As cluster labels are created using point indexes, their
value never exceeds |X|. The noise label can then be en-
coded using any value from outside the range [−|X|,|X|].
For this, we have selected the upper bound of the value
range—the maximal positive signed long integer. As long
as range(signed long int)>=|X|+ 1, signed long integers
are sufficient to encode the cluster labels as well as the core
and noise flags. In that way, minimizing the cluster label
is possible via simple atomic min implementation to set the
cluster label and core flag at once. Some processor architec-
tures, e.g., Intel x86, do not provide an atomic min instruc-
tion. Instead, a spinlock realization using basic atomic read
and compare-and-swap instruction, as shown in Listing 3, is
used.
1de f at om i c Mi n ( a dd r es s , v al ) :
2pr e v = a t om i cR e ad ( a d dr es s )
3while( p re v > va l ) :
4sw ap pe d = CA S( ad dre ss , p rev , v al )
5if s w ap pe d : break
6pr e v = a t om i cR e ad ( a d dr e ss )
Listing 3: Spinlock atomic min
5.3 Parallelization of the local DBSCAN loop
The iterative conversion of DBSCAN allows us to divide
the computation of the loop iterations among all threads of
a compute node. Because the density of data points within
a chunk can be highly skewed, a naive chunking approach is
suboptimal (see Section 4.2), and can lead to a highly unbal-
anced work load. To mitigate this, a work stealing approach
is advisable. Our HPDBSCAN implements threading using
OpenMP’s parallel for pragma. The closest representa-
tive of work stealing in OpenMP is the schedule(dynamic)
clause, added to the parallel for pragma. Optimal perfor-
mance is achieved, when the dynamically pulled workload is
small enough—so that the workload imbalances are split and
fairly divided, and at the same time large enough—so that
not too many atomic min operations (whether supported
by hardware or not) are performed simultaneously on the
same memory location. This number is highly dependent
on environment details, such as the clustered problem and
the execution hardware. Through empirical tests, however,
we determined a reasonable dynamic chunk size of 40.
6. EXPERIMENTAL EVALUATION
In this section we will describe the methodology and find-
ings of the experiments conducted to evaluate the parallel
DBSCAN approach described above. The main focus of
the investigation is the performance evaluation of the imple-
mentation with respect to computation time, memory con-
sumption and the parallel programming metrics: speed- and
scale-up [12].
6.1 Hardware setup
To verify the computation time and speed up of our im-
plementation, we have performed tests on the Juelich Dedi-
cated Graphic Environment (JUDGE) [10]. It consists of 206
IBM System x iDataPlex dx360 M3 compute nodes, where
each node has 12 compute cores combined through two Intel
Xeon X5650 (Westmere) hex-core processors clocked at 2.66
GHz. A compute node has 96 GB of DDR-2 main memory.
JUDGE is connected to a parallel, network-attached GPFS-
storage system, called Juelich Storage Cluster (JUST) [11].
Even though the system has a total core count of 2.472,
we were only able to acquire a maximum of 64 nodes (768
cores) for our benchmark, as JUDGE is used as a production
cluster for other scientific applications. Our hardware allo-
cation, though, was solely dedicated for us, which ensured
that no other computations interfered with our tests. The
plugged-in Westmere processors allow to use 24 virtual pro-
cessors, when hyperthreading is enabled. For the test runs,
however, we disabled this feature as it can falsify or destabi-
lize measurement correctness, as Leng et al. [22] have shown.
In a multithreading scenario we facilitate for this reason a
maximum of 12 threads per node.
6.2 Software setup
The operation system running on JUDGE is a SUSE Linux
SLES 11 with the kernel version 2.6.32.59-0.7. All appli-
cations in the test have been compiled with gcc 4.9.2 us-
ing the optimization level O3. The MPI distribution on
JUDGE is MPICH2 in version 1.2.1p1. For the compila-
tion of HPDBSCAN, a working HDF5 development library
including headers and C++ bindings is required. For our
benchmarks we used the HDF group’s reference implemen-
Dataset Points Dims. Size (MB) ε minPts
Tweets [t] 16,602,137 2 253.34 0.01 40
Twitter small [ts] 3,704,351 2 56.52 0.01 40
Bremen [b] 81,398,810 3 1863.68 100 10000
Bremen small [bs] 2,543,712 3 48.51 100 312
Table 1: HPDBSCAN benchmark datasets properties
tation, version 1.8.14, pre-installed on JUDGE. Later in
this section we present a comparison of HPDBSCAN with
PDSDBSCAN-D created by Patwari et al [25]. The lat-
ter needs the parallel netCDF I/O library. We have ob-
tained and compiled pnetCDF from the project’s web page
at Northwestern University with version 1.5.0 [24].
6.3 Datasets
Despite DBSCAN ’s popularity its parallelization attempts
were mainly evaluated using synthetic datasets. To their
advantage, they can provide an arbitrarily large number of
data points and dimensionality. The downside, however, is
that they are not representative for actual real world applica-
tions. They might have inherent regular patterns from, e.g.,
pseudo random number generators that will silently bias the
implementation’s performance. For this reason, we decided
to resort to actual real-world data and its potential skew.
An overview of the chosen examples is depicted in Table 1.
We acknowledge that an evaluation of higher-dimensional
datasets is of great interest for some clustering application,
such as, for instance, genomics in bio-informatics, but could
not be obtained at the time of writing.
6.3.1 Geo-tagged collection of tweets
This set was collected and made available to us by Junjun
Yin form the National Center for Supercomputing Applica-
tion (NCSA). The dataset was obtained using the free twit-
ter streaming API and contains exactly one percent of all
geo-tagged tweets from the United Kingdom in June 2014.
It was initially created to investigate the possibility of min-
ing people’s trajectories and to identify hotspots and points
of interest (clusters of people) through monitoring tweet den-
sity. The full collection spans roughly 16.6 million tweets. A
smaller subset of this was generated by filtering the entire set
for the first week of June only. Both datasets are available at
the scientific storage and sharing platform B2SHARE [14].
6.3.2 Point cloud of Bremen’s old town
This data was collected and made available by Dorit Bor-
rmann and Andreas N¨
uchter from the Institute of Computer
Science at the Jacobs University Bremen, Germany. It is a
3D-point cloud of the old town of Bremen. A point cloud is
a set of points and its representing coordinate system that
often model the surface of objects. This particular point
cloud of Bremen was recorded using a laser scanner sys-
tem mounted onto an autonomous robotic vehicle. It has
stopped at 11 different locations, performing each time a
full 360scan of the surrounding area. Given the GPS tri-
angulated position and perspective of the camera, the sub-
point clouds were combined to one monolith. The raw data
is available from Borrmann and N¨
uchter’s webpage [20]. An
already combined version in HDF5 format, created by us,
can be obtained from B2SHARE [14]. DBSCAN can be
applied here in order to clean the dataset from noise or out-
liers, such as falsy scans or unwanted reflections of moving
objects. Moreover, DBSCAN can also be used to find dis-
tinct objects, represented as clusters, in the point cloud like
houses, roads or people. The whole point cloud contains
roughly 81.3 million data points. A smaller variant was gen-
erated by randomly sampling 1
32 of the points that is also
available on B2SHARE [14].
6.4 Speed up evaluation of HPDBSCAN
We benchmark our HPDBSCAN application’s speed up
using both, the full Twitter (t) and the full Bremen (b)
dataset. Our principal methodological approach is thereby
as follows. Each benchmark is ran five times, measuring
the application’s walltime at the beginning and end of the
main() function of the process with the MPI rank 0 and the
OpenMP thread number 0. After these five runs we double
the number of nodes and cores, starting from one node and
12 cores, up to the maximum of 768 cores. In addition to
that we have run a base measurement with exactly one core
on one node. For each “five-pack” benchmark run we report
the minimum, maximum, mean µ, standard deviation σand
coefficient of variation (CV ), defined as ν=µ
σ[1]. The
speed up coefficient is calculated in comparison to the sin-
gle core run, based on the mean values of the measurements
for each processor count configuration. Both datasets are
processed using the OpenMP/MPI hybrid features of our
application. That means that we spawn an MPI process for
each node available and parallelize locally on the nodes using
OpenMP. For the Bremen point cloud we have additionally
parallelized the computation with MPI alone, i.e., we use
one MPI process per core, enabling direct comparison of the
hybrid and fully distributed versions.
Nodes 1 1 2 ... 32 64
Cores 1 12 24 384 768
OpenMP+MPI hybrid b
time (s)
Mean µ79372.29 8037.71 4271.64 327.07 172.53
StDev σ17.6011 71.2829 16.2092 2.5971 1.3801
CV υ0.00022 0.00886 0.00379 0.0079 0.0079
Min 79342.08 7937.48 4253.45 322.71 170.77
Max 79385.57 8129.85 4293.86 329.65 174.47
Speed-Up 1 9.9 18.6 242.7 460.0
MPI b
time (s)
Mean µ79372.29 8028.67 4403.96 515.21 354.99
SrDev σ17.6012 9.5769 7.1526 94.7806 42.0006
CV υ0.00022 0.00119 0.00162 0.18396 0.11832
Min 79342.08 8019.10 4395.78 471.10 302.27
Max 79385.57 8040.83 4415.45 684.74 420.01
Speed-Up 1 9.9 18.0 154.1 232.7
OpenMP+MPI hybrid t
time (s)
Mean µ2079.26 212.77 115.66 10.04 7.88
StDev σ1.06455 0.56826 0.35893 0.42128 1.03302
CV υ0.00051 0.00267 0.00310 0.04194 0.13106
Min 2078.16 212.05 115.34 9.76 7.14
Max 2080.47 213.43 116.17 10.78 9.70
Speed-Up 1 9.8 18.0 207.0 263.8
Table 2: Measured and calculated values of the
HPDBSCAN speed-up evaluation
The results in Table 2 and Figure 5 show that we are
able to gain substantial speed up for both data sets. It
peaks for Bremen at 460.0 using 768 cores, and in the Twit-
ter analysis case at slightly more than half of this value at
263.8. For the MPI-only clustering of the Bremen dataset
the speed up value falls short of the hybrid implementation,
being only roughly half of it with 232.7 using 768 cores.
There are two noteworthy facts that can be observed in the
measurement data. The first and obvious one is that the hy-
brid implementation outperforms the fully distributed MPI
runs by a factor of two. The access to a shared cell in-
dex and the reduced number of nodes to communicate it to,
significantly reduces communication overhead and enables
faster processing time. Secondly, one can observe a steady
decrease in the efficiency of additional cores used for the
clustering. This seems to be especially true for the tweet
collections compares to the Bremen dataset. This observa-
tion can be explained best through Amdahl’s law [2]. In
the benchmark we use a constant problem size, disallowing
infinite speed up performance gains. Instead, we approach
the asymptote of the single threaded program parts. Due
to the fact that tweet collection is smaller in size, we ap-
proach this boundary earlier than with the Bremen data for
instance. Additional network communication overhead with
larger processor counts, atomicMin() clashes as well as load
imbalances are good examples of simultaneously growing se-
rial program parts. Moreover, the growing CV value is a
good indicator for the increasing influence of external fac-
tors onto the measurements, like varying operating system
scheduling.
2 8 32 128 512
1
2
4
8
16
32
64
128
256
512
number of cores
speed up
Hybrid b
Hybrid t
MPI b
Linear
Figure 5: Speed up curves of the HPDBSCAN appli-
cation analyzing the Bremen and Twitter datasets
6.5 Scale up evaluation of HPDBSCAN
In this section we investigate HPDBSCAN ’s scalability
properties. Our principal measuring methodology remains
unchanged. Instead of the speed up coefficient, we report
the efficiency value ep=t1
tpfor each benchmark run, which
is the fraction of the execution time with a single core and
the execution time with pprocessing cores. Perfect scalabil-
ity is achieved, when the efficiency equals one or is almost
one. Yet, it requires doubling the dataset size, whenever we
double the processor count.
As a base for this we use the small Bremen dataset bs.
In particular, for each run, we copy the entire data ptimes,
where pis equal to the number of used processors. Then,
each copy is shifted along the first axis of the dataset by the
copy’s index, times the axis range, and concatenated with
the others. This way, we get multiple (p) Bremen old towns
next to one another. We chose this approach to get a better
grasp of the overhead of our implementation by presenting
the same problem to each available MPI process. In contrast
to that, a random sampling of the whole Bremen dataset,
for instance, would have altered the problem difficulty.
The results of our test can be seen in Table 3 and Fig-
ure 6. In all of the three scenarios a near constant efficiency
Cores 1 2 4 8 16 32
OpenMP
time (s)
Mean µ99.5044 102.644 107.462 121.35 - -
StDev σ0.0991 0.0439 0.1515 5.6788 - -
CV υ0.00099 0.00042 0.00141 0.04679 - -
Min 99.41 102.58 107.2 117.81 - -
Max 99.66 102.69 107.59 131.45 - -
MPI
time (s)
Mean µ99.50 101.86 103.74 105.48 107.12 109.19
StDev σ0.0991 0.0884 0.1131 0.1881 0.5399 0.6653
CV υ0.00099 0.00086 0.00109 0.00178 0.00504 0.00609
Min 99.41 101.79 103.65 105.3 106.56 108.52
Max 99.66 102 103.91 105.76 107.77 110.15
Cores 12 24 48 96 192 384
OpenMP/MPI hybrid
time (s)
Mean µ10.46 13.3 14.02 14.786 16.31 19.76
StDev σ0.0974 0.1425 0.2420 0.2548 0.58307 3.15081
CV υ0.00931 0.01071 0.24207 0.25481 0.03578 0.15943
Min 10.38 13.12 13.76 14.57 15.67 17.909
Max 10.63 13.45 14.28 15.16 17.12 25.26
Table 3: Measured and calculated values of the
HPDBSCAN scale-up evaluation of the Bremen data
2 4 8 16 32
0
0.5
1
1.5
2
number of cores (nodes for Hybrid)
Efficency
OpenMP bs
MPI bs
Hybrid bs
Figure 6: Scale up curves of HPDBSCAN analyzing
the Bremen and Twitter datasets
value can be achieved, indicating good scalability. While the
MPI-only and OpenMP/MPI hybrid benchmark runs only
have a slightly increasing execution time curve, we can ob-
serve a clear peek for the OpenMP benchmark with four and
more cores. Through a separate test, we can attribute this
increase to more contentions in the spinlock of our atom-
icMin() implementation, introduced in Section 5.
6.6 Comparison of HPDBSCAN and PDSDBSCAN
As discussed in related work there are a number of other
parallel versions of DBSCAN. Most of them report different
value permutations for the computation time, memory con-
sumption, speed up and scalability of their implementations.
Almost all of them do not to provide either their used bench-
mark datasets or the source code of their implementations.
This in turn, disallows us to verify their results or to perform
a direct comparison of our approaches. To the best of our
knowledge the only exception are Patwary et al. [25]. Their
datasets can also not be recreated, but they made the C++
implementation of their parallel disjoint-set data structure
DBSCAN PDSDBSCAN —open-source. This allows us to
compare our approach with theirs at least using our bench-
mark datasets. Patwary et al. offer two versions of PDSDB-
SCAN. One targets shared-memory architectures only and
is based on OpenMP. The other can also be used in dis-
tributed environments and is implemented using MPI. In
order to distinguish between these two, the suffixes -S or -D
are added respectively.
In order to compare both parallel DBSCAN approaches,
we have performed another speed up benchmark according
to the introduced methodology on the small Twitter dataset.
Due to their technical similarities our two “contestants” are
HPDBSCAN using MPI processes only and PDSDBSCAN-
D. Thereby, we scale the process count from one to a max-
imum of 32, each being executed on a separate compute
node. Even though we have executed five runs for each level
of used processors, we report here only the mean value for
execution time, memory consumption and speed up, because
of space considerations.
Nodes 1 2 4 8 16 32
HPDBSCAN MPI
time (s) 114.39 58.99 30.14 15.71 8.37 6.07
Speed-Up 1.00 1.94 3.80 7.28 13.67 18.85
Memory (MB) 251064 345276 433340 678248 1101000 2111000
PDSDBSCAN-D
time (s) 288.35 162.47 105.94 89.87 85.37 88.42
Speed-Up 1 1.77 2.72 3.21 3.38 3.36
Memory (MB) 500512 725104 1370000 4954000 19724000 59685000
Table 4: Comparison of HPDBSCAN and
PDSDBSCAN Dusing the Twitter dataset
Table 4 and Figure 7 present the obtained results. HPDB-
SCAN shows a constant, near linear speed-up curve, whereas
PDSDBSCAN-D starts similarly, but soon flattens, stabiliz-
ing at a speed-up of around 3.5. The curve for the memory
consumption is inverse. HPDBSCAN shows a linear increase
again, seemingly being dependent on the number of used
processing cores, which can be explained by the larger num-
ber of replicated points in the halo areas. PDSDBSCAN-D,
however, presents an exponential memory consumption. An
investigation of the source code reveals that each MPI pro-
cess always loads the entire datafile into main memory, effec-
tively limiting its capabilities to scale with larger datasets.
This is also the reason why we have used the small Twit-
ter dataset ts for this experiment, as larger datasets have
caused out-of-memory exceptions. As a consequence, we
have not been able to reproduce the performance capabili-
ties of PDSDBSCAN-D.
1 2 4 8 16 32
1
2
4
8
16
32
number of nodes
speed up
HPDBSCAN ts
PDSDBSCAN-D ts
Linear
1 2 4 8 16 32
1
2
3
4
5
·107
number of nodes
memory use (MB)
HPDBSCAN ts
PDSDBSCAN-D ts
Figure 7: Sp eed up and memory usage of
HPDBSCAN compared to PDSDBSCAN D
7. CONCLUSION
In this paper, we have presented HPDBSCAN —a scal-
able version of the density-based clustering algorithm DB-
SCAN. We have overcome the algorithm’s inherent sequen-
tial control flow dependencies through a divide-and-conquer
approach, using techniques from cell-based clustering algo-
rithms. Specifically, we employ a regular hypergrid as the
spatial index in order to minimize the neighborhood-search
spaces and to partition the entire cluster analysis into local
subtasks, without requiring further communication. Using
a rule-based merging scheme, we combine the found local
cluster-labels into a global view. In addition to that, we
also proposes a cost heuristic that allows to balance the
computation workload, facilitated by the previously men-
tioned cells, divided among the compute nodes according
to their computation complexity. We have implemented
HPDBSCAN as an open-source OpenMP/MPI hybrid appli-
cation in C++, which can be deployed in shared-memory as
well as distributed-memory computing environments. Our
experimental evaluation of the application has proven the al-
gorithm’s scalability in terms of memory consumption and
computation time, outperforming PDSDBSCAN, the first
parallel HPC implementation. The presented cell-based spa-
tial index can easily be transferred to other clustering and
neighborhood-search problems with constant search range.
In future work we plan to demonstrate this on the basis of
parallelizing other clustering algorithms, such as OPTICS
and SUBCLU.
8. REFERENCES
[1] H. Abdi. Coefficient of variation. Encyclopedia of
research design, pages 169–171, 2010.
[2] G. M. Amdahl. Validity of the single processor
approach to achieving large scale computing
capabilities. In Proceedings of the spring joint
computer conference 1967, pages 483–485. ACM, 1967.
[3] D. Arlia and M. Coppola. Experiments in parallel
clustering with dbscan. In Euro-Par 2001 Parallel
Processing, pages 326–331. Springer, 2001.
[4] J. L. Bentley and J. H. Friedman. Data structures for
range searching. ACM Computing Surveys (CSUR),
11(4):397–409, 1979.
[5] S. Brecheisen, H.-P. Kriegel, and M. Pfeifle. Parallel
density-based clustering of complex objects. In
Advances in Knowledge Discovery and Data Mining,
pages 179–188. Springer, 2006.
[6] C. Cantrell. Modern mathematical methods for
physicists and engineers. CUP, 2000.
[7] M. Chen, X. Gao, and H. Li. Parallel DBSCAN with
Priority R-Tree. In Information Management and
Engineering (ICIME), 2010 The 2nd IEEE
International Conference, pages 508–511. IEEE, 2010.
[8] L. Dagum and R. Menon. OpenMP: an industry
standard API for shared-memory programming.
Computational Science & Engineering, IEEE,
5(1):46–55, 1998.
[9] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters in
large spatial databases with noise. In Kdd, volume 96,
pages 226–231, 1996.
[10] Forschungszentrum J¨
ulich GmbH. Juelich Dedicated
GPU Environment.
http://www.fz-juelich.de/ias/jsc/EN/Expertise/
Supercomputers/JUDGE/JUDGE_node.html.
[11] Forschungszentrum J¨
ulich GmbH. Juelich Storage
Cluster. http://www.fz-juelich.de/ias/jsc/EN/
Expertise/Datamanagement/OnlineStorage/JUST/
JUST_node.html.
[12] I. Foster. Designing and building paral lel programs.
Addison Wesley Publishing Company, 1995.
[13] Y. X. Fu, W. Z. Zhao, and H. F. Ma. Research on
parallel dbscan algorithm design based on mapreduce.
Advanced Materials Research, 301:1133–1138, 2011.
[14] G¨
otz, Markus and Bodenstein, Christian. HPDBSCAN
Benchmark test files. http://hdl.handle.net/11304/
6eacaa76-c275-11e4-ac7e-860aa0063d1f.
[15] W. Gropp, E. Lusk, and A. Skjellum. Using MPI:
portable parallel programming with the
message-passing interface, volume 1. MIT press, 1999.
[16] S. Habib, V. Morozov, H. Finkel, A. Pope,
K. Heitmann, K. Kumaran, T. Peterka, J. Insley,
D. Daniel, P. Fasel, et al. The universe at extreme
scale: multi-petaflop sky simulation on the bg/q. In
Proceedings of the International Conference on High
Performance Computing, Networking, Storage and
Analysis, page 4. IEEE Computer Society Press, 2012.
[17] J. Han, M. Kamber, and J. Pei. Data Mining: concepts
and techniques - 3rd ed. Morgan Kaufmann, 2011.
[18] HDF Group. Hierachical Data Format 5.
http://www.hdfgroup.org/HDF5.
[19] Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, and
J. Fan. MR-DBSCAN: an efficient parallel
density-based clustering algorithm using mapreduce.
In Parallel and Distributed Systems, 2011 IEEE 17th
International Conference, pages 473–480, 2011.
[20] Jacobs University Bremen. 3D Scan Repository. http:
//kos.informatik.uni-osnabrueck.de/3Dscans/.
[21] A. Kn¨
opfel, B. Gr¨
one, and P. Tabeling. Fundamental
modeling concepts, volume 154. Wiley, UK, 2005.
[22] T. Leng, R. Ali, J. Hsieh, V. Mashayekhi, and
R. Rooholamini. An empirical study of
hyper-threading in high performance computing
clusters. Linux HPC Revolution, 2002.
[23] Markus G¨
otz. HPDBSCAN implementation.
https://bitbucket.org/markus.goetz/hpdbscan.
[24] Northwestern University. Parallel netCDF. http:
//cucis.ece.northwestern.edu/projects/PnetCDF/.
[25] M. M. A. Patwary, D. Palsetia, A. Agrawal, et al. A
new scalable parallel dbscan algorithm using the
disjoint-set data structure. In High Performance
Computing, Networking, Storage and Analysis (SC),
International Conference for, pages 1–11. IEEE, 2012.
[26] D. ˇ
Sidlauskas, S. ˇ
Saltenis, C. W. Christiansen, J. M.
Johansen, and D. ˇ
Saulys. Trees or grids?: indexing
moving objects in main memory. In Proceedings of the
17th ACM SIGSPATIAL International Conference on
Advances in Geographic Information Systems, pages
236–245. ACM, 2009.
[27] X. XU, J. JAGER, and H.-P. KRIEGEL. A Fast
Parallel Clustering Algorithm for Large Spatial
Databases. Data Mining and Knowledge Discovery,
3:263–290, 1999.
[28] A. Zhou, S. Zhou, J. Cao, Y. Fan, and Y. Hu.
Approaches for scaling dbscan algorithm to large
spatial databases. Journal of computer science and
technology, 15(6):509–526, 2000.
... HPDBSCAN [42] propose une exécution distribuée de DBSCAN en adoptant une technique de division de l'espace qui permet la distribution du calcul sur différents processus, et l'ordonnancement des données par rapport à leurs distances spatiales afin d'accélérer la recherche de voisinage de chaque entité. ...
... De plus, les algorithmes qui partitionnent les données en utilisant des techniques de division d'espace comme le partitionnement binaire de l'espace (BSP) [37] ne s'appliquent pas aux données RDF. Ces techniques perdent leur efficacité lorsqu'elles sont utilisées sur les données de grande dimensionnalité comme les données RDF qui sont décrites par un nombre important de propriétés [50,51,22,42,100]. ...
... Des travaux existants ont en effet montré qu'un schéma produit par l'algorithme DBSCAN est de bonne qualité en considérant le rappel et la précision de ses classes [65,66] En conclusion à l'analyse des extensions scalables de l'algorithme DBSCAN, nous pouvons voir qu'aucune des approches étudiées ne propose une solution parallèle qui s'applique aux données RDF et qui produit le même résultat que l'algorithme DBSCAN séquentiel. En effet, certaines de ces approches utilisent des méthodes de partitionnement d'espace qui ne s'appliquent pas aux données RDF [50,51,22,42,100], d'autres ne produisent pas les mêmes clusters que l'algorithme DBSCAN [72,73,95] et enfin certaines peuvent introduire un surcoût important [84]. ...
Thesis
Le web des données est un espace dans lequel de nombreuses sources sont publiées et interconnectées, et qui repose sur les technologies du web sémantique. Cet espace offre des possibilités d'utilisation sans précédent, cependant, l'exploitation pertinente des sources qu'il contient est rendue difficile par l'absence de schéma décrivant leur contenu. Des approches de découverte automatique de schéma ont été proposées, mais si elles produisent des schémas de bonne qualité, leur complexité limite leur utilisation pour des sources de données massives. Dans notre travail, nous nous intéressons au problème du passage à l'échelle de la découverte de schéma à partir de sources de données RDF massives dont le schéma est incomplet ou absent. Nous nous intéressons également à l'incrémentalité de ces approches et à la prise en compte de connaissances implicites fournies par une source de données.Notre première contribution consiste en une approche scalable de découverte de schéma qui permet l'extraction des classes décrivant le contenu d'une source de données RDF massive. Pour cela, nous avons d'abord proposé d'extraire une représentation condensée d'une source de données RDF qui servira en entrée du processus de découverte de schéma afin d'en améliorer les performances.Cette représentation est un ensemble de patterns qui correspondent à des combinaisons de propriétés décrivant les entités du jeu de données.Nous avons ensuite proposé une approche scalable de découverte de schéma fondée sur un algorithme de clustering distribué qui forme des groupes d'entités structurellement similaires représentant les classes du schéma.Notre deuxième contribution a pour but de maintenir le schéma extrait cohérent avec les changements survenant au niveau des sources RDF, ces dernières étant en constante évolution. Nous proposons pour cela une approche incrémentale de découverte de schéma qui modifie l'ensemble des classes extraites en propageant dans ces dernières les changements survenus dans les sources.Enfin, dans la troisième contribution de notre travail, nous adaptons notre approche de découverte de schéma afin qu'elle prenne en compte toute la sémantique portée par la source de données, qui est représentée par les triplets explicitement déclarés, mais également tous ceux qui peuvent en être déduits par inférence. Nous proposons une extension permettant de prendre en compte toutes les propriétés d'une entité lors de la découverte de schéma, qu'elles correspondent à des triplets explicites ou implicites, ce qui améliorera la qualité du schéma produit.
... In the past years, there has been a significant amount of research towards the introduction of parallel data clustering algorithms. These efforts primarily focused on the traditional algorithms such as k-Means [1], [2], hierarchical clustering [3], [4], [5], spectral clustering [6], [7], DBSCAN [8], [9] and so on. Conforming to the present requirements for efficiently processing large volumes of text data, some recent works introduced state-of-the-art parallel algorithms for clustering massive document collections, [10], [11], [12]. ...
... A survey on the most important previous works on parallel hierarchical clustering is provided in [3]. The literature also contains parallel variants of spectral clustering [6], [7], DBSCAN [8], [9], affinity propagation [32], OPTICS [33], mean shift [34], and others. ...
... → (x[0], log(n/x[1]))); 7 rdd dic ← rdd words 8 .zipWithIndex() 9 . [1]))); 10 dictionary ←rdd dic.collectAsMap(); 11 D ← broadcast(dictionary); 12 -OR-13 dictionary file ← write(rdd dic); follows, simply computes the IDF(w) of each word according to Eq. 3. The produced data structure is a list of (w, IDF(w)) tuples stored in a new RDD called rdd words. ...
... Junjun Yin from the National Center for Supercomputing Application (NCSA) collected the real-world dataset for the experiments [71], [90]. The dataset was obtained through the free Twitter streaming API [91]. ...
... VOLUME 4, 2016 This experiment focuses on the computational time of the algorithms, as no predefined cluster labels exist for each data object to compare with using the ARI method. The input parameters of minP ts and ε were chosen at 40 and 0.01 respectively [71]. Table 5 presents the time taken for each algorithm to cluster the dataset at subsets of 1,000, 10,000, 50,000, 100,000, 1,000,000 and 2,000,000 instances of the Twitter dataset. ...
Article
Full-text available
The density-based spatial clustering of applications with noise (DBSCAN) is regarded as a pioneering algorithm of the density-based clustering technique. It provides the ability to handle outlier objects, detect clusters of different shapes, and disregard the need for prior knowledge about existing clusters in a dataset. These features along with its simplistic approach helped it become widely applicable in many areas of science. However, for all its accolades, the DBSCAN still has limitations in terms of performance, its ability to detect clusters of varying densities, and its dependence on user input parameters. Multiple DBSCAN-inspired algorithms have been subsequently proposed to alleviate these and more problems of the algorithm. In this paper, the implementation, features, strengths, and drawbacks of the DBSCAN are thoroughly examined. The successive algorithms proposed to provide improvement on the original DBSCAN are classified based on their motivations and are discussed. Experimental tests were conducted to understand and compare the changes presented by a C++ implementation of these algorithms along with the original DBSCAN algorithm. Finally, the analytical evaluation is presented based on the results found.
... The scaling limitations for clustering are attributable to a sequential overhead, which can become as high as 25%. One contribution to this overhead is unavoidable because certain steps of the MS algorithm enforce synchronization 69 . ...
... Given the importance of DBScan variants, we decided to focus in this work on their potential for parallelization. Paraprobe thus executes the OpenMP-parallelized DBScan implementation of Götz et al. 69 . ...
Article
Full-text available
The development of strong-scaling computational tools for high-throughput methods with an open-source code and transparent metadata standards has successfully transformed many computational materials science communities. While such tools are mature already in the condensed-matter physics community, the situation is still very different for many experimentalists. Atom probe tomography (APT) is one example. This microscopy and microanalysis technique has matured into a versatile nano-analytical characterization tool with applications that range from materials science to geology and possibly beyond. Here, data science tools are required for extracting chemo-structural spatial correlations from the reconstructed point cloud. For APT and other high-end analysis techniques, post-processing is mostly executed with proprietary software tools, which are opaque in their execution and have often limited performance. Software development by members of the scientific community has improved the situation but compared to the sophistication in the field of computational materials science several gaps remain. This is particularly the case for open-source tools that support scientific computing hardware, tools which enable high-throughput workflows, and open well-documented metadata standards to align experimental research better with the fair data stewardship principles. To this end, we introduce paraprobe, an open-source tool for scientific computing and high-throughput studying of point cloud data, here exemplified with APT. We show how to quantify uncertainties while applying several computational geometry, spatial statistics, and clustering tasks for post-processing APT datasets as large as two billion ions. These tools work well in concert with Python and HDF5 to enable several orders of magnitude performance gain, automation, and reproducibility.
... eps and MinPts if there is a sequence of points r1….rn, r1 = s, rn = s such that ri+1 is directly reachable from ri [39]. The algorithm finds dense areas and creates arbitrarily shaped clusters [40]. ...
Article
Full-text available
Touristic cities are home to historical landmarks and irreplaceable urban heritages. Although tourism brings financial advantages, mass tourism creates pressure on historical cities. Therefore , "attractiveness" is one of the key elements to explain tourism dynamics. User-contributed and geospatial data provide an evidence-based understanding of people's responses to these places. In this article, the combination of multisource information about national monuments, supporting products (i.e., attractions, museums), and geospatial data are utilized to understand attractive heritage locations and the factors that make them attractive. We retrieved geotagged photographs from the Flickr API, then employed density-based spatial clustering of applications with noise (DBSCAN) algorithm to find clusters. Then combined the clusters with Amsterdam heritage data and processed the combined data with ordinary least square (OLS) and geographically weighted regression (GWR) to identify heritage attractiveness and relevance of supporting products in Amsterdam. The results show that understanding the attractiveness of heritages according to their types and supporting products in the surrounding built environment provides insights to increase unattractive heritages' attractiveness. That may help diminish the burden of tourism in overly visited locations. The combination of less attractive heritage with strong influential supporting products could pave the way for more sustainable tourism in Amsterdam.
... When working with large real-world datasets the use of a distributed system for clustering is more appropriate [14][15][16][17]. A popular strategy for implementing the distributed DBSCAN algorithm is the usage of the disjoint set data structure. ...
Article
Full-text available
span id="docs-internal-guid-919b015d-7fff-56da-f81d-8f032097bce2"> Dealing with large samples of unlabeled data is a key challenge in today’s world, especially in applications such as traffic pattern analysis and disaster management. DBSCAN, or density based spatial clustering of applications with noise, is a well-known density-based clustering algorithm. Its key strengths lie in its capability to detect outliers and handle arbitrarily shaped clusters. However, the algorithm, being fundamentally sequential in nature, proves expensive and time consuming when operated on extensively large data chunks. This paper thus presents a novel implementation of a parallel and distributed DBSCAN algorithm on the HPCC Systems platform. The algorithm seeks to fully parallelize the algorithm implementation by making use of HPCC Systems optimal distributed architecture and performing a tree-based union to merge local clusters. The proposed approach* was tested both on synthetic as well as standard datasets (MFCCs Data Set) and found to be completely accurate. Additionally, when compared against a single node setup, a significant decrease in computation time was observed with no impact to accuracy. The parallelized algorithm performed eight times better for higher number of data points and takes exponentially lesser time as the number of data points increases. </span
Article
Dbscan is a density-based clustering algorithm which is well known for its ability to discover clusters of arbitrary shape as well as to distinguish noise. As it is computationally expensive for large datasets, research studies on the parallelization of Dbscan have been received a considerable amount of attention. In this paper we present an exact, efficient and scalable parallel Dbscan algorithm which we call Hy-Dbscan. It employs three major techniques to enable scalable data clustering on distributed-memory computers i) a modified kd-tree for domain decomposition, ii) a spatial indexing approach based on grid and inference, and iii) a cluster merging scheme based on distributed Rem's Union-Find algorithm. Moreover, Hy-Dbscan exploits process level and thread level parallelization. In experiments, we have demonstrated performance and scalability using two scientific datasets on up to 2048 cores of a distributed-memory computer. Through extensive evaluation, we show that Hy-Dbscan significantly outperforms previous state-of-the-art Dbscan implementations.
Article
Full-text available
Clustering is a commonly used tool for data management and analysis. One of the prominent group of clustering methods consists of the density-based clustering algorithms. The use of fuzzy neighborhood functions for density-based clustering algorithms are known to significantly improve the robustness, such that choosing neighborhood parameters is rather easy for the user. On the other hand, because of the overhead of the fuzzy calculations, they demand higher computing resources. This study discusses how FN-DNSCAN -a fuzzy density-based clustering algorithm- can be implemented efficiently. A rather specific FN-DBSCAN algorithm that adopts techniques used to improve classical density-based clustering algorithms is introduced. Also, a parallel version of the algorithm is proposed and their implementation details are discussed. The proposed algorithms are tested in a set of comparative experiments, along with a straightforward FN-DBSCAN implementation and a curious but unsafe modification of the parallel algorithm. The results of the experiments that are conducted in a modest parallel computing environment of 32 processing units, show a wide variety of differences in relative speed-ups ranging from 2 to 850 times.
Article
Full-text available
The effects of Intel Hyper-Threading technology on a system performance vary according to the type of applications the system is running. Hyper-Threading affects High Performance Computing (HPC) clusters similarly. The characteristics of application run on a cluster will determine whether Hyper-Threading will help or hinder performance. In addition, the operating system's support for scheduling tasks, with Hyper-Threading enabled, is an important factor in the overall performance of the system. In this paper, we used an experimental approach to demonstrate the performance gain or degradation of various parallel benchmarks running on a Linux cluster. The results of these benchmarks show that performance varies as a function of the number of nodes and the number of processors per node. Furthermore, we used a performance analysis tool to determine the cause of these performance differences when Hyper-Threading was enabled versus disabled. Our analysis shows the correlation between the cluster performance and the program characteristics, such as computational type, cache and memory usage, and message-passing properties. We conclude the paper by providing guidance on how to best apply Hyper-Threading technology to application classes.
Conference Paper
Full-text available
DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of DBSCAN is challenging as it exhibits an inherent sequential data access order. Moreover, existing parallel implementations adopt a master-slave strategy which can easily cause an unbalanced workload and hence result in low parallel efficiency. We present a new parallel DBSCAN algorithm (PDSDBSCAN) using graph algorithmic concepts. More specifically, we employ the disjoint-set data structure to break the access sequentiality of DBSCAN. In addition, we use a tree-based bottom-up approach to construct the clusters. This yields a better-balanced workload distribution. We implement the algorithm both for shared and for distributed memory. Using data sets containing up to several hundred million high-dimensional points, we show that PDSDBSCAN significantly outperforms the master-slave approach, achieving speedups up to 25.97 using 40 cores on shared memory architecture, and speedups up to 5,765 using 8,192 cores on distributed memory architecture.
Book
This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.
Article
Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, more researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel DBSCAN clustering algorithm based on Hadoop, which is a simple yet powerful parallel programming platform. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.
Article
Data clustering is an important data mining technology that plays a crucial role in numerous scientific applications. However, it is challenging due to the size of datasets has been growing rapidly to extra-large scale in the real world. Meanwhile, MapReduce is a desirable parallel programming platform that is widely applied in kinds of data process fields. In this paper, we propose an efficient parallel density-based clustering algorithm and implement it by a 4-stages MapReduce paradigm. Furthermore, we adopt a quick partitioning strategy for large scale non-indexed data. We study the metric of merge among bordering partitions and make optimizations on it. At last, we evaluate our work on real large scale datasets using Hadoop platform. Results reveal that the speedup and scale up of our work are very efficient.