Content uploaded by Markus Götz
Author content
All content in this area was uploaded by Markus Götz on Mar 09, 2018
Content may be subject to copyright.
HPDBSCAN – Highly Parallel DBSCAN
Markus Götz
m.goetz@fz-juelich.de Christian Bodenstein
c.bodenstein@fz-
juelich.de
Morris Riedel
m.riedel@fz-juelich.de
Jülich Supercomputing Center
Leo-Brandt-Straße
52428 Jülich, Germany
University of Iceland
Sæmundargötu 2
101, Reykjavik, Iceland
ABSTRACT
Clustering algorithms in the field of data-mining are used
to aggregate similar objects into common groups. One of
the best-known of these algorithms is called DBSCAN. Its
distinct design enables the search for an apriori unknown
number of arbitrarily shaped clusters, and at the same time
allows to filter out noise. Due to its sequential formula-
tion, the parallelization of DBSCAN renders a challenge. In
this paper we present a new parallel approach which we call
HPDBSCAN. It employs three major techniques in order
to break the sequentiality, empower workload-balancing as
well as speed up neighborhood searches in distributed paral-
lel processing environments i) a computation split heuristic
for domain decomposition, ii) a data index preprocessing
step and iii) a rule-based cluster merging scheme.
As a proof-of-concept we implemented HPDBSCAN as an
OpenMP/MPI hybrid application. Using real-world data
sets, such as a point cloud from the old town of Bremen,
Germany, we demonstrate that our implementation is able
to achieve a significant speed-up and scale-up in common
HPC setups. Moreover, we compare our approach with pre-
vious attempts to parallelize DBSCAN showing an order of
magnitude improvement in terms of computation time and
memory consumption.
Categories and Subject Descriptors
A.2.9 [General and reference]: Cross-computing tools
and techniques—Performance; F.5.8 [Theory of compu-
tation]: Design and analysis of algorithms—Parallel algo-
rithms; H.3.8 [Information systems]: Information sys-
tems applications—Data mining; I.2.1 [Computing method-
ologies]: Parallel computing methodologies—Parallel algo-
rithms
General Terms
Algorithms, Performance
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SC2015 2015 Austin, Texas USA
Copyright 2015 ACM X-XXXXX-XX-X/XX/XX ...$15.00.
Keywords
High performance computing, scalable clustering, parallel
DBSCAN, HPDBSCAN, OpenMP/MPI hybrid
1. INTRODUCTION
Cluster analysis is a data-mining technique that divides a
set of objects into disjoint subgroups, each containing sim-
ilar items. The resulting partition is called a clustering. A
clustering algorithm discovers these groups in the data by
maximizing a similarity measure within one group of items—
or cluster—and by minimizing it between individual clus-
ters. In contrast to supervised learning approaches, such
as classification or regression, clustering is an unsupervised
learning method. This means that, it tries to find the men-
tioned structures without any apriori knowledge about the
actual ground-truth. Typical fields of application for cluster
analysis include sequence analysis in bio-informatics, tissue
analysis in neuro-biology, or satellite image segmentation.
Clustering algorithms can be divided into four classes:
partitioning-based, hierarchy-based, density-based and grid-
based [17]. In this paper we will discuss aspects of the
two latter classes. Specifically, we are going to talk about
the density-based clustering algorithm DBSCAN —density-
based spatial clustering of applications with noise [9]—and
how to efficiently parallelize it using computationally ef-
ficient techniques of grid-based clustering algorithms. Its
principal idea is to find dense areas, the cluster cores, and
to expand these recursively in order to form clusters. The
algorithm’s formulation has an inherent sequential control
flow dependency at the point of the recursive expansion,
making it challenging to parallelize.
In our approach we break the interdependency by adopt-
ing core ideas of grid-based clustering algorithms. We over-
lay the input data with a regular hypergrid, which we use
to perform the actual DBSCAN clustering. The overlayed
grid has two main advantages. Firstly, we can use the grid
as a spatial index structure to reduce the search space for
neighborhood queries; and secondly, we can separate the en-
tire clustering space along the cell borders and distribute it
among all compute nodes. Due to the fact that the cells
are constructed in a regular fashion, we can redistribute the
data points, by facilitating halo areas, so that there is no in-
termediate communication required during the parallel com-
putation step.
In spatially skewed datasets regular cell space splits would
lead to an imbalanced computational workload, since most of
the points would reside in dense cell subspaces assigned to a
small number of compute nodes. To mitigate this we propose
a cost heuristic that allows us to identify data-dependent
split points on the fly. Finally, the local computation results
are merged through a rule-based cluster merging scheme,
with linear complexity.
The remainder of this paper is organized as follows. The
next section surveys related work. Section 3 describes de-
tails of the original DBSCAN algorithm. Subsequently, Sec-
tion 4 discusses our parallelized version of DBSCAN by the
name of HPDBSCAN and shows its algorithmic equivalence.
Section 5 presents details of our hybrid OpenMP/MPI im-
plementation. Evaluations are shown in Section 6, where we
also present our benchmark method, datasets and the layout
of the test environment. We conclude the paper in Section 7
and give an overview of possible future work.
2. RELATED WORK
There are a number of previous research studies dealing
with the parallelization of DBSCAN. To the best of our
knowledge, the first attempt was made by Xiaowei Xu in
collaboration with Kriegel et al. [27]. In their approach,
single neighborhood search queries are parallelized facilitat-
ing a distributed version of the R-Tree—DBSCAN ’s initial
spatial index data structure. They adopt a master-slave
model, where the index is built on the master node and
the whole data set is split among the slaves according to
the bounding rectangles of the index. Subsequently, they
merge the local cluster results by reclustering the bordering
regions of the splits. Zhou et al. [28] and Arlia et al. [3]
present similar approaches, where they accelerate the neigh-
borhood queries by replicating the entire index on each of
the slave nodes, assuming the index fits entirely into the
main memory. Brecheisen et al. [5] have published a paral-
lel version of DBSCAN that approximates the cluster using
another clustering algorithm called OPTICS. Each of the
cluster candidates found in this manner is sent to a slave
node in order to filter out the actual from the guessed clus-
ter points. The local results are then merged by the master
into one coherent view. This approach, however, fails to
scale for big databases, since the pre-filtering has to be done
on the master, in main memory. Chen et al. [7] propose an-
other distributed DBSCAN algorithm, called P-DBSCAN
that is based on a priority R-Tree. Unfortunately, the paper
does not state how the data is distributed or how the clus-
ters are formed. An in-depth speed and scale-up evaluation
is also not performed. A paper by Fu et al. [13] demon-
strates the first Map-Reduce implementation of DBSCAN.
The core idea of this approach is the same as the first par-
allelization attempt of Xu, that is, to parallelize singular
neighborhood queries—this time in form of Map-Tasks. He
et al. [19] present another implementation of a parallel DB-
SCAN based on the Map-Reduce paradigm. They are the
first to introduce the notion of a cell-based preprocessing
step in order to perform a fully distributed clustering with-
out the need to replicate the entire dataset or to communi-
cate inbetween. Finally, Patwary et al. [25] have published
research work that shows a parallel DBSCAN that scales
up to hundreds of cores. Their main contribution is a quick
merging algorithm based on a disjoint-set data structure.
However, they either need to fit the entire dataset into main
memory or need a manual preprocessing step that splits the
data within a distributed computing environment.
3. THE DBSCAN ALGORITHM
Noise
Core
ε
Border
DDR
DC
DR
Figure 1: DBSCAN clustering with minP oints = 4
DBSCAN is a density-based clustering algorithm that was
published 1996 by Ester et al. [9]. Its principal idea is to
find dense areas and to expand these recursively in order to
find clusters. A dense region is thereby formed by a point
that has within a given search radius εat least minPoints
neighboring points. This dense area is also called the core
of a cluster. For each of the found neighbor points the den-
sity criteria is reapplied and the cluster is consequently ex-
panded. All points that do not form a cluster core and that
are not “absorbed” through expansion are regarded as noise.
A formal definition of the algorithm is as follows. Let X
be the entire dataset of points to be clustered and p, q ∈X
two arbitrary points of this set. Then the following defi-
nitions describe DBSCAN with respect to its parameters ε
and minP oints. Figure 1 illustrates these notions.
Definition 1. Epsilon neighborhood (Nε)—The epsilon neigh-
borhood Nεof pdenotes all points qof the dataset X,
which have a distance dist(p, q) that is less or equal to ε,
or formally: Nε(p) = {q|dist(p, q )< ε}. In practice, the
euclidean distance is often used for dist making the epsilon-
neighborhood of pequal to the geometrically surrounding
hypersphere with radius ε.
Definition 2. Core point—pis considered a core point if
the epsilon-neighborhood of pcontains at least minP oints
number of points including itself: Core(p) = |Nε(p)| ≥
minP oints.
Definition 3. Directly density-reachable (DDR)—A point
qis directly density-reachable from a point p, if plies within
q’s epsilon-neighborhood and pis a core point, i.e., DDR(p, q) =
q∈Nε(p)∧Core(p).
Definition 4. Density-reachable (DR)—A pair of points
p0=pand pn=qare called density reachable, if there
exists a chain of directly density-reachable points—{pi|0≤
i∧i < n ∧DDR(pi, pi+1)}—linking them with one another.
Definition 5. Border point—Border points are special clus-
ter points that are usually located at the outer edges of a
cluster. They do not fulfill the core point criteria but are
still included in it due to direct density-reachability. For-
mally, this can be expressed as B order(p) = |Nε(p)|<
minP oints ∧ ∃q:DDR(q, p).
Definition 6. Density-connected (DC)—Two points pand
qare called density connected, if there is a third point r,
such that rcan density-reach pand q:DC (p, q) = ∃r∈
X:DR(r, p)∧DR(r, q). Note that density-connectivity is
a weaker condition than density-reachability. Two border
points can be density-connected, even though they are not
density-reachable by definition due to not fulfilling the core
point criteria.
Definition 7. Cluster—A cluster is a subset of the whole
dataset, where each of the points is density-connected to all
the other and that contains at least one dense region, or in
other words a core point. This can be denoted as ∅ ⊂ C⊆X
with ∀p, q ∈C:DC (p, q) and ∃p∈C:C ore(p).
Definition 8. Noise—Noise are special points that do not
belong to any epsilon-neighborhood, such that N oise(p) =
¬∃q:DDR(q, p).
Listing 1 sketches pseudo code for a classic DBSCAN im-
plementation. Some of the type and function definitions are
left out, as their meaning can easily be inferred.
DBSCAN ’s main properties that distinguish it from more
traditional clustering algorithms, such as k-means [17] for
instance, are: i) it can detect arbitrarily shaped clusters
that can even protrude into, or surround one another; ii) the
cluster count does not have to be known apriori, and, iii) it
has a notion of noise inside the data.
Finding actual values for εand minPoints is dependent
on the clustering problem and its application domain. Ester
et al. [9] propose a simple algorithm for estimating ε. The
core idea is to determine the “thinnest” cluster area through
either visualization or a sorted 4-dist graph, and then choose
εto be equal to that width.
1de f DB SC A N ( X , ep s , m i nP o i nt s ):
2clusters = li s t ()
3fo r pin X :
4if v i si t e d ( p ):
5continue
6ma r kA s Vi s it e d (p )
7Np = q ue r y ( p , X , e ps )
8if l e ng t h ( N p ) < m i n Po i n t s :
9markAsNoise(p)
10 el s e :
11 C = C lu s te r ()
12 ad d ( c lu st e rs , C )
13 ex p a nd ( p , Np , X , C , ep s , m i n Po i n ts )
14 return clusters
15
16 de f ex pa n d ( p , N p , X , C , e ps , m i n P oi n t s ) :
17 ad d ( p , C )
18 fo r oin Np :
19 if n o t Vi s i te d ( o ) :
20 ma r kA s Vi s it e d (o )
21 No = q ue r y ( o , X , e ps )
22 if l e ng t h ( N o ) >= m i n Po i n ts :
23 Np = j oi n( Np , No )
24 if h a s No C l u st e r ( o ) :
25 ad d ( o , C )
Listing 1: Classic DBSCAN pseudocode
4. HPDBSCAN
In this section we present Highly Parallel DBSCAN, or in
short HPDBSCAN. Our approach to parallelize DBSCAN
consists of four major stages. In the first step the entire
dataset is loaded in equal-sized chunks by all processors in
parallel. Then, the data is preprocessed. This entails as-
signing each of the d-dimensional points in the dataset to
a virtual, unique spatial cell corresponding to their location
within the data space, with respect to the given distance
function. This allows us to sort the data points according
to their proximity, and to redistribute them to distinct com-
putation units of the parallel computing system. In order
to balance the computational load for each of the process-
ing units, we estimate the load using a simple cost heuristic
accommodating the grid overlay.
After this division phase, we perform the clustering of the
redistributed points in the second step locally on each of the
processing units, i.e., we assign a temporary cluster label to
each of the data points.
Subsequently, these have to be merged into one global
result view in step three. Whenever the temporary label
assigned by a processing unit disagrees with the ones in the
halo areas of the neighboring processors, we generate cluster
relabeling rules.
In the fourth step, the rules are broadcasted and applied
locally. Figure 2 shows a schematic overview of the pro-
cess using the fundamental modeling concepts (FMC) nota-
tion [21]. The next sections scrutinizes each of the substeps
theoretically.
HPDBSCAN
Cluster
labels
Data
...
...
Preprocessing
Overlay
hypergrid
Sort and
distribute
...
Estimate
splits
...
Local
DBSCAN
...
Merge
halos
Clustering
Cells
Initial
order
Spatially
ordered
data
...
Temporary
labels
Cluster relabeling
ε
minPoints
Figure 2: Schematic overview of HPDBSCAN
4.1 Grid-based data preprocessing and index
The original DBSCAN paper proposes the use of R-trees [4]
in order to reduce the neighborhood search complexity from
O(n2) to O(log(n)). The construction of the basic R-tree
cannot be performed in parallel as it requires the entire
dataset to be known. Therefore, other researchers [4, 27]
propose to either just replicate the entire dataset, and per-
form linear neighborhood scans in parallel for each data
item, or to use distributed versions of the R- or k-d-trees.
However, He et al. [19] point out that these approaches do
not scale in terms of memory consumption or communica-
tion cost with respect to large datasets and number of par-
allel processors.
Therefore, we have selected a far more scalable approach
for HPDBSCAN that is based on grid-based clustering algo-
rithms like, e.g., STING [17], and common spatial problem
in HPC, like for example HACC in particle physics [16]. Its
core idea is that the d-dimensional bounding box of the en-
tire dataset, with respect to dist, is overlayed by a regular,
non-overlapping hypergrid structure, which is then decom-
posed into subspaces by splitting the grid along the grid cell
boundaries. Each of the resulting subspaces is then exclu-
sively assigned to a parallel processor that is responsible for
computing the local clustering. In order to be able to do
so in a scalable fashion, all the data points within a par-
ticular subspace should be in the local memory of the re-
spective parallel processor, so that communication overhead
is avoided. However, in most cases the data points will be
distributed in arbitrary order within the dataset. Therefore,
the data has to be indexed first and then redistributed to
the parallel processor responsible for clustering the respec-
tive subspace.
In HPDBSCAN the indexing is performed by employing
a hashmap with offset and pointers into the data mem-
ory. For this, all parallel processors read an arbitrary, non-
overlapping, equally-sized chunk of the complete dataset
first. Then each data item of a chunk is uniquely associated
with the cell of the overlayed grid that it spatially occupies,
and vice versa—every grid cell contains all the data items
that its bounding box covers. This in turn enables us to or-
der all of the local data items with respect to their grid cell
so that they are consecutively placed in memory. Finally,
an indexing hashmap can be constructed with the grid cells
being the key, and the tuple of pointer into the memory and
number of items in this cell the value. An indexing approach
like this has an additional memory overhead of O(log(n))
similar to other approaches like R- or k-d-trees. Figure 3
shows the indexing approach examplified by the dataset, in-
troduced in Section 3, for the data chunk of a processing
unit called processor 1.
ε
ε
1 2 34
5 6 78
9 10 11 12
1
0 2
processor 1 processor 1
3
2 1
12
3 1
processor 2
Figure 3: Sorted data chunks locally indexed by each
processor using hashmaps pointing into the memory
Using that index the data redistribution can be performed
in a straightforward fashion. The local data points of a par-
allel processor that do not lie within its assigned subspace
are simply transferred to the respective parallel processor
“owning” them. Afterwards all parallel processors have to
rebuild their local data indices in order to encompass the
received data. An efficient way of doing this is to send the
section of the data index structure along with the data to
the recipients. Due to the fact that the received and the
local data are pre-sorted, the sent data index section and
its memory pointers can be used to quickly merge them us-
ing, e.g., the merge-step of mergesort. The downside of the
data redistribution approach is that it requires an additional
memory overhead of O(n
p), per parallel processor, with pbe-
ing the number of parallel processors, to be able to restore
the initial data arrangement after the clustering. However,
since the additional overhead has linear complexity, it is
maintainable even for large scale problems.
Using the described index structure, cell-neighborhood
queries execute in amortized computation time of O(1). The
cell-neighborhood Ncell thereby consists of all cells that are
directly bordering the searched cell, its diagonals, as well
as the cell itself with respect to all dimensions. For the
cell labeled 6in Figure 3 the cell-neighborhood is the set
{1,2,3,5,6,7,9,10,11}. A formal definition follows.
Definition 9. Cell neighborhood—The cell neighborhood
NCell (c) of a given cell cdenotes all cells dfrom the space
of all available grid cells Cthat have a Chebychev distance
distChebychev [6] of zero or one to c, i.e., NC ell(c) = {d|d∈
C∧distChebychev (c, d)≤1}.
The actual epsilon neighborhood is then constructed from
all points within the direct cell-neighborhood, filtered using
the distance function dist.ˇ
Sidlauskas et al. [26] show that
a spatial grid index like this, is superior to R-trees and k-d-
trees on index creation and queries, in terms of computation
time, under the assumption that the cell granularity is op-
timal with respect to future neighborhood searches. Due
to the fact that DBSCAN ’s search radius is constant, the
cells can trivially be determined to be hypercubes with the
side length of ε. From a technical perspective it has the
additional advantage that each of the dparts of the entire
cell-neighborhood vector are consecutive in memory. This
in turn enables data pre-fetching and the reuse of cell neigh-
borhoods, thus reducing the number of cache misses.
In order to be able to answer all range queries within its
assigned subspace, a parallel processor needs an additional
one-cell-thick-layer of redundant data items, surrounding
the grid splits that allow them to compute the cell neighbor-
hood even at the edges of said splits. In parallel codes this
is commonly referred to as halos or ghost cells. An efficient
way of providing these halo cells is to transfer them along
with the actual data during the data redistribution phase.
This way the parallel processor will also index them along
with the other data. Halo cells do not change the actual
split boundaries in which a parallel processor operates and
can be removed after the local clustering.
4.2 Cost heuristic
In the previous section we introduced the notion of sub-
dividing the data space in order to efficiently parallelize
HPDBSCAN and its spatial indexing especially also in dis-
tributed computing environments. However, we have not in-
troduced a way to determine the boundaries of these splits.
One of the most na¨
ıve approaches is to subdivide the space in
equally-sized chunks in all dimensions, so that the resulting
number of chunks equals to the number of available cores.
While the latter part of the assumption is sensible as it min-
imizes the communication overhead, the former is not. Con-
sider a spatially skewed dataset like shown in figure 4. The
sketched dotted boundary, chunking the data space equally
for two parallel processors, results in a highly unbalanced
computational load, where one core needs to cluster almost
all the data points and the other idles most of the time. Due
to the fact that computing the dist function, while filtering
the cell neighborhood, is for many distance functions the
most processing intensive part of DBSCAN, this distribu-
tion pattern is particularly undesirable. It should also be
clear that this is not only an issue of the presented example,
but other spatially skewed datasets and larger processing
core counts as well.
Therefore, we employ a cost heuristic to determine a more
balanced data space subdivision. For this, we exploit the
computation complexity properties of the cell neighborhood
36
16
18 28
92
68
24
54
26
6
1 2
3
2
processor 1 processor 2
∑(Costprocessor 1)=368 ∑(Costprocessor 2)=8
36
16
18 28
92
68
24
54
26
6
1 2
3
2
processor 1 processor 2
∑(Costprocessor 1)=190 ∑(Costprocessor 2)=186
Figure 4: Impact of naive and heuristic-based hy-
pergrid decompositions on compute load balancing.
Halo cells are marked with a hatched pattern.
query. For each data item we have to perform ncomputa-
tions of the distance function dist, where nis the number
of data items in the cell neighborhood. Since we have to
do that for all mdata items within a cell, the total num-
ber of comparisons for one cell is n∗m. The sum of all
comparisons, i.e. the cost scores, for all cells gives us the
total “difficulty” of the whole clustering problem, at least in
terms of the dist function evaluations. Then, we can assign
to each parallel processor a consecutive chunk of cells, the
cost of which is about a p-th part of the total score with p
being the number of available parallel processing cores. The
formal definitions are as follows.
Definition 10. Cell cost—The cell cost C ostCell(c) of a
cell cis the product of the number of items in it multiplied
with the number of data points in the cell neighborhood –
CostC ell(c) = |c| ∗ |NCell (c)|.
Definition 11. Total cost—The total cost CostT otal is equal
to the sum – Pc∈Cells CostCell(c) – of all individual cells.
Since the data items are already pre-sorted due to the spa-
tial preprocessing step, the hypergrid subdivision can be per-
formed by iteratively accumulating cell cost scores until the
per-core threshold is reached or exceeded. Moreover, the the
cell itself can be subdivided to gain more fine-grained con-
trol. For this, the cost score of the cell is not added entirely
but in n-steps for each data item in the cell, where nis the
number of items in the cell neighborhood. Figure 4 shows an
example of a dataset, its overlayed hypergrid, the annotated
cell cost values and the resulting subdivision. These subdi-
vision can easily be computed in parallel by computing the
cell score locally, reducing them to a global histogram and fi-
nally determining the boundaries according to the explained
accumulative algorithm.
4.3 Local DBSCAN
Having redistributed the chunks among the compute nodes
in a balanced fashion, the local DBSCAN execution follows.
To break the need for sequential computation, implied by the
recursive cluster expansion, this stage is converted to a par-
allelizable version with a single loop iterating over all data
points. This enables the further, fine-grained paralleliza-
tion of the algorithms using shared-memory parallelization
approaches such as threads for example. The performance-
focused algorithm redesign is twofold at this stage. Besides
the parallelization of the iterations, the amount of compu-
tation per iteration is also minimized. Due to the cell-wise
sorting and indexing of data points within the local data
chunk, all points occupying one cell are stored consecutively
in memory. This ensures that each cell-neighborhood must
be computed at most once per thread, as each of them can
be cached until all queries from within the same cell are vis-
ited. Listing 2 presents the pseudocode of the converted,
iterative local DBSCAN.
1de f lo c al D B S CA N ( X , ep s , m i n Pt s ) :
2ru le s = R ul es ( )
3@parallel
4fo r pin X :
5Cp , N p = qu e ry ( p , X , e ps )
6if l e ng th ( N p ) >= m in P ts :
7markAsCore(p)
8ad d ( Cp , p )
9fo r qin Np :
10 Cq = g et C lu s te r ( q)
11 if i s Co r e ( q ):
12 ma r k A sS a m e ( ru l es , Cp , Cq )
13 ad d ( Cp , q )
14 el if n o tV i s it e d ( p ) :
15 markAsNoise(p)
16 return rules
Listing 2: Local DBSCAN pseudocode
For each of the points the epsilon neighborhood query is
performed independently, i.e., not as an recursive expansion.
When a query for a point preturns more than minP oints
data points, from which none is yet labeled, pis marked as
a core of a cluster. The newly created cluster is then labeled
using p’s data point index, which is globally unique for the
entire sorted dataset. If the epsilon neighborhood, number-
ing at least minP oints, contains a point qthat is already
assigned to a cluster, the point pis added to that cluster and
inherits the cluster label from q. In case of multiple cluster
labels present in the neighborhood, the core pinherits any
one of the cluster labels and notes information indicating
that each of the encountered subclusters actually are one,
as they are inherently density connected. That information
is vital to formulate merger rules for the subsequent merging
of local cluster fragments and unification of cluster labels in
the global scope (see section 4.4).
In all of the above cases, the remainder of non-core points
in the epsilon neighborhood, which may also include halo
area points, is added to the cluster of p. If phas less than
minP oints data points in its neighborhood, it is marked as
visited and labeled as noise. The below proof shows that
replacing the iterative cluster relabeling is equivalent to the
original recursive expansion.
Theorem 1. Given points p∈Cpand q∈Cq: (Core(p)∨
Core(q)) ∧DDR(p, q) =⇒ ∃C:CpSCq⊆C∧p, q ∈C
Proof. If neither pnor qis core, or they are mutually
not DDR, the assumption is false and the implication triv-
ially true. If p,qor both are cores, and they are DDR then
by definition, they are also DR and therefore DC, with the
linking point rbeing either por q. Given the density connec-
tion DC between pand q, they belong to the same cluster C.
By extension, any point belonging to Cpor Cqalso belongs
to C.
The result of local DBSCAN is a list of subclusters along
with the points and cores they contain, a list of noise points,
and a set of rules describing which cluster labels are equiva-
lent. This information is necessary and sufficient for the next
step of merging the overlapping clusters with contradictory
labels within the nodes’ halos.
4.4 Rule-based cluster merging
The relabeling rules created by distinctive nodes are insuf-
ficient for merging cluster fragments from separate dataset
chunks. The label-mapping rules across different nodes are
created based on the labels of halo points. Upon the com-
pletion of the local DBSCAN, each halo zone is passed to
the node that owns the actual neighboring data chunk. Sub-
sequently, the comparison of local and halo point labels fol-
lows, resulting analogously in a set of relabeling rules for
neighboring chunks, which may create transitive cluster la-
bel chains. These rules are then serialized and broadcast to
all other nodes. Only then is the minimization of all local
and inter-chunk label-mapping rules possible, and all tran-
sitive labels can be removed. Thus each compute node is
equipped with a list of direct mappings from each existing
subcluster label to a single global cluster label.
Each compute node then proceeds to relabel the owned
clusters using the merger rules. At that stage each data
point, now having assigned a cluster label, is sent back to
the compute node that originally loaded it from the dataset.
Recreation of the order of all data points is enabled by the
initial ordering information created during the data redis-
tribution phase. The distributed HPDBSCAN execution is
complete and the result is a list of cluster ids or noise mark-
ers per data item.
5. IMPLEMENTATION
In this section we present our prototypical realization of
HPDBSCAN and specifics of distinct technical details. The
C++ source code can be obtained freely from our source
code repository [23]. It depends on the parallel programing
APIs Open Multiprocessing (OpenMP) [8] in version 4.0+
and Message Passing Interface (MPI) [15] in version 1.1+.
Additionally, the command-line version requires the I/O li-
brary Hierarchical Data Format 5 (HDF5) [18] in order to
pass the data and store computational results.
5.1 Data distribution and gathering
As explained in section 4.1, the data items of the datasets
are redistributed in the preprocessing step, in order to achieve
data locality. Implementing this behavior in shared-memory
architectures is trivially not required, due to the fact that
all processors can access the same memory. For distributed
environments, however, this step is needed and can be quite
challenging to realize—especially in a scalable fashion.
Since HPDBSCAN sorts the data points during the in-
dexing phase and lays them out consecutively in memory,
we are able to exploit collective communication operations
of the MPI. We first send the local histograms of data points
from each compute node to the one that owns the respec-
tive bounds during the local DBSCAN execution. This can
be implemented either by an MPI_Reduce or, alternatively,
by an MPI_Alltoall and a subsequent summation of the
array. After that, each of the compute nodes allocates lo-
cal memory, and the actual data points are exchanged using
an MPI_Alltoallv call. Using the received histograms, the
compute nodes are also able to memorize the initial ordering
of the data points, in a flat array, for example.
Vice versa, the gather step can be implemented analo-
gously. Instead of sorting the local data items by their as-
signed grid cell, they are now re-organized by their initial
global position in the dataset. After that, they can be ex-
changed again, using the MPI collectives, and stored. Note
that in this step the computed cluster labels are transferred
along with the data points in order to avoid multiple com-
munication.
5.2 Lock-free cluster labeling
To ensure that the cluster labels are unique within a chunk
as well as globally, each cluster label cis determined by the
lowest index of a core point inside a cluster—
c= minp∈C|Core(p)index(p). The index(p) function returns
the position of a data point p, within the globally sorted
dataset, redistributed to the compute nodes. Additionally to
ensuring global uniqueness, this mechanism also maximizes
the size of consistently labeled cluster fragments within the
same compute node, as each consecutive iteration over the
points increments the current point’s index. Whenever a
core is found in the epsilon neighborhood, the current point
inherits its cluster label, even if it is a core itself.
A data race may occur, when the current epsilon neighbor-
hoods of multiple parallel threads overlap. In that case each
thread may attempt to assign a label to a point within their
neighborhood intersection. The na¨
ıve approach of locking
the data structures storing the cluster label and core infor-
mation is not scalable.
The better alternative of using atomic operations, here
atomic min, requires encoding the values to operate on, with
a single native data type. For this, we use signed long inte-
ger type values, and compress all flags and labels described
by DBSCAN ’s original definition, i.e., “visited”, “core” and
“noise” flags, and a “cluster label”, to that data type. As the
iterations are performed for each data point exactly once,
the “visited” flag, is made redundant and abandoned. The
cluster label value is stored using the absolute value of the
lowest core point index it contains. The sign bit is used
to encode the “core” flag, such that each core of cluster c
is marked by value -c, and each non-core point—by value
c. As cluster labels are created using point indexes, their
value never exceeds |X|. The noise label can then be en-
coded using any value from outside the range [−|X|,|X|].
For this, we have selected the upper bound of the value
range—the maximal positive signed long integer. As long
as range(signed long int)>=|X|+ 1, signed long integers
are sufficient to encode the cluster labels as well as the core
and noise flags. In that way, minimizing the cluster label
is possible via simple atomic min implementation to set the
cluster label and core flag at once. Some processor architec-
tures, e.g., Intel x86, do not provide an atomic min instruc-
tion. Instead, a spinlock realization using basic atomic read
and compare-and-swap instruction, as shown in Listing 3, is
used.
1de f at om i c Mi n ( a dd r es s , v al ) :
2pr e v = a t om i cR e ad ( a d dr es s )
3while( p re v > va l ) :
4sw ap pe d = CA S( ad dre ss , p rev , v al )
5if s w ap pe d : break
6pr e v = a t om i cR e ad ( a d dr e ss )
Listing 3: Spinlock atomic min
5.3 Parallelization of the local DBSCAN loop
The iterative conversion of DBSCAN allows us to divide
the computation of the loop iterations among all threads of
a compute node. Because the density of data points within
a chunk can be highly skewed, a naive chunking approach is
suboptimal (see Section 4.2), and can lead to a highly unbal-
anced work load. To mitigate this, a work stealing approach
is advisable. Our HPDBSCAN implements threading using
OpenMP’s parallel for pragma. The closest representa-
tive of work stealing in OpenMP is the schedule(dynamic)
clause, added to the parallel for pragma. Optimal perfor-
mance is achieved, when the dynamically pulled workload is
small enough—so that the workload imbalances are split and
fairly divided, and at the same time large enough—so that
not too many atomic min operations (whether supported
by hardware or not) are performed simultaneously on the
same memory location. This number is highly dependent
on environment details, such as the clustered problem and
the execution hardware. Through empirical tests, however,
we determined a reasonable dynamic chunk size of 40.
6. EXPERIMENTAL EVALUATION
In this section we will describe the methodology and find-
ings of the experiments conducted to evaluate the parallel
DBSCAN approach described above. The main focus of
the investigation is the performance evaluation of the imple-
mentation with respect to computation time, memory con-
sumption and the parallel programming metrics: speed- and
scale-up [12].
6.1 Hardware setup
To verify the computation time and speed up of our im-
plementation, we have performed tests on the Juelich Dedi-
cated Graphic Environment (JUDGE) [10]. It consists of 206
IBM System x iDataPlex dx360 M3 compute nodes, where
each node has 12 compute cores combined through two Intel
Xeon X5650 (Westmere) hex-core processors clocked at 2.66
GHz. A compute node has 96 GB of DDR-2 main memory.
JUDGE is connected to a parallel, network-attached GPFS-
storage system, called Juelich Storage Cluster (JUST) [11].
Even though the system has a total core count of 2.472,
we were only able to acquire a maximum of 64 nodes (768
cores) for our benchmark, as JUDGE is used as a production
cluster for other scientific applications. Our hardware allo-
cation, though, was solely dedicated for us, which ensured
that no other computations interfered with our tests. The
plugged-in Westmere processors allow to use 24 virtual pro-
cessors, when hyperthreading is enabled. For the test runs,
however, we disabled this feature as it can falsify or destabi-
lize measurement correctness, as Leng et al. [22] have shown.
In a multithreading scenario we facilitate for this reason a
maximum of 12 threads per node.
6.2 Software setup
The operation system running on JUDGE is a SUSE Linux
SLES 11 with the kernel version 2.6.32.59-0.7. All appli-
cations in the test have been compiled with gcc 4.9.2 us-
ing the optimization level O3. The MPI distribution on
JUDGE is MPICH2 in version 1.2.1p1. For the compila-
tion of HPDBSCAN, a working HDF5 development library
including headers and C++ bindings is required. For our
benchmarks we used the HDF group’s reference implemen-
Dataset Points Dims. Size (MB) ε minPts
Tweets [t] 16,602,137 2 253.34 0.01 40
Twitter small [ts] 3,704,351 2 56.52 0.01 40
Bremen [b] 81,398,810 3 1863.68 100 10000
Bremen small [bs] 2,543,712 3 48.51 100 312
Table 1: HPDBSCAN benchmark datasets properties
tation, version 1.8.14, pre-installed on JUDGE. Later in
this section we present a comparison of HPDBSCAN with
PDSDBSCAN-D created by Patwari et al [25]. The lat-
ter needs the parallel netCDF I/O library. We have ob-
tained and compiled pnetCDF from the project’s web page
at Northwestern University with version 1.5.0 [24].
6.3 Datasets
Despite DBSCAN ’s popularity its parallelization attempts
were mainly evaluated using synthetic datasets. To their
advantage, they can provide an arbitrarily large number of
data points and dimensionality. The downside, however, is
that they are not representative for actual real world applica-
tions. They might have inherent regular patterns from, e.g.,
pseudo random number generators that will silently bias the
implementation’s performance. For this reason, we decided
to resort to actual real-world data and its potential skew.
An overview of the chosen examples is depicted in Table 1.
We acknowledge that an evaluation of higher-dimensional
datasets is of great interest for some clustering application,
such as, for instance, genomics in bio-informatics, but could
not be obtained at the time of writing.
6.3.1 Geo-tagged collection of tweets
This set was collected and made available to us by Junjun
Yin form the National Center for Supercomputing Applica-
tion (NCSA). The dataset was obtained using the free twit-
ter streaming API and contains exactly one percent of all
geo-tagged tweets from the United Kingdom in June 2014.
It was initially created to investigate the possibility of min-
ing people’s trajectories and to identify hotspots and points
of interest (clusters of people) through monitoring tweet den-
sity. The full collection spans roughly 16.6 million tweets. A
smaller subset of this was generated by filtering the entire set
for the first week of June only. Both datasets are available at
the scientific storage and sharing platform B2SHARE [14].
6.3.2 Point cloud of Bremen’s old town
This data was collected and made available by Dorit Bor-
rmann and Andreas N¨
uchter from the Institute of Computer
Science at the Jacobs University Bremen, Germany. It is a
3D-point cloud of the old town of Bremen. A point cloud is
a set of points and its representing coordinate system that
often model the surface of objects. This particular point
cloud of Bremen was recorded using a laser scanner sys-
tem mounted onto an autonomous robotic vehicle. It has
stopped at 11 different locations, performing each time a
full 360◦scan of the surrounding area. Given the GPS tri-
angulated position and perspective of the camera, the sub-
point clouds were combined to one monolith. The raw data
is available from Borrmann and N¨
uchter’s webpage [20]. An
already combined version in HDF5 format, created by us,
can be obtained from B2SHARE [14]. DBSCAN can be
applied here in order to clean the dataset from noise or out-
liers, such as falsy scans or unwanted reflections of moving
objects. Moreover, DBSCAN can also be used to find dis-
tinct objects, represented as clusters, in the point cloud like
houses, roads or people. The whole point cloud contains
roughly 81.3 million data points. A smaller variant was gen-
erated by randomly sampling 1
32 of the points that is also
available on B2SHARE [14].
6.4 Speed up evaluation of HPDBSCAN
We benchmark our HPDBSCAN application’s speed up
using both, the full Twitter (t) and the full Bremen (b)
dataset. Our principal methodological approach is thereby
as follows. Each benchmark is ran five times, measuring
the application’s walltime at the beginning and end of the
main() function of the process with the MPI rank 0 and the
OpenMP thread number 0. After these five runs we double
the number of nodes and cores, starting from one node and
12 cores, up to the maximum of 768 cores. In addition to
that we have run a base measurement with exactly one core
on one node. For each “five-pack” benchmark run we report
the minimum, maximum, mean µ, standard deviation σand
coefficient of variation (CV ), defined as ν=µ
σ[1]. The
speed up coefficient is calculated in comparison to the sin-
gle core run, based on the mean values of the measurements
for each processor count configuration. Both datasets are
processed using the OpenMP/MPI hybrid features of our
application. That means that we spawn an MPI process for
each node available and parallelize locally on the nodes using
OpenMP. For the Bremen point cloud we have additionally
parallelized the computation with MPI alone, i.e., we use
one MPI process per core, enabling direct comparison of the
hybrid and fully distributed versions.
Nodes 1 1 2 ... 32 64
Cores 1 12 24 384 768
OpenMP+MPI hybrid b
time (s)
Mean µ79372.29 8037.71 4271.64 327.07 172.53
StDev σ17.6011 71.2829 16.2092 2.5971 1.3801
CV υ0.00022 0.00886 0.00379 0.0079 0.0079
Min 79342.08 7937.48 4253.45 322.71 170.77
Max 79385.57 8129.85 4293.86 329.65 174.47
Speed-Up 1 9.9 18.6 242.7 460.0
MPI b
time (s)
Mean µ79372.29 8028.67 4403.96 515.21 354.99
SrDev σ17.6012 9.5769 7.1526 94.7806 42.0006
CV υ0.00022 0.00119 0.00162 0.18396 0.11832
Min 79342.08 8019.10 4395.78 471.10 302.27
Max 79385.57 8040.83 4415.45 684.74 420.01
Speed-Up 1 9.9 18.0 154.1 232.7
OpenMP+MPI hybrid t
time (s)
Mean µ2079.26 212.77 115.66 10.04 7.88
StDev σ1.06455 0.56826 0.35893 0.42128 1.03302
CV υ0.00051 0.00267 0.00310 0.04194 0.13106
Min 2078.16 212.05 115.34 9.76 7.14
Max 2080.47 213.43 116.17 10.78 9.70
Speed-Up 1 9.8 18.0 207.0 263.8
Table 2: Measured and calculated values of the
HPDBSCAN speed-up evaluation
The results in Table 2 and Figure 5 show that we are
able to gain substantial speed up for both data sets. It
peaks for Bremen at 460.0 using 768 cores, and in the Twit-
ter analysis case at slightly more than half of this value at
263.8. For the MPI-only clustering of the Bremen dataset
the speed up value falls short of the hybrid implementation,
being only roughly half of it with 232.7 using 768 cores.
There are two noteworthy facts that can be observed in the
measurement data. The first and obvious one is that the hy-
brid implementation outperforms the fully distributed MPI
runs by a factor of two. The access to a shared cell in-
dex and the reduced number of nodes to communicate it to,
significantly reduces communication overhead and enables
faster processing time. Secondly, one can observe a steady
decrease in the efficiency of additional cores used for the
clustering. This seems to be especially true for the tweet
collections compares to the Bremen dataset. This observa-
tion can be explained best through Amdahl’s law [2]. In
the benchmark we use a constant problem size, disallowing
infinite speed up performance gains. Instead, we approach
the asymptote of the single threaded program parts. Due
to the fact that tweet collection is smaller in size, we ap-
proach this boundary earlier than with the Bremen data for
instance. Additional network communication overhead with
larger processor counts, atomicMin() clashes as well as load
imbalances are good examples of simultaneously growing se-
rial program parts. Moreover, the growing CV value is a
good indicator for the increasing influence of external fac-
tors onto the measurements, like varying operating system
scheduling.
2 8 32 128 512
1
2
4
8
16
32
64
128
256
512
number of cores
speed up
Hybrid b
Hybrid t
MPI b
Linear
Figure 5: Speed up curves of the HPDBSCAN appli-
cation analyzing the Bremen and Twitter datasets
6.5 Scale up evaluation of HPDBSCAN
In this section we investigate HPDBSCAN ’s scalability
properties. Our principal measuring methodology remains
unchanged. Instead of the speed up coefficient, we report
the efficiency value ep=t1
tpfor each benchmark run, which
is the fraction of the execution time with a single core and
the execution time with pprocessing cores. Perfect scalabil-
ity is achieved, when the efficiency equals one or is almost
one. Yet, it requires doubling the dataset size, whenever we
double the processor count.
As a base for this we use the small Bremen dataset bs.
In particular, for each run, we copy the entire data ptimes,
where pis equal to the number of used processors. Then,
each copy is shifted along the first axis of the dataset by the
copy’s index, times the axis range, and concatenated with
the others. This way, we get multiple (p) Bremen old towns
next to one another. We chose this approach to get a better
grasp of the overhead of our implementation by presenting
the same problem to each available MPI process. In contrast
to that, a random sampling of the whole Bremen dataset,
for instance, would have altered the problem difficulty.
The results of our test can be seen in Table 3 and Fig-
ure 6. In all of the three scenarios a near constant efficiency
Cores 1 2 4 8 16 32
OpenMP
time (s)
Mean µ99.5044 102.644 107.462 121.35 - -
StDev σ0.0991 0.0439 0.1515 5.6788 - -
CV υ0.00099 0.00042 0.00141 0.04679 - -
Min 99.41 102.58 107.2 117.81 - -
Max 99.66 102.69 107.59 131.45 - -
MPI
time (s)
Mean µ99.50 101.86 103.74 105.48 107.12 109.19
StDev σ0.0991 0.0884 0.1131 0.1881 0.5399 0.6653
CV υ0.00099 0.00086 0.00109 0.00178 0.00504 0.00609
Min 99.41 101.79 103.65 105.3 106.56 108.52
Max 99.66 102 103.91 105.76 107.77 110.15
Cores 12 24 48 96 192 384
OpenMP/MPI hybrid
time (s)
Mean µ10.46 13.3 14.02 14.786 16.31 19.76
StDev σ0.0974 0.1425 0.2420 0.2548 0.58307 3.15081
CV υ0.00931 0.01071 0.24207 0.25481 0.03578 0.15943
Min 10.38 13.12 13.76 14.57 15.67 17.909
Max 10.63 13.45 14.28 15.16 17.12 25.26
Table 3: Measured and calculated values of the
HPDBSCAN scale-up evaluation of the Bremen data
2 4 8 16 32
0
0.5
1
1.5
2
number of cores (nodes for Hybrid)
Efficency
OpenMP bs
MPI bs
Hybrid bs
Figure 6: Scale up curves of HPDBSCAN analyzing
the Bremen and Twitter datasets
value can be achieved, indicating good scalability. While the
MPI-only and OpenMP/MPI hybrid benchmark runs only
have a slightly increasing execution time curve, we can ob-
serve a clear peek for the OpenMP benchmark with four and
more cores. Through a separate test, we can attribute this
increase to more contentions in the spinlock of our atom-
icMin() implementation, introduced in Section 5.
6.6 Comparison of HPDBSCAN and PDSDBSCAN
As discussed in related work there are a number of other
parallel versions of DBSCAN. Most of them report different
value permutations for the computation time, memory con-
sumption, speed up and scalability of their implementations.
Almost all of them do not to provide either their used bench-
mark datasets or the source code of their implementations.
This in turn, disallows us to verify their results or to perform
a direct comparison of our approaches. To the best of our
knowledge the only exception are Patwary et al. [25]. Their
datasets can also not be recreated, but they made the C++
implementation of their parallel disjoint-set data structure
DBSCAN —PDSDBSCAN —open-source. This allows us to
compare our approach with theirs at least using our bench-
mark datasets. Patwary et al. offer two versions of PDSDB-
SCAN. One targets shared-memory architectures only and
is based on OpenMP. The other can also be used in dis-
tributed environments and is implemented using MPI. In
order to distinguish between these two, the suffixes -S or -D
are added respectively.
In order to compare both parallel DBSCAN approaches,
we have performed another speed up benchmark according
to the introduced methodology on the small Twitter dataset.
Due to their technical similarities our two “contestants” are
HPDBSCAN using MPI processes only and PDSDBSCAN-
D. Thereby, we scale the process count from one to a max-
imum of 32, each being executed on a separate compute
node. Even though we have executed five runs for each level
of used processors, we report here only the mean value for
execution time, memory consumption and speed up, because
of space considerations.
Nodes 1 2 4 8 16 32
HPDBSCAN MPI
time (s) 114.39 58.99 30.14 15.71 8.37 6.07
Speed-Up 1.00 1.94 3.80 7.28 13.67 18.85
Memory (MB) 251064 345276 433340 678248 1101000 2111000
PDSDBSCAN-D
time (s) 288.35 162.47 105.94 89.87 85.37 88.42
Speed-Up 1 1.77 2.72 3.21 3.38 3.36
Memory (MB) 500512 725104 1370000 4954000 19724000 59685000
Table 4: Comparison of HPDBSCAN and
PDSDBSCAN −Dusing the Twitter dataset
Table 4 and Figure 7 present the obtained results. HPDB-
SCAN shows a constant, near linear speed-up curve, whereas
PDSDBSCAN-D starts similarly, but soon flattens, stabiliz-
ing at a speed-up of around 3.5. The curve for the memory
consumption is inverse. HPDBSCAN shows a linear increase
again, seemingly being dependent on the number of used
processing cores, which can be explained by the larger num-
ber of replicated points in the halo areas. PDSDBSCAN-D,
however, presents an exponential memory consumption. An
investigation of the source code reveals that each MPI pro-
cess always loads the entire datafile into main memory, effec-
tively limiting its capabilities to scale with larger datasets.
This is also the reason why we have used the small Twit-
ter dataset ts for this experiment, as larger datasets have
caused out-of-memory exceptions. As a consequence, we
have not been able to reproduce the performance capabili-
ties of PDSDBSCAN-D.
1 2 4 8 16 32
1
2
4
8
16
32
number of nodes
speed up
HPDBSCAN ts
PDSDBSCAN-D ts
Linear
1 2 4 8 16 32
1
2
3
4
5
·107
number of nodes
memory use (MB)
HPDBSCAN ts
PDSDBSCAN-D ts
Figure 7: Sp eed up and memory usage of
HPDBSCAN compared to PDSDBSCAN −D
7. CONCLUSION
In this paper, we have presented HPDBSCAN —a scal-
able version of the density-based clustering algorithm DB-
SCAN. We have overcome the algorithm’s inherent sequen-
tial control flow dependencies through a divide-and-conquer
approach, using techniques from cell-based clustering algo-
rithms. Specifically, we employ a regular hypergrid as the
spatial index in order to minimize the neighborhood-search
spaces and to partition the entire cluster analysis into local
subtasks, without requiring further communication. Using
a rule-based merging scheme, we combine the found local
cluster-labels into a global view. In addition to that, we
also proposes a cost heuristic that allows to balance the
computation workload, facilitated by the previously men-
tioned cells, divided among the compute nodes according
to their computation complexity. We have implemented
HPDBSCAN as an open-source OpenMP/MPI hybrid appli-
cation in C++, which can be deployed in shared-memory as
well as distributed-memory computing environments. Our
experimental evaluation of the application has proven the al-
gorithm’s scalability in terms of memory consumption and
computation time, outperforming PDSDBSCAN, the first
parallel HPC implementation. The presented cell-based spa-
tial index can easily be transferred to other clustering and
neighborhood-search problems with constant search range.
In future work we plan to demonstrate this on the basis of
parallelizing other clustering algorithms, such as OPTICS
and SUBCLU.
8. REFERENCES
[1] H. Abdi. Coefficient of variation. Encyclopedia of
research design, pages 169–171, 2010.
[2] G. M. Amdahl. Validity of the single processor
approach to achieving large scale computing
capabilities. In Proceedings of the spring joint
computer conference 1967, pages 483–485. ACM, 1967.
[3] D. Arlia and M. Coppola. Experiments in parallel
clustering with dbscan. In Euro-Par 2001 Parallel
Processing, pages 326–331. Springer, 2001.
[4] J. L. Bentley and J. H. Friedman. Data structures for
range searching. ACM Computing Surveys (CSUR),
11(4):397–409, 1979.
[5] S. Brecheisen, H.-P. Kriegel, and M. Pfeifle. Parallel
density-based clustering of complex objects. In
Advances in Knowledge Discovery and Data Mining,
pages 179–188. Springer, 2006.
[6] C. Cantrell. Modern mathematical methods for
physicists and engineers. CUP, 2000.
[7] M. Chen, X. Gao, and H. Li. Parallel DBSCAN with
Priority R-Tree. In Information Management and
Engineering (ICIME), 2010 The 2nd IEEE
International Conference, pages 508–511. IEEE, 2010.
[8] L. Dagum and R. Menon. OpenMP: an industry
standard API for shared-memory programming.
Computational Science & Engineering, IEEE,
5(1):46–55, 1998.
[9] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters in
large spatial databases with noise. In Kdd, volume 96,
pages 226–231, 1996.
[10] Forschungszentrum J¨
ulich GmbH. Juelich Dedicated
GPU Environment.
http://www.fz-juelich.de/ias/jsc/EN/Expertise/
Supercomputers/JUDGE/JUDGE_node.html.
[11] Forschungszentrum J¨
ulich GmbH. Juelich Storage
Cluster. http://www.fz-juelich.de/ias/jsc/EN/
Expertise/Datamanagement/OnlineStorage/JUST/
JUST_node.html.
[12] I. Foster. Designing and building paral lel programs.
Addison Wesley Publishing Company, 1995.
[13] Y. X. Fu, W. Z. Zhao, and H. F. Ma. Research on
parallel dbscan algorithm design based on mapreduce.
Advanced Materials Research, 301:1133–1138, 2011.
[14] G¨
otz, Markus and Bodenstein, Christian. HPDBSCAN
Benchmark test files. http://hdl.handle.net/11304/
6eacaa76-c275-11e4-ac7e-860aa0063d1f.
[15] W. Gropp, E. Lusk, and A. Skjellum. Using MPI:
portable parallel programming with the
message-passing interface, volume 1. MIT press, 1999.
[16] S. Habib, V. Morozov, H. Finkel, A. Pope,
K. Heitmann, K. Kumaran, T. Peterka, J. Insley,
D. Daniel, P. Fasel, et al. The universe at extreme
scale: multi-petaflop sky simulation on the bg/q. In
Proceedings of the International Conference on High
Performance Computing, Networking, Storage and
Analysis, page 4. IEEE Computer Society Press, 2012.
[17] J. Han, M. Kamber, and J. Pei. Data Mining: concepts
and techniques - 3rd ed. Morgan Kaufmann, 2011.
[18] HDF Group. Hierachical Data Format 5.
http://www.hdfgroup.org/HDF5.
[19] Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, and
J. Fan. MR-DBSCAN: an efficient parallel
density-based clustering algorithm using mapreduce.
In Parallel and Distributed Systems, 2011 IEEE 17th
International Conference, pages 473–480, 2011.
[20] Jacobs University Bremen. 3D Scan Repository. http:
//kos.informatik.uni-osnabrueck.de/3Dscans/.
[21] A. Kn¨
opfel, B. Gr¨
one, and P. Tabeling. Fundamental
modeling concepts, volume 154. Wiley, UK, 2005.
[22] T. Leng, R. Ali, J. Hsieh, V. Mashayekhi, and
R. Rooholamini. An empirical study of
hyper-threading in high performance computing
clusters. Linux HPC Revolution, 2002.
[23] Markus G¨
otz. HPDBSCAN implementation.
https://bitbucket.org/markus.goetz/hpdbscan.
[24] Northwestern University. Parallel netCDF. http:
//cucis.ece.northwestern.edu/projects/PnetCDF/.
[25] M. M. A. Patwary, D. Palsetia, A. Agrawal, et al. A
new scalable parallel dbscan algorithm using the
disjoint-set data structure. In High Performance
Computing, Networking, Storage and Analysis (SC),
International Conference for, pages 1–11. IEEE, 2012.
[26] D. ˇ
Sidlauskas, S. ˇ
Saltenis, C. W. Christiansen, J. M.
Johansen, and D. ˇ
Saulys. Trees or grids?: indexing
moving objects in main memory. In Proceedings of the
17th ACM SIGSPATIAL International Conference on
Advances in Geographic Information Systems, pages
236–245. ACM, 2009.
[27] X. XU, J. JAGER, and H.-P. KRIEGEL. A Fast
Parallel Clustering Algorithm for Large Spatial
Databases. Data Mining and Knowledge Discovery,
3:263–290, 1999.
[28] A. Zhou, S. Zhou, J. Cao, Y. Fan, and Y. Hu.
Approaches for scaling dbscan algorithm to large
spatial databases. Journal of computer science and
technology, 15(6):509–526, 2000.