ArticlePDF Available

A generalized GPU-based connected component labeling algorithm

Authors:

Abstract and Figures

We propose a generalized GPU-based cluster-labeling algorithm, which is applicable to various lattices in an uniform method. This method is realized by extending the GPU-based cluster-labeling algorithm without the use of conventional iteration [Y. Komura, Comput. Phys. Comm. 194 (2015) 54-58]. In this paper, we deal with the Swendsen-Wang (SW) multi-cluster spin flip algorithm as an application of the generalized cluster-labeling algorithm, and we compare the performance of generalized method with that of original method for the Ising model on the square lattice. As a result, the difference in performance between the generalized and original methods is suppressed to only 20\%. Moreover, we also show the performance for the Ising models on the honeycomb and body-centered cubic lattices as examples of the generalized method.
Content may be subject to copyright.
arXiv:1603.08357v1 [physics.comp-ph] 28 Mar 2016
A generalized GPU-based connected component
labeling algorithm
Yukihiro Komuraa,
aRIKEN, Advanced Institute for Computational Science, 7-1-26 Minatojima-minami-machi,
Chuo-ku, Kobe, Hyogo 650-0047, Japan
Abstract
We propose a generalized GPU-based connected component labeling (CCL) al-
gorithm that can be applied to both various lattices and to non-lattice environ-
ments in a uniform fashion. We extend our recent GPU-based CCL algorithm
without the use of conventional iteration to the generalized method. As an
application of this algorithm, we deal with the bond percolation problem. We
investigate bond percolation on the honeycomb and triangle lattices to confirm
the correctness of this algorithm. Moreover, we deal with bond percolation on
the Bethe lattice as a substitute for a network structure, and demonstrate the
performance of this algorithm on those lattices.
Keywords: connected component labeling, percolation theory, Bethe lattice,
parallel computing, GPU
1. Introduction
The connected component labeling (CCL) algorithm has been used to study
image processing [1, 2, 3, 4], applications in physics [5, 6, 7, 8], percolation the-
ory, and a problem involving porous rocks [9]. The majority of CCL algorithms
were designed for lattice environments, with only a few studies of CCL algo-
rithms for non-lattice environments. In non-lattice environments, the positions
of the sites are arbitrary rather than being restricted to discrete points of a reg-
Corresponding author.
Email address: yukihiro.komura.ss@alum.riken.jp (Yukihiro Komura)
Preprint submitted to Computer Physics Communications March 29, 2016
ular lattice. Non-lattice environments exist not only in the percolation theory
of disordered discs and spheres, but also in networks.
The performance of a single CPU core remains almost unchanged from a
decade ago. However, the number of cores has increased each year, and recent
advances in application performance have been realized by exploiting such con-
cepts as multiple cores and many threads. Graphics accelerators are a common
example of many-thread devices, having evolved into highly parallel threads
with very high memory bandwidth, and research has clearly shown that they
can dramatically improve computing performance. The most widely used pro-
gramming model for accelerators is CUDA [10], a parallel-computing platform
and programming model developed by NVIDIA that is essentially C/C++ with
several extensions that allow functions to be executed directly on the NVIDIA
GPU.
Connected component labeling algorithms with a single GPU have been
proposed by many researchers. The ma jority of those CCL algorithms have been
designed for lattice environments, and use a two-stage approach that divides
the lattice into sub-blocks that are independently treated and then merged.
However, the sub-block decomposition is difficult in non-lattice environments.
Thus we need GPU-based CCL algorithms without a sub-block decomposition
for computation in non-lattice environments.
Here we describe several experiments on CCL algorithms with a single GPU
and without the sub-block decomposition. Hawick et al. [3] proposed a CCL
algorithm called ”Label Equivalence”, and Kalentev et al. [4] improved the label
equivalence algorithm. Both of these algorithms are realized by an iterative
method of comparison with nearest-neighbor sites. More recently, the present
author [7] has also proposed a single GPU CCL algorithm that does not use
conventional iteration. The computation times using this approach have proved
to be about half those of the previous method [4] for the application of the
Swendsen-Wang multi-cluster spin-flip algorithm [11].
The above described CCL algorithms focus on the case of square and simple
cubic lattices, and the algorithm in [7] is specialized to the case of square and
2
simple cubic lattices. In this paper, we propose a generalized GPU-based CCL
algorithm that can be applied to both various lattices and to non-lattice en-
vironments in a uniform fashion. This generalized method extends our recent
GPU-based CCL algorithm [7], which does not use conventional iteration. To
confirm the correctness and performance of this algorithm, we deal with bond
percolation problems. This paper is organized as follows: In Section 2, we briefly
review the GPU-based CCL algorithm that serves as a starting point [7]. In Sec-
tion 3, we describe the new generalized algorithm. In Section 4, we first show
the result of the bond percolations for the honeycomb and triangle lattices to
confirm the correctness. Moreover, we adapt the generalized algorithm to bond
percolation for the Bethe lattice as a substitute for a network structure, and
the resulting performance is described. Finally, a summary and discussion are
presented in Section 5.
2. Connected Component Labeling
Connected component labelling algorithms assign proper cluster labels to
each site on the basis of local connection information. Many CCL algorithms
have been proposed by many researchers. The Hoshen–Kopelman algorithm
[12] and the CCL algorithm proposed by Suzuki et al. [1] are the best-known
CCL algorithms that use a single CPU. The majority of CCL algorithms using
a single CPU are specialized to the case of square and simple cubic lattices.
Moreover, those algorithms are realized by sequential computation. However,
sequential algorithms cannot be applied to GPU computation, and thus many
CCL algorithms that use a single CPU cannot be directly applied to GPU
computation, so a suitable new algorithm is required.
We briefly review the CCL algorithm without conventional iteration [7].
The cornerstone of this algorithm is the method of label reduction proposed
by Wende et al. [13]; the procedure is illustrated in Fig. 1 of [7]. The labeling
consists of four steps: (i) initialization, (ii) analysis, (iii) label reduction, and
(iv) analysis. This method does not require an iterative method of comparison
with nearest-neighbor sites. The number of such comparisons in this method
3
㼟㼕㼠㼑 㻺㻺㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㻺㻺㻌㼘㼕㼟㼠
㻜 㻟㻘㻝㻜 㻝㻝 㻠㻘㻞㻜
㻝 㻠㻘㻡㻘㻝㻡 㻝㻞 㻝㻢㻘㻝㻥㻘㻞㻝
㻞 㻝㻣㻘㻝㻤 㻝㻟 㻡㻘㻝㻡
㻟 㻜㻘㻤㻘㻥㻘㻞㻝 㻝㻠 㻝㻥㻘㻞㻝
㻠 㻝㻘㻝㻝 㻝㻡 㻝㻘㻝㻟
㻡 㻝㻘㻝㻟㻘㻞㻜 㻝㻢 㻝㻞㻘㻞㻝
㻢 㻤㻘㻥 㻝㻣 㻞㻘㻝㻤
㻣 㻞 㻝㻤 㻞㻘㻝㻣
㻤 㻟㻘㻢 㻝㻥 㻝㻞㻘㻝㻠
㻥 㻟㻘㻢㻘㻝㻜 㻞㻜 㻡㻘㻣㻘㻝
㻝㻜 㻜㻘㻥 㻞㻝 㻟㻘㻝㻞㻘㻝㻠㻘㻝㻢
Figure 1: Example of the nearest-neighbor (NN) list for a complex network. The number at
each site is the index, and each has a NN list.
is 1 for a square lattice and 2 for a simple cubic lattice if periodic boundary
conditions are not used. In the initialization function, a label is assigned to
each site based on its connections: each site has label[i] = min, where min
is the lowest-numbered of the connected sites. The analysis function tracks
the label from a given site to a new site determined by the value of the label
at the given site. All sites have label[i] = label [label[i]] calculated until label
remains unchanged. In the label reduction step, we use Algorithm 1 of [7] on
all sites. The reduction method is used in only one direction for square lattices
and in only two directions for simple cubic lattices. To resolve conflicts in the
label update process, the atomic function is used in Algorithm 1 of [7], and
each chain of labels, i.e., label [label], is constructed automatically. Finally, the
analysis function is executed again. The cluster labeling algorithm in [7] uses
awhile loop within the label reduction step. However, the number of sites for
which the while loop is executed in this step is kept to a minimum.
3. Generalized GPU-based cluster labeling algorithm
We now turn to the proposed algorithm. The algorithm in [7] is specialized
to the case of square and simple cubic lattices. We extend this algorithm of
[7] to the generalized algorithm. The proposed algorithm can be applied to
4
㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠
㻜 㻜 㻝㻝
㻝㻞 㻝㻞
㻞 㻞 㻝㻟
㻝㻠 㻝㻠
㻝㻡 㻝 㻝㻟
㻝㻢 㻝㻞
㻢 㻢 㻝㻣
㻝㻤 㻞 㻝㻣
㻟 㻢 㻝㻥 㻝㻞 㻝㻠
㻟 㻢 㻞㻜 㻣㻘㻝㻝
㻝㻜 㻜 㻥 㻞㻝 㻝㻞㻘㻝㻠 㻘㻝㻢
㼟㼕㼠㼑 㻺㻺㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㻺㻺㻌㼘㼕㼟㼠
㻜 㻟㻘㻝㻜 㻝㻝 㻠㻘㻞㻜
㻝 㻠㻘㻡㻘㻝 㻝㻞 㻝㻢㻘㻝㻥㻘㻞㻝
㻞 㻝㻣㻘㻝 㻝㻟 㻡㻘㻝㻡
㻟 㻜㻘㻤㻘㻥 㻘㻞㻝 㻝㻠 㻝㻥㻘㻞㻝
㻠 㻝㻘㻝㻝 㻝㻡 㻝㻘㻝㻟
㻡 㻝㻘㻝㻟 㻘㻞㻜 㻝㻢 㻝㻞㻘㻞㻝
㻢 㻤㻘㻥 㻝㻣 㻞㻘㻝㻤
㻣 㻞㻜 㻝㻤 㻞 㻘㻝㻣
㻤 㻟㻘㻢 㻝㻥 㻝㻞㻘㻝㻠
㻥 㻟㻘㻢㻘㻝 㻞㻜 㻡㻘㻣㻘㻝㻝
㻝㻜 㻜㻘㻥 㻘㻝㻞㻘㻝㻠㻘㻝㻢
Figure 2: Initialization step of the generalized algorithm. The left panel shows the state of
the NN list in Fig. 1. Each site generates the residual list from the NN list by keeping only
those neighbors that are numbered lower than the site. Moreover, each site has had its label
set to the lowest value in the residual list, label[i] = min, and this number has been removed
from the residual list. If there is no value in the residual list, each site is labeled with its site
number, i.e., label[i] = i. The right panel is the state after the initialization step [step (i)],
㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍 㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠
㻜 㻜 㻝 㻝
㻝 㻝 㻞 㻜
㻞 㻞 㻟 㻝
㻟 㻜 㻠 㻜
㻝㻡 㻝 㻝㻟
㻝㻢 㻝㻞
㻢 㻜 㻣 㻞
㻝㻤 㻞 㻝㻣
㻜 㻢 㻝㻥 㻝㻞 㻝㻠
㻜 㻢 㻞㻜 㻣㻘㻝
㻝㻜 㻜 㻥 㻞㻝 㻝㻞㻘㻝㻠㻘㻝
㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍 㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠
㻜 㻜 㻝 㻝
㻝㻞 㻝㻞
㻞 㻞 㻟 㻝
㻝㻠 㻝㻠
㻝㻡 㻝 㻝㻟
㻝㻢 㻝㻞
㻢 㻢 㻣 㻞
㻝㻤 㻞 㻝㻣
㻝㻥 㻞 㻝㻠
㻜 㻢 㻞㻜 㻣㻘㻝
㻝㻜 㻜 㻥 㻞㻝 㻝㻞㻘㻝㻠㻘㻝
Figure 3: Label reduction step of the generalized algorithm. The left panel shows the state
after the end of the first analysis step [step (ii)]. The right panel shows the state after label
reduction [step (iii)], in which each site is calculated using Algorithm 1.
㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠
㻜 㻜 㻝㻝 㻝
㻝 㻝 㻝㻞 㻜
㻞 㻞 㻝㻟 㻝
㻟 㻜 㻝㻠 㻜
㻝㻡 㻝 㻝㻟
㻡 㻝 㻝㻢 㻜
㻢 㻜 㻝㻣 㻞
㻝㻤 㻞 㻝㻣
㻜 㻢 㻝㻥 㻝㻠
㻜 㻢 㻞㻜 㻣㻘㻝㻝
㻝㻜 㻜 㻥 㻞㻝 㻝㻞 㻘㻝㻠㻘㻝㻢
㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍㼎㼑 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠
㻜 㻜 㻝㻝 㻝
㻝 㻝 㻝㻞 㻜
㻞 㻞 㻝㻟 㻝
㻟 㻜 㻝㻠 㻜
㻝㻡 㻝 㻝㻟
㻝㻢 㻝㻞
㻢 㻜 㻝㻣 㻞
㻝㻤 㻞 㻝㻣
㻝㻥 㻝㻞 㻝㻠
㻜 㻢 㻞㻜 㻣㻘㻝㻝
㻝㻜 㻜 㻥 㻞㻝 㻝㻞 㻘㻝㻠㻘㻝㻢
Figure 4: Final analysis step of the generalized algorithm. The left panel shows the state
after the end of the label reduction step [step (iii)]. The right panel shows the state after the
second analysis step [step (iv)].
5
both lattices and to non-lattice environments in a uniform fashion as long as a
nearest-neighbor (NN) list is supplied. Figure 1 presents an example of such a
list. Each site has a list of immediate neighbors. In this paper, we prepare the
array of NN list in advance. The total size of NN list is equal to the product of
the number of sites and the largest size of NN list in all sites, and each site gets
access to a NN list as N N list[i], N N list[i+N], N N list[i+ 2N], ..., where N
is the number of sites.
The generalized CCL algorithm also consists of the same four steps: (i)
initialization, (ii) analysis, (iii) label reduction, and (iv) analysis. In the ini-
tialization step, we generate the residual lists from the NN list by keeping only
those neighbors that are numbered lower than any given site. Moreover, each
site has its label set to the lowest site number in the residual list, label[i] = min,
and this is removed from the residual list. If there is no number in the residual
list, the site is labeled with its site number, i.e., label [i] = i. Figure 2 shows an
example of this procedure. The analysis function tracks the label from a given
site to a new site determined by the value of the label at the given site. All sites
have label [i] = label[label [i]] calculated until label remains unchanged. In the
label reduction step, each site executes Algorithm 1. In the label reduction step,
the calculation at each site uses the pairs at each site and the remaining sites
in the residual list. The label reduction method allows each chain of labels, i.e.,
label[label ], to be constructed automatically. Figure 3 illustrates this procedure.
In Fig. 3, we show the actual output of our algorithm for the complex network
in Fig. 1. Finally, each site is assigned the proper cluster label by executing the
analysis function again. The procedure is illustrated in Fig. 4.
Unlike the CCL algorithms using a single CPU, the label of the cluster is
not given serially in this method. The label of each cluster depends on the
minimum site number in each cluster. However, we can renumber the cluster
labels to sequence labels by using the method shown in Fig. 4 in [14], for
example.
6
for jresidual list do
// iis a site number
label 1label[i] ;
while label 16=label[label 1] do
label 1label[label 1];
end
// jis a number in the residual list of site i
label 2label[j] ;
while label 26=label[label 2] do
label 2label[label 2];
end
flag true;
if label 16=label 2then
flag f alse;
end
if label 1<label 2 then
tmp label 1;
label 1label 2;
label 2tmp;
end
while f lag =false do
label 3atomicM in(&label[label 1], l abel 2);
if label 3 = label 2then f lag true else if label 3> label 2
then label 1label 3else if label 3< label 2then
label 1label 2; label 2label 3
end
end
Here, iis a site, jis a nearest-neighbor to site i, and label is the label at
each site. ] Pseudo-code of the label reduction method used in this
paper. Each site executes this algorithm at the label reduction step [step
(iii)]. Here, iis a site, jis a nearest-neighbor to site i, and label is the
label at each site. Algorithm 1: .
7
0.64 0.65 0.66
0
0.5
1
p
N=262144
N=1048576
N=4194304
N=16777216
<Rspan(p)>
Figure 5: Spanning probability for bond percolation on the honeycomb lattice.
4. Results
As an application of the generalized CCL algorithm, we study bond per-
colation problems. Bond percolation creates bonds between neighboring sites
with probability p. In bond percolation, one considers whether there is a cluster
that spans the entire lattice. If there is such a spanning cluster, the system is
regarded as percolating. There is a threshold value pcthat distinguishes perco-
lating and non-percolating behavior. The probability that a spanning cluster is
produced is zero for p < pcand is unity for p > pcas the system size approaches
infinity.
We test the correctness and performance of the proposed code on an NVIDIA
TITAN X machine with the CUDA version 7.5 compiler. For the random-
number generator, we used the XORWOW pseudorandom generator [15] with
double precision on the device API in the cuRAND library [16].
We first deal with bond percolation on the honeycomb and triangle lattices
to check the correctness of our algorithm. For the measured quantity, we use
8
0.34 0.35
0
0.5
1
<Rspan(p)>
p
N=262144
N=1048576
N=4194304
N=16777216
Figure 6: Spanning probability for bond percolation on the triangle lattice.
106107
10−2
10−1
100
101
computational time (ms)
N
honeycomb, 0.5pchc
honeycomb, pchc
honeycomb, 1.5pchc
triangle, 0.5pctr
triangle, pctr
triangle, 1.5pctr
Figure 7: Average computational time for a single realization on the honeycomb and tri-
angle lattices for probabilities p= 0.5pc, pc,1.5pc. We use the threshold value of phc
c=
12 sin(π/18) for the honeycomb lattice and ptr
c= 2 sin(π/18) for the triangle lattice.
9
the spanning probability Rspan (p). Figure 5 shows the spanning probability
Rspan(p) for the honeycomb lattice, and Fig. 6 shows the spanning probability
for the triangle lattice. The total lattice sizes Nare 262144, 1048576, 4194304,
and 16777216. From these figures, we can see that all curves intersect at one
point. From the crossing point, we estimate the threshold value phc
con the
honeycomb lattice to be phc
c= 0.6527 ±0.0002 and the threshold value ptr
con
the triangle lattice to be ptr
c= 0.3472 ±0.0002. These estimated values are
compatible with the exact values from [17], namely phc
c= 1 2 sin(π/18) and
ptr
c= 2 sin(π/18). Those results show that our algorithm works well.
Secondly, we demonstrate the performance for bond percolation on the hon-
eycomb and triangle lattices. We give a double logarithmic plot of the aver-
age computational times for a single realization for the honeycomb and trian-
gle lattices in Fig. 7. Because the computational time is dependent on the
probability p, we show the average computational times at the probabilities
p= 0.5pc, pc,1.5pcon these lattices. We use phc
c= 1 2 sin(π/18) as the thresh-
old value for the honeycomb lattice and ptr
c= 2 sin(π/18) as the threshold value
for triangle lattice, and the total lattice sizes Nare 262144, 1048576, 4194304,
and 16777216. From Fig. 7, we can see that the average computational time
is proportional to the total lattice size for all probabilities. The computational
times on each lattice when the probability is 0.5pcis the shortest of those proba-
bilities. The computational time for the honeycomb lattice with N= 16777216
when the probability is 0.5phc
cis 5.8 ms, and that for the triangle lattice with
N= 16777216 when the probability is 0.5ptr
cis 7.2 ms. Moreover, the computa-
tional time for the honeycomb lattice with N= 16777216 when the probability
is 1.5phc
cis about 1.4 times greater than when the probability is 0.5phc
c, and the
computational time for the triangle lattice with N= 16777216 when the prob-
ability is 1.5ptr
cis about 1.7 times greater than when the probability is 0.5ptr
c.
The differences between the computational times for different probabilities are
due to the increment in the size of the residual list, and the difference between
the computational times on the two lattices is due to the coordination number.
Next, we deal with bond percolation on the Bethe lattice as a substitute for
10
Figure 8: An example of one realization of the standard position on the Bethe lattice with
coordination number z= 3 when the probability is p= 1.5pc. The number at each site is its
index, and the connection between sites in the same cluster is represented by the same color.
11
Figure 9: An example of one realization of the random position on the Bethe lattice with
coordination number z= 3 when the probability is p= 1.5pc. The number at each site is its
index, and the connection between sites in the same cluster is represented by the same color.
12
106107
10−2
10−1
100
101
computational time (ms)
N
standard, 0.5pc
standard, pc
standard,1.5pc
random, 0.5pc
random, pc
random,1.5pc
Figure 10: Average computational times for a single realization on the Bethe lattice with
coordination number z= 3 for standard site positions and random site positions when the
probability is p= 0.5pc, pc,1.5pc.
a network structure. The Bethe lattice is a connected cycle-free graph, where
each site is connected to zneighbors. The threshold value for bond percolation
on the Bethe lattice is related to the coordination number z, and the threshold
value of the Bethe lattice with coordination number zis 1/(z1). We usually
give the site number starting from a center site, called the standard position
in this paper. The generalized CCL algorithm is excellent with this numbering
method because the remaining sites in the residual list for all sites become zero
at all probabilities. Thus, we deal with the two Bethe lattices with coordination
number z= 3. One is standard position, where the site numbers are given from
a center site, and the other is random position, where the site numbers are given
at random. Figure 8 shows an example of one realization of standard position
on the Bethe lattice with coordination number z= 3 when the probability is
p= 1.5pc, and Fig. 9 shows an example of one realization of random position
on the Bethe lattice with coordination number z= 3 when the probability is
13
p= 1.5pc. The threshold value of the Bethe lattice with coordination number
z= 3 is pc= 1/2. Those examples are the results of the actual output.
We examine the computational times of the generalized CCL algorithm for
bond percolation on the two Bethe lattices. Figure 10 shows a double logarith-
mic plot of the average time required for one realization on the two Bethe lattices
when the probabilities are p= 0.5pc, pc,1.5pc. We measure the average compu-
tational times for the total lattice sizes N= 393214,786430,1572862,3145726,6291454,
and 12582910. In the case of standard position, the computational times are
almost independent of the probability, and the computational time is propor-
tional to the total lattice size. In contrast, in the case of random position, the
computational time is dependent on the probability. The computational times
of the random position for the total lattice sizes N786430 are proportional
to the total lattice sizes for all probabilities, and those for the total lattice sizes
N786430 deviate from being proportional to the total lattice size at all prob-
abilities. This change is related to the amount of L2 cache, which is 3MB in the
TITAN X. The computational time is shortest when the probability is 0.5pc.
The computational time for random position with N= 12582910 when the
probability is 0.5pcis 7.1 ms, and that when the probability is p= 1.5pcis 21
ms. Moreover, the computational time for random position with N= 12582910
when the probability is 1.5pcis about 8.0 times greater than that for standard
position with N= 12582910. Because we give the site number at random, the re-
maining sites in the residual list for each site are also random. This randomness
causes a load imbalance. Thus, the computational times for random position
are greater than those for standard position. However, those results show that
the generalized CCL algorithm can be applied directly to any structure without
reordering.
5. Summary and discussion
We have proposed a generalized GPU-based CCL algorithm that can be
applied to both various lattices and to non-lattice environments. Because the
algorithm in [7] is specialized to the case of square and simple cubic lattices,
14
we have extended our previous labeling algorithm [7], which does not use con-
ventional iteration, to the generalized method. The proposed algorithm can be
applied directly to any structure without reordering.
We chose the bond percolation problem as a test application. We first dealt
with bond percolation on the honeycomb and triangle lattices to confirm the
correctness of our algorithm. We used the spanning probability as a measured
quantity, and reproduced some well-known results. Secondly, we showed the
performance for bond percolation on the honeycomb and triangle lattices. Be-
cause the computational time is dependent on the probability p, we reported the
average computational times for the probabilities p= 0.5pc, pc,1.5pcThe com-
putational time for the honeycomb lattice with N= 16777216 when the prob-
ability is 0.5pcis 5.8 ms, and that for the triangle lattice with N= 16777216
when the probability is 0.5pcis 7.2 ms. Finally, we dealt with bond percola-
tion on the Bethe lattice as a substitute for a network structure. Because the
remaining sites in the residual list for all sites become zero at all probabilities
when we give the site numbers starting from a center site, we dealt with the two
Bethe lattices with coordination number z= 3. The computational times for
random position, where the site numbers are given at random, are greater than
those for standard position, where the site numbers are given from a center site,
at all probabilities. However, the generalized CCL algorithm can be applied
directly to any structure without reordering.
We finally emphasize the efficiency of the proposed algorithm. This algo-
rithm is very powerful; it can be adapted both to a variety of lattices and to
non-lattice environments in a uniform fashion as long as an NN list is supplied.
That is, this algorithm can be adapted to any structure by simply replacing the
NN list. Moreover, this algorithm is suitable for the GPU architecture which
the memory access latency can be hidden with calculations instead of big data
caches. This algorithm is realized by indirect memory access and random mem-
ory access, and frequently cause the simultaneous update with atomic function.
Thus, this algorithm is unsuitable for the architecture of traditional multipro-
cessors because this algorithm can not be vectorized and frequently cause the
15
cache coherence. We hope that many researchers show more interest in devel-
oping parallel algorithms from a standpoint of architecture.
Acknowledgments
This work was supported by KAKENHI grant 15K21623 from the Japan
Society for the Promotion of Science.
References
[1] K. Suzuki, I. Horiba, N. Sugie, Fast connected-component labeling based
on sequential local operations in the course of forward raster scan followed
by backward raster scan, Pattern Recognition, 2000. Proceedings. 15th
International Conference on, (2000) 434 - 437.
[2] L. He, Y. Chao, K. Suzuki, K. Wu, Fast connected-component labeling,
Pattern Recognition 42 (2009) 1977-1987
[3] K.A. Hawick, A. Leist, D. P. Playne, Parallel Graph Component Labelling
with GPUs and CUDA, Parallel Computing 36 (2010) 655-678.
[4] O. Kalentev, A. Rai, S. Kemnitz, R. Schneider, Connected component la-
beling on a 2D grid using CUDA, J. Parallel Distrib. Comput. 71 (2011)
615-620.
[5] Y. Komura, Y. Okabe, GPU-based Swendsen-Wang multi-cluster algorithm
for the simulation of two-dimensional classical spin systems, Comput. Phys.
Comm. 183 (2012) 1155-1161.
[6] Y. Komura, Y. Okabe, CUDA programs for the GPU computing of the
Swendsen-Wang multi-cluster spin flip algorithm: 2D and 3D Ising, Potts,
and XY models, Comput. Phys. Comm. 185 (2014) 1038-1043.
[7] Y. Komura, GPU-based cluster-labeling algorithm without the use of con-
ventional iteration : Application to the Swendsen-Wang multi-cluster spin
flip algorithm, Comput. Phys. Comm. 194 (2015) 54-58.
16
[8] Y. Komura, Y. Okabe, Improved CUDA programs for GPU computing of
Swendsen-Wang multi-cluster spin flip algorithm: 2D and 3D Ising, Potts,
and XY models, Comput. Phys. Comm. 200 (2016) 400-401.
[9] A. Al-Futaisia, T. W. Patzek, Extension of Hoshen–Kopelman algorithm
to non-lattice environments, Physica A: Statistical Mechanics and its Ap-
plications 321 (2003) 665-678
[10] NVIDIA CUDA ZONE https://developer.nvidia.com/cuda-zone
[11] R.H. Swendsen, J.S. Wang, Nonuniversal critical dynamics in Monte Carlo
simulations, Phys. Rev. Lett. 58 (1987) 86-88.
[12] J. Hoshen, R. Kopelman, Percolation and cluster distribution. I. Cluster
multiple labeling technique and critical concentration algorithm, Phys. Rev.
B 14 (1976) 3438-3445.
[13] F. Wende, T. Steinke, Swendsen-Wang multi-cluster algorithm for the
2D/3D Ising model on Xeon Phi and GPU, Proceeding SC ’13 Proceedings
of the International Conference on High Performance Computing, Network-
ing, Storage and Analysis Article No. 83 (2013).
[14] Y. Komura, Multi-GPU-based Swendsen–Wang multi-cluster algorithm
with reduced data traffic, Comput. Phys. Comm. 195 (2015) 84-94.
[15] G. Marsaglia, Xorshift RNGs, Journal of Statistical Software 8 (2003) 1-6.
[16] cuRAND CUDA Toolkit Documentation,
http://docs.nvidia.com/cuda/curand/
[17] M. F. Sykes, J. W. Essam, Exact Critical Percolation Probabilities for Site
and Bond Problems in Two Dimensions, J. Math. Phys. 5 (1964) 1117-1127.
17
ResearchGate has not been able to resolve any citations for this publication.
Article
Abstract We present new versions of sample CUDA programs for the GPU computing of the Swendsen–Wang multi-cluster spin flip algorithm. In this update, we add the method of GPU-based cluster-labeling algorithm without the use of conventional iteration (Komura, 2015) to those programs. For high-precision calculations, we also add a random-number generator in the cuRAND library. Moreover, we fix several bugs and remove the extra usage of shared memory in the kernel functions. New version program summary Program title: SWspin_v2_0 Catalogue identifier: AERM_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AERM_v2_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 6337 No. of bytes in distributed program, including test data, etc.: 26316 Distribution format: tar.gz Programming language: C, CUDA. Computer: System with an NVIDIA CUDA enabled GPU. Operating system: No limits (tested on Linux). RAM: About 2MiB for the parameters used in the sample programs. Classification: 23. Catalogue identifier of previous version: AERM_v1_0 Journal reference of previous version: Comput. Phys. Comm. 185(2014)1038 Does the new version supersede the previous version?: No Nature of problem: Monte Carlo simulation of classical spin systems. Ising, q -state Potts model, and the classical XY model are treated for both two-dimensional and three-dimensional lattices. Solution method: GPU-based Swendsen–Wang multi-cluster spin flip Monte Carlo method. The CUDA implementation for the cluster-labeling is based on the work by Hawick et al. [K.A. Hawick, A. Leist, and D. P Playne, Parallel Computing 36 (2010). 655–678], that by Kalentev et al. [O. Kalentev, A. Rai, S. Kemnitzb, and R. Schneider, J. Parallel Distrib. Comput. 71 (2011) 615–620], and that by Komura [Y. Komura, Comput. Phys. Comm. 194 (2015) 54–58]. Reasons for new version:1. Adding the method of GPU-based cluster-labeling algorithm without the use of conventional iteration [1]. 2. Adding a random-number generator in the cuRAND library [2] for high-precision calculations. 3. Fixing several bugs and removing the extra usage of shared memory in the kernel functions. Summary of revisions:1. Recently, we proposed the GPU-based cluster-labeling algorithm without the use of conventional iteration [1]. This cluster-labeling algorithm does not require an iterative method of comparison with the nearest-neighbor sites. The number of comparisons with the nearest-neighbor site in this method is one for a two dimensional system and two for a three-dimensional system if periodic boundary conditions are not employed. To realize this cluster-labeling algorithm, the atomic function, which is performed without interference from any other threads, is needed. Now, we explain about the added part of programs. In this update, we add the GPU-based cluster-labeling algorithm without the use of conventional iteration as a direct-type algorithm [1] to the present programs. This cluster-labeling algorithm consists of four steps: (i) initialization (ii) analysis (iii) label reduction (iv) analysis, and we add those steps to the present programs for the cluster-labeling algorithm, we can choose the algorithm of Hawick et al. [3] (algorithm = 0), the algorithm by Kalentev et al. [4] (algorithm = 1), or the algorithm by Komura [1] (algorithm = 2). The kernel function device_function_init_Y K; is a function for the step of active bond generation, which corresponds to the step of initialization for the algorithm of Komura. Three kernel functions device_function_analysis_Y K; device_ReduceLabels; device_function_analysis_Y K; are used in the step of cluster labeling for the algorithm of Komura. Those functions correspond to the steps of analysis, label reduction, and analysis, respectively. The cluster-labeling algorithm of Komura does not require an iterative method such as the cluster-labeling algorithms of Hawick et al. and Kalentev et al. The kernel functions device_function_spin_select; device_function_spin_flip_Y K; are used in the step of spin flip. 2. In the previous programs, we used a linear congruential random-number generator which was proposed by Preis et al. [5]. The computational cost and the usage of memory of the linear congruential random-number generator are low. However, the higher-quality random-number generator is needed for high-precision Monte Carlo simulations. In the CUDA, the cuRAND library [2], which focuses on the simple and efficient generation of high-quality pseudorandom and quasirandom numbers, is provided. In this update, we modify a linear congruential random-number generator to a random-number generator in the cuRAND library. We use a random-number generator of the host API in the cuRAND library. The kernel function curandCreateGenerator; is used to create a random-number generator. In the sample programs, we use an XORWOW pseudorandom generator [6]. However we can choose other random-number generators by changing the parameter of this function. The kernel function curandGenerateUniformDouble; is used to generate random numbers. The generated random numbers are stored in the d_random_data. For high-precision Monte Carlo simulations, we use the random numbers of double precision. Restrictions: The system size is limited depending on the memory of a GPU. Since the usage of memory in the present programs is increased compared with that of the previous programs, the maximum system size of the present programs is smaller than that of the previous programs. Running time: For the parameters used in the sample programs, it takes about a minute for each program. The computational time depends on the system size, the number of Monte Carlo steps, etc. References:[1] Y. Komura, GPU-based cluster-labeling algorithm without the use of conventional iteration: Application to the Swendsen–Wang multi-cluster spin flip algorithm, Comput. Phys. Comm. 194 (2015) 54–58. [2] cuRAND CUDA Toolkit Documentation, http://docs.nvidia.com/cuda/curand/. [3] K.A. Hawick, A. Leist, D. P Playne, Parallel Graph. Component Labeling with GPUs and CUDA, Parallel Computing 36 (2010) 655–678. [4] O. Kalentev, A. Rai, S. Kemnitzb, R. Schneider, Connected component labeling on a 2D grid using CUDA, J. Parallel Distrib. Comput. 71 (2011) 615–620. [5] T. Preis, P Virnau, W. Paul, J.J. Schneider, GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model, J. Comp. Phys. 228 (2009) 4468–4477. [6] G. Marsaglia, Xorshift RNGs, Journal of Statistical Software 8 (2003).
Article
Cluster-labeling algorithms that use a single GPU can be roughly divided into direct and two-stage approaches. To date, both types use an iterative method to compare the labels of nearest-neighbor sites. In this paper, I present a GPU-based cluster-labeling algorithm that does not use conventional iteration. The proposed method is applicable to both direct algorithms and two-stage approaches. Under the proposed approach, only one comparison with the nearest-neighbor site is needed for a two-dimensional (2D) system, and just two comparisons are needed for three-dimensional (3D) systems. As an application of the new cluster-labeling algorithm, I consider the Swendsen-Wang (SW) multi-cluster spin-flip algorithm. The performance of the proposed method is compared with that of other cluster-labeling algorithms for the SW multi-cluster spin-flip problem using the 2D and 3D Ising models. As a result, the computation time of the new algorithm is shown to be 40% faster than that of the previous algorithm for the 2D Ising model, and 20% faster than that of the previous algorithm for the 3D Ising model at the critical temperature.
Article
The computational performance of multi-GPU applications can be degraded by the data communication between each GPU. To realize high-speed computation with multiple GPUs, we should minimize the cost of this data communication. In this paper, I propose a multiple GPU computing method for the Swendsen-Wang (SW) multi-cluster algorithm that reduces the data traffic between each GPU. I realize this reduction in data traffic by adjusting the connection information between each GPU in advance. The code is implemented on the large-scale open science TSUBAME 2.5 supercomputer, and its performance is evaluated using a simulation of the three-dimensional Ising model at the critical temperature. The results show that the data communication between each GPU is reduced by 90%, and the number of communications between each GPU decreases by about half. Using 512 GPUs, the computation time is 0.005 ns per spin update at the critical temperature for a total system size of .
Conference Paper
Simulations of the critical Ising model by means of local update algorithms suffer from critical slowing down. One way to partially compensate for the influence of this phenomenon on the runtime of simulations is using increasingly faster and parallel computer hardware. Another approach is using algorithms that do not suffer from critical slowing down, such as cluster algorithms. This paper reports on the Swendsen-Wang multi-cluster algorithm on Intel Xeon Phi coprocessor 5110P, Nvidia Tesla M2090 GPU, and x86 multi-core CPU. We present shared memory versions of the said algorithm for the simulation of the two- and three-dimensional Ising model. We use a combination of local cluster search and global label reduction by means of atomic hardware primitives. Further, we describe an MPI version of the algorithm on Xeon Phi and CPU, respectively. Significant performance improvements over known implementations of the Swendsen-Wang algorithm are demonstrated.
Article
We present new versions of sample CUDA programs for the GPU computing of the Swendsen–Wang multi-cluster spin flip algorithm. In this update, we add the method of GPU-based cluster-labeling algorithm without the use of conventional iteration (Komura, 2015) to those programs. For high-precision calculations, we also add a random-number generator in the cuRAND library. Moreover, we fix several bugs and remove the extra usage of shared memory in the kernel functions. New version program summary Program title: SWspin_v2_0 Catalogue identifier: AERM_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AERM_v2_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 6337 No. of bytes in distributed program, including test data, etc.: 26316 Distribution format: tar.gz Programming language: C, CUDA. Computer: System with an NVIDIA CUDA enabled GPU. Operating system: No limits (tested on Linux). RAM: About 2MiB for the parameters used in the sample programs. Classification: 23. Catalogue identifier of previous version: AERM_v1_0 Journal reference of previous version: Comput. Phys. Comm. 185(2014)1038 Does the new version supersede the previous version?: No Nature of problem: Monte Carlo simulation of classical spin systems. Ising, q-state Potts model, and the classical XY model are treated for both two-dimensional and three-dimensional lattices. Solution method: GPU-based Swendsen–Wang multi-cluster spin flip Monte Carlo method. The CUDA implementation for the cluster-labeling is based on the work by Hawick et al. [K.A. Hawick, A. Leist, and D. P Playne, Parallel Computing 36 (2010). 655–678], that by Kalentev et al. [O. Kalentev, A. Rai, S. Kemnitzb, and R. Schneider, J. Parallel Distrib. Comput. 71 (2011) 615–620], and that by Komura [Y. Komura, Comput. Phys. Comm. 194 (2015) 54–58]. Reasons for new version: 1. Adding the method of GPU-based cluster-labeling algorithm without the use of conventional iteration [1]. 2. Adding a random-number generator in the cuRAND library [2] for high-precision calculations. 3. Fixing several bugs and removing the extra usage of shared memory in the kernel functions. Summary of revisions: 1. Recently, we proposed the GPU-based cluster-labeling algorithm without the use of conventional iteration [1]. This cluster-labeling algorithm does not require an iterative method of comparison with the nearest-neighbor sites. The number of comparisons with the nearest-neighbor site in this method is one for a two dimensional system and two for a three-dimensional system if periodic boundary conditions are not employed. To realize this cluster-labeling algorithm, the atomic function, which is performed without interference from any other threads, is needed. Now, we explain about the added part of programs. In this update, we add the GPU-based cluster-labeling algorithm without the use of conventional iteration as a direct-type algorithm [1] to the present programs. This cluster-labeling algorithm consists of four steps: (i) initialization (ii) analysis (iii) label reduction (iv) analysis, and we add those steps to the present programs for the cluster-labeling algorithm, we can choose the algorithm of Hawick et al. [3] (algorithm = 0), the algorithm by Kalentev et al. [4] (algorithm = 1), or the algorithm by Komura [1] (algorithm = 2). The kernel function device_function_init_Y K; is a function for the step of active bond generation, which corresponds to the step of initialization for the algorithm of Komura. Three kernel functions device_function_analysis_Y K; device_ReduceLabels; device_function_analysis_Y K; are used in the step of cluster labeling for the algorithm of Komura. Those functions correspond to the steps of analysis, label reduction, and analysis, respectively. The cluster-labeling algorithm of Komura does not require an iterative method such as the cluster-labeling algorithms of Hawick et al. and Kalentev et al. The kernel functions device_function_spin_select; device_function_spin_flip_Y K; are used in the step of spin flip. 2. In the previous programs, we used a linear congruential random-number generator which was proposed by Preis et al. [5]. The computational cost and the usage of memory of the linear congruential random-number generator are low. However, the higher-quality random-number generator is needed for high-precision Monte Carlo simulations. In the CUDA, the cuRAND library [2], which focuses on the simple and efficient generation of high-quality pseudorandom and quasirandom numbers, is provided. In this update, we modify a linear congruential random-number generator to a random-number generator in the cuRAND library. We use a random-number generator of the host API in the cuRAND library. The kernel function curandCreateGenerator; is used to create a random-number generator. In the sample programs, we use an XORWOW pseudorandom generator [6]. However we can choose other random-number generators by changing the parameter of this function. The kernel function curandGenerateUniformDouble; is used to generate random numbers. The generated random numbers are stored in the d_random_data. For high-precision Monte Carlo simulations, we use the random numbers of double precision. Restrictions: The system size is limited depending on the memory of a GPU. Since the usage of memory in the present programs is increased compared with that of the previous programs, the maximum system size of the present programs is smaller than that of the previous programs. Running time: For the parameters used in the sample programs, it takes about a minute for each program. The computational time depends on the system size, the number of Monte Carlo steps, etc. References: [1] Y. Komura, GPU-based cluster-labeling algorithm without the use of conventional iteration: Application to the Swendsen–Wang multi-cluster spin flip algorithm, Comput. Phys. Comm. 194 (2015) 54–58. [2] cuRAND CUDA Toolkit Documentation, http://docs.nvidia.com/cuda/curand/. [3] K.A. Hawick, A. Leist, D. P Playne, Parallel Graph. Component Labeling with GPUs and CUDA, Parallel Computing 36 (2010) 655–678. [4] O. Kalentev, A. Rai, S. Kemnitzb, R. Schneider, Connected component labeling on a 2D grid using CUDA, J. Parallel Distrib. Comput. 71 (2011) 615–620. [5] T. Preis, P Virnau, W. Paul, J.J. Schneider, GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model, J. Comp. Phys. 228 (2009) 4468–4477. [6] G. Marsaglia, Xorshift RNGs, Journal of Statistical Software 8 (2003).
Article
An exact method for determining the critical percolation probability, pc, for a number of two-dimensional site and bond problems is described. For the site problem on the plane triangular lattice pc = ½. For the bond problem on the triangular, simple quadratic, and honeycomb lattices, pc=2 sin (118π),12,1−2 sin (118π), respectively. A matching theorem for the mean number of finite clusters on certain two-dimensional lattices, somewhat analogous to the duality transformation for the partition function of the Ising model, is described.
Article
Graph component labelling, which is a subset of the general graph colouring problem, is a computationally expensive operation that is of importance in many applications and simulations. A number of data-parallel algorithmic variations to the component labelling problem are possible and we explore their use with general purpose graphical processing units (GPGPUs) and with the CUDA GPU programming language. We discuss implementation issues and performance results on GPUs using CUDA. We present results for regular mesh graphs as well as arbitrary structured and topical graphs such as small-world and scale-free structures. We show how different algorithmic variations can be used to best effect depending upon the cluster structure of the graph being labelled and consider how features of the GPU architectures and host CPUs can be combined to best effect into a cluster component labelling algorithm for use in high performance simulations.
Article
Labeling of connected components in a binary image is one of the most fundamental operations in pattern recognition: labeling is required whenever a computer needs to recognize objects (connected components) in a binary image. This paper presents a fast two-scan algorithm for labeling of connected components in binary images. We propose an efficient procedure for assigning provisional labels to object pixels and checking label equivalence. Our algorithm is very simple in principle, easy to implement, and suitable for hardware and parallel implementation. We show the correctness of our algorithm, analyze its complexity, and compare it with other labeling algorithms. Experimental results demonstrated that our algorithm is superior to conventional labeling algorithms.
Article
We extend the Hoshen–Kopelman (HK) algorithm for cluster labeling to non-lattice environments, where sites are placed at random at non-lattice points. This extension is useful for continuum systems and disordered networks. Our extension of the HK algorithm relies on several data structures that describe network connectivity regardless of its dimensionality. Just as for the classic HK algorithm on lattices, our extension is completed in a single pass through the sites of the network and cluster relabeling operates on a vector whose size is much smaller than the size of the network. Our extension of the HK algorithm works for any environment (lattice or non-lattice) of any dimensionality, type (sites, bonds or both), and with arbitrary connectivity between the sites. The proposed extension is illustrated through a simple network consisting of 16 sites and 24 bonds, and applied to a complex network extracted from a 3D micro-focused X-ray CT image of Bentheimer sandstone consisting of 3677 sites and 8952 bonds.
Article
We present the GPU calculation with the common unified device architecture (CUDA) for the Swendsen-Wang multi-cluster algorithm of two-dimensional classical spin systems. We adjust the two connected component labeling algorithms recently proposed with CUDA for the assignment of the cluster in the Swendsen-Wang algorithm. Starting with the q-state Potts model, we extend our implementation to the system of vector spins, the q-state clock model, with the idea of embedded cluster. We test the performance, and the calculation time on GTX580 is obtained as 2.51 nano sec per a spin flip for the q=2 Potts model (Ising model) and 2.42 nano sec per a spin flip for the q=6 clock model with the linear size L=4096 at the critical temperature, respectively. The computational speed for the q=2 Potts model on GTX580 is 12.4 times as fast as the calculation speed on a current CPU core. That for the q=6 clock model on GTX580 is 35.6 times as fast as the calculation speed on a current CPU core.