Content uploaded by Yukihiro Komura
Author content
All content in this area was uploaded by Yukihiro Komura on Mar 29, 2016
Content may be subject to copyright.
arXiv:1603.08357v1 [physics.comp-ph] 28 Mar 2016
A generalized GPU-based connected component
labeling algorithm
Yukihiro Komuraa,∗
aRIKEN, Advanced Institute for Computational Science, 7-1-26 Minatojima-minami-machi,
Chuo-ku, Kobe, Hyogo 650-0047, Japan
Abstract
We propose a generalized GPU-based connected component labeling (CCL) al-
gorithm that can be applied to both various lattices and to non-lattice environ-
ments in a uniform fashion. We extend our recent GPU-based CCL algorithm
without the use of conventional iteration to the generalized method. As an
application of this algorithm, we deal with the bond percolation problem. We
investigate bond percolation on the honeycomb and triangle lattices to confirm
the correctness of this algorithm. Moreover, we deal with bond percolation on
the Bethe lattice as a substitute for a network structure, and demonstrate the
performance of this algorithm on those lattices.
Keywords: connected component labeling, percolation theory, Bethe lattice,
parallel computing, GPU
1. Introduction
The connected component labeling (CCL) algorithm has been used to study
image processing [1, 2, 3, 4], applications in physics [5, 6, 7, 8], percolation the-
ory, and a problem involving porous rocks [9]. The majority of CCL algorithms
were designed for lattice environments, with only a few studies of CCL algo-
rithms for non-lattice environments. In non-lattice environments, the positions
of the sites are arbitrary rather than being restricted to discrete points of a reg-
∗Corresponding author.
Email address: yukihiro.komura.ss@alum.riken.jp (Yukihiro Komura)
Preprint submitted to Computer Physics Communications March 29, 2016
ular lattice. Non-lattice environments exist not only in the percolation theory
of disordered discs and spheres, but also in networks.
The performance of a single CPU core remains almost unchanged from a
decade ago. However, the number of cores has increased each year, and recent
advances in application performance have been realized by exploiting such con-
cepts as multiple cores and many threads. Graphics accelerators are a common
example of many-thread devices, having evolved into highly parallel threads
with very high memory bandwidth, and research has clearly shown that they
can dramatically improve computing performance. The most widely used pro-
gramming model for accelerators is CUDA [10], a parallel-computing platform
and programming model developed by NVIDIA that is essentially C/C++ with
several extensions that allow functions to be executed directly on the NVIDIA
GPU.
Connected component labeling algorithms with a single GPU have been
proposed by many researchers. The ma jority of those CCL algorithms have been
designed for lattice environments, and use a two-stage approach that divides
the lattice into sub-blocks that are independently treated and then merged.
However, the sub-block decomposition is difficult in non-lattice environments.
Thus we need GPU-based CCL algorithms without a sub-block decomposition
for computation in non-lattice environments.
Here we describe several experiments on CCL algorithms with a single GPU
and without the sub-block decomposition. Hawick et al. [3] proposed a CCL
algorithm called ”Label Equivalence”, and Kalentev et al. [4] improved the label
equivalence algorithm. Both of these algorithms are realized by an iterative
method of comparison with nearest-neighbor sites. More recently, the present
author [7] has also proposed a single GPU CCL algorithm that does not use
conventional iteration. The computation times using this approach have proved
to be about half those of the previous method [4] for the application of the
Swendsen-Wang multi-cluster spin-flip algorithm [11].
The above described CCL algorithms focus on the case of square and simple
cubic lattices, and the algorithm in [7] is specialized to the case of square and
2
simple cubic lattices. In this paper, we propose a generalized GPU-based CCL
algorithm that can be applied to both various lattices and to non-lattice en-
vironments in a uniform fashion. This generalized method extends our recent
GPU-based CCL algorithm [7], which does not use conventional iteration. To
confirm the correctness and performance of this algorithm, we deal with bond
percolation problems. This paper is organized as follows: In Section 2, we briefly
review the GPU-based CCL algorithm that serves as a starting point [7]. In Sec-
tion 3, we describe the new generalized algorithm. In Section 4, we first show
the result of the bond percolations for the honeycomb and triangle lattices to
confirm the correctness. Moreover, we adapt the generalized algorithm to bond
percolation for the Bethe lattice as a substitute for a network structure, and
the resulting performance is described. Finally, a summary and discussion are
presented in Section 5.
2. Connected Component Labeling
Connected component labelling algorithms assign proper cluster labels to
each site on the basis of local connection information. Many CCL algorithms
have been proposed by many researchers. The Hoshen–Kopelman algorithm
[12] and the CCL algorithm proposed by Suzuki et al. [1] are the best-known
CCL algorithms that use a single CPU. The majority of CCL algorithms using
a single CPU are specialized to the case of square and simple cubic lattices.
Moreover, those algorithms are realized by sequential computation. However,
sequential algorithms cannot be applied to GPU computation, and thus many
CCL algorithms that use a single CPU cannot be directly applied to GPU
computation, so a suitable new algorithm is required.
We briefly review the CCL algorithm without conventional iteration [7].
The cornerstone of this algorithm is the method of label reduction proposed
by Wende et al. [13]; the procedure is illustrated in Fig. 1 of [7]. The labeling
consists of four steps: (i) initialization, (ii) analysis, (iii) label reduction, and
(iv) analysis. This method does not require an iterative method of comparison
with nearest-neighbor sites. The number of such comparisons in this method
3
㼟㼕㼠㼑 㻺㻺㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㻺㻺㻌㼘㼕㼟㼠
㻜 㻟㻘㻝㻜 㻝㻝 㻠㻘㻞㻜
㻝 㻠㻘㻡㻘㻝㻡 㻝㻞 㻝㻢㻘㻝㻥㻘㻞㻝
㻞 㻝㻣㻘㻝㻤 㻝㻟 㻡㻘㻝㻡
㻟 㻜㻘㻤㻘㻥㻘㻞㻝 㻝㻠 㻝㻥㻘㻞㻝
㻠 㻝㻘㻝㻝 㻝㻡 㻝㻘㻝㻟
㻡 㻝㻘㻝㻟㻘㻞㻜 㻝㻢 㻝㻞㻘㻞㻝
㻢 㻤㻘㻥 㻝㻣 㻞㻘㻝㻤
㻣 㻞㻜 㻝㻤 㻞㻘㻝㻣
㻤 㻟㻘㻢 㻝㻥 㻝㻞㻘㻝㻠
㻥 㻟㻘㻢㻘㻝㻜 㻞㻜 㻡㻘㻣㻘㻝㻝
㻝㻜 㻜㻘㻥 㻞㻝 㻟㻘㻝㻞㻘㻝㻠㻘㻝㻢
Figure 1: Example of the nearest-neighbor (NN) list for a complex network. The number at
each site is the index, and each has a NN list.
is 1 for a square lattice and 2 for a simple cubic lattice if periodic boundary
conditions are not used. In the initialization function, a label is assigned to
each site based on its connections: each site has label[i] = min, where min
is the lowest-numbered of the connected sites. The analysis function tracks
the label from a given site to a new site determined by the value of the label
at the given site. All sites have label[i] = label [label[i]] calculated until label
remains unchanged. In the label reduction step, we use Algorithm 1 of [7] on
all sites. The reduction method is used in only one direction for square lattices
and in only two directions for simple cubic lattices. To resolve conflicts in the
label update process, the atomic function is used in Algorithm 1 of [7], and
each chain of labels, i.e., label [label], is constructed automatically. Finally, the
analysis function is executed again. The cluster labeling algorithm in [7] uses
awhile loop within the label reduction step. However, the number of sites for
which the while loop is executed in this step is kept to a minimum.
3. Generalized GPU-based cluster labeling algorithm
We now turn to the proposed algorithm. The algorithm in [7] is specialized
to the case of square and simple cubic lattices. We extend this algorithm of
[7] to the generalized algorithm. The proposed algorithm can be applied to
4
㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠
㻜 㻜 㻝㻝 㻠
㻝 㻝 㻝㻞 㻝㻞
㻞 㻞 㻝㻟 㻡
㻟 㻜 㻝㻠 㻝㻠
㻠 㻝 㻝㻡 㻝 㻝㻟
㻡 㻝 㻝㻢 㻝㻞
㻢 㻢 㻝㻣 㻞
㻣 㻣 㻝㻤 㻞 㻝㻣
㻤 㻟 㻢 㻝㻥 㻝㻞 㻝㻠
㻥 㻟 㻢 㻞㻜 㻡 㻣㻘㻝㻝
㻝㻜 㻜 㻥 㻞㻝 㻟 㻝㻞㻘㻝㻠 㻘㻝㻢
㼟㼕㼠㼑 㻺㻺㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㻺㻺㻌㼘㼕㼟㼠
㻜 㻟㻘㻝㻜 㻝㻝 㻠㻘㻞㻜
㻝 㻠㻘㻡㻘㻝 㻡 㻝㻞 㻝㻢㻘㻝㻥㻘㻞㻝
㻞 㻝㻣㻘㻝 㻤 㻝㻟 㻡㻘㻝㻡
㻟 㻜㻘㻤㻘㻥 㻘㻞㻝 㻝㻠 㻝㻥㻘㻞㻝
㻠 㻝㻘㻝㻝 㻝㻡 㻝㻘㻝㻟
㻡 㻝㻘㻝㻟 㻘㻞㻜 㻝㻢 㻝㻞㻘㻞㻝
㻢 㻤㻘㻥 㻝㻣 㻞㻘㻝㻤
㻣 㻞㻜 㻝㻤 㻞 㻘㻝㻣
㻤 㻟㻘㻢 㻝㻥 㻝㻞㻘㻝㻠
㻥 㻟㻘㻢㻘㻝 㻜 㻞㻜 㻡㻘㻣㻘㻝㻝
㻝㻜 㻜㻘㻥 㻞 㻝 㻟 㻘㻝㻞㻘㻝㻠㻘㻝㻢
Figure 2: Initialization step of the generalized algorithm. The left panel shows the state of
the NN list in Fig. 1. Each site generates the residual list from the NN list by keeping only
those neighbors that are numbered lower than the site. Moreover, each site has had its label
set to the lowest value in the residual list, label[i] = min, and this number has been removed
from the residual list. If there is no value in the residual list, each site is labeled with its site
number, i.e., label[i] = i. The right panel is the state after the initialization step [step (i)],
㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍 㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠
㻜 㻜 㻝 㻝 㻝
㻝 㻝 㻝 㻞 㻜
㻞 㻞 㻝 㻟 㻝
㻟 㻜 㻝 㻠 㻜
㻠 㻝 㻝㻡 㻝 㻝㻟
㻡 㻝 㻝㻢 㻝㻞
㻢 㻜 㻝 㻣 㻞
㻣 㻝 㻝㻤 㻞 㻝㻣
㻤 㻜 㻢 㻝㻥 㻝㻞 㻝㻠
㻥 㻜 㻢 㻞㻜 㻝 㻣㻘㻝 㻝
㻝㻜 㻜 㻥 㻞㻝 㻜 㻝㻞㻘㻝㻠㻘㻝 㻢
㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍 㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠
㻜 㻜 㻝 㻝 㻝
㻝 㻝 㻝㻞 㻝㻞
㻞 㻞 㻝 㻟 㻝
㻟 㻜 㻝㻠 㻝㻠
㻠 㻝 㻝㻡 㻝 㻝㻟
㻡 㻝 㻝㻢 㻝㻞
㻢 㻢 㻝 㻣 㻞
㻣 㻣 㻝㻤 㻞 㻝㻣
㻤 㻜 㻢 㻝㻥 㻝 㻞 㻝㻠
㻥 㻜 㻢 㻞㻜 㻝 㻣㻘㻝 㻝
㻝㻜 㻜 㻥 㻞㻝 㻜 㻝㻞㻘㻝㻠㻘㻝 㻢
Figure 3: Label reduction step of the generalized algorithm. The left panel shows the state
after the end of the first analysis step [step (ii)]. The right panel shows the state after label
reduction [step (iii)], in which each site is calculated using Algorithm 1.
㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠
㻜 㻜 㻝㻝 㻝
㻝 㻝 㻝㻞 㻜
㻞 㻞 㻝㻟 㻝
㻟 㻜 㻝㻠 㻜
㻠 㻝 㻝㻡 㻝 㻝㻟
㻡 㻝 㻝㻢 㻜
㻢 㻜 㻝㻣 㻞
㻣 㻝 㻝㻤 㻞 㻝㻣
㻤 㻜 㻢 㻝㻥 㻜 㻝㻠
㻥 㻜 㻢 㻞㻜 㻝 㻣㻘㻝㻝
㻝㻜 㻜 㻥 㻞㻝 㻜 㻝㻞 㻘㻝㻠㻘㻝㻢
㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍㼎㼑 㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠
㻜 㻜 㻝㻝 㻝
㻝 㻝 㻝㻞 㻜
㻞 㻞 㻝㻟 㻝
㻟 㻜 㻝㻠 㻜
㻠 㻝 㻝㻡 㻝 㻝㻟
㻡 㻝 㻝㻢 㻝㻞
㻢 㻜 㻝㻣 㻞
㻣 㻝 㻝㻤 㻞 㻝㻣
㻤 㻜 㻢 㻝㻥 㻝㻞 㻝㻠
㻥 㻜 㻢 㻞㻜 㻝 㻣㻘㻝㻝
㻝㻜 㻜 㻥 㻞㻝 㻜 㻝㻞 㻘㻝㻠㻘㻝㻢
Figure 4: Final analysis step of the generalized algorithm. The left panel shows the state
after the end of the label reduction step [step (iii)]. The right panel shows the state after the
second analysis step [step (iv)].
5
both lattices and to non-lattice environments in a uniform fashion as long as a
nearest-neighbor (NN) list is supplied. Figure 1 presents an example of such a
list. Each site has a list of immediate neighbors. In this paper, we prepare the
array of NN list in advance. The total size of NN list is equal to the product of
the number of sites and the largest size of NN list in all sites, and each site gets
access to a NN list as N N list[i], N N list[i+N], N N list[i+ 2N], ..., where N
is the number of sites.
The generalized CCL algorithm also consists of the same four steps: (i)
initialization, (ii) analysis, (iii) label reduction, and (iv) analysis. In the ini-
tialization step, we generate the residual lists from the NN list by keeping only
those neighbors that are numbered lower than any given site. Moreover, each
site has its label set to the lowest site number in the residual list, label[i] = min,
and this is removed from the residual list. If there is no number in the residual
list, the site is labeled with its site number, i.e., label [i] = i. Figure 2 shows an
example of this procedure. The analysis function tracks the label from a given
site to a new site determined by the value of the label at the given site. All sites
have label [i] = label[label [i]] calculated until label remains unchanged. In the
label reduction step, each site executes Algorithm 1. In the label reduction step,
the calculation at each site uses the pairs at each site and the remaining sites
in the residual list. The label reduction method allows each chain of labels, i.e.,
label[label ], to be constructed automatically. Figure 3 illustrates this procedure.
In Fig. 3, we show the actual output of our algorithm for the complex network
in Fig. 1. Finally, each site is assigned the proper cluster label by executing the
analysis function again. The procedure is illustrated in Fig. 4.
Unlike the CCL algorithms using a single CPU, the label of the cluster is
not given serially in this method. The label of each cluster depends on the
minimum site number in each cluster. However, we can renumber the cluster
labels to sequence labels by using the method shown in Fig. 4 in [14], for
example.
6
for j∈residual list do
// iis a site number
label 1⇐label[i] ;
while label 16=label[label 1] do
label 1⇐label[label 1];
end
// jis a number in the residual list of site i
label 2⇐label[j] ;
while label 26=label[label 2] do
label 2⇐label[label 2];
end
flag ⇐true;
if label 16=label 2then
flag ⇐f alse;
end
if label 1<label 2 then
tmp ⇐label 1;
label 1⇐label 2;
label 2⇐tmp;
end
while f lag =false do
label 3⇐atomicM in(&label[label 1], l abel 2);
if label 3 = label 2then f lag ⇐true else if label 3> label 2
then label 1⇐label 3else if label 3< label 2then
label 1⇐label 2; label 2⇐label 3
end
end
Here, iis a site, jis a nearest-neighbor to site i, and label is the label at
each site. ] Pseudo-code of the label reduction method used in this
paper. Each site executes this algorithm at the label reduction step [step
(iii)]. Here, iis a site, jis a nearest-neighbor to site i, and label is the
label at each site. Algorithm 1: .
7
0.64 0.65 0.66
0
0.5
1
p
N=262144
N=1048576
N=4194304
N=16777216
<Rspan(p)>
Figure 5: Spanning probability for bond percolation on the honeycomb lattice.
4. Results
As an application of the generalized CCL algorithm, we study bond per-
colation problems. Bond percolation creates bonds between neighboring sites
with probability p. In bond percolation, one considers whether there is a cluster
that spans the entire lattice. If there is such a spanning cluster, the system is
regarded as percolating. There is a threshold value pcthat distinguishes perco-
lating and non-percolating behavior. The probability that a spanning cluster is
produced is zero for p < pcand is unity for p > pcas the system size approaches
infinity.
We test the correctness and performance of the proposed code on an NVIDIA
TITAN X machine with the CUDA version 7.5 compiler. For the random-
number generator, we used the XORWOW pseudorandom generator [15] with
double precision on the device API in the cuRAND library [16].
We first deal with bond percolation on the honeycomb and triangle lattices
to check the correctness of our algorithm. For the measured quantity, we use
8
0.34 0.35
0
0.5
1
<Rspan(p)>
p
N=262144
N=1048576
N=4194304
N=16777216
Figure 6: Spanning probability for bond percolation on the triangle lattice.
106107
10−2
10−1
100
101
computational time (ms)
N
honeycomb, 0.5pchc
honeycomb, pchc
honeycomb, 1.5pchc
triangle, 0.5pctr
triangle, pctr
triangle, 1.5pctr
Figure 7: Average computational time for a single realization on the honeycomb and tri-
angle lattices for probabilities p= 0.5pc, pc,1.5pc. We use the threshold value of phc
c=
1−2 sin(π/18) for the honeycomb lattice and ptr
c= 2 sin(π/18) for the triangle lattice.
9
the spanning probability Rspan (p). Figure 5 shows the spanning probability
Rspan(p) for the honeycomb lattice, and Fig. 6 shows the spanning probability
for the triangle lattice. The total lattice sizes Nare 262144, 1048576, 4194304,
and 16777216. From these figures, we can see that all curves intersect at one
point. From the crossing point, we estimate the threshold value phc
con the
honeycomb lattice to be phc
c= 0.6527 ±0.0002 and the threshold value ptr
con
the triangle lattice to be ptr
c= 0.3472 ±0.0002. These estimated values are
compatible with the exact values from [17], namely phc
c= 1 −2 sin(π/18) and
ptr
c= 2 sin(π/18). Those results show that our algorithm works well.
Secondly, we demonstrate the performance for bond percolation on the hon-
eycomb and triangle lattices. We give a double logarithmic plot of the aver-
age computational times for a single realization for the honeycomb and trian-
gle lattices in Fig. 7. Because the computational time is dependent on the
probability p, we show the average computational times at the probabilities
p= 0.5pc, pc,1.5pcon these lattices. We use phc
c= 1 −2 sin(π/18) as the thresh-
old value for the honeycomb lattice and ptr
c= 2 sin(π/18) as the threshold value
for triangle lattice, and the total lattice sizes Nare 262144, 1048576, 4194304,
and 16777216. From Fig. 7, we can see that the average computational time
is proportional to the total lattice size for all probabilities. The computational
times on each lattice when the probability is 0.5pcis the shortest of those proba-
bilities. The computational time for the honeycomb lattice with N= 16777216
when the probability is 0.5phc
cis 5.8 ms, and that for the triangle lattice with
N= 16777216 when the probability is 0.5ptr
cis 7.2 ms. Moreover, the computa-
tional time for the honeycomb lattice with N= 16777216 when the probability
is 1.5phc
cis about 1.4 times greater than when the probability is 0.5phc
c, and the
computational time for the triangle lattice with N= 16777216 when the prob-
ability is 1.5ptr
cis about 1.7 times greater than when the probability is 0.5ptr
c.
The differences between the computational times for different probabilities are
due to the increment in the size of the residual list, and the difference between
the computational times on the two lattices is due to the coordination number.
Next, we deal with bond percolation on the Bethe lattice as a substitute for
10
0
1
2
3
4
5
6
7
8
9
10
11
1213
14
15
16
17
18 19
20
21
22
23
24
25
26
2728
29
30
31
32
33
34
35
36
37
38 39 40 41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
5758
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77 78 79 80 81 82 83 84 85 86
87
88
89
90
91
92
93
Figure 8: An example of one realization of the standard position on the Bethe lattice with
coordination number z= 3 when the probability is p= 1.5pc. The number at each site is its
index, and the connection between sites in the same cluster is represented by the same color.
11
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62 63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
Figure 9: An example of one realization of the random position on the Bethe lattice with
coordination number z= 3 when the probability is p= 1.5pc. The number at each site is its
index, and the connection between sites in the same cluster is represented by the same color.
12
106107
10−2
10−1
100
101
computational time (ms)
N
standard, 0.5pc
standard, pc
standard,1.5pc
random, 0.5pc
random, pc
random,1.5pc
Figure 10: Average computational times for a single realization on the Bethe lattice with
coordination number z= 3 for standard site positions and random site positions when the
probability is p= 0.5pc, pc,1.5pc.
a network structure. The Bethe lattice is a connected cycle-free graph, where
each site is connected to zneighbors. The threshold value for bond percolation
on the Bethe lattice is related to the coordination number z, and the threshold
value of the Bethe lattice with coordination number zis 1/(z−1). We usually
give the site number starting from a center site, called the standard position
in this paper. The generalized CCL algorithm is excellent with this numbering
method because the remaining sites in the residual list for all sites become zero
at all probabilities. Thus, we deal with the two Bethe lattices with coordination
number z= 3. One is standard position, where the site numbers are given from
a center site, and the other is random position, where the site numbers are given
at random. Figure 8 shows an example of one realization of standard position
on the Bethe lattice with coordination number z= 3 when the probability is
p= 1.5pc, and Fig. 9 shows an example of one realization of random position
on the Bethe lattice with coordination number z= 3 when the probability is
13
p= 1.5pc. The threshold value of the Bethe lattice with coordination number
z= 3 is pc= 1/2. Those examples are the results of the actual output.
We examine the computational times of the generalized CCL algorithm for
bond percolation on the two Bethe lattices. Figure 10 shows a double logarith-
mic plot of the average time required for one realization on the two Bethe lattices
when the probabilities are p= 0.5pc, pc,1.5pc. We measure the average compu-
tational times for the total lattice sizes N= 393214,786430,1572862,3145726,6291454,
and 12582910. In the case of standard position, the computational times are
almost independent of the probability, and the computational time is propor-
tional to the total lattice size. In contrast, in the case of random position, the
computational time is dependent on the probability. The computational times
of the random position for the total lattice sizes N≤786430 are proportional
to the total lattice sizes for all probabilities, and those for the total lattice sizes
N≥786430 deviate from being proportional to the total lattice size at all prob-
abilities. This change is related to the amount of L2 cache, which is 3MB in the
TITAN X. The computational time is shortest when the probability is 0.5pc.
The computational time for random position with N= 12582910 when the
probability is 0.5pcis 7.1 ms, and that when the probability is p= 1.5pcis 21
ms. Moreover, the computational time for random position with N= 12582910
when the probability is 1.5pcis about 8.0 times greater than that for standard
position with N= 12582910. Because we give the site number at random, the re-
maining sites in the residual list for each site are also random. This randomness
causes a load imbalance. Thus, the computational times for random position
are greater than those for standard position. However, those results show that
the generalized CCL algorithm can be applied directly to any structure without
reordering.
5. Summary and discussion
We have proposed a generalized GPU-based CCL algorithm that can be
applied to both various lattices and to non-lattice environments. Because the
algorithm in [7] is specialized to the case of square and simple cubic lattices,
14
we have extended our previous labeling algorithm [7], which does not use con-
ventional iteration, to the generalized method. The proposed algorithm can be
applied directly to any structure without reordering.
We chose the bond percolation problem as a test application. We first dealt
with bond percolation on the honeycomb and triangle lattices to confirm the
correctness of our algorithm. We used the spanning probability as a measured
quantity, and reproduced some well-known results. Secondly, we showed the
performance for bond percolation on the honeycomb and triangle lattices. Be-
cause the computational time is dependent on the probability p, we reported the
average computational times for the probabilities p= 0.5pc, pc,1.5pcThe com-
putational time for the honeycomb lattice with N= 16777216 when the prob-
ability is 0.5pcis 5.8 ms, and that for the triangle lattice with N= 16777216
when the probability is 0.5pcis 7.2 ms. Finally, we dealt with bond percola-
tion on the Bethe lattice as a substitute for a network structure. Because the
remaining sites in the residual list for all sites become zero at all probabilities
when we give the site numbers starting from a center site, we dealt with the two
Bethe lattices with coordination number z= 3. The computational times for
random position, where the site numbers are given at random, are greater than
those for standard position, where the site numbers are given from a center site,
at all probabilities. However, the generalized CCL algorithm can be applied
directly to any structure without reordering.
We finally emphasize the efficiency of the proposed algorithm. This algo-
rithm is very powerful; it can be adapted both to a variety of lattices and to
non-lattice environments in a uniform fashion as long as an NN list is supplied.
That is, this algorithm can be adapted to any structure by simply replacing the
NN list. Moreover, this algorithm is suitable for the GPU architecture which
the memory access latency can be hidden with calculations instead of big data
caches. This algorithm is realized by indirect memory access and random mem-
ory access, and frequently cause the simultaneous update with atomic function.
Thus, this algorithm is unsuitable for the architecture of traditional multipro-
cessors because this algorithm can not be vectorized and frequently cause the
15
cache coherence. We hope that many researchers show more interest in devel-
oping parallel algorithms from a standpoint of architecture.
Acknowledgments
This work was supported by KAKENHI grant 15K21623 from the Japan
Society for the Promotion of Science.
References
[1] K. Suzuki, I. Horiba, N. Sugie, Fast connected-component labeling based
on sequential local operations in the course of forward raster scan followed
by backward raster scan, Pattern Recognition, 2000. Proceedings. 15th
International Conference on, (2000) 434 - 437.
[2] L. He, Y. Chao, K. Suzuki, K. Wu, Fast connected-component labeling,
Pattern Recognition 42 (2009) 1977-1987
[3] K.A. Hawick, A. Leist, D. P. Playne, Parallel Graph Component Labelling
with GPUs and CUDA, Parallel Computing 36 (2010) 655-678.
[4] O. Kalentev, A. Rai, S. Kemnitz, R. Schneider, Connected component la-
beling on a 2D grid using CUDA, J. Parallel Distrib. Comput. 71 (2011)
615-620.
[5] Y. Komura, Y. Okabe, GPU-based Swendsen-Wang multi-cluster algorithm
for the simulation of two-dimensional classical spin systems, Comput. Phys.
Comm. 183 (2012) 1155-1161.
[6] Y. Komura, Y. Okabe, CUDA programs for the GPU computing of the
Swendsen-Wang multi-cluster spin flip algorithm: 2D and 3D Ising, Potts,
and XY models, Comput. Phys. Comm. 185 (2014) 1038-1043.
[7] Y. Komura, GPU-based cluster-labeling algorithm without the use of con-
ventional iteration : Application to the Swendsen-Wang multi-cluster spin
flip algorithm, Comput. Phys. Comm. 194 (2015) 54-58.
16
[8] Y. Komura, Y. Okabe, Improved CUDA programs for GPU computing of
Swendsen-Wang multi-cluster spin flip algorithm: 2D and 3D Ising, Potts,
and XY models, Comput. Phys. Comm. 200 (2016) 400-401.
[9] A. Al-Futaisia, T. W. Patzek, Extension of Hoshen–Kopelman algorithm
to non-lattice environments, Physica A: Statistical Mechanics and its Ap-
plications 321 (2003) 665-678
[10] NVIDIA CUDA ZONE https://developer.nvidia.com/cuda-zone
[11] R.H. Swendsen, J.S. Wang, Nonuniversal critical dynamics in Monte Carlo
simulations, Phys. Rev. Lett. 58 (1987) 86-88.
[12] J. Hoshen, R. Kopelman, Percolation and cluster distribution. I. Cluster
multiple labeling technique and critical concentration algorithm, Phys. Rev.
B 14 (1976) 3438-3445.
[13] F. Wende, T. Steinke, Swendsen-Wang multi-cluster algorithm for the
2D/3D Ising model on Xeon Phi and GPU, Proceeding SC ’13 Proceedings
of the International Conference on High Performance Computing, Network-
ing, Storage and Analysis Article No. 83 (2013).
[14] Y. Komura, Multi-GPU-based Swendsen–Wang multi-cluster algorithm
with reduced data traffic, Comput. Phys. Comm. 195 (2015) 84-94.
[15] G. Marsaglia, Xorshift RNGs, Journal of Statistical Software 8 (2003) 1-6.
[16] cuRAND CUDA Toolkit Documentation,
http://docs.nvidia.com/cuda/curand/
[17] M. F. Sykes, J. W. Essam, Exact Critical Percolation Probabilities for Site
and Bond Problems in Two Dimensions, J. Math. Phys. 5 (1964) 1117-1127.
17