Content uploaded by Yukihiro Komura

Author content

All content in this area was uploaded by Yukihiro Komura on Mar 29, 2016

Content may be subject to copyright.

arXiv:1603.08357v1 [physics.comp-ph] 28 Mar 2016

A generalized GPU-based connected component

labeling algorithm

Yukihiro Komuraa,∗

aRIKEN, Advanced Institute for Computational Science, 7-1-26 Minatojima-minami-machi,

Chuo-ku, Kobe, Hyogo 650-0047, Japan

Abstract

We propose a generalized GPU-based connected component labeling (CCL) al-

gorithm that can be applied to both various lattices and to non-lattice environ-

ments in a uniform fashion. We extend our recent GPU-based CCL algorithm

without the use of conventional iteration to the generalized method. As an

application of this algorithm, we deal with the bond percolation problem. We

investigate bond percolation on the honeycomb and triangle lattices to conﬁrm

the correctness of this algorithm. Moreover, we deal with bond percolation on

the Bethe lattice as a substitute for a network structure, and demonstrate the

performance of this algorithm on those lattices.

Keywords: connected component labeling, percolation theory, Bethe lattice,

parallel computing, GPU

1. Introduction

The connected component labeling (CCL) algorithm has been used to study

image processing [1, 2, 3, 4], applications in physics [5, 6, 7, 8], percolation the-

ory, and a problem involving porous rocks [9]. The majority of CCL algorithms

were designed for lattice environments, with only a few studies of CCL algo-

rithms for non-lattice environments. In non-lattice environments, the positions

of the sites are arbitrary rather than being restricted to discrete points of a reg-

∗Corresponding author.

Email address: yukihiro.komura.ss@alum.riken.jp (Yukihiro Komura)

Preprint submitted to Computer Physics Communications March 29, 2016

ular lattice. Non-lattice environments exist not only in the percolation theory

of disordered discs and spheres, but also in networks.

The performance of a single CPU core remains almost unchanged from a

decade ago. However, the number of cores has increased each year, and recent

advances in application performance have been realized by exploiting such con-

cepts as multiple cores and many threads. Graphics accelerators are a common

example of many-thread devices, having evolved into highly parallel threads

with very high memory bandwidth, and research has clearly shown that they

can dramatically improve computing performance. The most widely used pro-

gramming model for accelerators is CUDA [10], a parallel-computing platform

and programming model developed by NVIDIA that is essentially C/C++ with

several extensions that allow functions to be executed directly on the NVIDIA

GPU.

Connected component labeling algorithms with a single GPU have been

proposed by many researchers. The ma jority of those CCL algorithms have been

designed for lattice environments, and use a two-stage approach that divides

the lattice into sub-blocks that are independently treated and then merged.

However, the sub-block decomposition is diﬃcult in non-lattice environments.

Thus we need GPU-based CCL algorithms without a sub-block decomposition

for computation in non-lattice environments.

Here we describe several experiments on CCL algorithms with a single GPU

and without the sub-block decomposition. Hawick et al. [3] proposed a CCL

algorithm called ”Label Equivalence”, and Kalentev et al. [4] improved the label

equivalence algorithm. Both of these algorithms are realized by an iterative

method of comparison with nearest-neighbor sites. More recently, the present

author [7] has also proposed a single GPU CCL algorithm that does not use

conventional iteration. The computation times using this approach have proved

to be about half those of the previous method [4] for the application of the

Swendsen-Wang multi-cluster spin-ﬂip algorithm [11].

The above described CCL algorithms focus on the case of square and simple

cubic lattices, and the algorithm in [7] is specialized to the case of square and

2

simple cubic lattices. In this paper, we propose a generalized GPU-based CCL

algorithm that can be applied to both various lattices and to non-lattice en-

vironments in a uniform fashion. This generalized method extends our recent

GPU-based CCL algorithm [7], which does not use conventional iteration. To

conﬁrm the correctness and performance of this algorithm, we deal with bond

percolation problems. This paper is organized as follows: In Section 2, we brieﬂy

review the GPU-based CCL algorithm that serves as a starting point [7]. In Sec-

tion 3, we describe the new generalized algorithm. In Section 4, we ﬁrst show

the result of the bond percolations for the honeycomb and triangle lattices to

conﬁrm the correctness. Moreover, we adapt the generalized algorithm to bond

percolation for the Bethe lattice as a substitute for a network structure, and

the resulting performance is described. Finally, a summary and discussion are

presented in Section 5.

2. Connected Component Labeling

Connected component labelling algorithms assign proper cluster labels to

each site on the basis of local connection information. Many CCL algorithms

have been proposed by many researchers. The Hoshen–Kopelman algorithm

[12] and the CCL algorithm proposed by Suzuki et al. [1] are the best-known

CCL algorithms that use a single CPU. The majority of CCL algorithms using

a single CPU are specialized to the case of square and simple cubic lattices.

Moreover, those algorithms are realized by sequential computation. However,

sequential algorithms cannot be applied to GPU computation, and thus many

CCL algorithms that use a single CPU cannot be directly applied to GPU

computation, so a suitable new algorithm is required.

We brieﬂy review the CCL algorithm without conventional iteration [7].

The cornerstone of this algorithm is the method of label reduction proposed

by Wende et al. [13]; the procedure is illustrated in Fig. 1 of [7]. The labeling

consists of four steps: (i) initialization, (ii) analysis, (iii) label reduction, and

(iv) analysis. This method does not require an iterative method of comparison

with nearest-neighbor sites. The number of such comparisons in this method

3

㼟㼕㼠㼑 㻺㻺㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㻺㻺㻌㼘㼕㼟㼠

㻜 㻟㻘㻝㻜 㻝㻝 㻠㻘㻞㻜

㻝 㻠㻘㻡㻘㻝㻡 㻝㻞 㻝㻢㻘㻝㻥㻘㻞㻝

㻞 㻝㻣㻘㻝㻤 㻝㻟 㻡㻘㻝㻡

㻟 㻜㻘㻤㻘㻥㻘㻞㻝 㻝㻠 㻝㻥㻘㻞㻝

㻠 㻝㻘㻝㻝 㻝㻡 㻝㻘㻝㻟

㻡 㻝㻘㻝㻟㻘㻞㻜 㻝㻢 㻝㻞㻘㻞㻝

㻢 㻤㻘㻥 㻝㻣 㻞㻘㻝㻤

㻣 㻞㻜 㻝㻤 㻞㻘㻝㻣

㻤 㻟㻘㻢 㻝㻥 㻝㻞㻘㻝㻠

㻥 㻟㻘㻢㻘㻝㻜 㻞㻜 㻡㻘㻣㻘㻝㻝

㻝㻜 㻜㻘㻥 㻞㻝 㻟㻘㻝㻞㻘㻝㻠㻘㻝㻢

Figure 1: Example of the nearest-neighbor (NN) list for a complex network. The number at

each site is the index, and each has a NN list.

is 1 for a square lattice and 2 for a simple cubic lattice if periodic boundary

conditions are not used. In the initialization function, a label is assigned to

each site based on its connections: each site has label[i] = min, where min

is the lowest-numbered of the connected sites. The analysis function tracks

the label from a given site to a new site determined by the value of the label

at the given site. All sites have label[i] = label [label[i]] calculated until label

remains unchanged. In the label reduction step, we use Algorithm 1 of [7] on

all sites. The reduction method is used in only one direction for square lattices

and in only two directions for simple cubic lattices. To resolve conﬂicts in the

label update process, the atomic function is used in Algorithm 1 of [7], and

each chain of labels, i.e., label [label], is constructed automatically. Finally, the

analysis function is executed again. The cluster labeling algorithm in [7] uses

awhile loop within the label reduction step. However, the number of sites for

which the while loop is executed in this step is kept to a minimum.

3. Generalized GPU-based cluster labeling algorithm

We now turn to the proposed algorithm. The algorithm in [7] is specialized

to the case of square and simple cubic lattices. We extend this algorithm of

[7] to the generalized algorithm. The proposed algorithm can be applied to

4

㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠

㻜 㻜 㻝㻝 㻠

㻝 㻝 㻝㻞 㻝㻞

㻞 㻞 㻝㻟 㻡

㻟 㻜 㻝㻠 㻝㻠

㻠 㻝 㻝㻡 㻝 㻝㻟

㻡 㻝 㻝㻢 㻝㻞

㻢 㻢 㻝㻣 㻞

㻣 㻣 㻝㻤 㻞 㻝㻣

㻤 㻟 㻢 㻝㻥 㻝㻞 㻝㻠

㻥 㻟 㻢 㻞㻜 㻡 㻣㻘㻝㻝

㻝㻜 㻜 㻥 㻞㻝 㻟 㻝㻞㻘㻝㻠 㻘㻝㻢

㼟㼕㼠㼑 㻺㻺㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㻺㻺㻌㼘㼕㼟㼠

㻜 㻟㻘㻝㻜 㻝㻝 㻠㻘㻞㻜

㻝 㻠㻘㻡㻘㻝 㻡 㻝㻞 㻝㻢㻘㻝㻥㻘㻞㻝

㻞 㻝㻣㻘㻝 㻤 㻝㻟 㻡㻘㻝㻡

㻟 㻜㻘㻤㻘㻥 㻘㻞㻝 㻝㻠 㻝㻥㻘㻞㻝

㻠 㻝㻘㻝㻝 㻝㻡 㻝㻘㻝㻟

㻡 㻝㻘㻝㻟 㻘㻞㻜 㻝㻢 㻝㻞㻘㻞㻝

㻢 㻤㻘㻥 㻝㻣 㻞㻘㻝㻤

㻣 㻞㻜 㻝㻤 㻞 㻘㻝㻣

㻤 㻟㻘㻢 㻝㻥 㻝㻞㻘㻝㻠

㻥 㻟㻘㻢㻘㻝 㻜 㻞㻜 㻡㻘㻣㻘㻝㻝

㻝㻜 㻜㻘㻥 㻞 㻝 㻟 㻘㻝㻞㻘㻝㻠㻘㻝㻢

Figure 2: Initialization step of the generalized algorithm. The left panel shows the state of

the NN list in Fig. 1. Each site generates the residual list from the NN list by keeping only

those neighbors that are numbered lower than the site. Moreover, each site has had its label

set to the lowest value in the residual list, label[i] = min, and this number has been removed

from the residual list. If there is no value in the residual list, each site is labeled with its site

number, i.e., label[i] = i. The right panel is the state after the initialization step [step (i)],

㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍 㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠

㻜 㻜 㻝 㻝 㻝

㻝 㻝 㻝 㻞 㻜

㻞 㻞 㻝 㻟 㻝

㻟 㻜 㻝 㻠 㻜

㻠 㻝 㻝㻡 㻝 㻝㻟

㻡 㻝 㻝㻢 㻝㻞

㻢 㻜 㻝 㻣 㻞

㻣 㻝 㻝㻤 㻞 㻝㻣

㻤 㻜 㻢 㻝㻥 㻝㻞 㻝㻠

㻥 㻜 㻢 㻞㻜 㻝 㻣㻘㻝 㻝

㻝㻜 㻜 㻥 㻞㻝 㻜 㻝㻞㻘㻝㻠㻘㻝 㻢

㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍 㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠

㻜 㻜 㻝 㻝 㻝

㻝 㻝 㻝㻞 㻝㻞

㻞 㻞 㻝 㻟 㻝

㻟 㻜 㻝㻠 㻝㻠

㻠 㻝 㻝㻡 㻝 㻝㻟

㻡 㻝 㻝㻢 㻝㻞

㻢 㻢 㻝 㻣 㻞

㻣 㻣 㻝㻤 㻞 㻝㻣

㻤 㻜 㻢 㻝㻥 㻝 㻞 㻝㻠

㻥 㻜 㻢 㻞㻜 㻝 㻣㻘㻝 㻝

㻝㻜 㻜 㻥 㻞㻝 㻜 㻝㻞㻘㻝㻠㻘㻝 㻢

Figure 3: Label reduction step of the generalized algorithm. The left panel shows the state

after the end of the ﬁrst analysis step [step (ii)]. The right panel shows the state after label

reduction [step (iii)], in which each site is calculated using Algorithm 1.

㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠

㻜 㻜 㻝㻝 㻝

㻝 㻝 㻝㻞 㻜

㻞 㻞 㻝㻟 㻝

㻟 㻜 㻝㻠 㻜

㻠 㻝 㻝㻡 㻝 㻝㻟

㻡 㻝 㻝㻢 㻜

㻢 㻜 㻝㻣 㻞

㻣 㻝 㻝㻤 㻞 㻝㻣

㻤 㻜 㻢 㻝㻥 㻜 㻝㻠

㻥 㻜 㻢 㻞㻜 㻝 㻣㻘㻝㻝

㻝㻜 㻜 㻥 㻞㻝 㻜 㻝㻞 㻘㻝㻠㻘㻝㻢

㼟㼕㼠㼑 㼘㼍㼎㼑㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠 㼟㼕㼠㼑 㼘㼍㼎㼑 㼘 㼞㼑㼟㼕㼐㼡㼍㼘㻌㼘㼕㼟㼠

㻜 㻜 㻝㻝 㻝

㻝 㻝 㻝㻞 㻜

㻞 㻞 㻝㻟 㻝

㻟 㻜 㻝㻠 㻜

㻠 㻝 㻝㻡 㻝 㻝㻟

㻡 㻝 㻝㻢 㻝㻞

㻢 㻜 㻝㻣 㻞

㻣 㻝 㻝㻤 㻞 㻝㻣

㻤 㻜 㻢 㻝㻥 㻝㻞 㻝㻠

㻥 㻜 㻢 㻞㻜 㻝 㻣㻘㻝㻝

㻝㻜 㻜 㻥 㻞㻝 㻜 㻝㻞 㻘㻝㻠㻘㻝㻢

Figure 4: Final analysis step of the generalized algorithm. The left panel shows the state

after the end of the label reduction step [step (iii)]. The right panel shows the state after the

second analysis step [step (iv)].

5

both lattices and to non-lattice environments in a uniform fashion as long as a

nearest-neighbor (NN) list is supplied. Figure 1 presents an example of such a

list. Each site has a list of immediate neighbors. In this paper, we prepare the

array of NN list in advance. The total size of NN list is equal to the product of

the number of sites and the largest size of NN list in all sites, and each site gets

access to a NN list as N N list[i], N N list[i+N], N N list[i+ 2N], ..., where N

is the number of sites.

The generalized CCL algorithm also consists of the same four steps: (i)

initialization, (ii) analysis, (iii) label reduction, and (iv) analysis. In the ini-

tialization step, we generate the residual lists from the NN list by keeping only

those neighbors that are numbered lower than any given site. Moreover, each

site has its label set to the lowest site number in the residual list, label[i] = min,

and this is removed from the residual list. If there is no number in the residual

list, the site is labeled with its site number, i.e., label [i] = i. Figure 2 shows an

example of this procedure. The analysis function tracks the label from a given

site to a new site determined by the value of the label at the given site. All sites

have label [i] = label[label [i]] calculated until label remains unchanged. In the

label reduction step, each site executes Algorithm 1. In the label reduction step,

the calculation at each site uses the pairs at each site and the remaining sites

in the residual list. The label reduction method allows each chain of labels, i.e.,

label[label ], to be constructed automatically. Figure 3 illustrates this procedure.

In Fig. 3, we show the actual output of our algorithm for the complex network

in Fig. 1. Finally, each site is assigned the proper cluster label by executing the

analysis function again. The procedure is illustrated in Fig. 4.

Unlike the CCL algorithms using a single CPU, the label of the cluster is

not given serially in this method. The label of each cluster depends on the

minimum site number in each cluster. However, we can renumber the cluster

labels to sequence labels by using the method shown in Fig. 4 in [14], for

example.

6

for j∈residual list do

// iis a site number

label 1⇐label[i] ;

while label 16=label[label 1] do

label 1⇐label[label 1];

end

// jis a number in the residual list of site i

label 2⇐label[j] ;

while label 26=label[label 2] do

label 2⇐label[label 2];

end

flag ⇐true;

if label 16=label 2then

flag ⇐f alse;

end

if label 1<label 2 then

tmp ⇐label 1;

label 1⇐label 2;

label 2⇐tmp;

end

while f lag =false do

label 3⇐atomicM in(&label[label 1], l abel 2);

if label 3 = label 2then f lag ⇐true else if label 3> label 2

then label 1⇐label 3else if label 3< label 2then

label 1⇐label 2; label 2⇐label 3

end

end

Here, iis a site, jis a nearest-neighbor to site i, and label is the label at

each site. ] Pseudo-code of the label reduction method used in this

paper. Each site executes this algorithm at the label reduction step [step

(iii)]. Here, iis a site, jis a nearest-neighbor to site i, and label is the

label at each site. Algorithm 1: .

7

0.64 0.65 0.66

0

0.5

1

p

N=262144

N=1048576

N=4194304

N=16777216

<Rspan(p)>

Figure 5: Spanning probability for bond percolation on the honeycomb lattice.

4. Results

As an application of the generalized CCL algorithm, we study bond per-

colation problems. Bond percolation creates bonds between neighboring sites

with probability p. In bond percolation, one considers whether there is a cluster

that spans the entire lattice. If there is such a spanning cluster, the system is

regarded as percolating. There is a threshold value pcthat distinguishes perco-

lating and non-percolating behavior. The probability that a spanning cluster is

produced is zero for p < pcand is unity for p > pcas the system size approaches

inﬁnity.

We test the correctness and performance of the proposed code on an NVIDIA

TITAN X machine with the CUDA version 7.5 compiler. For the random-

number generator, we used the XORWOW pseudorandom generator [15] with

double precision on the device API in the cuRAND library [16].

We ﬁrst deal with bond percolation on the honeycomb and triangle lattices

to check the correctness of our algorithm. For the measured quantity, we use

8

0.34 0.35

0

0.5

1

<Rspan(p)>

p

N=262144

N=1048576

N=4194304

N=16777216

Figure 6: Spanning probability for bond percolation on the triangle lattice.

106107

10−2

10−1

100

101

computational time (ms)

N

honeycomb, 0.5pchc

honeycomb, pchc

honeycomb, 1.5pchc

triangle, 0.5pctr

triangle, pctr

triangle, 1.5pctr

Figure 7: Average computational time for a single realization on the honeycomb and tri-

angle lattices for probabilities p= 0.5pc, pc,1.5pc. We use the threshold value of phc

c=

1−2 sin(π/18) for the honeycomb lattice and ptr

c= 2 sin(π/18) for the triangle lattice.

9

the spanning probability Rspan (p). Figure 5 shows the spanning probability

Rspan(p) for the honeycomb lattice, and Fig. 6 shows the spanning probability

for the triangle lattice. The total lattice sizes Nare 262144, 1048576, 4194304,

and 16777216. From these ﬁgures, we can see that all curves intersect at one

point. From the crossing point, we estimate the threshold value phc

con the

honeycomb lattice to be phc

c= 0.6527 ±0.0002 and the threshold value ptr

con

the triangle lattice to be ptr

c= 0.3472 ±0.0002. These estimated values are

compatible with the exact values from [17], namely phc

c= 1 −2 sin(π/18) and

ptr

c= 2 sin(π/18). Those results show that our algorithm works well.

Secondly, we demonstrate the performance for bond percolation on the hon-

eycomb and triangle lattices. We give a double logarithmic plot of the aver-

age computational times for a single realization for the honeycomb and trian-

gle lattices in Fig. 7. Because the computational time is dependent on the

probability p, we show the average computational times at the probabilities

p= 0.5pc, pc,1.5pcon these lattices. We use phc

c= 1 −2 sin(π/18) as the thresh-

old value for the honeycomb lattice and ptr

c= 2 sin(π/18) as the threshold value

for triangle lattice, and the total lattice sizes Nare 262144, 1048576, 4194304,

and 16777216. From Fig. 7, we can see that the average computational time

is proportional to the total lattice size for all probabilities. The computational

times on each lattice when the probability is 0.5pcis the shortest of those proba-

bilities. The computational time for the honeycomb lattice with N= 16777216

when the probability is 0.5phc

cis 5.8 ms, and that for the triangle lattice with

N= 16777216 when the probability is 0.5ptr

cis 7.2 ms. Moreover, the computa-

tional time for the honeycomb lattice with N= 16777216 when the probability

is 1.5phc

cis about 1.4 times greater than when the probability is 0.5phc

c, and the

computational time for the triangle lattice with N= 16777216 when the prob-

ability is 1.5ptr

cis about 1.7 times greater than when the probability is 0.5ptr

c.

The diﬀerences between the computational times for diﬀerent probabilities are

due to the increment in the size of the residual list, and the diﬀerence between

the computational times on the two lattices is due to the coordination number.

Next, we deal with bond percolation on the Bethe lattice as a substitute for

10

0

1

2

3

4

5

6

7

8

9

10

11

1213

14

15

16

17

18 19

20

21

22

23

24

25

26

2728

29

30

31

32

33

34

35

36

37

38 39 40 41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

5758

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77 78 79 80 81 82 83 84 85 86

87

88

89

90

91

92

93

Figure 8: An example of one realization of the standard position on the Bethe lattice with

coordination number z= 3 when the probability is p= 1.5pc. The number at each site is its

index, and the connection between sites in the same cluster is represented by the same color.

11

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62 63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

Figure 9: An example of one realization of the random position on the Bethe lattice with

coordination number z= 3 when the probability is p= 1.5pc. The number at each site is its

index, and the connection between sites in the same cluster is represented by the same color.

12

106107

10−2

10−1

100

101

computational time (ms)

N

standard, 0.5pc

standard, pc

standard,1.5pc

random, 0.5pc

random, pc

random,1.5pc

Figure 10: Average computational times for a single realization on the Bethe lattice with

coordination number z= 3 for standard site positions and random site positions when the

probability is p= 0.5pc, pc,1.5pc.

a network structure. The Bethe lattice is a connected cycle-free graph, where

each site is connected to zneighbors. The threshold value for bond percolation

on the Bethe lattice is related to the coordination number z, and the threshold

value of the Bethe lattice with coordination number zis 1/(z−1). We usually

give the site number starting from a center site, called the standard position

in this paper. The generalized CCL algorithm is excellent with this numbering

method because the remaining sites in the residual list for all sites become zero

at all probabilities. Thus, we deal with the two Bethe lattices with coordination

number z= 3. One is standard position, where the site numbers are given from

a center site, and the other is random position, where the site numbers are given

at random. Figure 8 shows an example of one realization of standard position

on the Bethe lattice with coordination number z= 3 when the probability is

p= 1.5pc, and Fig. 9 shows an example of one realization of random position

on the Bethe lattice with coordination number z= 3 when the probability is

13

p= 1.5pc. The threshold value of the Bethe lattice with coordination number

z= 3 is pc= 1/2. Those examples are the results of the actual output.

We examine the computational times of the generalized CCL algorithm for

bond percolation on the two Bethe lattices. Figure 10 shows a double logarith-

mic plot of the average time required for one realization on the two Bethe lattices

when the probabilities are p= 0.5pc, pc,1.5pc. We measure the average compu-

tational times for the total lattice sizes N= 393214,786430,1572862,3145726,6291454,

and 12582910. In the case of standard position, the computational times are

almost independent of the probability, and the computational time is propor-

tional to the total lattice size. In contrast, in the case of random position, the

computational time is dependent on the probability. The computational times

of the random position for the total lattice sizes N≤786430 are proportional

to the total lattice sizes for all probabilities, and those for the total lattice sizes

N≥786430 deviate from being proportional to the total lattice size at all prob-

abilities. This change is related to the amount of L2 cache, which is 3MB in the

TITAN X. The computational time is shortest when the probability is 0.5pc.

The computational time for random position with N= 12582910 when the

probability is 0.5pcis 7.1 ms, and that when the probability is p= 1.5pcis 21

ms. Moreover, the computational time for random position with N= 12582910

when the probability is 1.5pcis about 8.0 times greater than that for standard

position with N= 12582910. Because we give the site number at random, the re-

maining sites in the residual list for each site are also random. This randomness

causes a load imbalance. Thus, the computational times for random position

are greater than those for standard position. However, those results show that

the generalized CCL algorithm can be applied directly to any structure without

reordering.

5. Summary and discussion

We have proposed a generalized GPU-based CCL algorithm that can be

applied to both various lattices and to non-lattice environments. Because the

algorithm in [7] is specialized to the case of square and simple cubic lattices,

14

we have extended our previous labeling algorithm [7], which does not use con-

ventional iteration, to the generalized method. The proposed algorithm can be

applied directly to any structure without reordering.

We chose the bond percolation problem as a test application. We ﬁrst dealt

with bond percolation on the honeycomb and triangle lattices to conﬁrm the

correctness of our algorithm. We used the spanning probability as a measured

quantity, and reproduced some well-known results. Secondly, we showed the

performance for bond percolation on the honeycomb and triangle lattices. Be-

cause the computational time is dependent on the probability p, we reported the

average computational times for the probabilities p= 0.5pc, pc,1.5pcThe com-

putational time for the honeycomb lattice with N= 16777216 when the prob-

ability is 0.5pcis 5.8 ms, and that for the triangle lattice with N= 16777216

when the probability is 0.5pcis 7.2 ms. Finally, we dealt with bond percola-

tion on the Bethe lattice as a substitute for a network structure. Because the

remaining sites in the residual list for all sites become zero at all probabilities

when we give the site numbers starting from a center site, we dealt with the two

Bethe lattices with coordination number z= 3. The computational times for

random position, where the site numbers are given at random, are greater than

those for standard position, where the site numbers are given from a center site,

at all probabilities. However, the generalized CCL algorithm can be applied

directly to any structure without reordering.

We ﬁnally emphasize the eﬃciency of the proposed algorithm. This algo-

rithm is very powerful; it can be adapted both to a variety of lattices and to

non-lattice environments in a uniform fashion as long as an NN list is supplied.

That is, this algorithm can be adapted to any structure by simply replacing the

NN list. Moreover, this algorithm is suitable for the GPU architecture which

the memory access latency can be hidden with calculations instead of big data

caches. This algorithm is realized by indirect memory access and random mem-

ory access, and frequently cause the simultaneous update with atomic function.

Thus, this algorithm is unsuitable for the architecture of traditional multipro-

cessors because this algorithm can not be vectorized and frequently cause the

15

cache coherence. We hope that many researchers show more interest in devel-

oping parallel algorithms from a standpoint of architecture.

Acknowledgments

This work was supported by KAKENHI grant 15K21623 from the Japan

Society for the Promotion of Science.

References

[1] K. Suzuki, I. Horiba, N. Sugie, Fast connected-component labeling based

on sequential local operations in the course of forward raster scan followed

by backward raster scan, Pattern Recognition, 2000. Proceedings. 15th

International Conference on, (2000) 434 - 437.

[2] L. He, Y. Chao, K. Suzuki, K. Wu, Fast connected-component labeling,

Pattern Recognition 42 (2009) 1977-1987

[3] K.A. Hawick, A. Leist, D. P. Playne, Parallel Graph Component Labelling

with GPUs and CUDA, Parallel Computing 36 (2010) 655-678.

[4] O. Kalentev, A. Rai, S. Kemnitz, R. Schneider, Connected component la-

beling on a 2D grid using CUDA, J. Parallel Distrib. Comput. 71 (2011)

615-620.

[5] Y. Komura, Y. Okabe, GPU-based Swendsen-Wang multi-cluster algorithm

for the simulation of two-dimensional classical spin systems, Comput. Phys.

Comm. 183 (2012) 1155-1161.

[6] Y. Komura, Y. Okabe, CUDA programs for the GPU computing of the

Swendsen-Wang multi-cluster spin ﬂip algorithm: 2D and 3D Ising, Potts,

and XY models, Comput. Phys. Comm. 185 (2014) 1038-1043.

[7] Y. Komura, GPU-based cluster-labeling algorithm without the use of con-

ventional iteration : Application to the Swendsen-Wang multi-cluster spin

ﬂip algorithm, Comput. Phys. Comm. 194 (2015) 54-58.

16

[8] Y. Komura, Y. Okabe, Improved CUDA programs for GPU computing of

Swendsen-Wang multi-cluster spin ﬂip algorithm: 2D and 3D Ising, Potts,

and XY models, Comput. Phys. Comm. 200 (2016) 400-401.

[9] A. Al-Futaisia, T. W. Patzek, Extension of Hoshen–Kopelman algorithm

to non-lattice environments, Physica A: Statistical Mechanics and its Ap-

plications 321 (2003) 665-678

[10] NVIDIA CUDA ZONE https://developer.nvidia.com/cuda-zone

[11] R.H. Swendsen, J.S. Wang, Nonuniversal critical dynamics in Monte Carlo

simulations, Phys. Rev. Lett. 58 (1987) 86-88.

[12] J. Hoshen, R. Kopelman, Percolation and cluster distribution. I. Cluster

multiple labeling technique and critical concentration algorithm, Phys. Rev.

B 14 (1976) 3438-3445.

[13] F. Wende, T. Steinke, Swendsen-Wang multi-cluster algorithm for the

2D/3D Ising model on Xeon Phi and GPU, Proceeding SC ’13 Proceedings

of the International Conference on High Performance Computing, Network-

ing, Storage and Analysis Article No. 83 (2013).

[14] Y. Komura, Multi-GPU-based Swendsen–Wang multi-cluster algorithm

with reduced data traﬃc, Comput. Phys. Comm. 195 (2015) 84-94.

[15] G. Marsaglia, Xorshift RNGs, Journal of Statistical Software 8 (2003) 1-6.

[16] cuRAND CUDA Toolkit Documentation,

http://docs.nvidia.com/cuda/curand/

[17] M. F. Sykes, J. W. Essam, Exact Critical Percolation Probabilities for Site

and Bond Problems in Two Dimensions, J. Math. Phys. 5 (1964) 1117-1127.

17