Content uploaded by John Antonio
Author content
All content in this area was uploaded by John Antonio on Jul 23, 2015
Content may be subject to copyright.
Counting Problems on Graphs:
GPU Storage and Parallel Computing Techniques
Amlan Chatterjee, Sridhar Radhakrishnan, and John K. Antonio
School of Computer Science
University of Oklahoma
{amlan, sridhar, antonio}@ou.edu
Abstract
The availability and utility of large numbers of
Graphical Processing Units (GPUs) have enabled
parallel computations using extensive multithreading.
Sequential access to global memory and contention
at the sizelimited shared memory have been main
impediments to fully exploiting potential performance
in architectures having a massive number of GPUs. We
propose novel memory storage and retrieval techniques
that enable parallel graph computations to overcome
the above issues. More speciﬁcally, given a graph
G=(V,E)and an integer k<=V, we provide
both storage techniques and algorithms to count the
number of: a) connected subgraphs of size k;b)k
cliques; and c) kindependent sets, all of which can be
exponential in number. Our storage technique is based
on creating a breadthﬁrst search tree and storing it
along with nontree edges in a novel way. The counting
problems mentioned above have many uses, including
the analysis of social networks.
1. Introduction
The continuous growth and availability of huge
graphs for modelling social networks, World Wide
Web and biological systems has rekindled interests in
their analysis. Social networking sites such as Face
book with 750 million users [12], Twitter with 200
million users ([13]) and LinkedIn with over 100 million
users [5] are a huge source of data for research in the
ﬁelds of anthropology, social psychology, economics
and others. It is impractical to analyze very large
graphs with a single CPU, even if multithreading is
employed.
Recent advancements in computer hardware have
led to the usage of graphics processors for solving
general purpose problems. Compute Uniﬁed Device
Architecture (CUDA) from Nvidia [6] and StreamSDK
from AMD/ATI are interfaces for modern GPUs that
enable the use of graphics cards as powerful co
processors. These systems enable acceleration of vari
ous algorithms including those involving graphs [10].
The graph problems that we focus on in this paper
involve counting the number of subgraphs that satisfy
a given property. We are interested in counting the
number of: a) connected subgraphs of size k; b) cliques
of size k; and c) independent sets of size k. A naive
mechanism to perform this counting is to generate a
combination of the nodes (node IDs), construct the
induced subgraph with the same, and check if the
desired property holds. Since the number of subgraphs
that can be constructed by choosing kout of nnodes is
nCk, this approach is combinatorially explosive. But it
is noted that this naive approach does lend itself to be
executed in parallel by using a large number of threads.
The Nvidia Tesla C1060 system contains 30 mod
ules, each containing 8GPU processors and 16 mem
ory banks each of size 1K. These 240 processors have
access to both memory banks in the module they reside
in and the global memory. The global memory access
incurs a larger latency (23times more) in comparison
with the memory inside the module referred to as the
shared memory [3]. Contention results when two or
more threads access memory (same memory bank or
global memory) at the same time. Such concurrent
requests are simply queued up and processed sequen
tially.
Using only the shared memory, the total space
available in the Nvidia system mentioned previously
is 30 ×16 KB = 480 KB. In practice, the entire 480
KB might not be available due to storage of kernel
parameters and other intrinsic values in it. However,
this effect can be reduced by storing such data in the
constant memory instead of using the shared memory.
Using a boolean adjacency matrix, a graph of size
up to 1982 can be stored in 480 KB. For undirected
graphs, we need to store only the upper triangular
2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops
9780769546766/12 $26.00 © 2012 IEEE
DOI 10.1109/IPDPSW.2012.99
798
2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
9780769546766/12 $26.00 © 2012 IEEE
DOI 10.1109/IPDPSW.2012.99
798
2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
9780769546766/12 $26.00 © 2012 IEEE
DOI 10.1109/IPDPSW.2012.99
804
!
"
#
$
"$
"
"%
$
#
&
'
(
%
)*
+
"(,
"
$
#
&
'
(
%
)*
+
"(,
"
$
#
&
'
(
%
)*
+
"(,
"
Figure 1: Memory hierarchy of GPU for CUDA [6]
matrix and hence can store a graph of size up to 2804.
The main issue in storing a single graph in the shared
memory is certain solution combinations may contain
nodes whose adjacency information may be stored in
different modules. CUDA architecture (see Fig. 1) and
the Nvidia implementation does not allow a processor
in one module to access a memory bank belonging to
another; as a result the graph cannot be stored in an
arbitrary fashion across all modules. Considering the
shared memory in a single module of size 16 KB, we
can store graphs of size 360 (or 512) using adjacency
(or upper triangular matrix) representation. In this
case information relating to nodes in any combination
remains in the same module and threads mapped to
associated GPUs can access them for processing.
Rather than generating all possible combinations,
we have devised techniques to reduce the number of
combinations by considering nodes in kneighborhoods
(nodes that are at most distance kfrom each other.)
This is accomplished by considering a breadthﬁrst
search (BFS) tree, constructed in the host CPU, and
nodes in any adjacent klevels only. Additionally,
this BFS aided technique allows us to carefully split
the graph in such a way that we can use the entire
shared memory available across all the modules and
process larger graphs – even those that have to be
stored externally. Using all modules not only helps in
processing larger graphs, but also allows the use of all
the GPU processors thereby decreasing total execution
time.
The outline of our paper is as follows. In Section
2, we present information on related work. Section
3discusses the proposed data structures used to ﬁt
adjacency information of graphs on the shared memory
modules. In Section 4we present information on using
BFStree and its properties to reduce the number of
computations and storage requirements. Algorithms for
solving graphs that do not ﬁt in the shared memory
of a single module are discussed in Section 5. Sec
tion 6discusses algorithms for other related problems
such as ﬁnding the total number of kcliques and k
independent sets. Implementation results for various
graph representations stored both in global and shared
memory are presented in Section 7. Conclusion and
future work is discussed in Section 8.
2. Related work
Using efﬁcient data structures to store graphs for
computation on both CPUs and GPUs have been
studied extensively. Katz and Kider [11] and Buluc et
al. [4] proposed storing graphs on the GPU by dividing
the adjacency matrix into smaller blocks. Frishman and
Tal [7] propose representing multilevel graphs using
2D arrays of textures. They propose partitioning the
graph in a balanced way, by identifying geometrically
close nodes and putting them in the same partition.
But partitioning the graph itself is a hard problem. For
sparse matrices, the Compressed Sparse Row (CSR)
representation is useful [8]. Bader and Madduri [1]
proposed a technique where different representations
are used depending on the degree of the vertices, which
is relevant for representing graphs that exhibit the
smallworld network property. Harish and Narayanan
[10] describe the use of a compacted adjacency list,
where instead of using several lists, the data is stored
in a single list. Using pointers for each of the vertices’
adjacency information and keeping data for the entire
graph in a single one dimensional array, which can be
signiﬁcantly large to be kept in the shared memory, and
has been implemented in the global memory in their
paper. Bordino et al. [2] have developed a technique
for counting subgraphs of size 3and 4from large
streaming graphs. The counting algorithm is sequential
in nature and can be used in many applications.
In this paper, in addition to using the BFS to
carefully split the graph for processing, we propose
a data structure for storing nodes and their adjacency
information that are in contiguous levels of the BFS
tree. The modiﬁed data structure is similar to the one
proposed by Harish and Narayanan [10], but with
improvements, including the use of fewer bits in the
general case and also using more than one array to
store the entire adjacency data for the graph, thereby
adhering to stricter memory requirement constraints.
3. Simple data structures for storing the
graph information
For a graph G=(V,E)with V=n, the size of
adjacency matrix is n2bits, where each edge is stored
using a single bit. To ﬁt the adjacency matrix in the
799799805
shared memory, n2≤131,072 (16 KB = 16×1024×8
bits = 131,072 bits), which gives n≈360. Therefore,
using the adjacency matrix representation, the size of
the largest graph that can be kept in the shared memory
is 360 (assuming all shared memories in the different
modules contain identical data.)
For undirected graphs, values (i, j) and (j, i) are
identical. So, storing only the Upper Triangular Matrix
(UTM) of the adjacency matrix is enough, which
requires n×(n+1)
2bits. So, the largest graph that can
be kept in the shared memory using the UTM repre
sentation is 511. As all the values of (i, i)=0, using
the Strictly UTM representation (SUTM) (i.e. without
the data on the diagonal), size of the largest graph that
can be kept in the shared memory is 512. Although the
shared memory spans across 16 banks, it is preferable
to store data for any speciﬁc node within a single
bank thereby avoiding potential memory contention
and reduce overall execution time. With the above
requirement, the number of nodes that can ﬁt in the
shared memory is reduced to 506, where data is kept
in increasing order of the node numbers. Also, using
UTM representation, the number of nodes’ data in each
of the bank (size 8192 bits) varies, as shown in Table
1.
Table 1: Distribution of nodes in the banks
Bank Nodes in the # of Nodes Space Required
# Bank in bits
0 015 16 7976
1 1631 16 7720
2 3248 17 7922
3 4966 18 8073
4 6785 19 8170
5 86104 19 7809
6 105124 20 7830
7 125146 22 8151
8 147169 23 8004
9 170194 25 8100
10 195221 27 8046
11 222251 30 8085
12 252285 34 8075
13 286325 40 8020
14 326378 53 8162
15 379505 127 8128
In the previous approach due to unbalanced dis
tribution of nodes, threads assigned to operate on
banks with more nodes would have to do signiﬁcantly
more work than the threads accessing banks with less
number of nodes. On the other hand, if threads access
a constant number of nodes it will result in inefﬁcient
memory utilization, thus limiting the overall size of
graph that can be stored.
For load balancing the distribution can be done as
follows. Using SUTM representation different rows
have different amount of data (see Table 2). To make
the structure rectangular (see Table 3), the space gained
by not storing redundant information in any row in
the upper part of the SUTM can be ﬁlled up by a
corresponding row from the lower part. When nis
even, the space gained in row iis ﬁlled with data values
from row n−i(see Table 3); when nis odd, the
corresponding space in row iis ﬁlled with data from
row n−(i+1). In general, for any value of nthe
number of rows of data is reduced from nto n
2. This
is called Balanced SUTM (BSUTM), where all rows
have the same amount of data, each corresponding
to that of 2nodes. The above method is similar to
“rectangular full packed” in dense linear algebra [9].
With the desire to store an entire row of data in a
single module, this scheme also ensures all banks have
equal number of nodes in them thereby achieving load
balancing. Using this scheme, the maximum number of
rows that can be kept in a single bank is (1024×8)
511 ≈16.
As there are 2nodes’ data in each row, the total number
of nodes’ data in each of the banks is 32. Therefore, the
total number of nodes that can be kept in this manner
in the 16 banks is 32 ×16 = 512.
Table 2: SUTM for even number of nodes
– a b c d e f g
– – h i j k l m
– – – n o p q r
– – – – s t u v
– – – – – w x y
– – – – – – z φ
– – – – – – – ψ
– – – – – – – –
Table 3: SUTM with load balanced approach for even
number of nodes
a b c d e f g
h i j k l m ψ
n o p q r z φ
s t u v w x y
The number of simultaneous thread executions is
limited by the number of GPU processors in a module.
Each thread is allocated a set of combinations of nodes
(with cardinality k) and a thread is to determine if
the desired property (e.g do they form a connected
subgraph?) holds. In order to use all available GPU
processors in other modules, we have to duplicate the
graph and place it on all the shared memories on other
modules. Care must be taken to ensure each thread is
given a unique set of combinations to test, to avoid
duplication in work.
The sets of combination of knodes are allocated
to each module as follows. Since the shared memory
in each module can store up to 512 nodes, it can
be assumed that there are 512 sets of combinations,
800800806
each starting with a unique node number. We allow
the ﬁrst 29 modules to operate on 17 unique sets
of combinations each and the last module operates
on the remaining 19 sets (17 ×29 + 19 = 512). In
each module, depending on the number of threads the
unique sets are uniformly divided to be processed.
4. Using BFStree information to reduce
the number of computations
Considering a breadthﬁrst search (BFS) represen
tation of the graph, nodes chosen in any combination
must be in kadjacent levels, otherwise the subgraph
containing the nodes in the combination will not be
connected. For example, let k=3and a combination
contain the nodes 10,12, and 14 at levels 4,5, and 7,
respectively in the BFStree. It is possible for nodes
10 and 12 to be connected by an edge since they are
in adjacent levels. For the sake of discussion, assume
there is an edge (10,12) in the graph. It follows that
there cannot be a edge (10,14), for otherwise the level
of node 14 in the BFStree must be 3,4,or5.A
similar argument can be made for the edge (12,14)
and hence the graph induced by those vertices is not
connected. From the above reasoning it is evident that
using the graph’s BFStree can be an effective tool in
reducing the number of combinations to be tested. We
will further examine this in the following subsections.
4.1. BFStree node numbering
The nodes in the BFStree are numbered in the order
they are visited following a breadthﬁrst search of the
graph. Any arbitrary node is chosen as the starting or
root node, and is numbered 0and belongs to the ﬁrst
level. All nodes that are neighbors of the ﬁrst node
belong to the second level. Similarly, any unvisited
node that is neighbor of the nodes in the previous level,
belong to the next level.
4.2. BFStree properties and applications
The following are some of the properties of BFS
tree that are useful in the study of graphs.
1) If two nodes are neighbors in the original graph,
their level numbers cannot differ by more than
one. Let viand vjbe adjacent nodes belonging
to levels lviand lvjrespectively in the BFStree.
So, from this property we have lvi−lvj≤1.
2) Any node in a level with a parent numbered
αcan also be neighbors with nodes numbered
greater than αin the level of α, but not less. Let
node vibe the parent of node vjand Δjbe the
set of neighbors of vj. Therefore, for any node
vtwhere lvi=lvt,vt/∈Δj∀t<i.
3) The structure and height of the BFStree depends
on the choice of the starting or root node.
4.3. Reducing number of combinations to be
tested
In the case of purely random distribution of nodes
all possible combinations must be tested. So, for n
nodes and subgraphs of size k, the total number of
combinations to be tested is nCk.Ifn= 360 and k=
10, for example, the corresponding value is 360C10 ≈
8.88 ×1018.
In the case where a BFStree of the graph ﬁts in
the shared memory, using its properties the number of
combinations to be tested can be drastically reduced.
The idea here is to test for combinations with nodes
in each of the consecutive klevels of the graph. Let
n= 360 and k=10, and the number of nodes in
each level of the graph be 20. Then the total number
of levels Lin the graph is 360
20 =18. Therefore, the
number of consecutive klevels is L−k+1 = 9.
Now, for each of these set of levels, knodes are to
be chosen. The number of combinations to be tested
taking different number of levels at a time is given as
follows:
Considering 1level i.e., for each of the differ
ent levels: 20C10 = 184,756 ≈1.84 ×105. Con
sidering 2levels: (20C1×20 C9+20C2×20 C8+
20C3×20 C7+20 C4×20 C6)×2+20C5×20 C5=
847,291,016 ≈8.47 ×108. Considering for 3levels:
72,851,600,250 ≈7.28 ×1010. Similarly, we can
calculate the number of combinations taking 4 levels
to 10 levels at a time.
Considering the general case, where there is a total
of nnodes divided among Llevels, such that each
level consists of n
Lnodes say p. Then, the number of
combinations to be checked for knode subgraph is:
pCn1×pCn2×···×
pCnksuch that ∀ni≥1,∃n1+
n2+···+nk=k.
When taking 1level at a time, the number of
combinations to be tested is ≈1.84 ×105. But,
all available Llevels will be tested for these many
combinations when they are considered individually.
So, total number of combinations when considering
all the 1levels is ≈1.84 ×105×L. Similarly, taking
2levels at a time, the number of combinations to
be tested is ≈7.28 ×1010. But, there are L−1
such combinations of consecutive 2levels. So, the total
combinations when considering the contribution of all
the 2levels is ≈8.47 ×108×(L−1). There are
801801807
correspondingly L−2combinations for consecutive
3levels, L−3combinations for 4levels and so on.
In general, there are L−(k−1) such combinations
for consecutive klevels. Therefore, the total number
of combinations to be tested is given by: 1.84 ×105×
18+8.47×108×17 + 7.28 ×1010 ×16 +1.35×1012 ×
15 + 9.82 ×1012 ×14 + 3.5×1013 ×13 + 6.9×1013 ×
12+7.6×1013 ×11+ 4.3×1013 ×10 + 1.02 ×1013 ×9
=2.8×1015 i.e., about 2,800 trillion. The number of
combination testing decreases by ≈8.87 ×1018.
4.4. Storing the graph with BFS Information
In order to process the reduced combinations de
scribed in the previous subsection, modules should be
responsible for generating only relevant combinations,
by knowing the number of nodes in each level and
BFS numbering of the nodes. The entire graph can
be stored using SUTM representation along with the
information of the number of nodes at each level.
Instead, it may be beneﬁcial in some cases to store
adjacent levels of the BFStree. Thus, except for the
node in the root level for each set of adjacent levels,
there will be a SUTM data structure along with the
starting node number and the total number of nodes.
This representation is called SUTMADJ.
In addition to the SUTMADJ we have devised
another data structure called Parent Array Represen
tation (PAR) wherein each node keeps information
of its parent along with adjacent nodes in both the
parent’s and the same level as a list. By keeping
additional information we can further reduce the space
requirements as shown below.
In PAR, for each of the nodes contained in klevels,
the following information is stored: a) The parent node
number, b) An identiﬁer (0or 1) to specify if there are
other neighboring nodes belonging to the same level
(siblings) or previous level (i.e., parent’s level), c) If
the identiﬁer value is 1, then the number of neighbors
identiﬁed in the previous step, d) Node numbers for
each of the neighbors to identify them. Fig. 2 shows
an example of PAR for an arbitrary graph. The node
numbers are not stored explicitly but calculated from
a value x, which gives the starting node number. As
the node numbers are in strictly increasing order, the
calculation is simple. Also, neighboring node numbers
are not stored explicitly, but calculated from another
value y, which is the parent number for x.
The space required is therefore reduced by storing
only the differences in the numbering of the parents
and neighbors for the nodes, which depends on the
value δ(degree of the graph.) While storing the
differences between the numbering, the worst case
graph, shown in Fig. 3, would give the difference in
the order of n. For example, if a level has pnodes,
there would be pparent nodes, and say there are
other qneighboring nodes. If the node numbers are
stored, it would take ((2 ×p+q+p)×log2n+p)
bits, where the extra pbits are the identiﬁers. Also
p≤pare nodes that have other neighbors and
need an additional value indicating the total number
of such nodes. Whereas, if the differences in the
numbering is used, then it would take in the worst
case, ((p+q)×log2δ+(2+p)×log2n+p)bits.
!
"
"
+
+
+
*

!
"
"
+
+
+
!
$
#%(
#%'
.*
./*
011*
*0*11
Figure 2: PAR for an arbitrary graph
!
"#$
%""
%!$
%"!
Figure 3: Worst Case: dis order of n
Interestingly, more than one type of storage mecha
nism can be used depending upon the structure of the
BFStree. Each of the levels of the BFStree would use
either UTM or PAR. The graph is preprocessed in the
CPU, and depending on the size of the representations,
the smaller data structure is chosen.
4.5. Comparison of space requirements for the
different data structures:
An example graph of size 16 is provided for il
lustration. The structure of the original graph and a
BFStree is shown in Fig. 4. Table 4 compares space
requirements for the different data structures.
5. Splitting for larger graphs
In the previous sections, we have seen if the graph
ﬁts in the shared memory, we can use one of the several
techniques based on BFStree. In this section we show
how to process graphs where the BFStree does not ﬁt
in the shared memory.
802802808
&
!
"
%
2"
(
$"$
#
"%
'
""
"#
"!
!
#
" $
% ( ' &
2"! "" "# "$
" "%
"( "'
"&
"2
"( "' "& "2
Figure 4: Sample graph (top) and BFStree for com
paring data structures (bottom)
Table 4: Comparison of space requirements
Data Structure Space Required
(bits)
Adjacency matrix 400
UTM 210
SUTM 190
SUTMADJ 174
PAR 113
SUTMADJ/PAR: Hybrid 110
We make the following assumption: Given a graph
G, there exists a BFStree T, such that the graph
induced by nodes in any consecutive klevels of T
has connected components of size less than 512 in G.
Additionally, we assume the entire graph can ﬁt in 30
modules.
In each of the consecutive modules i, nodes in level
k+iare added and nodes in level iremoved. In this
case, if the average number of nodes in each of the
levels is lavg, then the total number of nodes that can
ﬁt in the shared memory is 512 + lavg ×29, where
lavg ≤512. This is an improvement over the scheme
presented in Section 3wherein we could process at
most 512 nodes even when using all the modules.
In the worst case, if the ﬁrst 29 levels of Thas
single nodes, then only 1new node can be brought
in while storing the next consecutive klevels, giving
a maximum size graph of 512 + 29 = 541 nodes.
Now, let us assume there exists consecutive klevels
of Tthat do not ﬁt in the shared memory. In this case
ﬁnding connected components in the graph induced
by the nodes in the given klevels might be helpful.
As nodes in separate connected components cannot
be part of connected subgraphs, calculations can be
done on them separately. If there are more than one
connected components where each has less than 512
nodes, then calculations can proceed in the following
manner; else we have to use the global memory to store
them. If any of the components has less than knodes,
then those can be excluded from the calculations. The
other components can be kept in the shared memory
of a module separately, or with other connected com
ponents provided the total number of nodes in all does
not exceed 512. Now, all possible combinations of size
kcan be tested from among the nodes of each of the
connected components separately.
Algorithm 1: Counting subgraphs of size kusing
all modules by splitting Ghorizontally using T
Input: BFStree Tof graph G
Output: Total count of connected subgraphs of
size k
begin
{Lk}←divIntoKLevelSets(T);
{Lk},{M}←
resetLevlsModls({Lk},{M});
T otalC ount ←0;
while there are sets of levels marked as new
do
if selected set is the last one then
while levels available in the set do
if there are klevels in the memory
then
Clear memory, mark module
available;
else
curLvl ←nextAvailableLvl;
T otalC ount ←T otalCount
+testCon(GenNxtComb
(curLvl));
while there are previous levels
do
curLvl ←
curLvl ∪prevLvls;
T otalC ount ←
T otalC ount
+testCon(
GenNxtComb(curLvl));
else
T otalC ount ←T otalCount +
testCon(GenNxtComb(fstLvl));
T otalC ount ←T otalCount +
testCon(GenNxtComb
(allKLvls));
end
803803809
Algorithm 1 takes Tas input, and checks for
connected subgraphs. The algorithm makes use of all
available 30 modules in the GPU by dividing the work
among them. The number of sets of kconsecutive
levels if there are Llevels in Tis Q=L−k+1.
If Q>30, the remaining sets are brought from the
global memory after the current round of executions
are completed. With this approach graphs that are
stored even in external memories can be processed.
Overview of Algorithm 1: Tis divided into sets of k
consecutive levels and each is processed by a module.
The beginning level in the set is processed ﬁrst, and
if it contains more than knodes, then subgraphs of
size kare checked for connectedness from among the
nodes in that level. Then all the nodes in the klevels
are considered together, and tested for combinations by
the function GenNxtComb(Nodes) where each combi
nation contains at least one node from the beginning
level and one from the rest of the levels to avoid
redundant checking. The above procedure is done for
all but the last set of klevels. For the last set, each new
level is ﬁrst processed separately and then combined
with all previous levels and checked for combinations
provided there is at least one node from both the sets
of previous levels and the new level. The function
divIntoKLevelSets(T) divides Tinto sets of consecutive
klevels, resetLevlsModls(Levels, Modules) resets the
modules marking all of them as available and marks
all the klevel sets as new, and ready to be processed.
The function testCon(Comb) checks if the subgraph
induced by the nodes in Comb is connected or not.
6. Other related problems
Using similar approaches as adapted in the connec
tivity testing algorithm, graph problems like ﬁnding
total number of cliques of size kand total number
of independent sets of size kcan also be solved. For
ﬁnding the total number of cliques of size k, only nodes
in adjacent levels Tneeds to be considered. For ﬁnding
total number of independent sets in G, it’s complement
say Gis taken, and then a BFS is performed on G
to get T, as given in Algorithm 2. Finding cliques of
size kin Tis equivalent to ﬁnding independent sets
of size kin G, as given in Algorithm 3.
7. Results
The problem of ﬁnding the number of connected
subgraphs of size kfrom a graph of size nis solved on
the GPU for different values of nand kby storing the
adjacency information of the graph in both the shared
memory and the global memory, while using both a
single module and also all available modules. The
number of threads in each of the modules considered is
limited to 32, which is equal to the Warpsize [6]. Fig. 5
plots the timings for evaluating all the combinations for
the graph kept on both the shared and global memory
while using both single and all available modules. The
plots are as expected, with the timings for the shared
memory better than those compared to the global
memory. Also, by using all the modules as compared to
using just a single module, more threads are available
which leads to a better performance.
Fig. 6 plots the timings for the graph kept on both
the shared and global memory considering the BFS
tree topology information while using both a single
module and all the available modules. The number
of computations and resulting computation times are
greatly reduced in this case as compared to the pre
vious case (Fig. 5) where all the combinations of the
nodes are tested.
Fig. 7 plots the timings for the graph kept on
the shared memory, using all available modules, and
evaluating both all and reduced number of combina
tions thereby comparing the previous two cases. It is
clear from the Figs. 5–7that the calculations done
using the BFStree topology information while keeping
the adjacency information on the shared memory and
utilizing all available modules for a large number of
threads is the most efﬁcient approach.
Algorithm 2: Counting number of k–cliques
Input: BFStree Tof graph G
Output: Total count of cliques of size k
begin
{Li}←divIntoLevels(T);
{Li}←markLevelsNew({Li});
T otalC ount ←0;
T otalLevel s ←0;
while Li.Status ∈{Li}=New do
curLvl ←Li;
T otalC ount ←T otalCount +
testClique(GenNxtComb
(curLvl));
T otalLevel s ++;
if T otalLevel s > 1then
curLvl ←Li∪Li−1;
T otalC ount ←T otalCount +
testClique(GenNxtComb
(curLvl));
end
804804810
0 50 100 150 200 250 300 350
0
500
1000
1500
Total nodes in the graph
Computation Time (seconds)
Finding connected subgraphs using all combinations
Single Module, Global Mem
Single Module, Shared Mem
All Modules, Global Mem
All Modules, Shared Mem
Figure 5: Evaluating all combinations for k=3with
32 threads in each module, for data stored on both
shared and global memory
Algorithm 3: Counting independent sets of size k
Input: Graph G(V,E)
Output: Total count of independent sets of size k
begin
{G}←FindComplement(G);
{T}←BFSTreeGenerate(G);
T otalC ount ←Algorithm2(T);
end
8. Conclusion
In this paper, algorithms to solve several graph
problems on a parallel GPU architecture are derived.
The major focus is utilizing faster shared memory of
the GPUs and devising data structures to represent
graphs in these small memory modules. Methods to
generate combinations to efﬁciently divide the work
among threads belonging to both single and multiple
modules are developed. In addition, techniques to
reduce computations by using breadthﬁrst search tree
and exploiting topology information are discussed. Our
results show that the smallest computation times are
indeed for the graphs stored on the shared memory
and calculations using the BFStree information. Our
future work involves the study of graph compression
relevant to the problems under consideration.
References
[1] D. A. Bader and K. Madduri. Snap, smallworld
network analysis and partitioning: An opensource par
allel graph framework for the exploration of largescale
networks. In IEEE International Symposium on Parallel
and Distributed Processing, pages 1 –12, April 2008.
0 50 100 150 200 250 300 350
0
20
40
60
80
100
120
140
160
180
200
Total nodes in the graph
Computation Time (seconds)
Finding connected subgraphs using BFS−tree information
Single Module, Global Mem
Single Module, Shared Mem
All Modules, Global Mem
All Modules, Shared Mem
Figure 6: Evaluating reduced number of combinations
using BFStree information for k=3with 32 threads
in each module, for data stored on both shared and
global memory
20 40 60 80 100 120 140 160 180 200
0
200
400
600
800
1000
1200
1400
Total nodes in the graph
Computation Time (seconds)
Finding connected subgraphs using both approaches
All Modules, All Combinations
All Modules, Reduced Combinations
Figure 7: Evaluating all combinations and reduced
combinations for k=4with 32 threads in each
module, for data stored on shared memory
[2] I. Bordino, D. Donato, A. Gionis, and S. Leonardi.
Mining large networks with subgraph counting. In Data
Mining, 2008. ICDM ’08. Eighth IEEE International
Conference on, pages 737 –742, Dec. 2008.
[3] M. Boyer, K. Skadron, and W. Weimer. Automated
dynamic analysis of cuda programs. In Third Workshop
on Software Tools for MultiCore Systems, 2008.
[4] A. Buluc¸, J. R. Gilbert, and C. Budak. Solving path
problems on the GPU. Parallel Computing, 36:241–
253, June 2010.
[5] LinkedIn Press Center. http://press.linkedin.com/about,
2011.
[6] NVIDIA Corporation. NVIDIA CUDA C Programming
Guide, Version 3.2, 2010.
805805811
[7] Y. Frishman and A. Tal. Multilevel graph layout on the
gpu. IEEE Transactions on Visualization and Computer
Graphics, 13:1310–1319, November 2007.
[8] M. Garland. Sparse matrix computations on manycore
GPUs. In Design Automation Conference, 2008. DAC
2008. 45th ACM/IEEE, pages 2 –6, June 2008.
[9] F. G. Gustavson, J. Wa´
sniewski, J. J. Dongarra,
and J. Langou. Rectangular full packed format for
cholesky’s algorithm: factorization, solution, and inver
sion. ACM Trans. Math. Softw., 37:18:1–18:21, April
2010.
[10] P. Harish and P. J. Narayanan. Accelerating large
graph algorithms on the gpu using cuda. In Proc. of
the IEEE Intl Conf. on High Performance Computing,
LNCS 4873, pages 197–208, 2007.
[11] G. J. Katz and J. T. Kider, Jr. Allpairs shortestpaths
for large graphs on the gpu. In Proceedings of the
23rd ACM SIGGRAPH/EUROGRAPHICS symposium
on Graphics hardware, pages 47–55. Eurographics As
sociation, 2008.
[12] Facebook Statistics. https://www.facebook.com/press/
info.php?statistics, 2011.
[13] Twitter Statistics. http://www.geek.com/articles/news/
twitterreaches200millionusersand110million
tweetsperday20110120, 2011.
806806812