Content uploaded by Oded Green

Author content

All content in this area was uploaded by Oded Green on Sep 11, 2018

Content may be subject to copyright.

Logarithmic Radix Binning and Vectorized Triangle Counting

Oded Green, James Fox, Alex Watkins, Alok Tripathy, Kasimir Gabert,

Euna Kim, Xiaojing An, Kumar Aatish, and David A. Bader

Computational Science and Engineering, Georgia Institute of Technology - USA

Abstract— Triangle counting is a building block for numerous

graph applications and given the fact that graphs continue to

grow in size, its scalability is important. As such, numerous

algorithms have been designed for triangle counting - some of

which are compute-bound rather than memory bound. Even

for compute-bound algorithms, one of the key challenges is

the limited control ﬂow available on the processor. This is in-

part due to the high dependency between the control ﬂow,

input data, and limited utilization of vector instructions. Not

surprising, compilers are not always able to detect these data

dependencies and vectorize the algorithms. Using the branch-

avoiding model we show to remove control ﬂow restrictions

by replacing branches with an equivalent set of arithmetic

operations. More so, we show how these can be vectorized using

Intel’s AVX-512 instruction set and that our new vectorized

algorithms are 2× −5×faster than scalar counterparts. We

also present a new load balancing method, Logarithmic Radix

Binning (LRB) that ensures that threads and the vector data

lanes execute a near equal amount of work at any given

time. Altogether, our algorithm outperforms several 2017 HPEC

Graph Challenge Champions such as the KOKKOS framework

and a GPU based algorithm by anywhere from 1.5×and up

to 14×.

I. INT ROD UC TI ON

Triangle counting and enumeration is a widely used kernel

for numerous applications. These include clustering coefﬁ-

cients, community detection, email spam detection, Jaccard

indices, and ﬁnding maximal k-trusses. The key building

block of triangle counting is adjacency list intersection. Nu-

merous algorithms have been developed for triangle counting

and these encapsulate a wide range of programming models:

vertex centric, edge centric, gather-apply-scatter (GAS), and

linear algebra. Some approaches require the adjacency arrays

to be sorted whereas other approaches do not. Most previous

algorithms focus on scalar execution as vectorizing these

algorithms is typically challenging. In this paper, we show

several news algorithm for vectorizing the computational

kernels of triangle counting.

We also propose a new load-balancing technique, which

we call Logarithmic Radix Binning (LRB), that ensures that

all the threads and vector units get a near equal amount

of work. LRB improves on previous techniques which only

focus on thread level load-balancing and shows how to

vectorize at the vector lane granularity. Thus, vector unit

utilization.

In this paper, we present several new algorithms for

triangle counting. These algorithm uses techniques developed

for the branch-avoiding model discussed in Green et al. [9],

[7]. Speciﬁcally, Green et al. show that the cost of branch

mis-prediction is high for data dependent applications (such

as graph algorithms) and that these applications can typically

be implemented in an alternative manner that avoids branches

entirely. This eliminates mis-prediction and makes execution

times more consistent, at the cost of additional operations.

Algorithmic Contributions

•We develop a two-tiered binning mechanism for triangle

counting. In the ﬁrst tier, for each edge we decide on an

intersection method that will be applied to that edge. Our

current implementation uses two different kernels (though

it can be extended to more). Therefore, this tier consist

of two bins. For the second tier, we present Logarithmic

Radix Binning (LRB). For each intersection kernel, a 2D

set of bins is maintained. Each of these bin stores edges

with similar computational properties. Thus, our vectorized

triangle counting algorithm can grab Kedges, where Kis

the vector width, with similar computational requirements

allowing for good load-balancing and utilization at the vector

lane granularity. In practice, this offers good scalability.

•We show how to increase the number of control ﬂows

in software. For a multi-threaded processor with Pphysical

hardware threads and vector instructions Kelements wide,

we show how to execute up to P·Kconcurrent software

threads. Where each of these threads is executing a different

intersections. Thus, our new algorithm increases the control

ﬂow on the processor by a factor of K. For the Intel

Knights Landing processor (discussed in Sec. V) used in our

experiments, P= 272 and K= 16, allows us to support up

to 4352 concurrent software threads.

•We show two novel vectorized triangle counting algo-

rithms: 1) sorted set intersection, and 2) binary search. The

ﬁrst of these approaches ﬁnds common neighbors by scan-

ning across the two sorted adjacency arrays being intersected,

similar to a merge. The second approach ﬁnds common

neighbors by searching for elements of one array in the

other. We also show a runtime mechanism for deciding which

algorithm should be selected The LRB method is then used

to ensure good execution.

•LRB is architecture agnostic and has been extended to the

GPU [5]. On the GPU, different bins are executed using

a different number of threads, thus ensuring good load-

balancing and trading off various overheads.

1

Performance Contributions

•We compare our new algorithm against several high

performing triangle counting algorithms, including [23], [31],

[2], [27]. Several of these were the fastest triangle counting

algorithms in the 2017 HPEC Graph Challenge [24]. Our

algorithm is faster than all algorithms for a but small number

of instances across a wide range of test graphs. This includes

outperforming KOKKOS [31] by an average of 4×and as

much as 10×and [2] by an average of 2×and as much as

6×.

•From a scalability perspective, we show that our algorithm

scales to large thread counts. This highlights the fact LRB

is frequently able to give each thread a near equal amount

of work and ensure good workload balance.

•Our new vectorized algorithms offers 2×-5×performance

speedup over their scalar counterparts (which also use the

LRB load balancing mechanism).

II. RE LATE D WOR K

The applications in which triangle counting and triangle

listing are used is broad. It became an important metric to

data scientists with the introduction of clustering coefﬁcients

[30]. Other applications for triangle counting are: ﬁnding

transitivity [18], spam detection in email networks [1],

ﬁnding tightly knit communities [21], ﬁnding k-trusses [4],

[28], [10], and evaluating the quality of different community

detection algorithms [17], [32]. An extended discussion of

triangle counting applications can also be found in [3].

Triangle counting was also a key kernel for the HPEC Graph

Challenge [22].

a) Computational Approaches: Given a graph, G=

(V, E ), where Vare the vertices and Eare the edges in the

graph, the three simplest and mostly widely used approaches

for counting triangles in a graph [26]; enumerating over

all node triplets O(|V|3), using linear algebra operations

O(|V|w)(where w < 2.376), and adjacency list intersection.

Adjacency list intersection can be completed in multiple

ways: sorted set intersections, hash tables, or binary searches

to look up values. The time complexity of each of these

approaches is data dependent, yet upper bounds can be given.

Triangle can also be completed using the Gather-Apply-

Scatter (GAS) programming models[6], [14]. In this work

we focus on the adjacency intersection approaches.

b) Algorithmic Optimization: Numerous computational

optimizations can be applied to triangle counting algo-

rithms for static graphs to help reduce the overall execution

time. For example, Green & Bader [8] present a combi-

natorial optimization that reduces the number of necessary

intersections–offering a better complexity bound. Green et al.

[12] show a scalable technique for load-balancing the triangle

counting on shared-memory systems. Shun & Tangwongsan

[23], Polak [20], and Pearce [19] show how to reduce the

computational requirements by ﬁnding triangles in a directed

graph rather than the undirected graph. Leist et al. [15],

Green et al. [13], Wang et al. [29], and Fox et al.[5]

show several different strategies for implementing triangle

counting on the GPU.

c) Vector Instruction Sets: Vector instructions have

been an integral part of commodity processors in the last

twenty years, though the history of Single Instruction Multi-

ple Data (SIMD) programs goes back even further. In SIMD

instruction, each datum is placed in a separate lane and each

lane executes the same instruction. Intel’s AVX-512 ISA can

operate on vectors of 512 bits. The AVX-512 instruction set

has numerous conditional instructions referred to as masks

in the AVX-512 ISA. Our algorithm makes extensive use of

these instructions.

III. LOG AR IT HM RA DI X BIN NI NG

In this section, we present Logarithm Radix Binning (LRB

for short), a method that effectively load balances the inter-

sections across the thread, for both intersection algorithms.

This method is efﬁcient and works well for both the scalar

and vectorized algorithms. LRB works by placing edges into

bins based on the logarithmic value of the estimated amount

of work for that edge. For triangle counting, we initially

group the edges into two unique bins, one bin for the sorted

set intersection and one for the binary search. We will then

apply LRB to each to the edges and distribute them over a

2D grid of bins.

a) Initial binning: The intersection of the adjacency

arrays can be done in multiple ways. Two popular approaches

are 1) sorted set intersections and 2) using binary search. In

the sorted set intersection, common elements are found by

moving across the two sort adjacency arrays using a merge-

like access pattern. Sorted set intersection performs well

when the two adjacency arrays are of near equal length–

we call this a balanced intersection. However, when one

adjacency array is extremely large and the other is small,

or the intersection is imbalanced, binary search is more

efﬁcient. For binary search, each element in the smaller array

is looked up in the larger of the two arrays.

To determine which bin to place an edge (u, v)∈Einto,

we use the following estimations:

I ntersectionW ork(u, v) = du+dv(1)

Binar yW ork(u, v) = du·log(dv).(2)

Intersecting (u, v)and (v, u)will ﬁnd the same triangles,

as such only one of these is necessary. For simplicity and

performance reasons, we choose the edge (u, v)such that

du< dv. If du=dv, we select the vertex with the smaller id.

The intersection method selected is based on the minimum

of Eq. (1) and Eq. (2).

b) Finer grain binning: Fig. 1 depicts an edge list with

the estimated amount of work for that edge. For each edge,

the method with the minimal amount of estimated work is

selected using Eq. 1 and Eq. 2. The yellow boxes denote

edges that use the sorted set intersection approach and the

blue boxes represent edges that use the binary search. The

second row represents an edge ordering where the edges

are placed into two bins—one for each approach. For the

vectorized algorithm, this can lead to signiﬁcant workload

2

Algorithm 2: Branch-avoiding with conditional in-

structions. Variants of the branch-based and branch-

avoiding algorithms can be found in [7].

T ri −Counting −Br anch −Avoiding −Conditionals()

ai←0;bi←0;count ←0;

while (ai<|A|and bi<|B|)do

CM P (A[ai], B[bi]));

CADDE Q(count);//Conditional −ADD if (A[ai] = B[bi])

CADDLE Q(ai);//Conditional −ADD if (A[ai]≤B[bi])

CADDGE Q(bi);//Conditional −ADD if (A[ai]≥B[bi])

Algorithm 3: Vectorized sorted intersection kernel

wh i l e ( co nd ) {

AVec = m m 5 1 2 i 3 2 g a t h e r e p i 3 2 ( i n de xA , EAr r , 4) ;

BVec = m m 5 1 2 i 3 2 g a t h e r ep i 3 2 ( i nd e xB , EAr r , 4 ) ;

cmpL eVec = m m5 1 2 ma s k cm p le e pi 3 2 ma s k ( c on d , AVec , B Vec ) ;

cmpGe Vec = mm 51 2 ma s k cm pg e ep i3 2 ma s k ( c on d , AVec , BV ec ) ;

cmpE qVec = mm 5 12 m as k cm pe q ep i 32 m as k ( c on d , AVec , BV ec ) ;

t r i s = mm 51 2 ma sk a dd e pi 3 2 ( t r i s , cmpEqV ec , t r i s , m ion e3 2 ) ;

in de xA = mm 51 2 ma sk a dd ep i3 2 ( in dex A , cmpLe Vec , i nd exA , mio ne 32 ) ;

in de xB = mm 51 2 ma sk a dd e pi3 2 ( in de xB , cmpG eVec , i nd exB , mi one 32 ) ;

co nd = mm 51 2 ma sk c mp gt e pi 32 ma sk ( co nd , i nde xA St op , in de xA ) ;

co nd = mm 51 2 ma sk c mp gt e pi 32 ma sk ( co nd , i nd ex BS to p , in de xB ) ;

}

the threads get an equal amount of work. This will become

in Section V where the scaling of the algorithm is almost

perfectly linear.

f) Time Complexity Analysis: Phase 1: ﬁnding the

proper bin takes O(|E|)steps for evaluating Eedges. Phase

2: O(B2)to compute the preﬁx matrices. Phase 3) an

additional O(|E|)steps to reorder the edge list. Phase 1 and

Phase 3 are embarrassingly parallel and easily split across

the Pthreads. Phase 2 can also be done in parallel, however,

the cost of the preﬁx operation on arrays of B2is relatively

small in comparison to the other two phases and sequential

implementation is enough. Recall that B∈32,64 and that

B2<< |E|. The total time complexity is O(|E|+B2) =

O(|E|).

g) Storage Complexity Analysis: We use CSR (com-

pressed sparse row) to represent the original graph and

assume that the adjacency arrays are sorted. For triangle

counting, CSR requires two arrays, one for the offsets of

size O(|V|)and one for the indices (edges) which is of size

O(|E|). The binning technique used by LRB stores the edges

in a different order. We used an array of size O(|E|)to

store these edges. While this new edge list does not increase

the theoretical upper-bound; from a practical perspective, it

does double the memory consumption. The edges in the new

reordered edge-lists determine the order in which the edges

will be intersected. However, the intersection process itself

uses the sorted adjacency arrays in the CSR graph.

IV. VEC TO RI ZE D ALG OR IT HM S

In this section we present our new branch-avoiding and

vectorized triangle counting algorithms. Alg. 2 depicts the

sorted list intersection using the branch-avoiding program-

ming model [9], [7]. Note, the control ﬂow for the branch-

avoiding algorithms is largely independent of input values–

this is a key enabling factor for our vectorized approach. See

Green et al. [9], [7] for additional discussion on the branch-

avoiding programming model.

Algorithm 4: Vectorized binary search kernel

while ( co nd ){

sumV ec = mm 5 1 2 ad d e p i 32 ( l ow , h i g h ) ;

mi d d l e = m m 51 2 m a s kz sr l e p i3 2 ( con d , sum Vec , o n e S h i f t e r ) ;

v a l s = mm 5 1 2 ma s k i 3 2 g a t h e r e p i 3 2 ( v a l s , co nd , mi d dl e , EAr r , 4 ) ;

cmpE qVec = mm 5 1 2 ma s k c m pe q e p i 32 m a s k ( c on d , va l s , k e y s ) ;

cmp LtV ec = mm 5 1 2 m a sk c m p l t ep i 3 2 m a sk ( c on d , v a l s , k e y s ) ;

cmp GtVe c = m m 5 1 2 m a s k cm p g t e p i 3 2 ma s k ( c o nd , v a l s , k e y s ) ;

t r i s = mm 51 2 ma sk a dd e pi 3 2 ( t r i s , c mpEqV ec , t r i s , m io ne 32 ) ;

low = mm 51 2 ma sk a dd e pi 3 2 ( low , c mpLt Vec , m id d le , mio ne 32 ) ;

hi g h = mm 51 2 ma s k ad d ep i3 2 ( hi gh , cmpG tVec , m id d le , m iMon e32 ) ; .

co nd = mm5 1 2 ma sk cm p ge e pi 32 m as k ( c on d & ˜ c mpEqV ec , hi g h , l ow ) ;

}

Alg. 2 depicts a branch-avoiding algorithm for list

intersection—this algorithm uses conditional instructions.

Such instructions do not always exist in all architectures and

are in fact designed for a single control ﬂow systems where a

single instruction is executed based on hardware ﬂags (zero-

ﬂag, carry-ﬂag, and overﬂow-ﬂag). As such, a na¨

ıve vector

implementation might be constrained to a single set of these

ﬂags or would a require a single control ﬂow. We show how

to overcome this hardware constraint using the AVX-512

instruction set. Speciﬁcally we show 1) how to increase the

number of software control ﬂows and 2) how to control the

execution of each lane using masks (even though we do not

have enough hardware ﬂags).

Our experience with conditional instructions is that the

compiler is not able to ﬁgure out how to use them. We

strongly differentiate between compare and branch instruc-

tions. Compare instructions are used for evaluating different

values whereas a branch typically uses compare output to

decide on the next sequence of executable instructions.

a) Vectorized Intersections: We started off by describ-

ing how to increase the control ﬂow. The vectorized triangle

counting can be implemented in a variety of ways using the

branch avoiding model. In the ﬁrst approach the different

lanes work together on the same intersection (consisting of

two arrays). This approach was shown to be effective on the

GPU’s SIMT programming model [11], [13]; however, this

approach is signiﬁcantly more challenging to implement for

vector instructions. In the second approach, each lane in the

vector unit is responsible for a different intersection. Thus,

each vector unit requires 2·Kdifferent adjacency arrays for

Kdifferent intersections.

We choose the second of the approaches–each lane is

responsible for a different intersection. This also removes

the overhead of the partitioning scheme found in[13]. Thus,

the maximal number of concurrent intersections (software

threads) that can be executed on a system is Concurrent =

P·K.

The branch-avoiding algorithm found in Alg. 2 depicts

an initial “recipe” for implementing a vectorized triangle

counting algorithm. Note that the number of data depen-

dent branches has been signiﬁcantly reduced, yet there

still remains one condition in the control ﬂow that is

data dependent–the WHILE loop’s condition which checks

bounds of the two respective arrays. To vectorize the algo-

rithm, this condition also needs to be vectorized and this is

by no means trivial as this condition is responsible for the

entire WHILE loop. The vectorized version of the algorithm

4

TABLE I

NET WOR KS US ED I N OUR E XP ERI ME NTS .

Name |V| |E|

amazon0312 400,727 3,200,440

amazon0505 410,236 3,356,440

amazon0601 403,394 3,387,388

cit-HepTh 27770 352285

cit-Patents 3774768 16518947

email-EuAll 265214 364481

g500-s21-ef16 1243072 31731650

g500-s22-ef16 2393285 64097004

Name |V| |E|

g500-s23-ef16 4606314 129250705

g500-s24-ef16 8860450 260261843

g500-s25-ef16 17043780 523467448

soc-Epinions1 75879 405740

soc-LiveJournal1 4847571 68993773

soc-Slashdot0811 77360 469180

soc-Slashdot0902 82168 504230

ensures that certain data lanes will be ignored if the bounds

of the indices for that lane are exceeded—each of the K

lanes is responsible for managing its own bounds. Similar

restrictions exist for the binary search based intersection

(Alg. 4).

Alg. 3 and Alg. 4 depict the vectorized code for the sorted

set intersection approach and for the binary search approach,

respectively. These algorithms show close-to-real vector code

instructions (using Intel’s AVX-512 instructions set) rather

than pseudo-code. This allows highlighting:

•The vectorized algorithms require gathering the elements

for Kintersections (using 2·Karrays) instead of just two

arrays for a scalar execution. The introduction of efﬁcient

gather instructions has greatly simpliﬁed the process of

collecting elements from random locations in memory.

•The AVX-512 instruction set introduces masked-vector

instructions. These masks enable operating on a subset of

the vectors lanes and updating the counters for each lane.

Masked instructions are not conditional instructions. Specif-

ically, the masked instructions are always executed across

all the lanes; however, some data lanes might not be updated

based on the value of the mask. Another key difference is that

conditional instructions were designed for a single control

ﬂow (one per thread) whereas the masked operations allow

a vector-wide control ﬂow (for multiple control ﬂows). This

distinction is the reason that a single conditional instruction

is replaced with two masked instructions. For example,

the CADDEQ operations is replaced with vector CMPEQ

instruction followed by a masked add vector operation. While

this obviously incurs a performance penalty, it also enables

increasing the scalability of the algorithm across the vector.

V. PE RF OR MA NC E ANALYSI S

a) Experiment System: The experiments presented in

this paper are primarily executed on an Intel Xeon Phi 7250

processor with 96GB of DRAM memory (102.4 GB/s peak

bandwidth). This processor is part of the Xeon Knights Land-

ing series of processors. In addition to the main memory,

the processor has an additional 16GB of MCDRAM high

bandwidth memory (400 GB/s peak bandwidth) which is

used used as our primary memory - if the graph ﬁts into main

memory the lower latency DRAM memory is not utilized.

The Intel Xeon Phi 7250 has 68 cores with 272 threads (4-

way SMP). These cores run at a 1.3 GHz clock and share a

32MB L2 cache. Given these system parameters and using

our new algorithms, we are able to execute up to 4352

TABLE II

DIFF ERE NT PAR ALL EL VAR IATIO NS O F OUR T RI ANG LE C OUN TI NG

AL GOR IT HMS .†DEN OTE S OU R FAST EST I MP LEM EN TATION .

Algorithm name Description

Mixed-EdgeList Simple algorithm that selects intersection method based on edge properties.

lrb-scalar Scalar implementation of our LRB load-balancing.

lrb-scalar-dod Scalar implementation that includes the direction optimized graph.

lrb-vector Vectorized (branch-avoiding) implementation of our LRB load-balancing.

lrb-vector-dod †Vectorized (branch-avoiding) implementation including the direction optimized graph.

concurrent intersections 1. We also provide results for a dual

Intel Xeon 8160 Skylake processor, with 48 cores (96 threads

with hyper-threading), 32 MB LLC, and 192GB of DDR4-

2400 memory. All code, on both systems, is compiled with

the Intel Compiler (icc) (version 2017).

b) Inputs: The algorithms are tested using real world

graphs and networks taken from SNAP[16] and the HPEC

Graph Challenge [24], Table I. By default, all graphs are

treated as undirected. Directed graphs are transposed and

duplicate edges created in this phase are removed. Our al-

gorithm can also utilize the optimization of ﬁnding triangles

in a directed graph (where only half the edges exist). This

concept is used in [23], [20], [19] and is referred to as the

DOD graph in [19]—which is the terminology we use in this

paper.

c) LRB Analysis: Our algorithm implementation incor-

porates multiple optimizations. To capture the beneﬁts of

each of these optimization, we execute our algorithm with

several different optimizations. Table II describes the various

optimizations we use.

Fig. 3 depicts various performance characteristics of our

new algorithms and the various optimizations for the soc-

LiveJournal1 graph - similar results were seen for other

graphs. Note the abscissa is log scale for all the sub-ﬁgures.

Fig. 3 (a) depicts the execution time as a function of the

number of threads and Fig. 3 (b) depicts the speedup for

each of these conﬁgurations in comparison with a sequential

execution of a speciﬁc algorithm. For all these conﬁgurations,

the parallel scalability is near linear all the way up to

68 threads which is the number of physical cores on the

KNL system used in our experiments. While there is some

performance improvement beyond 68 threads, the scaling it

is not linear. This is a well known artifact of multiple threads

per core when resources are shared. Yet, it also shows that

LRB is successful as a load-balancing mechanism.

Fig. 3 (c) highlights the contributions of the different

optimizations of our algorithm. For each thread count, all

algorithms are normalized against the “Mixed edge-list”

implementation for a given thread count. The typical speedup

of going from the scalar execution to the vectorized execution

increases performance by roughly 2.5×for both the regular

graph as well the the DOD graph. For other graphs, the

vectorization increased performance by as much as 5×.

Applying all these optimizations together greatly improves

performance over an already optimized algorithm (that se-

lects an ideal intersection kernel for each edge). Speciﬁcally,

for soc-LiveJournal this improves performance by an average

1We note that parallelism may be limited in practice by the number of

vector units. To the best of our knowledge 4 threads (single core) share 2

VPUs [25].

5

counting algorithms - including several HPEC Graph Chal-

lenge champions. On average our algorithm out performed

KOKKOS, a SpMV based implementation that uses vector

instructions, HPEC Graph Challenge Champion by an av-

erage of 2.5×. Our new algorithm is also upto 4×faster

than the fastest algorithm for the GPU (running an NVIDIA

P100 GPU). There are numerous instances where our new

algorithm is also 5× −10×faster than these algorithm.

ACK NOW LE DG ME NT S

Funding was provided in part by the Defense Advanced

Research Projects Agency (DARPA) under Contract Number

FA8750-17-C-0086. This work was partially funded by the

Doctoral Studies Program at Sandia National Laboratories.

Sandia National Laboratories is a multimission laboratory

managed and operated by National Technology & Engineer-

ing Solutions of Sandia, LLC, a wholly owned subsidiary

of Honeywell International Inc., for the U.S. Department

of Energy’s National Nuclear Security Administration under

contract DE-NA0003525. The content of the information in

this document does not necessarily reﬂect the position or

the policy of the Government, and no ofﬁcial endorsement

should be inferred. The U.S. Government is authorized to

reproduce and distribute reprints for Government purposes

notwithstanding any copyright notation here on. The authors

acknowledge the Texas Advanced Computing Center (TACC)

at The University of Texas at Austin for providing HPC re-

sources that have contributed to the research results reported

within this paper.

REF ER EN CE S

[1] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis, “Efﬁcient Semi-

streaming Algorithms for Local Triangle Counting in Massive

Graphs,” in 14th ACM SIGKDD Int’l Conf. on Knowledge Discovery

and Data Mining, 2008, pp. 16–24.

[2] M. Bisson and M. Fatica, “Static graph challenge on gpu,” in High

Performance Extreme Computing Conference (HPEC), 2017 IEEE.

IEEE, 2017, pp. 1–8.

[3] S. Chu and J. Cheng, “Triangle listing in massive networks and its

applications,” in Proceedings of the 17th ACM SIGKDD Int’l Conf.

on Knowledge Discovery and Data Mining, 2011, pp. 672–680.

[4] J. Cohen, “Trusses: Cohesive Subgraphs for Social Network Analysis,”

National Security Agency Technical Report, p. 16, 2008.

[5] J. Fox, O. Green, K. Gabert, X. An, and D. Bader, “Fast and Adaptive

List Intersections on the GPU,” in IEEE Proc. High Performance

Extreme Computing (HPEC), Waltham, MA, 2018.

[6] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Pow-

erGraph: Distributed Graph-Parallel Computation on Natural Graphs,”

in OSDI, vol. 12, 2012.

[7] O. Green, “When Merging and Branch Predictors Collide,” in IEEE

Fourth Workshop on Irregular Applications: Architectures and Algo-

rithms, 2014, pp. 33–40.

[8] O. Green and D. Bader, “Faster Clustering Coefﬁcients Using Vertex

Covers,” in 5th ASE/IEEE International Conference on Social Com-

puting, ser. SocialCom, 2013.

[9] O. Green, M. Dukhan, and R. Vuduc, “Branch-Avoiding Graph Al-

gorithms,” in 27th ACM on Symposium on Parallelism in Algorithms

and Architectures, 2015, pp. 212–223.

[10] O. Green, J. Fox, E. Kim, F. Busato, N. Bombieri, K. Lakhotia,

S. Zhou, S. Singapura, H. Zeng, R. Kannan, V. Prasanna, and D. Bader,

“Quickly Finding a Truss in a Haystack,” in IEEE Proc. High

Performance Extreme Computing (HPEC), Waltham, MA, 2017.

[11] O. Green, R. McColl, and D. Bader., “GPU Merge Path: A GPU

Merging Algorithm,” in 26th ACM International Conference on Su-

percomputing, 2012, pp. 331–340.

[12] O. Green, L. Munguia, and D. Bader, “Load Balanced Clustering Co-

efﬁcients,” in ACM Workshop on Parallel Programming for Analytics

Applications (PPAA), Feb. 2014.

[13] O. Green, P. Yalamanchili, and L. Mungu´

ıa, “Fast Triangle Counting

on the GPU,” in IEEE Fourth Workshop on Irregular Applications:

Architectures and Algorithms, 2014, pp. 1–8.

[14] F. Khorasani, K. Vora, R. Gupta, and L. Bhuyan, “CuSha: Vertex-

Centric Graph Processing on GPUs,” in 23rd ACM Int’l Symp. on

High-Performance Parallel and Distributed Computing (HPDC), 2014,

pp. 239–252.

[15] A. Leist, K. Hawick, D. Playne, and N. S. Albany, “GPGPU and Multi-

Core Architectures for Computing Clustering Coefﬁcients of Irregular

Graphs,” in Int’l Conf. on Scientiﬁc Computing (CSC’11), 2011.

[16] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford Large Network

Dataset Collection,” http://snap.stanford.edu/data.

[17] J. Leskovec, K. J. Lang, and M. Mahoney, “Empirical comparison of

algorithms for network community detection,” in Proceedings of the

19th Int’l Conf. on World Wide Web. ACM, 2010, pp. 631–640.

[18] M. E. Newman, D. J. Watts, and S. H. Strogatz, “Random Graph

Models of Social Networks,” Proceedings of the National Academy of

Sciences, vol. 99, no. suppl 1, pp. 2566–2572, 2002.

[19] R. Pearce, “Triangle counting for scale-free graphs at scale in dis-

tributed memory,” in High Performance Extreme Computing Confer-

ence (HPEC), 2017 IEEE. IEEE, 2017, pp. 1–4.

[20] A. Polak, “Counting triangles in large graphs on GPU,” arXiv preprint

arXiv:1503.00576, 2015.

[21] A. Prat-P´

erez, D. Dominguez-Sal, J. M. Brunat, and J.-L. Larriba-

Pey, “Shaping Communities out of Triangles,” in Proceedings of the

21st ACM International Conference on Information and Knowledge

Management, ser. CIKM ’12, 2012, pp. 1677–1681.

[22] S. Samsi, V. Gadepally, M. Hurley, M. Jones, E. Kao, S. Mohindra,

P. Monticciolo, A. Reuther, S. Smith, W. Song, D. Staheli, and

J. Kepner, “Static Graph Challenge: Subgraph Isomorphism,” in IEEE

Proc. High Performance Extreme Computing (HPEC), Waltham, MA,

2017.

[23] J. Shun and K. Tangwongsan, “Multicore Triangle Computations

Without Tuning,” in IEEE Int’l Conf. on Data Engineering (ICDE),

2015.

[24] S. Siddharth, V. Gadepally, M. Hurley, M. Jones, E. Kao, S. Mohindra,

P. Monticciolo, A. Reuther, S. Smith, W. Song, D. Staheli, and

J. Kepner, “Static graph challenge: Subgraph isomorphism,” in IEEE

Proc. High Performance Embedded Computing Workshop (HPEC),

Waltham, MA, 2017.

[25] A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod,

S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu, “Knights land-

ing: Second-generation intel xeon phi product,” Ieee micro, vol. 36,

no. 2, pp. 34–46, 2016.

[26] T. Schank and D. Wagner, “Finding, Counting and Listing All Tri-

angles in Large Graphs, an Experimental Study,” in Experimental &

Efﬁcient Algorithms. Springer, 2005, pp. 606–609.

[27] A. S. Tom, N. Sundaram, N. K. Ahmed, S. Smith, S. Eyerman,

M. Kodiyath, I. Hur, F. Petrini, and G. Karypis, “Exploring opti-

mizations on shared-memory platforms for parallel triangle counting

algorithms,” in High Performance Extreme Computing Conference

(HPEC), 2017 IEEE. IEEE, 2017, pp. 1–7.

[28] J. Wang and J. Cheng, “Truss Decomposition in Massive Networks,”

Proceedings of the VLDB Endowment, vol. 5, no. 9, pp. 812–823,

2012.

[29] L. Wang, Y. Wang, C. Yang, and J. D. Owens, “A comparative study

on exact triangle counting algorithms on the gpu,” in Proceedings of

the ACM Workshop on High Performance Graph Processing. ACM,

2016, pp. 1–8.

[30] D. J. Watts and S. H. Strogatz, “Collective Dynamics of ‘Small-World’

Networks,” Nature, vol. 393, no. 6684, pp. 440–442, 1998.

[31] M. M. Wolf, M. Deveci, J. W. Berry, S. D. Hammond, and S. Rajaman-

ickam, “Fast linear algebra-based triangle counting with kokkosker-

nels,” in High Performance Extreme Computing Conference (HPEC),

2017 IEEE. IEEE, 2017, pp. 1–7.

[32] J. Yang and J. Leskovec, “Deﬁning and evaluating network communi-

ties based on ground-truth,” in Data Mining (ICDM), 2012 IEEE 12th

International Conference on. IEEE, 2012, pp. 745–754.

7