Content uploaded by Santosh Pandey

Author content

All content in this area was uploaded by Santosh Pandey on Aug 29, 2020

Content may be subject to copyright.

Multicore Triangle Computations

Without Tuning

Julian Shun and Kanat Tangwongsan

Presentation is based on paper published in International

Conference on Data Engineering (ICDE), 2015 1

•Triangle Counting

•Other variants:

•Triangle listing

•Local triangle counting/clustering coefficients

•Triangle enumeration

•Approximate counting

•Analogs on directed graphs

•Numerous applications…

•Social network analysis, Web structure, spam detection, outlier

detection, dense subgraph mining, 3-way database joins, etc.

2

Triangle Computations

Alice Bob

Carol David

Eve

Fred Greg

Hannah

✔✔

✔

Count = 3

Need fast triangle computation algorithms!

•Sequential algorithms for exact counting/listing

•Naïve algorithm of trying all triplets

O(V3) work

•Node-iterator algorithm [Schank]

O(VE) work

•Edge-iterator algorithm [Itai-Rodeh]

O(VE) work

•Tree-lister [Itai-Rodeh], forward/compact-forward [Schank-Wagner,

Lapaty]

O(E1.5) work

•Sequential algorithms via matrix multiplication

•O(V2.37) work compute A3, where A is the adjacency matrix

•O(E1.41) work [Alon-Yuster-Zwick]

•These require superlinear space

3

Sequential Triangle Computation

Algorithms V = # vertices E = # edges

4

Sequential Triangle Computation

Algorithms

What about parallel algorithms?

Source: “Algorithmic Aspects of Triangle-Based Network

Analysis”, Dissertation by Thomas Schank

•Most designed for distributed memory

•MapReduce algorithms [Cohen ’09, Suri-Vassilvitskii ‘11, Park-

Chung ‘13, Park et al. ‘14]

•MPI algorithms [Arifuzzaman et al. ‘13, Graphlab]

5

Parallel Triangle Computation Algorithms

•What about shared-memory multicore?

•Multicores are everywhere!

•Node-iterator algorithm [Green et al. ‘14]

•O(VE) work in worst case

•Can we obtain an O(E1.5) work shared-memory multicore

algorithm?

6

Triangle Computation:

Challenges for Shared Memory Machines

Irregular

computation

1Deep memory

hierarchy

2

Cache Complexity Model

9

Cache

Main Memory

CPU

Unit cost for transferring

line of size B

Free

Main

Memory

Disk

CPU

External Memory Model

Complexity = # cache misses disk accesses

Size M

Cache-aware (external-memory) algorithms: have

knowledge of M and B

Cache-oblivious algorithms: no knowledge of parameters

Size M

Cache Oblivious Model [Frigo et al. ‘99]

10

Cache

Main Memory

CPU

Size M

•Algorithm works well regardless of

cache parameters

•Works well on multi-level hierarchies

•Parallel Cache Oblivious Model for

hierarchies of shared and private

caches [Blelloch et al. ‘11]

Block size B

L3 Cache Size M3

L2 Cache Size M2

Block size B3

Block size B2

L1 Cache Size M1

Block size B1

CPU

Primitive

Work

Depth

Cache Complexity

Scan/filter/merge

O(n

)

O(log n)

O(n

/B)

Sort

O(n log n)

O(log

2n)

O((n/B)

log(M/B)(n/B))

•All previous algorithms are sequential

•External-memory (cache-aware) algorithms

•Natural-join O(E3/(M2 B)) I/O’s

•Node-iterator [Dementiev ’06] O((E1.5/B) logM/B(E/B)) I/O’s

•Compact-forward [Menegola ‘10] O(E + E1.5/B) I/O’s

•[Chu-Cheng ’11, Hu et al. ‘13] O(E2/(MB) + #triangles/B) I/O’s

•External-memory and cache-oblivious

•[Pagh-Silvestri ‘14] O(E1.5/(M0.5 B)) I/O’s or cache misses

•Parallel cache-oblivious algorithms?

11

External-Memory and Cache-Oblivious

Triangle Computation

12

Algorithm

Work

Depth

Cache Complexity

TC

-Merge

O(E

1.5)

O(log

2E)

O(E + E

1.5/B)

TC

-Hash

O(V log V + αE)

O(log

2E)

O(sort(V) + αE

)

Par.

Pagh-Silvestri

O(E

1.5)

O(log

3E)

O(E

1.5/(M0.5 B))

1

Our Contributions

Parallel Cache-Oblivious Triangle Counting Algs

2

Extensive Experimental Study

3

Extensions to Other Triangle Computations:

Enumeration, Listing, Local Counting/Clustering Coefficients,

Approx. Counting, Variants on Directed Graphs

V = # vertices E = # edges α = arboricity (at most E0.5)

M = cache size B = line size sort(n) = (n/B) logM/B(n/B)

Sequential Triangle Counting (Exact)

13

1 32

0

4

Rank vertices by degree (sorting)

Return A[v] for all vstoring higher

ranked neighbors

for each vertex v:

for each win A[v]:

count += intersect(A[v], A[w])

Work = O(E1.5)

[Schank-Wagner ‘05, Latapy ‘08]

Gives all triangles (v, w, x) where

rank(v) < rank(w) < rank(x)

1

2

(Forward/compact-forward algorithm)

Proof of O(E1.5) work bound when intersect

uses merging

14

1 32

0

4

Rank vertices by degree (sorting)

Return A[v] for all vstoring higher

ranked neighbors

for each vertex v:

for each win A[v]:

count += intersect(A[v], A[w])

•Step 1: O(E+V log V) work

•Step 2:

• For each edge (v,w), intersect does O(d+(v) + d+(w)) work

•For all v, d+(v) ≤ E0.5

•If d+(v) > E0.5, each of its higher degree neighbors also

have degree > E0.5 and total number of directed edges > E,

a contradiction

• Total work = E * O(E0.5) = O(E1.5)

1

2

Parallel Triangle Counting (Exact)

15

Rank vertices by degree (sorting)

Return A[v] for all vstoring higher

ranked neighbors

for each vertex v:

for each win A[v]:

count += intersect(A[v], A[w])

Parallel sort

and filter

parallel_

parallel_

Parallel reduction

Parallel merge (TC-Merge)

or

Parallel hash table (TC-Hash)

1

2

Step 1

Work = O(E+V log V)

Depth = O(log2V)

Cache = O(E+sort(V))

parfor v ∈ V!

parfor w ∈ A[0]!

parfor w ∈ A[1]!

parfor w ∈ A[2]!

parfor w ∈ A[3]!

parfor w ∈ A[4]!

v = 0!

v = 1!

v = 2!v = 3!v = 4!

intersect(

(A [0], A [1])!

+ +

intersect(

(A [0], A [3])!

+ +

intersect(

(A [2], A [1])!

+ +

intersect(

(A [3], A [1])!

+ +

intersect(

(A [4], A [1])!

+ +

intersect(

(A [4], A [3])!

+ +

safe to

run all in

parallel

TC-Merge and TC-Hash Details

16

for each vertex v:

for each win A[v]:

count += intersect(A[v], A[w])

parallel_

parallel_

Parallel reduction

Parallel merge (TC-Merge)

or

Parallel hash table (TC-Hash)

Step 2: TC-Merge

Work = O(E1.5)

Depth = O(log2E)

Cache = O(E+E1.5/B)

2

Step 2: TC-Hash

Work = O(αE)

Depth = O(log E)

Cache = O(αE)

•TC-Merge

•Preprocessing: sort adjacency lists

•Intersect: use a parallel and cache-oblivious merge based on divide-

and-conquer [Blelloch et al. ‘11]

•TC-Hash

•Preprocessing: for each vertex, create parallel hash table storing

edges [Shun-Blelloch ‘14]

•Intersect: scan smaller list, querying hash table of larger list in parallel

(α = arboricity (at most E0.5))

17

Algorithm

Work

Depth

Cache Complexity

TC

-Merge

O(E

1.5)

O(log

2E)

O(E + E

1.5/B) (oblivious)

TC

-Hash

O(V log V + αE)

O(log

2E)

O(sort(V) + αE

) (oblivious)

Par.

Pagh-Silvestri

O(E

1.5)

O(log

3E)

O(E

1.5/(M0.5 B))

(oblivious)

Chu

-Cheng ‘11,

Hu

et al. ‘13

O(E log E + E

2

/M

+ αE)

O(E

2/(MB) + #triangles/B)

(aware)

Pagh

-Silvestri ‘14

O(E

1.5)

O(E

1.5/(M0.5 B))

(oblivious)

Green et al.

’14

O(VE)

O(log E)

Comparison of Complexity Bounds

V = # vertices E = # edges α = arboricity (at most E0.5)

M = cache size B = line size sort(n) = (n/B) logM/B(n/B)

18

Algorithm

Work

Depth

Cache Complexity

TC

-Merge

O(E

1.5)

O(log

2E)

O(E + E

1.5/B)

TC

-Hash

O(V log V + αE)

O(log

2E)

O(sort(V) + αE

)

Par.

Pagh-Silvestri

O(E

1.5)

O(log

3E)

O(E

1.5/(M0.5 B))

1

Our Contributions

Parallel Cache-Oblivious Triangle Counting Algs

Extensive Experimental Study

3

V = # vertices E = # edges α = arboricity (at most E0.5)

M = cache size B = line size sort(n) = (n/B) logM/B(n/B)

2Extensions to Other Triangle Computations:

Enumeration, Listing, Local Counting/Clustering Coefficients,

Approx. Counting, Variants on Directed Graphs

Extensions of Exact Counting Algorithms

19

•Triangle enumeration

•Call emit function whenever triangle is found

•Listing: add to hash table to list; return contents at the end

•Local counting/clustering coefficients: atomically increment

count of three triangle endpoints

•Directed triangle counting/enumeration

•Keep separate counts for different types of triangles

•Approximate counting

•Use colorful triangle sampling scheme to create smaller sub-graph

[Pagh-Tsourakakis ‘12]

•Run TC-Merge or TC-Hash on sub-graph with pE edges (0 < p < 1)

and return #triangles/p2 as estimate

Approximate Counting

20

•Colorful triangle counting [Pagh-Tsourakakis ’12]

Assign random color in {1, …, 1/p}

to each vertex 1

Sampling: Keep edges whose

endpoints have the same color 2

Run exact triangle counting on

sampled graph, return Δsampled/p23

Parallel scan

Parallel filter

Use TC-Merge

or TC-Hash

Steps 1 & 2

Work = O(E)

Depth = O(log E)

Cache = O(E/B)

Step 3: TC-Merge

Work = O((pE)1.5)

Depth = O(log2E)

Cache = O(pE+(pE)1.5/B)

Step 3: TC-Hash

Work = O(V log V + αpE)

Depth = O(log E)

Cache = O(sort(V)+pαE)

Expected # edges = pE

Sampling rate: 0 < p < 1

21

Algorithm

Work

Depth

Cache Complexity

TC

-Merge

O(E

1.5)

O(log

2E)

O(E + E

1.5/B)

TC

-Hash

O(V log V + αE)

O(log

2E)

O(sort(V) + αE

)

Par.

Pagh-Silvestri

O(E

1.5)

O(log

3E)

O(E

1.5/(M0.5 B))

1

Our Contributions

Parallel Cache-Oblivious Triangle Counting Algs

Extensive Experimental Study

3

V = # vertices E = # edges α = arboricity (at most E0.5)

M = cache size B = line size sort(n) = (n/B) logM/B(n/B)

2Extensions to Other Triangle Computations:

Enumeration, Listing, Local Counting/Clustering Coefficients,

Approx. Counting, Variants on Directed Graphs

Experimental Setup

22

•Implementations using Intel Cilk Plus

•40-core Intel Nehalem machine (with 2-way hyper-threading)

•4 sockets, each with 30MB shared L3 cache, 256KB private L2 caches

•Sequential TC-Merge as baseline (faster than existing

sequential implementations)

•Other multicore implementations: Green et al. and GraphLab

•Our parallel Pagh-Silvestri algorithm was not competitive

•Variety of real-world and artificial graphs

Both TC-Merge and TC-Hash scale well

with # of cores:

23

LiveJournal

4M vtxes, 34.6M edges

~ 27x ~ 48x

Orkut

3M vtxes, 117M edges

40-core (with hyper-threading) Performance

24

0

5

10

15

20

25

30

35

40

45

50

random (V=100M, E=500M)

rMat (V=134M, E=500M)

3D-grid (V=100M, E=300M)

soc-LJ (V=5M, E=43M)

Patents (V=3.7M, E=17M)

com-LJ (V=4M, E=35M)

Orkut (V=3M, E=117M)

Speedup over

sequential TC-Merge

TC-Merge

TC-Hash

Green et al.

GraphLab

•TC-Merge always faster than TC-Hash (by 1.3—2.5x)

•TC-Merge always faster than Green et al. or GraphLab

(by 2.1—5.2x)

Why is TC-Merge faster than TC-Hash?

25

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Running

Time

L3 Cache

Misses

L2 Cache

Missses

# Ops for

Intersect

Normalized to TC-Merge

soc-LJ

TC-Merge

TC-Hash

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Running

Time

L3 Cache

Misses

L2 Cache

Misses

# Ops for

Intersect

Orkut

•TC-Hash less cache-efficient than TC-Merge

•Running time more correlated with cache misses than work

Comparison to existing counting algs.

26

0246810 12 14 16 18 20

TC-Merge (40 cores)

GraphLab (40 cores)

GraphLab (MPI, 64 nodes, 1024 cores)

PATRIC (MPI, 1200 cores)

Park and Chung (MapReduce, 47 nodes)

Suri and Vassilvitskii (MapReduce, 1636 nodes)

Minutes

Twitter graph (41M vertices, 1.2B undirected edges, 34.8B triangles)

(213 minutes)

(423 minutes)

•Yahoo graph (1.4B vertices, 6.4B edges, 85.8B triangles)

on 40 cores: TC-Merge takes 78 seconds

– Approximate counting algorithm achieves 99.6% accuracy in 9.1

seconds

Approximate counting

28

p=1/25

Accuracy

T

approx

T

approx/Texact

Orkut

(V=3M, E=117M)

99.8%

0.067sec

0.035

Twitter (V=41M, E=1.2B)

99.9%

2.4sec

0.043

Yahoo (V=1.4B, E=6.4B)

99.6%

9.1sec

0.117

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5

soc-LJ

com-LJ

Orkut

p

Tapprox/Texact

•Simple multicore algorithms for triangle computations are

provably work-efficient, low-depth and cache-friendly

•Implementations require no load-balancing or tuning for

cache

•Experimentally outperforms existing multicore and

distributed algorithms

•Future work: Design a practical parallel algorithm

achieving O(E1.5/(M0.5 B)) cache complexity

29

Conclusion

Algorithm

Work

Depth

Cache Complexity

TC

-Merge

O(E

1.5)

O(log

2E)

O(E + E

1.5/B)

TC

-Hash

O(V log V + αE)

O(log

2E)

O(sort(V) + αE

)

Par.

Pagh-Silvestri

O(E

1.5)

O(log

3E)

O(E

1.5/(M0.5 B))