Page 1

PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations

U Kang

SCS, Carnegie Mellon University

ukang@cs.cmu.edu

Charalampos E. Tsourakakis

SCS, Carnegie Mellon University

ctsourak@cs.cmu.edu

Christos Faloutsos

SCS, Carnegie Mellon University

christos@cs.cmu.edu

Abstract—In this paper, we describe PEGASUS, an open

source Peta Graph Mining library which performs typical

graph mining tasks such as computing the diameter of the

graph, computing the radius of each node and finding the

connected components. As the size of graphs reaches several

Giga-, Tera- or Peta-bytes, the necessity for such a library

grows too. To the best of our knowledge, PEGASUS is the first

such library, implemented on the top of the HADOOP platform,

the open source version of MAPREDUCE.

Many graph mining operations (PageRank, spectral cluster-

ing, diameter estimation, connected components etc.) are es-

sentially a repeated matrix-vector multiplication. In this paper

we describe a very important primitive for PEGASUS, called

GIM-V (Generalized Iterated Matrix-Vector multiplication).

GIM-V is highly optimized, achieving (a) good scale-up on the

number of available machines (b) linear running time on the

number of edges, and (c) more than 5 times faster performance

over the non-optimized version of GIM-V.

Our experiments ran on M45, one of the top 50 supercom-

puters in the world. We report our findings on several real

graphs, including one of the largest publicly available Web

Graphs, thanks to Yahoo!, with ≈ 6,7 billion edges.

Keywords-PEGASUS; graph mining; hadoop

I. INTRODUCTION

Graphs are ubiquitous: computer networks, social net-

works, mobile call networks, the World Wide Web [1],

protein regulation networks to name a few.

The large volume of available data, the low cost of storage

and the stunning success of online social networks and

web2.0 applications all lead to graphs of unprecedented

size. Typical graph mining algorithms silently assume that

the graph fits in the memory of a typical workstation, or

at least on a single disk; the above graphs violate these

assumptions, spanning multiple Giga-bytes, and heading to

Tera- and Peta-bytes of data.

A promising tool is parallelism, and specifically MAPRE-

DUCE [2] and its open source version, HADOOP. Based

on HADOOP, here we describe PEGASUS, a graph min-

ing package for handling graphs with billions of nodes

and edges. The PEGASUS code and several dataset are

at http://www.cs.cmu.edu/∼pegasus. The contributions are

the following:

1) Unification of seemingly different graph mining tasks,

via a generalization of matrix-vector multiplication

(GIM-V).

2) The careful implementation of GIM-V, with several

optimizations, and several graph mining operations

(PageRank, Random Walk with Restart(RWR), diame-

ter estimation, and connected components). Moreover,

the method is linear on the numberof edges, and scales

up well with the number of available machines.

3) Performance analysis, pinpointing the most successful

combination of optimizations, which lead to up to 5

times better speed than naive implementation.

4) Analysis of large, real graphs, including one of the

largest publicly available graph that was ever analyzed,

Yahoo’s web graph.

The rest of the paper is organized as follows. Sec-

tion II presents the related work. Section III describes our

framework and explains several graph mining algorithms.

Section IV discusses optimizations that allow us to achieve

significantly faster performance in practice. In Section V we

present timing results and Section VI our findings in real

world, large scale graphs. We conclude in Section VII.

II. BACKGROUND AND RELATED WORK

The related work forms two groups, graph mining, and

HADOOP.

Large-Scale Graph Mining.: There are a huge number

of graph mining algorithms, computing communities (eg.,

[3], DENGRAPH [4], METIS [5]), subgraph discovery(e.g.,

GraphSig [6], [7], [8], [9], gPrune [10], gApprox [11],

gSpan [12], Subdue [13], HSIGRAM/VSIGRAM [14],

ADI [15], CSV [16]), finding important nodes (e.g., PageR-

ank [17] and HITS [18]), computing the number of tri-

angles [19], [20], computing the diameter [21], topic de-

tection [22], attack detection [23], with too-many-to-list

alternatives for each of the above tasks. Most of the previous

algorithms do not scale, at least directly, to several millions

and billions of nodes and edges.

For connected components, there are several algorithms,

using Breadth-First Search, Depth-First-Search, “propaga-

tion” ([24], [25], [26]), or “contraction” [27] . These works

rely on a shared memory model which limits their ability to

handle large, disk-resident graphs.

MapReduce and Hadoop.: MAPREDUCE is a program-

ming framework [2] [28] for processing huge amounts of

unstructured data in a massively parallel way. MAPREDUCE

has two major advantages: (a) the programmer is oblivious

Page 2

of the details of the data distribution, replication, load bal-

ancing etc. and furthermore (b) the programming concept is

familiar, i.e., the concept of functional programming.Briefly,

the programmer needs to provide only two functions, a map

and a reduce. The typical framework is as follows [29]: (a)

the map stage sequentially passes over the input file and

outputs (key, value) pairs; (b) the shuffling stage groups of

all values by key, (c) the reduce stage processes the values

with the same key and outputs the final result.

HADOOP is the open source implementation of MAPRE-

DUCE. HADOOP provides the Distributed File System

(HDFS) [30] and PIG, a high level language for data

analysis [31]. Due to its power, simplicity and the fact

that building a small cluster is relatively cheap, HADOOP

is a very promising tool for large scale graph mining

applications, something already reflected in academia, see

[32]. In addition to PIG, there are several high-levellanguage

and environments for advanced MAPREDUCE-like systems,

including SCOPE [33], Sawzall [34], and Sphere [35].

III. PROPOSED METHOD

How can we quickly find connected components, diameter,

PageRank, node proximities of very large graphs fast? We

show that, even if they seem unrelated, eventually we

can unify them using the GIM-V primitive, standing for

Generalized Iterative Matrix-Vector multiplication, which

we describe in the next.

A. Main Idea

GIM-V, or ‘Generalized Iterative Matrix-Vector multipli-

cation’ is a generalization of normal matrix-vector multipli-

cation. Suppose we have a n by n matrix M and a vector v

of size n. Let mi,jdenote the (i,j)-th element of M. Then

the usual matrix-vector multiplication is

M × v = v′where v′

There are three operations in the previous formula, which,

if customized separately, will give a surprising number of

useful graph mining algorithms:

1) combine2: multiply mi,j and vj.

2) combineAll: sum n multiplication results for node

i.

3) assign: overwrite previous value of vi with new

result to make v′

In GIM-V, let’s define the operator ×G, where the three

operations can be defined arbitrarily. Formally, we have:

v′= M ×Gv

where v′

1..n, and xj=combine2(mi,j,vj)})).

Thefunctions

combine2(),

assign() have the following signatures (generalizing

the product, sum and assignment, respectively, that the

traditional matrix-vector multiplication requires):

1) combine2(mi,j,vj) : combine mi,j and vj.

i=?n

j=1mi,jvj.

i.

i= assign(vi,combineAlli({xj| j =

combineAll(),and

2) combineAlli(x1,...,xn) : combine all the results

from combine2() for node i.

3) assign(vi,vnew) : decide how to update vi with

vnew.

The ‘Iterative’ in the name of GIM-V denotes that

we apply the ×G operation until an algorithm-specific

convergence criterion is met. As we will see in a moment,

by customizing these operations, we can obtain different,

useful algorithms including PageRank, Random Walk with

Restart, connected components, and diameter estimation.

But first we want to highlight the strong connection of

GIM-V with SQL: When combineAlli() and assign()

can be implemented by user defined functions, the operator

×G can be expressed concisely in terms of SQL. This

viewpoint is important when we implement GIM-V in large

scale parallel processing platforms, including HADOOP, if

they can be customized to support several SQL primitives

including JOIN and GROUP BY. Suppose we have an edge

table E(sid, did, val) and a vector table V(id,

val), corresponding to a matrix and a vector, respectively.

Then, ×G corresponds to the following SQL statement -

we assume that we have (built-in or user-defined) functions

combineAlli() and combine2()) and we also assume

that the resulting table/vector will be fed into the assign()

function (omitted, for clarity):

SELECT E.sid, combineAllE.sid(combine2(E.val,V.val))

FROM E, V

WHERE E.did=V.id

GROUP BY E.sid

In the following sections we show how we can customize

GIM-V, to handle important graph mining operations in-

cluding PageRank, Random Walk with Restart, diameter

estimation, and connected components.

B. GIM-V and PageRank

Our first application of GIM-V is PageRank, a famous

algorithm that was used by Google to calculate relative

importance of web pages [17]. The PageRank vector p of n

web pages satisfies the following eigenvector equation:

p = (cET+ (1 − c)U)p

where c is a damping factor (usually set to 0.85), E is the

row-normalized adjacency matrix (source, destination), and

U is a matrix with all elements set to 1/n.

To calculate the eigenvector p we can use the power

method, which multiplies an initial vector with the matrix,

several times. We initialize the current PageRank vector pcur

and set all its elements to 1/n. Then the next PageRank

pnextis calculated by pnext= (cET+ (1 − c)U)pcur. We

continue to do the multiplication until p converges.

PageRank is a direct application of GIM-V. In this view,

we first construct a matrix M by column-normalize ET

such that every column of M sum to 1. Then the next

Page 3

PageRank is calculated by pnext= M ×Gpcurwhere the

three operations are defined as follows:

1) combine2(mi,j,vj) = c × mi,j× vj

2) combineAlli(x1,...,xn) =

3) assign(vi,vnew) = vnew

C. GIM-V and Random Walk with Restart

Random Walk with Restart(RWR) is an algorithm to

measure the proximity of nodes in graph [36]. In RWR,

the proximity vector rkfrom node k satisfies the equation:

rk= cMrk+ (1 − c)ek

where ekis a n-vector whose kthelement is 1, and every

other elements are 0. c is a restart probability parameter

which is typically set to 0.85 [36]. M is a column-normalized

and transposed adjacency matrix, as in Section III-B. In

GIM-V, RWR is formulated by rnext

the three operations are defined as follows (I(x) is 1 if x is

true, and 0 otherwise.):

1) combine2(mi,j,vj) = c × mi,j× vj

2) combineAlli(x1,...,xn) = (1 − c)I(i ?= k) +

?n

j=1xj

3) assign(vi,vnew) = vnew

D. GIM-V and Diameter Estimation

HADI [21] is an algorithm to estimate the diameter and

radius of large graphs. The diameter of a graph is the

maximum of the length of the shortest path between every

pair of nodes. The radius of a node vi is the number of

hops that we need to reach the farthest-away node from vi.

The main idea of HADI is as follows. For each node viin

the graph, we maintain the number of neighbors reachable

from vi within h hops. As h increases, the number of

neighbors increases until h reaches it maximum value. The

diameter is h where the number of neighbors within h + 1

does not increase for every node. For further details and

optimizations, see [21].

The main operation of HADI is updating the number

of neighbors as h increases. Specifically, the number of

neighbors within hop h reachable from node viis encoded

in a probabilistic bitstring bh

bh+1

i

= bh

In GIM-V, the bitstring update of HADI is represented by

bh+1= M ×Gbh

where M is an adjacency matrix, bh+1is a vector of length

n which is updated by

bh+1

i

=assign(bh

xj=combine2(mi,j,bh

and the three operations are defined as follows:

1) combine2(mi,j,vj) = mi,j× vj.

2) combineAlli(x1,...,xn) = BITWISE-OR{xj| j =

1..n}

3) assign(vi,vnew) = BITWISE-OR(vi,vnew).

The ×Goperation is run iteratively until the bitstring for

all the nodes do not change.

(1−c)

n

+?n

j=1xj

k

= M ×Grcur

k

where

iwhich is updated as follows:

iBITWISE-OR {bh

k| (i,k) ∈ E}

i,combineAlli({xj | j = 1..n, and

j)})),

E. GIM-V and Connected Components

We propose HCC, a new algorithm for finding connected

components in large graphs. Like HADI, HCC is an appli-

cation of GIM-V with custom functions. The main idea is

as follows. For every node vi in the graph, we maintain

a component id ch

h hops from vi. Initially, ch

id: that is, c0

current ch

vi at the next step, is set to the minimum value among

its current component id and the received component ids

from its neighbors. The crucial observation is that this

communication between neighbors can be formulated in

GIM-V as follows:

ch+1= M ×Gch

where M is an adjacency matrix, ch+1is a vector of length

n which is updated by

ch+1

i

=assign(ch

xj=combine2(mi,j,ch

and the three operations are defined as follows:

1) combine2(mi,j,vj) = mi,j× vj.

2) combineAlli(x1,...,xn) = MIN{xj| j = 1..n}

3) assign(vi,vnew) = MIN(vi,vnew).

By repeating this process, component ids of nodes in a

component are set to the minimum node id of the compo-

nent. We iteratively do the multiplication until component

ids converge. The upper bound of the number of iterations

in HCC are determined by the following theorem.

Theorem 1 (Upper bound of iterations in HCC): HCC

requires maximum d iterations where d is the diameter of

the graph.

Proof: The minimum node id is propagated to its

neighbors at most d times.

Since the diameter of real graphs are relatively small, HCC

completes after small number of iterations.

iwhich is the minimum node id within

iof vi is set to its own node

i= i. For each iteration, each node sends its

ito its neighbors. Then ch+1

i

, component id of

i,combineAlli({xj | j = 1..n, and

j)})),

IV. FAST ALGORITHMS FOR GIM-V

How can we parallelize the algorithm presented in the

previous section? In this section, we first describe naive

HADOOP algorithms for GIM-V. After that we propose

several faster methods for GIM-V.

A. GIM-V BASE: Naive Multiplication

GIM-V BASE is a two-stage algorithm whose pseudo

code is in Algorithm 1 and 2. The inputs are an edge

file and a vector file. Each line of the edge file contains

one (idsrc,iddst,mval) which corresponds to a non-zero

cell in the adjacency matrix M. Similarly, each line of the

vector file contains one (id,vval) which corresponds to an

element in the vector V . Stage1 performs combine2

operation by combining columns of matrix(iddst of M)

with rows of vector(id of V ). The output of Stage1 are

(key, value) pairs where key is the source node id of the

Page 4

Algorithm 1: GIM-V BASE Stage 1.

Input : Matrix M = {(idsrc,(iddst,mval))},

Vector V = {(id,vval)}

Output: Partial vector

V′= {(idsrc,combine2(mval,vval)}

Stage1-Map(Key k, Value v) ;

begin

if (k,v) is of type V then

Output(k,v);

else if (k,v) is of type M then

(iddst,mval) ← v;

Output(iddst,(k,mval));

end

1

2

3

// (k: id, v: vval)

4

5

6

// (k: idsrc)

7

8

Stage1-Reduce(Key k, Value v[1..m]) ;

begin

saved kv ←[ ];

saved v ←[ ];

foreach v ∈ v[1..m] do

if (k,v) is of type V then

saved v ← v;

Output(k,(“self”,saved v));

else if (k,v) is of type M then

Add v to saved kv

end

foreach (id′

Output(id′

end

end

9

10

11

12

13

14

15

16

17

// (v: (idsrc,mval))

18

19

src,mval′) ∈ saved kv do

src,(“others”,combine2(mval′,saved v)));

20

21

22

23

matrix(idsrcof M) and the value is the partially combined

result(combine2(mval,vval)). This output of Stage1

becomes the input of Stage2. Stage2 combines all partial

results from Stage1 and assigns the new vector to the old

vector. The combineAlli() and assign() operations are

done in line 16 of Stage2, where the “self” and “others”

tags in line 16 and line 21 of Stage1 are used to make vi

and vnewof GIM-V, respectively.

Thistwo-stagealgorithm

application-specific convergence criterion is met. In Algo-

rithm 1 and 2, Output(k, v) means to output data with the

key k and the value v.

is runiterativelyuntil

B. GIM-V BL: Block Multiplication

GIM-V BL is a fast algorithm for GIM-V which is

based on block multiplication. The main idea is to group

elements of the input matrix into blocks or submatrices of

size b by b. Also we group elements of input vectors into

blocks of length b. Here the grouping means we put all the

elements in a group into one line of input file. Each block

contains only non-zero elements of the matrix or vector.

The format of a matrix block with k nonzero elements

is(rowblock,colblock,rowelem1,colelem1,mvalelem1,...,

Algorithm 2: GIM-V BASE Stage 2.

Input : Partial vector V′= {(idsrc,vval′)}

Output: Result Vector V = {(idsrc,vval)}

Stage2-Map(Key k, Value v) ;

begin

Output(k,v);

end

1

2

3

4

Stage2-Reduce(Key k, Value v[1..m]) ;

begin

others v ←[ ];

self v ←[ ];

foreach v ∈ v[1..m] do

(tag,v′) ← v;

if tag == “same” then

self v ← v′;

else if tag == “others” then

Add v′to others v;

end

Output(k,assign(self v,combineAllk(others v)));

end

5

6

7

8

9

10

11

12

13

14

15

16

17

rowelemk,colelemk,mvalelemk).

ofavector

(idblock,idelem1,vvalelem1,...,idelemk,vvalelemk).

blocks with at least one nonzero elements are saved to disk.

This block encoding forces nearby edges in the adjacency

matrix to be closely located; it is different from HADOOP’s

default behavior which do not guarantee co-locating them.

After grouping, GIM-V is performed on blocks, not on

individual elements. GIM-V BL is illustrated in Figure 1.

Similarly,

nonzero

the

elements

format

blockwithkis

Only

Figure 1.

and virepresents a vector block. The matrix and vector are joined block-

wise, not element-wise.

GIM-V BL using 2 x 2 blocks. Bi,jrepresents a matrix block,

In our experiment at Section V, GIM-V BL is more than 5

times faster than GIM-V BASE. There are two main reasons

for this speed-up.

• Sorting Time Block encoding decrease the number

of items to sort in the shuffling stage of HADOOP.

We observed that the main bottleneck of programs in

HADOOP is its shuffling stage where network transfer,

sorting, and disk I/O happens. By encoding to blocks

of width b, the number of lines in the matrix and the

vector file decreases to 1/b2and 1/b times of their

original size, respectively for full matrices and vectors.

Page 5

• Compression The size of the data decreases signifi-

cantly by converting edges and vectors to block format.

The reason is that in GIM-V BASE we need 4×2 bytes

to save each (srcid, dstid) pair since we need 4 bytes to

save a node id using Integer. However in GIM-V BL

we can specify each block using a block row id and

a block column id with two 4-byte Integers, and refer

to elements inside the block using 2 × logb bits. This

is possible because we can use logb bits to refer to a

row or column inside a block. By this block method

we decreased the edge file size(e.g., more than 50%

for YahooWeb graph in Section V).

C. GIM-V CL: Clustered Edges

When we use block multiplication, another advantage is

that we can benefit from clustered edges. As can be seen

from Figure 2, we can use smaller number of blocks if

input edge files are clustered. Clustered edges can be built

if we can use heuristics in data preprocessing stage so that

edges are clustered, or by co-clustering (e.g., see [32]). The

preprocessing for edge clustering need to be done only once;

however, they can be used by every iteration of various

application of GIM-V. So we have two variants of GIM-V:

GIM-V CL, which is GIM-V BASE with clustered edges,

and GIM-V BL-CL, which is GIM-V BL with clustered

edges. Be aware that clustered edges is only useful when

combined with block encoding. If every element is treated

separately, then clustered edges don’t help anything for the

performance of GIM-V.

Figure 2.

edges are grouped into 2 by 2 blocks. The left graph uses only 3 blocks

while the right graph uses 9 blocks.

Clustered vs. non-clustered graphs with same topology. The

D. GIM-V DI: Diagonal Block Iteration

As mentioned in Section IV-B, the main bottleneck of

GIM-V is its shuffling and disk I/O steps. Since GIM-V

iteratively runs Algorithm 1 and 2, and each Stage requires

disk IO and shuffling, we could decrease running time if we

decrease the number of iterations.

In HCC, it is possible to decrease the number of iterations.

The main idea is to multiply diagonal matrix blocks and

corresponding vector blocks as much as possible in one

iteration. Remember that multiplying a matrix and a vector

corresponds to passing node ids to one step neighbors in

HCC. By multiplying diagonal blocks and vectors until the

contents of the vectors do not change in one iteration, we

can pass node ids to neighbors located more than one step

away. This is illustrated in Figure 3.

Figure 3.

element in the adjacency matrix of (a) represents a 4 by 4 block; each

column in (b) and (c) represents the vector after each iteration. GIM-V DL

finishes in 4 iterations while GIM-V BL requires 8 iterations.

Propagation of component id(=1) when block width is 4. Each

We see that in Figure 3 (c) we multiply Bi,i with vi

several times until vi do not change in one iteration. For

example in the first iteration v0changed from {1,2,3,4} to

{1,1,1,1} since it is multiplied to B0,0 four times. GIM-V

DI is especially useful in graphs with long chains.

The upper bound of the number of iterations in HCC DI

with chain graphs are determined by the following theorem.

Theorem 2 (Upper bound of iterations in HCC DI): In a

chain graph with length m, it takes maximum 2∗⌈m/b⌉−1

iterations in HCC DI with block size b.

Proof: The worst case happens when the minimum

node id is in the beginning of the chain. It requires 2

iterations(one for propagating the minimum node id inside

the block, another for passing it to the next block) for the

minimum node id to move to an adjacent block. Since

the farthest block is ⌈m/b⌉ − 1 steps away, we need

2 ∗ (⌈m/b⌉ − 1) iterations. When the minimum node id

reached the farthest away block, GIM-V DI requires one

more iteration to propagate the minimum node id inside

the last block. Therefore, we need 2 ∗ (⌈m/b⌉ − 1) + 1 =

2 ∗ ⌈m/b⌉ − 1 iterations.