Page 1

arXiv:1006.0809v1 [cs.DS] 4 Jun 2010

Tight and simple Web graph compression

Szymon Grabowski and Wojciech Bieniecki

Computer Engineering Department, Technical University of ? L´ od´ z,

Al. Politechniki 11, 90–924 ? L´ od´ z, Poland

{sgrabow,wbieniec}@kis.p.lodz.pl

Abstract. Analysing Web graphs has applications in determining page ranks, fighting Web

spam, detecting communities and mirror sites, and more. This study is however hampered by

the necessity of storing a major part of huge graphs in the external memory, which prevents

efficient random access to edge (hyperlink) lists. A number of algorithm involving compression

techniques have thus been presented, to represent Web graphs succinctly but also providing

random access. Those techniques are usually based on differential encodings of the adjacency

lists, finding repeating nodes or node regions in the successive lists, more general grammar-

based transformations or 2-dimensional representations of the binary matrix of the graph.

In this paper we present two Web graph compression algorithms. The first can be seen as

engineering of the Boldi and Vigna (2004) method. We extend the notion of similarity between

link lists, and use a more compact encoding of residuals. The algorithm works on blocks of

varying size (in the number of input lines) and sacrifices access time for better compression

ratio, achieving more succinct graph representation than other algorithms reported in the

literature. The second algorithm works on blocks of the same size, in the number of input

lines, and its key mechanism is merging the block into a single ordered list. This method

achieves much more attractive space-time tradeoffs. Additionally, we show a simple idea for

2-dimensional graph representation which also achieves state-of-the-art compression ratio.

Key words: graph compression, random access

1Introduction

Development of succinct data structures is one of the most active research areas in

algorithmics in the last years. A succinct data structure shares the interface with

its classic (non-succinct) counterpart, but is represented in much smaller space, via

data compression. Successful examples along these lines include text indexes [16],

dictionaries, trees [15,11] and graphs [15]. Queries to succinct data structures are

usually slower (in practice, although not always in complexity terms) than using

non-compressed structures, hence the main motivation in using them is to allow to

deal with huge datasets in the main memory. For example, indexed exact pattern

matching in DNA would be limited to sequences shorter than 1 billion nucleotides

on a commodity PC with 4 GB of main memory, if the indexing structure were the

classic suffix array (SA), and even less than half of it, if SA were replaced with a

suffix tree. On the other hand, switching to some compressed full-text index (see [16]

for a survey) shifts the limit to over 10 billion nucleotides, which is more than enough

to handle the whole human genome.

Another huge object of significant interest seems to be the Web graph. This is a

directed unlabeled graph of connections between Web pages (i.e., documents), where

the nodes are individual HTML documents and the edges from a given node are the

outgoing links to other nodes. We assume that the order of hyperlinks in a document

is irrelevant. Web graph analyses can be used to rank pages, fight Web spam, detect

communities and mirror sites, etc.

Page 2

It was estimated that the graph of the Web index by Yahoo!, Google, Bing and

Ask has between 21 and 59 billion nodes (http://www.worldwidewebsize.com/, May

2010), but the top figure is more likely. Therefore assuming 50 billion nodes and 20

outgoing links per node, we have about 1 trillion links. Using plain adjacency lists,

representation of this graph would require about 8 TB, if the edges are represented

with 64-bit pointers (note that 32-bit pointers may simply be too small). In a slightly

less na¨ ıve variant, with 5-byte pointers (note that 40 bits are just enough to represent

1 trillion values, but cannot scale any longer), the space occupancy drops to 5 TB,

i.e., is still ways beyond the capacities of the current RAM memories. We believe that,

confronted with the given figures, the reader is now convinced about the necessity of

compression techniques for Web graph representation.

A shorter version of this manuscript was submitted to Prague Stringology Con-

ference 2010.

2Related work

We assume that a directed graph G = (V,E) is a set of n = |V | vertices and m = |E|

edges. The earliest works on graph compression were theoretical, and they usually

dealt with specific graph classes. For example, it is known that planar graphs can be

compressed into O(n) bits [19,12]. For dense enough graphs, it is impossible to reach

o(mlogn) bits of space, i.e., go below the space complexity of the trivial adjacency list

representation. Since the seminal Jacobson’s thesis [13] on succinct data structures,

there appear papers taking into account not only the space occupied by a graph, but

also access times.

There are several works dedicated to Web graph compression. Bharat et al. [3]

suggested to order documents according to their URL’s, to exploit the simple ob-

servation that most outgoing links actually point to another document within the

same Web site. Their Connectivity Server provided linkage information for all pages

indexed by the AltaVista search engine at that time. The links are merely represented

by the node numbers (integers) using the URL lexicographical order. We noted that

we assume the order of hyperlinks in a document irrelevant (like most works on Web

graph compression do), hence the link lists can be sorted, in ascending order. As the

successive numbers tend to be close, differential encoding may be applied efficiently.

Randall et al. [18] also use this technique (stating that for their data 80% of all

links are local), but they also note that commonly many pages within the same site

share large parts of their adjacency lists. To exploit this phenomenon, a given list may

be encoded with a reference to another list from its neighborhood (located earlier),

plus a set of additions and deletions to/from the referenced list. Their encoding, in

the most compact variant, encodes an outgoing link in 5.55 bits on average, a result

reported over a Web crawl consisting of 61 million URL’s and 1 billion links.

One of the most efficient compression schemes for Web graph was presented by

Boldi and Vigna [4] in 2003. Their method is likely to achieve around 3 bits per edge,

or less, at link access time below 1ms at their 2.4GHz Pentium4 machine. Of course,

the compression ratios vary from dataset to dataset. We are going to describe the

Boldi and Vigna algorithm in detail in the next section as this is the main inspiration

for our solution.

2

Page 3

Claude and Navarro [7,9] took a totally different approach of grammar-based

compression. In particular, they focus on Re-Pair [14] and LZ78 compression schemes,

getting close, and sometimes even below, the compression ratios of Boldi and Vigna,

while achieving much faster access times. To mitigate one of the main disadvantages

of Re-Pair, high memory requirements, they develop an approximate variant of this

algorithm.

When compression is at a premium, one may acknowledge the work of Asano et al.

[2] in which their present a scheme creating a compressed graph structure smaller by

about 20–35% than the BV scheme with extreme parameters (best compression but

also impractically slow). The Asano et al. scheme perceives the Web graph as a binary

matrix (1s stand for edges) and detects 2-dimensional redundancies in it, via finding

six types of blocks in the matrix: horizontal, vertical, diagonal, L-shaped, rectangular

and singleton blocks. The algorithm compresses the data of intra-hosts separately for

each host, and the boundaries between hosts must be taken from a separate source

(usually, the list of all URL’s in the graph), hence it cannot be justly compared to

other algorithms mentioned here. Worse, retrieval times per adjacency list are much

longer than for other schemes: on a order of a few milliseconds (and even over 28ms for

one of three tested datasets) on their Core2 Duo E6600 (2.40GHz) machine running

Java code. We note that 28ms is at least twice more than the access time of modern

hard disks, hence working with a na¨ ıve (uncompressed) external representation would

be faster for that dataset (on the other hand, excessive disk use from very frequent

random accesses to the graph can result in a premature disk failure). It seems that

the retrieval times can be reduced (and made more stable across datasets) if the

boundaries between hosts in the graph are set artificially, in more or less regular

distances, but then also the compression ratio is likely to drop.

Also excellent compression results were achieved by Buehrer and Chellapilla [6],

who used grammar-based compression. Namely, they replace groups of nodes appear-

ing in several adjacency lists with a single “virtual node” and iterate this procedure;

no access times were reported in that work, but according to findings in [8] they

should be rather competitive and at least much shorter than of the algorithm from

[2], with compression ratio worse only by a few percent.

Anh and Moffat [1] devised a scheme which seems to use grammar-based com-

pression in a local manner. They work in groups of h consecutive lists and perform

some operations to reduce their size (e.g., a sort of 2-dimensional RLE if a run of

successive integers appears on all the h lists). What remains in the group is then en-

coded statistically. Their results are very promising: graph representations by about

15–30% (or even more in some variant) smaller than the BV algorithm with practical

parameter choice (in particular, Anh and Moffat achieve 3.81bpe and 3.55bpe for the

graph EU) and reported comparable decoding speed. Details of the algorithm cannot

however be deduced from their 1-page conference poster.

Recent works focus on graph compression with support for bidirectional naviga-

tion. To this end, Brisaboa et al. [5] proposed the k2-tree, a spatial data structure,

related to the well-known quadtree, which performs a binary partition of the graph

matrix and labels empty areas with 0s and non-empty areas with 1s. The non-empty

areas are recursively split and labeled, until reaching the leaves (single nodes). An im-

portant component in their scheme is an auxiliary structure to compute rank queries

[13] efficiently, to navigate between tree levels. It is easy to notice that this elegant

3

Page 4

data structure supports handling both forward and reverse neighbors, which implies

from its symmetry. Experiments show that this approach uses significantly less space

(3.3–5.3 bits per link) than the Boldi and Vigna scheme applied for both direct and

transposed graph, at the average neighbor retrieval times of 2–15 microseconds (Pen-

tium4 3.0GHz).

Even more recently, Claude and Navarro [8] showed how Re-Pair can be used to

compress the graph binary relation efficiently, enabling also to extract the reverse

neighbors of any node. These ideas let them achieve a number of Pareto-optimal

space-time tradeoffs, usually competitive to those from the k2-tree.

3 The Boldi and Vigna scheme

Based on WebGraph datasets (http://webgraph.dsi.unimi.it/), Boldi and Vigna

noticed that similarity is strongly concentrated; typically, either two adjacency (edge)

lists have nothing or little in common, or they share large subsequences of edges. To

exploit this redudancy, one bit per entry on the referenced list could be used, to

denote which of its integers are copied to the current list, and which are not. Those

bit-vectors are dubbed copy lists. Still, Boldi and Vigna go further, noticing that

copy lists tend to contain runs of 0s and 1s, thus they compress them using a sort

of run-length encoding. They assume the first run consists of 1s (if the copy list

actually starts with 0s, the length of the first run is simply zero), and then it allows

to represent a copy list as only a sequence of run lengths, encoded e.g. with Elias

coding.

The integers on the current list which didn’t occur on the referenced list must be

stored too, and how to encode them is another novelty of the described algorithm.

They detect intervals of consecutive (i.e., differing by 1) integers and encode them

as pairs of the left boundary and the interval length; the left boundary of the next

interval on a given list will be encoded as the difference to the right boundary of the

previous interval minus two (this is because between the end of one interval and the

beginning of another there must be at least one integer). The numbers which do not

fall into any interval are called residuals and are also stored, encoded in a differential

manner.

Finally, the algorithm allows to select as the reference list one of several previous

lines; the size of the window is one of the parameters of the algorithm posing a

tradeoff between compression ratio and compression/decompression time and space.

Another parameter affecting the results is the maximum reference count, which is the

maximum allowed length of a chain of lists such that one cannot be decoded without

extracting its predecessor in the chain.

4 Our algorithms

We present two approaches to Web graph compression working locally, in small blocks;

the first one reaches higher compression ratios but the second seems to be more

practical, as being much faster.

4

Page 5

Alg. 1 GraphCompressSSL(G,BSIZE).

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

firstLine ← true

prev ← [ ]

outB ← [ ]

outF ← [ ]

for line ∈ G do

residuals ← line

if firstLine = false then

f[1...|prev|] ← [1,1,...,1]

for i ← 1 to |prev| do

if prev[i] ∈ line then f[i] ← 0

else if prev[i] + 1 ∈ line then f[i] ← 2

else if prev[i] + 2 ∈ line then f[i] ← 3

append(outF, f)

for i ← 1 to |prev| do

if f[i] ?= 1 then

remove(residuals, prev[i])

residuals′← RLE(diffEncode(residuals)) + [0]

append(outB, byteEncode(residuals′))

prev ← line

firstLine ← false

if |outB| ≥ BSIZE then

compress(outB)

compress(outF)

outB ← [ ]

outF ← [ ]

firstLine ← true

4.1 An algorithm based on similarity of successive lists

Our first algorithm (Alg. 1, SSL stands for “similarity of successive lists”) works in

blocks consisting of multiple adjacency lists. The blocks in their compact form are

approximately equal, which means that the number of adjacency lists per block varies;

for example, in graph areas with dominating short lists the number of lists per block

is greater than elsewhere.

We work in two phases: preprocessing and final compression, using a general-

purpose compression algorithm. The algorithm processes the adjacency lines one-by-

one and splits their data into two streams.

One stream holds copy lists, in an extended sense compared to the Boldi and Vigna

solution. Our copy lists are no longer binary but consist of four different flag symbols:

0 denotes an exact match (i.e., value j from the reference list occurs somewhere on

the current list), 2 means that the current list contains integer j + 1, 3 means that

the current list contains integer j +2, if the corresponding integer from the reference

list is j. Finally, the bits 1 correspond to the items from the reference list which have

not been earlier labeled with 0, 2 or 3.

Of course, several events may happen for a single element, e.g., the integer 34

from the reference list triggers three events if the current list contains 34, 35 and 36.

In such case, the flag with the smallest value is chosen (i.e., 0 in our example).

Moreover, we make things even simpler than in the Boldi–Vigna scheme and our

reference list is always the previous adjacency list.

The other stream stores residuals, i.e., the values which cannot be decoded with

flags 0, 2 or 3 on the copy lists. First differential encoding is applied and then an

RLE compressor for differences 1 only (with minimum run length set experimentally

to 5) is run. The resulting sequence is terminated with a unique value (0) and then

encoded using a byte code.

5

Page 6

For this last step, we consider two variants. One is similar to two-byte dense code

[17] in spending one bit flag in the first codeword byte to tell the length of the current

codeword. Namely, we choose between 1 and b bytes for encoding each number, where

b is the minimum integer such that 8b − 1 bits are enough to encode any node value

in a given graph. In practice it means that b = 3 for EU and b = 4 for the remaining

available datasets.

The second coding variant can be classified as a prelude code [10] in which two

bits in the first codeword byte tell the length of the current codeword; originally the

lengths are 1, 2, 3 and 4 but we take 1, 2 and b such that 8b − 2 bits are enough

to encode the largest value in the given graph (i.e., b could be 5 or 6 for really huge

graphs).

Once the residual buffer reaches at least BSIZE bytes, it is time to end the current

block and start a new one. Both residual and flag buffers and then (independently)

compressed (we used the well-known Deflate algorithm for this purpose) and flushed.

The code at Alg. 1 is slightly simplified; we omitted technical details serving for

finding the list boundaries in all cases (e.g., empty lines).

4.2An algorithm based on list merging

Our second algorithm (Alg. 2, LM stands for “list merging”) works in blocks having

the same number of lists, h (at least in this aspect our algorithm resembles the one

from [1]).

Given the block of h lists, the procedure converts it into two streams: one stores

one long list consisting of all integers on the h input lists, without duplicates, and

the other stores flags necessary to reconstruct the original lists. In other words, the

algorithm performs a reversible merge of all the lists in the block.

The long list is compacted in a manner similar to the previous algorithm: the list

is differentially encoded, zero-terminated and submitted to a byte coder (the variant

with 1, 2 and b bytes per codeword was only tried). Note we gave up the RLE phase

here.

The flags describe to which input lists a given integer on the output list belong; the

number of bits per each item on the output list is h, and in practical terms we assume

h being a multiple of 8 (and even additionally a power of 2, in the experiments to

follow). The flag sequence does not need any terminator since its length is defined by

the length of the long list, which is located earlier in the output stream. For example,

if the length of the long list is 91 and h = 32, the corresponding flag sequence has

364 bytes.

Those two sequences, the compacted long list and the (raw) flag sequence, are

concatenated and compressed with the Deflate algorithm.

One can see that the key parameter here is the block size, h. Using a larger h lets

exploit a wider range of similar lists but also has two drawbacks. The flag sequence

gets more and more sparse (for example, for h = 64 and the EU-2005 crawl, as

much as about 68% of its list indicators have only one set bit out of 64!), and the

Deflate compressor is becoming relatively inefficient on those data. Worse, decoding

(including decompression) larger blocks takes longer time.

6

Page 7

Alg. 2 GraphCompressLM(G,h).

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

outF ← [ ]

i ← 1

for linei,linei+1,...,linei+h−1∈ G do

tempLine1← linei∪ linei+1∪ ... ∪ linei+h−1

tempLine2← removeDuplicates(tempLine1)

longLine ← sort(tempLine2)

items ← diffEncode(longLine) + [0]

outB ← byteEncode(items)

for j ← 1 to |longLine| do

f[1...|longLine|] ← [0,0,...,0]

for k ← 1 to h do

if longLine[j] ∈ linei+k−1then f[k] ← 1

append(outF, bitPack(f))

compress(concat(outB, outF))

outF ← [ ]

i ← i + h

5 Experimental results

We conducted experimented on the crawls EU-2005 and Indochina-2004, downloaded

from the WebGraph project (http://webgraph.dsi.unimi.it/), using both direct

and transposed graphs. The main characteristics of those datasets are presented in

Table 1.

DatasetEU-2005

direct transposed

862664

19235140

22.30

8.31

6985

Indochina-2004

direct transposed

7414866

19235140

26.18

17.66

6985

Nodes

Edges

Edges / nodes

% of empty lists

Longest list length

0.000

68922

0.004

256425

Table 1. Selected characteristics of the datasets used in the experiments.

The main experiments (Sect. 5.1) were run on a machine equipped with an Intel

Core 2 Quad Q9450 CPU, 8 GB of RAM, running Microsoft Windows XP (64-bit).

Our algorithms were implemented in Java (JDK 6). A single CPU core was used by all

implementations. As seemingly accepted in most reported works, we measure access

time per edge, extracting many (100,000 in our case) randomly selected adjacency

lists and summing those times, and dividing the total time by the number of edges

on the required lists. The space is measured in bits per edge (bpe), dividing the total

space of the structure (including entry points to blocks) by the total number of edges.

Throughout this section by 1KB we mean 1000 bytes.

5.1Compression ratios and access times

Our first algorithm has three parameters: the number of flags used (either 2 or 4,

where 2 flags mimic the Boldi–Vigna scheme and 4 correspond to Alg. 1), the byte

encoding scheme (either using 2 or 3 codeword lengths), and the residual block size

threshold BSIZE. As for the last parameter, we initially set it to 8192, which means

that the residual block gets closed and is submitted to the Deflate compression once

it reaches at least 8192 bytes. Experiments with the block size are presented in the

next subsection. The remaining parameters constitute four variants:

7

Page 8

2a Two flags and two codeword lengths are used.

2b Two flags and three codeword lengths are used.

4a Four flags and two codeword lengths are used.

4b Four flags and three codeword lengths are used.

DatasetEU-2005Indochina-2004

direct transposed

1.101

1.062

0.936

0.909

direct transposed

2.286

2.199

1.735

1.696

2a

2b

4a

4b

2.345

2.290

1.809

1.782

1.087

1.065

0.903

0.890

Table 2. The algorithm based on similarity of successive lists, compression ratios in

bits per edge.

As expected, the compression ratios improve with using more flags and more dense

byte codes (Table 2). Tables 3 and 4 present the compression and access time results

for the two extreme variants: 2a and 4b. Here we see that using more aggressive

preprocessing is unfortunately slower (partly because of increased amount of flag

data per block) and the difference in speed between variants 2a and 4b is close to

50%. Translating the times per edge into times per neighbor list, we need from 410µs

to 550µs for 2a and from 620µs to 760µs for 4b. This is about 10 times less than the

access time of 10K or 15K RPM hard disks.

Our second algorithm has one parameter, h, the number of lines (lists) per block.

We conducted experiments for h = 16, 32, 64, the results are presented in the last

three rows of Tables 3 and 4, respectively. We see that even LM64 cannot reach the

compression of our 4b variant, but its list extraction is faster 14–27 times. The fastest

of the variants presented here, LM16, is 1.3 and 2.0 slower than BV (7,3), respectively,

with much better compression (we checked also LM8, only on EU-2005: the results

are 3.814bpe and 0.20µs per edge).

direct graph

bpe

transposed graph

bpe

–

2.345

1.782

2.576

2.233

2.016

time [µs] time [µs]

BV (7,3) 5.169

2a

4b

LM16

LM32

LM64

0.24

18.59

28.93

0.31

0.55

1.05

–

2.286

1.696

2.963

2.373

2.008

18.88

27.83

0.82

1.05

2.01

Table 3. EU-2005 dataset. Compression ratios (bpe) and access times per edge. To

the results of BV (7,3) the amount of 0.510bpe should be added, corresponding to

extra data required to access the graph in random order.

5.2Varying the block size in the algorithm based on similarity of

successive lists

Obviously, the block size should seriously affect the overall space used by the structure

and the access time. Larger blocks mean that the Deflate algorithm is more successful

8

Page 9

direct graph

bpe

2.063

1.101

0.909

1.668

1.320

1.097

transposed graph

bpe

–

1.087

0.890

1.411

1.228

1.093

time [µs] time [µs]

BV (7,3)

2a

4b

LM16

LM32

LM64

0.21

20.77

29.03

0.43

0.55

0.79

–

21.10

27.43

0.47

0.69

1.16

Table 4. Indochina-2004 dataset. Compression ratios (bpe) and access times per edge.

To the results of BV (7,3) the amount of 0.348bpe should be added, corresponding

to extra data required to access the graph in random order.

in finding longer matches and the overhead from encoding first lines in a block without

any reference is smaller. On the other hand, more lines have to be usually decoded

before extracting the queried adjacency list.

In this experiment we run the 2a algorithm (the same implementation in Java)

with each block of residuals terminated (and later Deflate-compressed) after reaching

BSIZE of 1024, 2048, 4096, 8192 and 16384 bytes, respectively. The test computer

had an Intel Pentium4 HT 3.0GHz CPU, 1GB of RAM, and was running Microsoft

Windows XP Home SP3 (32-bit). The results (Table 5) show that doubling the block

size implies space reduction by about 10% while the access time grows less than twice

(in particular, using 8K blocks is only 2.0–2.5 times slower than using 2K blocks). Still,

as the block size gets larger (compare the last two rows in the table), the improvement

in compression starts to drop while the slowdown grows. For a reference, the access

times of a practical Boldi–Vigna variant, BV (7,3), are 0.47µs and 0.42µs on the test

machine.

EU-2005Indochina-2004

bpe

1.485

1.292

1.172

1.101

1.061

bpe

3.398

2.869

2.513

2.286

2.129

time [µs] time [µs]

1024

2048

4096

8192

16384

6.50

8.91

15.93

27.60

48.77

8.99

12.05

17.87

29.83

57.39

Table 5. Compression ratios and access times in function of the block size. 2a variant

used. Tests run on the non-transposed graphs.

6Obtaining forward and reverse neighbors

Sometimes one is interested in grasping not only the (forward) neighbors of a given

node but also the nodes that point to the current node (also called its reverse neigh-

bors). A na¨ ıve solution to this problem is to store a twin data structure built for

the transposed graph, which more or less doubles the required space. Interestingly,

as pointed out in Sect. 2, more sophisticated ideas are already known, using 2D

structures that support bidirectional navigation over the graph.

In this section we propose two simple techniques for this problem scenario. One of

them reduces the size of the compressed transposed graph for the price of moderate

9

Page 10

increase in search time. Basically, the idea is to remove parts of some adjacency lists

from the transposed graph and refer to the compressed structure for the direct graph

when there is a need to extract those removed reverse neighbors. In our preliminary

experiments the transposed graph compressed component was reduced by less than

10% while for many lists the access time had to be approximately doubled (instead of

extracting one compressed block, two randomly accessed blocks had to be extracted).

Even if more can be done along these lines, we do not anticipate this approach being

competitive.

The other algorithm partitions the binary matrix of the EU graph into squares,

in the manner of the k2-tree, but without any hierarchy, i.e., using only one level of

blocks. Although seemingly very primitive, this idea let us attain the smallest space

ever reported in the literature, for the EU dataset, among the algorithms supporting

bidirectional navigation, namely 1.76bpe, but the average extraction time per adja-

cency list is now on the order of a few milliseconds, i.e., close to hard disk access

time. This is, in a way, an extreme result; a slower algorithm could already lose in

speed to a plain external representation.

In an experiment, we partitioned the binary matrix M of the EU graph (n =

862,664 nodes) into boxes (squares) of size B = 1024 (the boundary areas may

be rectangular). Each box is identified with a single bit (totalling 89KB) where 1s

stand for the non-empty boxes (those that contain at least one edge). The non-empty

boxes, obtained in a row-wise scan, are labeled with successive integers, which are

offsets in an array A[1...|A|] of pointers to the actual (compressed) content of the

corresponding boxes. Now we present how forward and reverse neighbors of a given

page are found.

To find the forward neighbors of page j, we must retrieve and decode all the non-

empty boxes overlapping the jth row of the matrix. Note that for efficient retrieval

we need only to find quickly in the array A the pointer to first (leftmost) such box as

all its successors will be pointed from the following cells of A. A trivial yet satisfying

solution is to store in an extra array the indexes in A of the leftmost non-empty boxes

for the following rows of boxes. This needs n/B indexes (about 3.3KB for the EU

graph if 4-byte indexes are used).

Finding the reverse neighbors is harder but we avoid the challenge and solve it

trivially, storing an array analogous to A, only built according to the column-wise

scan. For the EU graph and our choice of B, the number of non-empty blocks is

about 24,700, i.e., the extra cost in space is just above 100KB (4-byte pointers and

the 3.3KB of the auxiliary array).

As mentioned, the non-empty boxes are stored in compressed form. We (concep-

tually) flatten each box, writing it row after row, and encode the gaps between the

successive 1s. The gaps are represented with a byte code (1, 2 or 3 bytes per gap).

Finally, the sequence of encoded gaps is compressed with the Deflate algorithm. To

improve compression, for each non-empty box we check if transposing it results in a

smaller Deflate-compressed size and also if it is better not to compress it at all (data

expansion is typical if the box contains only a few items). This adds two extra bits

per non-empty box.

Note that accessing the (forward or reverse) neighbors of a given page requires

decoding many boxes, even those that have no item in common with the desired

neighbor list. For the EU graph a single list passes through about 29 non-empty

10

Page 11

boxes, on average. The average non-empty box occupies a little over 1.3KB before

Deflate compression (their size variance is however very large), which means that

retrieving the neighbor list requires extracting compressed data to about 39KB, on

average. This estimation is optimistic since decompressing a single chunk of data is

usually faster than of several chunks totalling the same size, because of the locality of

memory accesses. Moreover, as said, those average-case estimations are far from the

worst case. Yet another factor is that the decoded boxes must be filtered to return

only those values which belong to the desired list. Overall, we however believe that

one can retrieve the neighbor list in about 2–3ms (i.e., about 100 microseconds per

neighbor) on modern hardware in an average case.

The total size of the EU graph compressed in the presented way is about 4,225KB,

which translates to 1.76bpe. This contrasts with 3.93bpe presented as the most

succinct result in [8], for which graph representation the average reported direct

and reverse edge retrieval time is about 35 and 55 microseconds, respectively. As the

number of edges per adjacency list is about 22 for this graph, the times to extract the

whole list are close to 1ms which is not that far from our (very crudely estimated)

retrieval times. This leads us to the conclusion that even simple heuristics and off-

the-shelf tools (like the Deflate compression algorithm) may help one get close to the

state of the art and should encourage researchers to rethink the problem.

7 Conclusions

We presented three algorithms for Web graph compression, two of them encoding

blocks consisting of whole lines and the other working on boxes (squares) of the

graph binary matrix. All those algorithms achieve much better compression results

than those presented in the literature, although two of them for the price of relatively

slow access time. The most interesting algorithm, based on list merging, seems to

be rather competitive to the algorithms known from the literature ([4,9,1]) but in

experiments we could directly compare it only to the Boldi–Vigna algorithm. Our

approach lets achieve compression ratios not reported in the literature, for moderate

slow-down in list accesses (the best tradeoff here seems to be the variant LM32).

If even better compression ratios are welcome, then our 4b variant can be consid-

ered, being more than an order of magnitude slower. We point out that one extreme

tradeoff in succinct in-memory data structures is when accessing the structure is only

slightly faster than reading data from disk. The niche for such a solution is when

the given Web crawl cannot fit in RAM memory using less tight compressed rep-

resentation and the stronger compression is already enough. The disk transfer rate

is of relatively small imporantance here and what matters is the access time, which

is about 10ms or more for commodity 7200RPM hard disks. Our algorithms spend

significantly less time for extracting an average adjacency list, even if they are 1 or

2 orders of magnitude slower than the solutions from [4,7,8]. Another challenge is to

compete with SSD disks which are not much faster than conventional disks in reading

or writing sequential data but their access times are two orders of magniture smaller.

Here our LM variants are fast enough, though.

Our future work will focus on improving the access times in both approaches;

some possibilities lie in more aggressive reference list encoding via referring to sev-

eral (cf. [4]) rather than a single previous list, using smaller independently com-

11

Page 12

pacted blocks with backend compression applied over many of them, and replacing

Deflate with alternative compressors from LZ77 family, either stronger (e.g. LZMA,

http://www.7-zip.org/), or even faster than Deflate in the decompression.

References

1.

V. N. Anh and A. F. Moffat: Local modeling for webgraph compression, in DCC, J. A. Storer and

M. W. Marcellin, eds., IEEE Computer Society, 2010, p. 519.

Y. Asano, Y. Miyawaki, and T. Nishizeki: Efficient compression of web graphs, in COCOON, X. Hu

and J. Wang, eds., vol. 5092 of Lecture Notes in Computer Science, Springer, 2008, pp. 1–11.

K. Bharat, A. Z. Broder, M. R. Henzinger, P. Kumar, and S. Venkatasubramanian: The

Connectivity Server: Fast access to linkage information on the Web. Computer Networks, 30(1–7) 1998,

pp. 469–477.

P. Boldi and S. Vigna: The webgraph framework I: Compression techniques, in WWW, S. I. Feldman,

M. Uretsky, M. Najork, and C. E. Wills, eds., ACM, 2004, pp. 595–602.

N. Brisaboa, S. Ladra, and G. Navarro: K2-trees for compact web graph representation, in Proc.

16th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5721,

Springer, 2009, pp. 18–30.

G. Buehrer and K. Chellapilla: A scalable pattern mining approach to web graph compression with

communities, in WSDM, M. Najork, A. Z. Broder, and S. Chakrabarti, eds., ACM, 2008, pp. 95–106.

F. Claude and G. Navarro: Fast and compact Web graph representations, Tech. Rep. TR/DCC-

2008-3, Department of Computer Science, University of Chile, April 2008.

F. Claude and G. Navarro: Extended compact web graph representations, in Algorithms and Appli-

cations, T. Elomaa, H. Mannila, and P. Orponen, eds., vol. 6060 of Lecture Notes in Computer Science,

Springer, 2010, pp. 77–91.

F. Claude and G. Navarro: Fast and compact web graph representations. ACM Transactions on the

Web (TWEB), 2010, To appear.

10. J. S. Culpepper and A. Moffat: Enhanced byte codes with restricted prefix properties, in SPIRE,

M. P. Consens and G. Navarro, eds., vol. 3772 of Lecture Notes in Computer Science, Springer, 2005,

pp. 1–12.

11. R. F. Geary, N. Rahman, R. Raman, and V. Raman: A simple optimal representation for balanced

parentheses, in Combinatorial Pattern Matching, 15th Annual Symposium, CPM 2004, Istanbul,Turkey,

July 5-7, 2004, Proceedings, S. C. Sahinalp, S. Muthukrishnan, and U. Dogrus¨ oz, eds., vol. 3109 of

Lecture Notes in Computer Science, Springer–Verlag, 2004, pp. 159–172.

12. X. He, M.-Y. Kao, and H.-I. Lu: A fast general methodology for information-theoretically optimal

encodings of graphs. SIAM J. Comput., 30(3) 2000, pp. 838–846.

13. G. Jacobson: Succinct Static Data Structures, PhD thesis, 1989.

14. N. J. Larsson and A. Moffat: Off-line dictionary-based compression. Proceedings of the IEEE,

88(11) Nov. 2000, pp. 1722–1732.

15. J. I. Munro and V. Raman: Succinct representation of balanced parentheses, static trees and planar

graphs, in IEEE Symposium on Foundations of Computer Science (FOCS), 1997, pp. 118–126.

16. G. Navarro and V. M¨ akinen: Compressed full-text indexes. ACM Computing Surveys, 39(1) 2007,

p. article 2.

17. P. Proch´ azka and J. Holub: New word-based adaptive dense compressors, in IWOCA, J. Fiala,

J. Kratochv´ ıl, and M. Miller, eds., vol. 5874 of Lecture Notes in Computer Science, Springer, 2009,

pp. 420–431.

18. K. Randall, R. Stata, R. Wickremesinghe, and J. Wiener: The link database: Fast access to

graphs of the Web, 2001.

19. G. Tur´ an: On the succinct representation of graphs. Discrete Applied Math, 15(2) May 1984, pp. 604–

618.

2.

3.

4.

5.

6.

7.

8.

9.

12