# Merging Adjacency Lists for Efficient Web Graph Compression

**ABSTRACT** Analysing Web graphs meets a difficulty in the necessity of storing a major part of huge graphs in the external memory, which

prevents efficient random access to edge (hyperlink) lists. A number of algorithms involving compression techniques have thus

been presented, to represent Web graphs succinctly but also providing random access. Our algorithm belongs to this category.

It works on contiguous blocks of adjacency lists, and its key mechanism is merging the block into a single ordered list. This

method achieves compression ratios much better than most methods known from the literature at rather competitive access times.

Keywordsgraph compression–random access

**0**Bookmarks

**·**

**124**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Compressed representations have become effective to store and access large Web and social graphs, in order to support various graph querying and mining tasks. The existing representations exploit various typical patterns in those networks and provide basic navigation support. In this paper, we obtain unprecedented results by finding “dense subgraph” patterns and combining them with techniques such as node orderings and compact data structures. On those representations, we support out-neighbor and out/in-neighbor queries, as well as mining queries based on the dense subgraphs. First, we propose a compression scheme for Web graphs that reduces edges by representing dense subgraphs with “virtual nodes”; over this scheme, we apply node orderings and other compression techniques. With this approach, we match the best current compression ratios that support out-neighbor queries (i.e., nodes pointed from a given node), using 1.0–1.8 bits per edge (bpe) on large Web graphs, and retrieving each neighbor of a node in 0.6–1.0 microseconds ( $\upmu $ s). When supporting both out- and in-neighbor queries, instead, our technique generally offers the best time when using little space. If the reduced graph, instead, is represented with a compact data structure that supports bidirectional navigation, we obtain the most compact Web graph representations (0.9–1.5 bpe) that support out/in-neighbor navigation; yet, the time per neighbor extracted raises to around 5–20 $\upmu $ s. We also propose a compact data structure that represents dense subgraphs without using virtual nodes. It allows us to recover out/in-neighbors and answer other more complex queries on the dense subgraphs identified. This structure is not competitive on Web graphs, but on social networks, it achieves 4–13 bpe and 8–12 $\upmu $ s per out/in-neighbor retrieved, which improves upon all existing representations.Knowledge and Information Systems 08/2013; · 2.23 Impact Factor

Page 1

Merging adjacency lists for efficient

Web graph compression

Szymon Grabowski and Wojciech Bieniecki

Abstract Analysing Web graphs meets a difficulty in the necessity of storing

a major part of huge graphs in the external memory, which prevents efficient

random access to edge (hyperlink) lists. A number of algorithms involving

compression techniques have thus been presented, to represent Web graphs

succinctly but also providing random access. Our algorithm belongs to this

category. It works on contiguous blocks of adjacency lists, and its key mech-

anism is merging the block into a single ordered list. This method achieves

compression ratios much better than most methods known from the literature

at rather competitive access times.

Key words: graph compression, random access

1 Introduction

Development of succinct data structures is one of the most active research

areas in algorithmics in the last years. A succinct data structure shares the

interface with its classic (non-succinct) counterpart, but is represented in

much smaller space, via data compression. Successful examples along these

lines include text indexes [15], dictionaries, trees and graphs [14]. Queries to

succinct data structures are usually slower (in practice, although not always

in complexity terms) than using non-compressed structures, hence the main

Szymon Grabowski

Computer Engineering Dept., Technical Univ. of ? L´ od´ z, al. Politechniki 11, 90-924 ? L´ od´ z,

e-mail: sgrabow@kis.p.lodz.pl

Wojciech Bieniecki

Computer Engineering Dept., Technical Univ. of ? L´ od´ z, al. Politechniki 11, 90-924 ? L´ od´ z,

e-mail: wbieniec@kis.p.lodz.pl

1

Page 2

2 Szymon Grabowski and Wojciech Bieniecki

motivation in using them is to allow to deal with huge datasets in the main

memory.

One particular huge object of significant interest is the Web graph. This is a

directed unlabeled graph of connections between Web pages (i.e., documents),

where the nodes are individual HTML documents and the edges from a given

node are the outgoing links to other nodes. We assume that the order of

hyperlinks in a document is irrelevant. Web graph analyses can be used to

rank pages, fight Web spam, detect communities and mirror sites, etc.

As of Feb. 2011, it is estimated that Google’s index has about 28 billion

webpages1. Assuming 20 outgoing links per node, 5-byte links (4-byte indexes

to other pages are simply too small) and pointers to each adjacency list, we

would need more than 2.8TB of memory, ways beyond the capacities of the

current RAM memories.

2 Related work

We assume that a directed graph G = (V,E) is a set of n = |V | vertices and

m = |E| edges. The earliest works on graph compression were theoretical,

and they usually dealt with specific graph classes (e.g. planar ones). The first

papers dedicated to Web graph compression, which appeared around 2000,

pointed out some redundancies in the graph, e.g., that successive adjacency

lists tend to have nodes in common, if they are sorted in URL lexicographical

order, but they failed to achieve impressive compression ratios.

One of the most efficient (and most often used as a reference in newer

works) compression schemes for Web graph was presented by Boldi and Vi-

gna [5] in 2003. Their method (BV) is likely to achieve around 3bpe (bits

per edge), or less, at link access time below 1ms at their 2.4GHz Pentium4

machine. Of course, the compression ratios vary from dataset to dataset.

Based on WebGraph datasets [4], Boldi and Vigna noticed that similarity is

strongly concentrated; typically, either two adjacency (edge) lists have noth-

ing or little in common, or they share large subsequences of edges. To exploit

this redundancy, one bit per entry on the referenced list is used, to denote

which of its integers are copied to the current list, and which are not. Those

bit-vectors tend to contain runs of 0s and 1s, and thus are compressed with

a sort of RLE (run-length encoding). The integers on the current list which

didn’t occur on the referenced list are stored too; intervals of consecutive

integers are also encoded in an RLE manner, while the numbers which do

not fall into any interval (residuals) are differentially encoded. Finally, the

BV algorithm allows to select as the reference list one of several previous

lines; the size of the window is one of the parameters of the algorithm posing

a tradeoff between compression ratio and compression/decompression time

1http://www.worldwidewebsize.com/

Page 3

Merging adjacency lists for efficient Web graph compression3

and space. Another parameter affecting the results is the maximum reference

count, which is the maximum allowed length of a chain of lists such that one

cannot be decoded without extracting its predecessor in the chain.

Claude and Navarro [11] took a totally different approach of grammar-

based compression. In particular, they focus on rule-based Re-Pair [13] and

dictionary-based LZ78 compression schemes, getting close, and sometimes

even below, the compression ratios of Boldi and Vigna, while achieving much

faster access times. To mitigate one of the main disadvantages of Re-Pair, high

memory requirements, they develop an approximate variant of this algorithm.

When compression is at a premium, one may acknowledge the work of

Asano et al. [3] in which they present a scheme creating a compressed graph

structure smaller by about 20–35% than the BV scheme with unbounded ref-

erence chains (best compression but also impractically slow). The Asano et al.

scheme perceives the Web graph as a binary matrix (1s stand for edges) and

detects 2-dimensional redundancies in it, via finding several types of blocks

in the matrix. The algorithm compresses the data of intra-hosts separately

for each host, and the boundaries between hosts must be taken from a sep-

arate source (usually, the list of all URL’s in the graph), hence it cannot be

justly compared to other algorithms mentioned here. Worse, retrieval times

per adjacency list are much longer than for other schemes, from 2.3 to 28.7

milliseconds (Core2 Duo E6600 2.40GHz, Java implementation), depending

on a dataset: the longest time is even longer than hard disk access time! It

seems that the retrieval times can be reduced (and made more stable across

datasets) if the boundaries between hosts in the graph are set artificially, in

more or less regular distances, but then also the compression ratio is likely

to drop.

Also excellent compression results were achieved by Buehrer and Chel-

lapilla [9], who used grammar-based compression. Namely, they replace

groups of nodes appearing in several adjacency lists with a single “virtual

node” and iterate this procedure; no access times were reported in that work,

but according to findings in [10] they should be rather competitive and at

least much shorter than of the algorithm from [3], with compression ratio

worse only by a few percent.

Apostolico and Drovandi [1] proposed an alternative Web graph ordering,

reflecting their BFS traversal (starting from a random node) rather than

traditional URL-based order. They obtain quite impressive compressed graph

structures, often by 20–30% smaller than those from BV at comparable access

speeds. Interestingly, the BFS ordering allows to handle the link existential

query (testing if page i has a link to page j) almost twice faster than returning

the whole neighbor list. Still, we note that using non-lexicographical ordering

is probably harmful for compact storing of the webpage URLs themselves (a

problem accompanying pure graph structure compression in most practical

applications).

Anh and Moffat [2] devised a scheme which seems to use grammar-based

compression in a local manner. They work in groups of h consecutive lists

Page 4

4 Szymon Grabowski and Wojciech Bieniecki

and perform some operations to reduce their size (e.g., a sort of 2-dimensional

RLE if a run of successive integers appears on all the h lists). What remains in

the group is then encoded statistically. Their results are very promising: graph

representations by about 15–30% (or even more in some variant) smaller than

the BV algorithm with practical parameter choice (in particular, Anh and

Moffat achieve 3.81bpe and 3.55bpe for the graph EU) and reported com-

parable decoding speed. Details of the algorithm cannot however be deduced

from their 1-page conference poster.

Some recent works focus on graph compression with support for bidirec-

tional navigation [8, 10] and experiments show that this approach uses sig-

nificantly less space (3.3–5.3 bits per link) than the Boldi and Vigna scheme

applied for both direct and transposed graph, at the average neighbor re-

trieval times of 2–15 microseconds (Pentium4 3.0GHz).

The smallest compressed Web graph datasets (only EU-2005 and Indochina-

2004 used in the experiments) were reported by Grabowski and Bieniecki [12];

their best results were about 1.7bpe and 0.9bpe (including offsets to com-

pressed chunk beginnings) for those graphs, respectively, which is 2.5–3 times

less than from BV variant with fast access. The algorithm (called SSL for

“Similarity of Successive Lists”) exploits similar ideas to BV, but uses Deflate

(zip) compression on chunks of byproducts at its last phase. Unfortunately,

the price for those record-breaking compression ratios is random list access

time, two orders of magnitude longer than in BV.

3 Our algorithm

We present an algorithm (Alg. 1, LM stands for “List Merging”) that works

locally, in blocks having the same number of adjacency lists, h (at least in

this aspect our algorithm resembles the one from [2]).

Given the block of h lists, the procedure converts it into two streams: one

stores one long list consisting of all integers on the h input lists, without

duplicates, and the other stores flags necessary to reconstruct the original

lists. In other words, the algorithm performs a reversible merge of all the

lists in the block.

The long list is compacted: differentially encoded, zero-terminated and

submitted to a byte coder, using 1, 2 or b bytes per codeword, where b is

the smallest number of bytes sufficient to handle any node number in a given

graph (in practice, this means b = 4 except for the smallest dataset, EU-2005,

where b = 3 was enough).

The flags describe to which input lists a given integer on the output list

belongs; the number of bits per each item on the output list is h, and in

practical terms we assume h being a multiple of 8 (and even additionally a

power of 2, in the experiments to follow). The flag sequence does not need

any terminator since its length is defined by the length of the long list, which

Page 5

Merging adjacency lists for efficient Web graph compression5

Alg. 1 GraphCompressLM(G,h).

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

outF ← [ ]

i ← 1

for linei,linei+1,...,linei+h−1∈ G do

tempLine1← linei∪ linei+1∪ ... ∪ linei+h−1

tempLine2← removeDuplicates(tempLine1)

longLine ← sort(tempLine2)

items ← diffEncode(longLine) + [0]

outB ← byteEncode(items)

for j ← 1 to |longLine| do

f[1...|longLine|] ← [0,0,...,0]

for k ← 1 to h do

if longLine[j] ∈ linei+k−1then f[k] ← 1

append(outF, bitPack(f))

compress(concat(outB, outF))

outF ← [ ]

i ← i + h

is located earlier in the output stream. For example, if the length of the long

list is 91 and h = 32, the corresponding flag sequence has 364 bytes.

Those two sequences, the compacted long list and the (raw) flag sequence,

are concatenated and compressed with the Deflate algorithm.

One can see that the key parameter here is the block size, h. Using a larger

h lets exploit a wider range of similar lists but also has two drawbacks. The

flag sequence gets more and more sparse (for example, for h = 64 and the

EU-2005 crawl, as much as about 68% of its list indicators have only one set

bit out of 64!), and the Deflate compressor is becoming relatively inefficient

on those data. Worse, decoding (including decompression) larger blocks takes

longer time.

4 Experimental results

We conducted experiments on the crawls EU-2005, Indochina-2004 and UK-

2002 [4], downloaded from the WebGraph project (http://webgraph.dsi.

unimi.it/). The main characteristics of those datasets are presented in Ta-

ble 1.

Dataset

Nodes

Edges

Edges / nodes

% of empty lists

Longest list length

EU-2005

862,664

19,235,140

Indochina-2004

7,414,866

194,109,311

UK-2002

18,520,486

298,113,762

22.30

8.31

6985

26.18

17.66

6985

16.10

14.91

2450

Table 1 Selected characteristics of the datasets used in the experiments.

Page 6

6 Szymon Grabowski and Wojciech Bieniecki

The main experiments were run on a machine equipped with an Intel Core

2 Quad Q9450 CPU, 8 GB of RAM, running Microsoft Windows XP (64-bit).

Our algorithms were implemented in Java (JDK 6). A single CPU core was

used by all implementations. As seemingly accepted in most reported works,

we measure access time per edge, extracting many (100,000 in our case)

randomly selected adjacency lists and summing those times, and dividing

the total time by the number of edges on the required lists. The space is

measured in bits per edge (bpe), dividing the total space of the structure

(including entry points to blocks) by the total number of edges. Throughout

this section by 1KB we mean 1000 bytes.

We test the following algorithms:

• The Boldi and Vigna algorithm [5], variant (7, 3), i.e., the sliding window

size is 7 and the maximum reference count 3,

• The Apostolico and Drovandi algorithm [1], using BFS webpage order-

ing, with parameter l (the number of nodes per compressed block) set to

{4,8,16,32,1024} and parameter r (the root of the BFS) set to 0,

• The variant offering strongest compression from our earlier work [12], SSL

4b,

• Our algorithm (LM) from this work, with 8, 16, 32 and 64 lists per chunk.

We used the implementations publicly available from the authors of the

respective algorithms. Note that all those implementations were written in

Java, which makes the comparison fair.

DatasetEU-2005

bpe

5.679

4.325

3.561

3.169

2.969

2.776

1.692

3.814

2.963

2.373

2.008

Indochina-2004

bpetime [µs]

2.411

2.331

1.860

1.615

1.488

1.363

0.907

2.207

1.668

1.320

1.097

UK-2002

bpe

3.567

3.369

2.627

2.242

2.042

1.851

1.678

3.490

2.733

2.241

1.925

time [µs]

0.227

0.242

0.227

0.351

0.617

15.425

22.276

0.179

0.265

0.453

0.815

time [µs]

0.262

0.307

0.260

0.343

0.542

12.338

23.654

0.196

0.253

0.395

0.654

BV (7,3)

BFS, l4

BFS, l8

BFS, l16

BFS, l32

BFS, l1024

SSL 4b

LM8

LM16

LM32

LM64

0.181

0.147

0.179

0.264

0.420

9.979

23.521

0.136

0.166

0.252

0.429

Table 2 Comparison of Web graph compressors. Compression ratios in bits per edge and

average access times per edge are presented. Offset data are included.

Several conclusions can be drawn from the results. BFS and LM seem to

be the best choices, considering the tradeoff between space and access time.

When access time is at a premium, those two are comparable, with a slight

advantage of LM (with 8 or 16 lists, confronted with BFS -l4 or -l8). When

stronger compression is required, LM reaches bpe results rather inaccessible

Page 7

Merging adjacency lists for efficient Web graph compression7

to BFS, with the exception of the UK-2002 dataset. In the latter case, the

BFS -l1024 archive is by 4% smaller than LM64 archive, for the price of 19

times longer average access time.

By default, BFS is a randomized algorithm and there are minor yet notica-

ble differences in its produced compressed graph sizes. Fixing the parameter

r makes the results deterministic. To avoid guessing, we simply set it to 0 in

all experiments.

The oldest algorithm, BV, may seem slightly dated, but we note the

work [6] from the same team, where they showed that non-URL based or-

dering can lead to compressed Web graphs about 10% smaller using their

old baseline scheme (in a practical variant), and even up to 35% smaller in

case of transposed graphs. BFS plays in the same league and reordering of

nodes is its core feature. It should be stressed that using a different ordering

than by URLs arranged lexicographically may spoil the compression of URLs

themselves; a practically important but oft-neglected factor (a recent work

pointing out this issue with some solutions tested is [7]).

As expected, our earlier algorithm, SSL 4b, remains the strongest but also

definitely the slowest competitor. It also uses Deflate compression in its final

phase. We note that the results of SSL 4b, nevertheless, are somewhat better

(mostly in access time but also slightly in compression) than in our previous

paper, which is due to removing some inefficiency in its Deflate invocation.

BV (7,3) timings are also better than in our previous tests on the same

machine and using the same methodology, a fact for which we do not find a

good explanation. Perhaps this might be due to an update of JDK 6 version.

Finally, we replaced Deflate in our LM algorithm with LZMA2, known

as one of the strongest LZ77-style algorithms. Unfortunately, we were disap-

pointed: only with chunk size of 64 lists LZMA proved better than Deflate

(from 4% to 6%) but the access times were more than 3 times longer. With

smaller chunks, the Deflate algorithm was usually better in compression (the

smaller chunks the greater its advantage) while the access times revealed the

same pattern as above.

5 Conclusions

We presented a surprisingly simple yet effective Web graph compression al-

gorithm, LM. Varying a single and very natural parameter (chunk size, in

the number of lists) we can obtain several competitive space-time tradeoffs.

As opposed to some other alternatives (in particular, BFS), LM does not

reorder the graph. (Still, it could be quite interesting to run LM over a per-

muted graph, making use of the conclusions drawn in [6].)

2http://sourceforge.net/projects/sevenzip/files/LZMA%20SDK/lzma920.tar.bz2

Page 8

8 Szymon Grabowski and Wojciech Bieniecki

Our algorithm works locally. In the future we are going to try to squeeze

out some global redundancy while compressing the LM byproducts. A nat-

ural candidate for such experiments is the RePair algorithm [13, 11]. Other

lines of research we are planning to follow are Web graph compression with

bidirectional navigation and efficient compression of URLs.

Acknowledgements The work was partially supported by the Polish Ministry of Science

and Higher Education under the project N N516 477338 (2010–2011).

References

[1] A. Alberto and G. Drovandi: Graph compression by BFS, Algorithms, 2(3) 2009,

pp. 1031–1044.

[2] V. N. Anh and A. F. Moffat: Local modeling for webgraph compression, in DCC,

J. A. Storer and M. W. Marcellin, eds., IEEE Computer Society, 2010, p. 519.

[3] Y. Asano, Y. Miyawaki, and T. Nishizeki: Efficient compression of web graphs, in

COCOON, X. Hu and J. Wang, eds., vol. 5092 of Lecture Notes in Computer Science,

Springer, 2008, pp. 1–11.

[4] P. Boldi, B. Codenotti, M. Santini, and S. Vigna: UbiCrawler: A Scalable Fully

Distributed Web Crawler. Software: Practice & Experience, 34(8) 2004, pp. 711–726.

[5] P. Boldi and S. Vigna: The webgraph framework I: Compression techniques, in

WWW, S. I. Feldman, M. Uretsky, M. Najork, and C. E. Wills, eds., ACM, 2004,

pp. 595–602.

[6] P. Boldi, M. Santini, and S. Vigna: Permuting web and social graphs. Internet

Math., 6(3) 2010, pp. 257–283.

[7] N. Brisaboa, R. C´ anovas, F. Claude, M. Mart´ ınez-Prieto, and G. Navarro:

Compressed String Dictionaries, in SEA, P. M. Pardalos and S. Rebennack, eds.,

vol. 6630 of Lecture Notes in Computer Science, Springer, 2011, pp. 136–147.

[8] N. Brisaboa, S. Ladra, and G. Navarro: K2-trees for compact web graph repre-

sentation, in SPIRE, J. Karlgren, J. Tarhio, and H. Hyyr¨ o, eds., vol. 5721 of Lecture

Notes in Computer Science, Springer, 2009, pp. 18–30.

[9] G. Buehrer and K. Chellapilla: A scalable pattern mining approach to web

graph compression with communities, in WSDM, M. Najork, A. Z. Broder, and

S. Chakrabarti, eds., ACM, 2008, pp. 95–106.

[10] F. Claude and G. Navarro: Extended compact web graph representations, in Algo-

rithms and Applications, T. Elomaa, H. Mannila, and P. Orponen, eds., vol. 6060 of

Lecture Notes in Computer Science, Springer, 2010, pp. 77–91.

[11] F. Claude and G. Navarro: Fast and compact web graph representations. ACM

Transactions on the Web (TWEB), 4(4):article 16, 2010.

[12] Sz. Grabowski and W. Bieniecki: Tight and simple Web graph compression, in

PSC, J. Holub and J.ˇZd’´ arek, eds., 2010, pp. 127–137.

[13] N. J. Larsson and A. Moffat: Off-line dictionary-based compression. Proceedings

of the IEEE, 88(11) 2000, pp. 1722–1732.

[14] J. I. Munro and V. Raman: Succinct representation of balanced parentheses, static

trees and planar graphs, in IEEE Symposium on Foundations of Computer Science

(FOCS), 1997, pp. 118–126.

[15] G. Navarro and V. M¨ akinen:Compressed full-text indexes.

Surveys, 39(1) 2007, p. article 2.

[16] G. Tur´ an: On the succinct representation of graphs. Discrete Applied Math, 15(2)

1984, pp. 604–618.

ACM Computing

#### View other sources

#### Hide other sources

- Available from Wojciech Bieniecki · May 21, 2014
- Available from lodz.pl