Article

Tight and simple Web graph compression

Computing Research Repository - CORR 06/2010;
Source: arXiv

ABSTRACT

Analysing Web graphs has applications in determining page ranks, fighting Web
spam, detecting communities and mirror sites, and more. This study is however
hampered by the necessity of storing a major part of huge graphs in the
external memory, which prevents efficient random access to edge (hyperlink)
lists. A number of algorithms involving compression techniques have thus been
presented, to represent Web graphs succinctly but also providing random access.
Those techniques are usually based on differential encodings of the adjacency
lists, finding repeating nodes or node regions in the successive lists, more
general grammar-based transformations or 2-dimensional representations of the
binary matrix of the graph. In this paper we present two Web graph compression
algorithms. The first can be seen as engineering of the Boldi and Vigna (2004)
method. We extend the notion of similarity between link lists, and use a more
compact encoding of residuals. The algorithm works on blocks of varying size
(in the number of input lines) and sacrifices access time for better
compression ratio, achieving more succinct graph representation than other
algorithms reported in the literature. The second algorithm works on blocks of
the same size, in the number of input lines, and its key mechanism is merging
the block into a single ordered list. This method achieves much more attractive
space-time tradeoffs.

Download full-text

Full-text

Available from: Wojciech Bieniecki
  • Source
    • "Assuming 20 outgoing links per node, 5-byte links (4-byte indexes to other pages are simply too small) and pointers to each adjacency list we would need more than 5.2 TB of memory, ways beyond the capacities of the current RAM memories. Preliminary versions of this manuscript were published in [4] and [5] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Analyzing Web graphs has applications in determining page ranks, fighting Web spam, detecting communities and mirror sites, and more. This study is however hampered by the necessity of storing a major part of huge graphs in the external memory which prevents efficient random access to edge (hyperlink) lists. A number of algorithms involving compression techniques have thus been presented, to represent Web graphs succinctly, but also providing random access. Those techniques are usually based on differential encodings of the adjacency lists, finding repeating nodes or node regions in the successive lists, more general grammar-based transformations or 2-dimensional representations of the binary matrix of the graph. In this paper we present three Web graph compression algorithms. The first can be seen as engineering of the Boldi and Vigna (2004) [8] method. We extend the notion of similarity between link lists and use a more compact encoding of residuals. The algorithm works on blocks of varying size (in the number of input lists) and sacrifices access time for better compression ratio, achieving more succinct graph representation than other algorithms reported in the literature. The second algorithm works on blocks of the same size in the number of input lists. Its key mechanism is merging the block into a single ordered list. This method achieves much more attractive space–time tradeoffs. Finally, we present an algorithm for bidirectional neighbor query support, which offers compression ratios better than those known from the literature.
    Full-text · Article · Jan 2014 · Discrete Applied Mathematics
  • Source
    • "This suggests that we can gain much insight into complex networked systems by discovering and examining their underlaying communities. More importantly, since the community-level structure is exhibited by almost all studied real-world complex networks, an efficient algorithm for detecting communities would be useful to implement a pre-treatment step for a number of general complex operations such as computation distribution, huge graph visualization and large-scale graph compression [2]. A huge number of algorithms have been proposed for detecting communities in complex networks. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work we propose a new efficient algorithm for communities construction based on the idea that a community is animated by a set of {\em leaders} that are followed by a set of nodes. A node can follow different leaders animating different communities. The algorithm is structured into two mains steps: identifying nodes in the network playing the role of leaders, then assigning other nodes to some leaders. We provide a general framework for implementing such an approach. First experimental results obtained by applying the algorithm on different real networks show the effectiveness of the proposed approach.
    Full-text · Conference Paper · Oct 2011
  • Source
    • "Throughout this section by 1 KB we mean 1000 bytes. We test the following algorithms: • The Boldi and Vigna algorithm [5], variant (7, 3), i.e., the sliding window size is 7 and the maximum reference count 3, • The Apostolico and Drovandi algorithm [1], using BFS webpage ordering , with parameter l (the number of nodes per compressed block) set to {4, 8, 16, 32, 1024} and parameter r (the root of the BFS) set to 0, • The variant offering strongest compression from our earlier work [12], SSL 4b, • Our algorithm (LM) from this work, with 8, 16, 32 and 64 lists per chunk. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Analysing Web graphs meets a difficulty in the necessity of storing a major part of huge graphs in the external memory, which prevents efficient random access to edge (hyperlink) lists. A number of algorithms involving compression techniques have thus been presented, to represent Web graphs succinctly but also providing random access. Our algorithm belongs to this category. It works on contiguous blocks of adjacency lists, and its key mechanism is merging the block into a single ordered list. This method achieves compression ratios much better than most methods known from the literature at rather competitive access times. Keywordsgraph compression–random access
    Full-text · Chapter · Aug 2011
Show more