Article

Tight and simple Web graph compression

Computing Research Repository - CORR 06/2010;
Source: arXiv

ABSTRACT Analysing Web graphs has applications in determining page ranks, fighting Web
spam, detecting communities and mirror sites, and more. This study is however
hampered by the necessity of storing a major part of huge graphs in the
external memory, which prevents efficient random access to edge (hyperlink)
lists. A number of algorithms involving compression techniques have thus been
presented, to represent Web graphs succinctly but also providing random access.
Those techniques are usually based on differential encodings of the adjacency
lists, finding repeating nodes or node regions in the successive lists, more
general grammar-based transformations or 2-dimensional representations of the
binary matrix of the graph. In this paper we present two Web graph compression
algorithms. The first can be seen as engineering of the Boldi and Vigna (2004)
method. We extend the notion of similarity between link lists, and use a more
compact encoding of residuals. The algorithm works on blocks of varying size
(in the number of input lines) and sacrifices access time for better
compression ratio, achieving more succinct graph representation than other
algorithms reported in the literature. The second algorithm works on blocks of
the same size, in the number of input lines, and its key mechanism is merging
the block into a single ordered list. This method achieves much more attractive
space-time tradeoffs.

Download full-text

Full-text

Available from: Wojciech Bieniecki, Jul 04, 2015
0 Followers
 · 
105 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Analyzing Web graphs has applications in determining page ranks, fighting Web spam, detecting communities and mirror sites, and more. This study is however hampered by the necessity of storing a major part of huge graphs in the external memory which prevents efficient random access to edge (hyperlink) lists. A number of algorithms involving compression techniques have thus been presented, to represent Web graphs succinctly, but also providing random access. Those techniques are usually based on differential encodings of the adjacency lists, finding repeating nodes or node regions in the successive lists, more general grammar-based transformations or 2-dimensional representations of the binary matrix of the graph. In this paper we present three Web graph compression algorithms. The first can be seen as engineering of the Boldi and Vigna (2004) [8] method. We extend the notion of similarity between link lists and use a more compact encoding of residuals. The algorithm works on blocks of varying size (in the number of input lists) and sacrifices access time for better compression ratio, achieving more succinct graph representation than other algorithms reported in the literature. The second algorithm works on blocks of the same size in the number of input lists. Its key mechanism is merging the block into a single ordered list. This method achieves much more attractive space–time tradeoffs. Finally, we present an algorithm for bidirectional neighbor query support, which offers compression ratios better than those known from the literature.
    Discrete Applied Mathematics 01/2014; 163:298–306. DOI:10.1016/j.dam.2013.05.028 · 0.68 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Analysing Web graphs meets a difficulty in the necessity of storing a major part of huge graphs in the external memory, which prevents efficient random access to edge (hyperlink) lists. A number of algorithms involving compression techniques have thus been presented, to represent Web graphs succinctly but also providing random access. Our algorithm belongs to this category. It works on contiguous blocks of adjacency lists, and its key mechanism is merging the block into a single ordered list. This method achieves compression ratios much better than most methods known from the literature at rather competitive access times. Keywordsgraph compression–random access
    Man-Machine Interactions 2 AISC 103, 08/2011: pages 385-392;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work we propose a new efficient algorithm for communities construction based on the idea that a community is animated by a set of {\em leaders} that are followed by a set of nodes. A node can follow different leaders animating different communities. The algorithm is structured into two mains steps: identifying nodes in the network playing the role of leaders, then assigning other nodes to some leaders. We provide a general framework for implementing such an approach. First experimental results obtained by applying the algorithm on different real networks show the effectiveness of the proposed approach.
    PASSAT/SocialCom 2011, Privacy, Security, Risk and Trust (PASSAT), 2011 IEEE Third International Conference on and 2011 IEEE Third International Confernece on Social Computing (SocialCom), Boston, MA, USA, 9-11 Oct., 2011; 01/2011