Merging Adjacency Lists for Efficient Web Graph Compression

DOI: 10.1007/978-3-642-23169-8_42 In book: Man-Machine Interactions 2 AISC 103, pp.385-392

ABSTRACT Analysing Web graphs meets a difficulty in the necessity of storing a major part of huge graphs in the external memory, which
prevents efficient random access to edge (hyperlink) lists. A number of algorithms involving compression techniques have thus
been presented, to represent Web graphs succinctly but also providing random access. Our algorithm belongs to this category.
It works on contiguous blocks of adjacency lists, and its key mechanism is merging the block into a single ordered list. This
method achieves compression ratios much better than most methods known from the literature at rather competitive access times.

Keywordsgraph compression–random access

Download full-text


Available from: Wojciech Bieniecki, Jul 06, 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Analyzing Web graphs has applications in determining page ranks, fighting Web spam, detecting communities and mirror sites, and more. This study is however hampered by the necessity of storing a major part of huge graphs in the external memory which prevents efficient random access to edge (hyperlink) lists. A number of algorithms involving compression techniques have thus been presented, to represent Web graphs succinctly, but also providing random access. Those techniques are usually based on differential encodings of the adjacency lists, finding repeating nodes or node regions in the successive lists, more general grammar-based transformations or 2-dimensional representations of the binary matrix of the graph. In this paper we present three Web graph compression algorithms. The first can be seen as engineering of the Boldi and Vigna (2004) [8] method. We extend the notion of similarity between link lists and use a more compact encoding of residuals. The algorithm works on blocks of varying size (in the number of input lists) and sacrifices access time for better compression ratio, achieving more succinct graph representation than other algorithms reported in the literature. The second algorithm works on blocks of the same size in the number of input lists. Its key mechanism is merging the block into a single ordered list. This method achieves much more attractive space–time tradeoffs. Finally, we present an algorithm for bidirectional neighbor query support, which offers compression ratios better than those known from the literature.
    Discrete Applied Mathematics 01/2014; 163:298–306. DOI:10.1016/j.dam.2013.05.028 · 0.68 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Compressed representations have become effective to store and access large Web and social graphs, in order to support various graph querying and mining tasks. The existing representations exploit various typical patterns in those networks and provide basic navigation support. In this paper, we obtain unprecedented results by finding “dense subgraph” patterns and combining them with techniques such as node orderings and compact data structures. On those representations, we support out-neighbor and out/in-neighbor queries, as well as mining queries based on the dense subgraphs. First, we propose a compression scheme for Web graphs that reduces edges by representing dense subgraphs with “virtual nodes”; over this scheme, we apply node orderings and other compression techniques. With this approach, we match the best current compression ratios that support out-neighbor queries (i.e., nodes pointed from a given node), using 1.0–1.8 bits per edge (bpe) on large Web graphs, and retrieving each neighbor of a node in 0.6–1.0 microseconds ( $\upmu $ s). When supporting both out- and in-neighbor queries, instead, our technique generally offers the best time when using little space. If the reduced graph, instead, is represented with a compact data structure that supports bidirectional navigation, we obtain the most compact Web graph representations (0.9–1.5 bpe) that support out/in-neighbor navigation; yet, the time per neighbor extracted raises to around 5–20 $\upmu $ s. We also propose a compact data structure that represents dense subgraphs without using virtual nodes. It allows us to recover out/in-neighbors and answer other more complex queries on the dense subgraphs identified. This structure is not competitive on Web graphs, but on social networks, it achieves 4–13 bpe and 8–12 $\upmu $ s per out/in-neighbor retrieved, which improves upon all existing representations.
    Knowledge and Information Systems 08/2013; 40(2). DOI:10.1007/s10115-013-0648-4 · 2.64 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The representation of large subsets of the World Wide Web in the form of a directed graph has been extensively used to analyze structure, behavior, and evolution of those so-called Web graphs. However, interesting Web graphs are very large and their classical representations do not fit into the main memory of typical computers, whereas the required graph algorithms perform inefficiently on secondary memory. Compressed graph representations drastically reduce their space requirements while allowing their efficient navigation in compressed form. While the most basic navigation operation is to retrieve the successors of a node, several important Web graph algorithms require support for extended queries, such as finding the predecessors of a node, checking the presence of a link, or retrieving links between ranges of nodes. Those are seldom supported by compressed graph representations. This paper presents the k2-tree, a novel Web graph representation based on a compact tree structure that takes advantage of large empty areas of the adjacency matrix of the graph. The representation not only retrieves successors and predecessors in symmetric fashion, but also it is particularly efficient to check for specific links between nodes, or between ranges of nodes, or to list the links between ranges. Compared to the best representations in the literature supporting successor and predecessor queries, our technique offers the least space usage (1–3 bits per link) while supporting fast navigation to predecessors and successors (2–8μs per neighbor retrieved) and sharply outperforming the others on the extended queries. The representation is also of general interest and can be used to compress other kinds of graphs and data structures.
    Information Systems 01/2014; 39:152–174. DOI:10.1016/ · 1.24 Impact Factor