Conference Paper

A distributed approach for graph-oriented multidimensional analysis

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The importance of graphs as the fundamental structure underpinning many real world applications is no longer to be proved. Large graphs have emerged in various fields such as biological, social and transportation networks. The sheer volume of these networks poses challenges to traditional techniques for storage and analysis of graph data. In particular, OLAP analysis requires access to large portions of data to extract key information and to feed strategic decision making. OLAP provides multilevel, multiperspective views of the data. Most of the current techniques are optimized for centralized graph processing. A distributed approach providing horizontal scalability is required in order to handle the analysis workload. In this paper, we focus on applying OLAP analysis on large, distributed graph data. We describe Distributed Graph Cube, our distributed framework for graph-based OLAP cubes computation and aggregation. Experimental results on large, real-world datasets demonstrate that our method significantly outperforms its centralized counterparts. We also evaluate the performance of both Hadoop and Spark for distributed cubes computations.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... There are two previous approaches to compute graph cubes in a distributed fashion: Distributed GraphCube [DGS13] that their implementations scale for the computation of the complete graph cube [Wa14] as well as single cuboids [DGS13]. ...
... There are two previous approaches to compute graph cubes in a distributed fashion: Distributed GraphCube [DGS13] that their implementations scale for the computation of the complete graph cube [Wa14] as well as single cuboids [DGS13]. ...
Conference Paper
Full-text available
Property graphs are an intuitive way to model, analyze and visualize complex relationships among heterogeneous data objects, for example, as they occur in social, biological and information networks. These graphs typically contain thousands or millions of vertices and edges and their entire representation can easily overwhelm an analyst. One way to reduce complexity is the grouping of vertices and edges to summary graphs. In this paper, we present an algorithm for graph grouping with support for attribute aggregation and structural summarization by user-defined vertex and edge properties. The algorithm is part of Gradoop, an open-source system for graph analytics. Gradoop is implemented on top of Apache Flink, a state-of-the-art distributed dataflow framework, and thus allows us to scale graph analytical programs across multiple machines. Our evaluation demonstrates the scalability of the algorithm on real-world and synthetic social network data.
... The next few studies focus on how to design graph data cubes for application on different types of networks, including Graph Cube [32], Hyper Graph Cube Model [20], GOLAP Model [1], which are based on homogenous networks, as well as those based on heterogeneous network, such as HM GraphOLAP Model [26], TSMH Graph Cube Model [18]. Some works also discuss how to perform GraphOLAP Operation on prallel computing framework to support large-scale datas [6], [18], [20], [24]. However, these frameworks cannot combine GraphOLAP operations and graph mining algorithms well to support more complex analysis. ...
... The idea behind these two proposals is basically the same: to efficiently compute all possible aggregations of an homogeneous graph. Also based on the notion of so-called graph cuboids (that is, graphs defined at different levels of aggregation), Distributed Graph Cube [7] is a distributed framework for graph cube computation and aggregation, implemented using Spark 5 and Hadoop 6 . Again, this proposal only supports homogeneous graphs. ...
Chapter
Full-text available
In current “Big Data” scenarios, graph databases are increasingly being used. Online Analytical Processing (OLAP) operations can expand the possibilities of graph analysis beyond the traditional graph-based computation. This paper studies graph databases as an alternative to implement star and snowflake schemas, the typical choices for data warehouse design. For this, the MusicBrainz database is used. A data warehouse for this database is designed, and implemented over a Postgres relational database. This warehouse is also represented as a graph, and implemented over the Neo4j graph database. A collection of typical OLAP queries is used to compare both implementations. The results reported here show that in ten out of thirteen queries tested, the graph implementation outperforms the relational one, in ratios that go from 1.3 to 26 times faster, and performs similarly to the relational implementation in the three remaining cases.
... In addition, such largescale multidimensional graph data cannot be processed on a single machine. To address these challenges, there have been many studies on methods that use the distributed parallel processing framework MapReduce [29,30]. ...
Article
Full-text available
Graph OLAP is a technology that generates aggregates or summaries of a large-scale graph based on the properties (or dimensions) associated with its nodes and edges, and in turn enables interactive analyses of the statistical information contained in the graph. To efficiently support these OLAP functions, a graph cube is widely used, which maintains aggregate graphs for all dimensions of the source graph. However, computing the graph cube for a large graph requires an enormous amount of time. While previous approaches have used the MapReduce framework to cut down on this computation time, the recently developed Spark environment offers superior computational performance. To leverage the advantages of Spark, we propose the GraphNaïve and GraphTDC algorithms. GraphNaïve sequentially computes graph cuboids for all dimensions in a graph, while GraphTDC computes them after first creating an execution plan. We also propose the Generate Multi-Dimension Table method to efficiently create a multidimensional graph table to express the graph. Evaluation experiments demonstrated that the GraphTDC algorithm significantly outperformed Spark SQL’s built-in library DataFrame, as the size of graphs increased.
... The framework introduced the cuboid and crossboid queries for building and analyzing the different graph cubes. Distributed Graph Cube is a distributed framework for graph cubes computation and aggregation implemented using Spark and Hadoop [14]. Pagrol is a Map-Reduce framework for distributed OLAP analysis of homogeneous attributed graphs [5]. ...
Conference Paper
Full-text available
Graphs are widespread structures providing a powerful abstraction for modeling networked data. Large and complex graphs have emerged in various domains such as social networks, bioinformatics, and chemical data. However, current warehousing frameworks are not equipped to handle efficiently the multidimensional modeling and analysis of complex graph data. In this paper, we propose a novel framework for building OLAP cubes from graph data and analyzing the graph topological properties. The framework supports the extraction and design of the candidate multidimensional spaces in property graphs. Besides property graphs, a new database model tailored for multidimensional modeling and enabling the exploration of additional candidate multidimensional spaces is introduced. We present novel techniques for OLAP aggregation of the graph, and discuss the case of dimension hierarchies in graphs. Furthermore, the architecture and the implementation of our graph warehousing framework are presented and show the effectiveness of our approach.
Article
Research on Data Cubes scalability is extensive, yet sparse. Scalable design patterns for Data Cube implementations are a trend as the technology shifts from centralized and fully materialized models to distributed and partially materialized ones. The implementations explore cheaper and upgraded hardware in clusters of computer nodes. It is a common understanding that the parallel and distributed hardware enables to handle large amounts of multidimensional data for online analytical processing, up to billions of tuples or more, with increased performance and fault tolerance. However, the number of research works and their heterogeneity may overwhelm new initiatives in this field, as there is little discussion regarding the state-of-the-art and ways for improvement. Moreover, the baseline for comparison in most works is often too limited and requires that the reader crosscheck the information among many articles to identify possible gaps. In order to help identifying these gaps, we analyzed the works on Data Cube scalability and elaborated a comparative study that provides directions for new research on the parallel and distributed implementations of data cubes. We identified some features for comparison that include cube function, implementation technology, cube storage type, and various experiments information. We expect that the features and comparisons help researchers to identify research gaps and pave ways for future works on the field.
Many real systems produce network data or highly interconnected data, which can be called information networks. These information networks form a critical component in modern information infrastructure, constituting a large graph data volume. The analysis of information network data covers several technological areas, among them OLAP technologies. OLAP is a technology that enables multi-dimensional and multi-level analysis on a large volume of data, providing aggregated data visualizations with different perspectives. This article presents a literature review on the main applications of OLAP technology in the analysis of information network data. To achieve such goal, it shows a systematic review to list the works that apply OLAP technologies in graph data. It defines seven comparison criteria (Materialization, Network, Selection, Aggregation, Model, OLAP Operations, Analytics) to qualify the works found based on their functionalities. The works are analyzed according to each criterion and discussed to identify trends and challenges in the application of OLAP in the information network.
Article
Full-text available
Graphs are a fundamental structure that provides an intuitive abstraction for modeling and analyzing complex and highly interconnected data. Given the potential complexity of such data, some approaches proposed extending decision-support systems with multidimensional analysis capabilities over graphs. In this paper, we introduce TopoGraph, an end-to-end framwork for building and analyzing graph cubes. TopoGraph extends the existing graph cube models by defining new types of dimensions and measures and organizing them within a multidimensional space that guarantees multidimensional integrity constraints. This results in defining three new types of graph cubes: property graph cubes, topological graph cubes, and graph-structured cubes. Afterwards, we define the algebraic OLAP operations for such novel cubes. We implement and experimentally validate TopoGraph with different types of real-world datasets.
Conference Paper
Observations of daily living (ODLs) are cues that people attend to in the course of their everyday life, that inform them about their health. In order to better understand the ODLs, we propose a set of innovative multi-dimensional analysis concepts and methods. Firstly, the ODLs are organized as directed graphs according the “observation-property” relationships and the chronological order of observations, which represents all the information in a flexible way; Secondly, a novel concept, the structure dimension, is proposed to integrate into the traditional multidimensional analysis framework. From the structure dimension that consists of three granularities, vertices, edges and subgraphs, one can get a clearer view of the ODLs; Finally, the hierarchy of ODLs Cube is introduced, and the semantics of OLAP operations, Roll-up, Drill-down and Slice/dice, are redefined to accommodate the structure dimension. The proposed structure dimension and ODLs cube are effective for multidimensional analysis of ODLs.
Conference Paper
In current Big Data scenarios, traditional data warehousing and Online Analytical Processing (OLAP) operations on cubes are clearly not sufficient to address the current data analysis requirements. Nevertheless, OLAP operations and models can expand the possibilities of graph analysis beyond the traditional graph-based computation. In spite of this, there is not much work on the problem of taking OLAP analysis to the graph data model. In previous work we proposed a multidimensional (MD) data model for graph analysis, that considers not only the basic graph data, but background information in the form of dimension hierarchies as well. The graphs in our model are node- and edge-labelled directed multi-hypergraphs, called graphoids, defined at several different levels of granularity. In this paper we show how we implemented this proposal over the widely used Neo4J graph database, discuss implementation issues, and present a detailed case study to show how OLAP operations can be used on graphs.
Conference Paper
Aggregation and multidimensional analysis are well-known powerful tools for extracting useful knowledge, shaped in a summarized manner, which are being successfully applied to the annoying problem of managing and mining big data produced by large-scale scientific applications. Indeed, in the context of big data analytics, aggregation approaches allow us to provide meaningful descriptions of these data, otherwise impossible for alternative data-intensive analysis tools. On the other hand, multidimensional analysis methodologies introduce fortunate metaphors that significantly empathize the knowledge discovery phase from such huge amounts of data. Following this main trend, several big data aggregation and multidimensional analysis approaches have been proposed recently. The goal of this paper is to (i) provide a comprehensive overview of state-of-the-art techniques and (ii) depict open research challenges and future directions adhering to the reference scientific field.
Conference Paper
Full-text available
Graphs are ubiquitous data structures commonly used to represent highly connected data. Many real-world applications, such as social and biological networks, are modeled as graphs. To answer the surge for graph data management, many graph database solutions were developed. These databases are commonly classified as NoSQL graph databases, and they provide better support for graph data management than their relational counterparts. However, each of these databases implement their own operational graph data model, which differ among the products. Further, there is no commonly agreed conceptual model for graph databases. In this paper, we introduce a novel conceptual model for graph databases. The aim of our model is to provide analysts with a set of simple, well-defined, and adaptable conceptual components to perform rich analysis tasks. These components take into account the evolving aspect of the graph. Our model is analytics-oriented, flexible and incremental, enabling analysis over evolving graph data. The proposed model provides a typing mechanism for the underlying graph, and formally defines the minimal set of data structures and operators needed to analyze the graph.
Conference Paper
Full-text available
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.
Conference Paper
Full-text available
Social Network Analysis has emerged as a key paradigm in modern sociology, technology, and information sciences. The paradigm stems from the view that the attributes of an individual in a network are less important than their ties (relationships) with other individuals in the network. Exploring the nature and strength of these ties can help understand the structure and dynamics of social networks and explain real-world phenomena, ranging from organizational efficiency to the spread of information and disease. In this paper, we examine the communication patterns of millions of mobile phone users, allowing us to study the underlying social network in a large-scale communication network. Our primary goal is to address the role of social ties in the formation and growth of groups, or communities, in a mobile network. In particular, we study the 'evolution of churners in an operator's network spanning over a period of four months. Our analysis explores the propensity of a subscriber to churn out of a service provider's network depending on the number of ties (friends) that have already churned. Based on our findings, we propose a spreading activation-based technique that predicts potential churners by examining the current set of churners and their underlying social network. The efficiency of the prediction is expressed as a lift curve, which indicates the fraction of all churners that can be caught when a certain fraction of subscribers were contacted.
Conference Paper
Full-text available
Extracting and visualizing information from biochemical data- bases is one of the most important challenges in biochemical research. The huge quantity and high complexity of the data available force the biologist to use sophisticated tools for extracting and interpreting accu- rately the information extracted from the database. These tools must deflne a graphical semantics associated to the data semantics in accor- dance with biologist usages. The aim of these tools is to display complex biochemical networks in a readable and understandable way. In this pa- per we deflne the notion of customizable representation model, which allows the biologist to change the graphical semantics associated to the data semantics. The approach is also generic since our graphical seman- tics is common to several kinds of biochemical networks. We also deflned adaptive graph layout algorithms taking into account the particular se- mantics of biochemical networks. We show how we implemented these notions in the BioMaze project1.
Article
Full-text available
Information diffusion and virus propagation are fundamental processes taking place in networks. While it is often possible to directly observe when nodes become infected with a virus or publish the information, observing individual transmissions (who infects whom, or who influences whom) is typically very difficult. Furthermore, in many applications, the underlying network over which the diffusions and propagations spread is actually unobserved. We tackle these challenges by developing a method for tracing paths of diffusion and influence through networks and inferring the networks over which contagions propagate. Given the times when nodes adopt pieces of information or become infected, we identify the optimal network that best explains the observed infection times. Since the optimization problem is NP-hard to solve exactly, we develop an efficient approximation algorithm that scales to large datasets and finds provably near-optimal networks. We demonstrate the effectiveness of our approach by tracing information diffusion in a set of 170 million blogs and news articles over a one year period to infer how information flows through the online media space. We find that the diffusion network of news for the top 1,000 media sites and blogs tends to have a core-periphery structure with a small set of core media sites that diffuse information to the rest of the Web. These sites tend to have stable circles of influence with more general news media sites acting as connectors between them.
Conference Paper
Extracting and visualizing information from biochemical databases is one of the most important challenges in biochemical research. The huge quantity and high complexity of the data available force the biologist to use sophisticated tools for extracting and interpreting accurately the information extracted from the database. These tools must define a graphical semantics associated to the data semantics in accordance with biologist usages. The aim of these tools is to display complex biochemical networks in a readable and understandable way. In this paper we define the notion of customizable representation model, which allows the biologist to change the graphical semantics associated to the data semantics. The approach is also generic since our graphical semantics is common to several kinds of biochemical networks. We also defined adaptive graph layout algorithms taking into account the particular semantics of biochemical networks. We show how we implemented these notions in the BioMaze project(1).
Article
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.
Conference Paper
Data cube has been playing an essential role in fast OLAP (online analytical processing) in many multi-dimensional data warehouses. However, there exist data sets in applications like bioinformatics, statistics, and text pro- cessing that are characterized by high dimen- sionality, e.g., over 100 dimensions, and mod- erate size, e.g., around 106 tuples. No feasible data cube can be constructed with such data sets. In this paper we will address the problem of developing an e-cient algorithm to perform OLAP on such data sets. Experience tells us that although data analy- sis tasks may involve a high dimensional space, most OLAP operations are performed only on a small number of dimensions at a time. Based on this observation, we propose a novel method that computes a thin layer of the data cube together with associated value-list indices. This layer, while being manageable in size, will be capable of supporting ∞exi- ble and fast OLAP operations in the original high dimensional space. Through experiments we will show that the method has I/O costs that scale nicely with dimensionality. Further- more, the costs are comparable to that of ac- cessing an existing data cube when full mate- rialization is possible.
Conference Paper
We consider extending decision support facilities toward large sophisticated networks, upon which multidimensional attributes are associated with network entities, thereby forming the so-called multidimensional networks. Data warehouses and OLAP (Online Analytical Processing) technology have proven to be effective tools for decision support on relational data. However, they are not well-equipped to handle the new yet important multidimensional networks. In this paper, we introduce Graph Cube, a new data warehousing model that supports OLAP queries effectively on large multidimensional networks. By taking account of both attribute aggregation and structure summarization of the networks, Graph Cube goes beyond the traditional data cube model involved solely with numeric value based group-by's, thus resulting in a more insightful and structure-enriched aggregate network within every possible multidimensional space. Besides traditional cuboid queries, a new class of OLAP queries, crossboid, is introduced that is uniquely useful in multidimensional networks and has not been studied before. We implement Graph Cube by combining special characteristics of multidimensional networks with the existing well-studied data cube techniques. We perform extensive experimental studies on a series of real world data sets and Graph Cube is shown to be a powerful and efficient tool for decision support on large multidimensional networks.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Article
We describe the techniques developed to gather and distribute in a highly compressed, yet accessible, form a series of twelve snapshot of the .uk web domain. Ad hoc compression techniques made it possible to store the twelve snapshots using just 1:9 bits per link, with constant-time access to temporal information. Our collection makes it possible to study the temporal evolution link-based scores (e.g., PageRank), the growth of online communities, and in general time-dependent phenomena related to the link structure.
Article
Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse. In addition to surveying the state of the art, this paper also identifies some promising research issues, some of which are related to problems that the database research community has worked on for years, but others are only just beginning to be addressed. This overview is based on a tutorial that the authors presented at the VLDB Conference, 1996.
Article
Decision support applications involve complex queries on very large databases. Since response times should be small, query optimization is critical. Users typically view the data as multidimensional data cubes. Each cell of the data cube is a view consisting of an aggregation of interest, like total sales. The values of many of these cells are dependent on the values of other cells in the data cube. A common and powerful query optimization technique is to materialize some or all of these cells rather than compute them from raw data each time. Commercial systems differ mainly in their approach to materializing the data cube. In this paper, we investigate the issue of which cells (views) to materialize when it is too expensive to materialize all views. A lattice framework is used to express dependencies among views. We then present greedy algorithms that work off this lattice and determine a good set of views to materialize. The greedy algorithm performs within a small constant...
An overview of data warehousing and olap technology
  • S Chaudhuri
  • U Dayal