Conference PaperPDF Available

An RDF Dataset Generator for the Social Network Benchmark with Real-World Coherence

Authors:

Abstract

Synthetic datasets used in benchmarking need to mimic all characteristics of real-world datasets, in order to provide realistic benchmarking results. Synthetic RDF datasets usually show a significant discrepancy in the level of structuredness compared to real-world RDF datasets. This structural difference is important as it directly affects storage, indexing and querying. In this paper, we show that the synthetic RDF dataset used in the Social Network Benchmark is characterized with high-structuredness and therefore introduce modifications to the data generator so that it produces an RDF dataset with a real-world structuredness.
An RDF Dataset Generator for the Social
Network Benchmark with Real-World Coherence
Mirko Spasi´c1,2, Milos Jovanovik1,3, and Arnau Prat-P´erez4
1OpenLink Software, UK
2Faculty of Mathematics, University of Belgrade, Serbia
3Faculty of Computer Science and Engineering
Ss. Cyril and Methodius University in Skopje, Macedonia
4Sparsity Technologies, Spain
{mspasic, mjovanovik}@openlinksw.com, arnau@sparsity-technologies.com
Abstract. Synthetic datasets used in benchmarking need to mimic all
characteristics of real-world datasets, in order to provide realistic bench-
marking results. Synthetic RDF datasets usually show a significant dis-
crepancy in the level of structuredness compared to real-world RDF
datasets. This structural difference is important as it directly affects stor-
age, indexing and querying. In this paper, we show that the synthetic
RDF dataset used in the Social Network Benchmark is characterized
with high-structuredness and therefore introduce modifications to the
data generator so that it produces an RDF dataset with a real-world
structuredness.
Keywords: Data Generation, Social Network Benchmark, Synthetic Data,
Linked Data, Big Data, RDF, Benchmarks.
1 Introduction
Linked Data and RDF benchmarking require the use of real-world or synthetic
RDF datasets [2]. For better benchmarking, synthetic RDF datasets need to
comply with the general characteristics observed in their real-world counterparts,
such as the schema, structure, size, distributions, etc. As the authors of [3] show,
synthetic RDF datasets used in benchmarking exhibit a significant structural
difference from real datasets: real-world RDF datasets are less coherent, i.e. have
a lower degree of structuredness than synthetic RDF datasets. This structural
difference is important as it has direct consequences on the storage of the data,
as well as on the ways the data are indexed and queried.
In order to create the basis for more realistic RDF benchmarking, we mod-
ify the existing RDF data generator for the Social Network Benchmark so that
the resulting synthetic dataset follows the structuredness observed in real-world
RDF datasets. Additionally, we implement the coherence measurement for RDF
datasets as a Virtuoso procedure, to simplify the process of measuring the struc-
turedness of any given RDF dataset by using the RDF graph stored in the quad
store, instead of using RDF files.
* The presented work was funded by the H2020 project HOBBIT (#688227).
2
2 Background and Related Work
2.1 Data Generator for the Social Network Benchmark
The Social Network Benchmark (SNB) [4] provides a synthetic data generator
(Datagen)1, which models an online social network (OSN), like Facebook. The
data contains different types of entities and relations, such as persons with friend-
ship relations among them, posts, comments or likes. Additionally, it reproduces
many of the structural characteristics observed in real OSNs, summarized below.
Attribute correlations. Real data is correlated by nature. For instance,
given names to persons are correlated with their country. Similarly, the textual
content of a message is correlated with the interests of its creator. For the purpose
of benchmarking and performance evaluation, reproducing these correlations is
essential since their understanding can be used to generate optimal execution
plans or to properly lay out the data in memory. Datagen uses dictionaries
extracted from DBpedia to create entities with correlated attribute values.
Degree distributions. The underlying graph structure of OSNs exhibits
skewed degree distributions: most of people have between 20 to 200 friends,
while a few of them have over 5000 friends [12]. In a real database system, this
skewness complicates the estimation of cardinalities required to produce optimal
execution plans, or the load balancing when executing parallel operators over
adjacency lists. Also, nodes with a large degree, commonly known as hubs, must
be carefully handled, specially in distributed systems, where traversing these
nodes can incur in a large inter-node communication. Datagen takes the degree
distribution of Facebook and empirically reproduces it.
Structure-Attribute correlations. The homophily principle states that in
a real social network, similar people have a larger probability to be connected,
which leads to the formation of many transitive relations between connected peo-
ple with similar characteristics. As a consequence, even though it is commonly
accepted that graph data access patterns are usually random, in practice there is
some degree of locality that can be exploited. For instance, people from a given
country are more likely to be connected among them than to people from other
countries. Thus, this information can be used to lay out data in memory wisely
to improve graph traversal’s performance, for instance, by putting all people in a
country closer in memory. Datagen generates person relationships based on dif-
ferent correlation dimensions (i.e. the interests of a person, the place that person
studied, etc.). These correlation dimensions are used to sort persons in such a
way that those that are more similar, are placed close. Then, edges are created
between close by persons, with a probability that decreases geometrically with
their distance. As shown in [11], this approach successfully produces attribute
correlated networks with other desirable characteristics such as large clustering
coefficient, a small diameter or a large largest connected component.
Spiky activity volume. In a real social network, the amount of activity is
not uniform but reactive to real-world events. If a natural disaster occurs, we
1https://github.com/ldbc/ldbc snb datagen
3
will observe people talking about it mostly after the time of the disaster, and
the associated activity volume will decay as the hours pass. This translates to an
spiky volume activity along the like of the social network, mixing moments with
a high load with situations where the load level is small. Also, this means that the
amount of messages produced by people and their topics are correlated with given
points in time, which complicates the work of a query optimizer when estimating
cardinalities for their query plans. This can also make a system unable to cope
with the load if it has not been properly overprovisioned to handle these spiky
situations. Instead of generating posts and comments uniformly distributed along
time, Datagen creates virtual events of different degrees of relevance, which are
used to drive the generation of the user activity, producing a spiky distribution.
2.2 Dataset Coherence
The comparison of data generated with existing RDF benchmarks (TPC-H,
BSBM, LUBM, SP2Bench, etc.) and data found in widely used real RDF datasets
(DBpedia, UniProt, etc.) shows that these two have significant structural differ-
ences [3]. In the same paper, the authors introduce a composite metric, called
coherence of a dataset, in order to quantify the structuredness of the dataset D
with respect to the type system Tas follows:
CH (T, D) = X
T∈T
W T (CV (T , D)) ·CV (T, D )
This is the weighted sum of the coverage CV (T , D) of individual types T∈ T ,
where the weight coefficient W T (CV (T , D)) depends on the number of proper-
ties for a type T, the number of entities in dataset Dof type T, and their share
in the totality of the dataset Damong the other types. Its rationale is to give
higher impact to types with more instances and properties. CV (T, D) represents
the coverage of type Ton the dataset D. It depends on whether the instances
of the type Tset a value for all its properties. If that is the case for all the
instances, the coverage will be 1 (perfect structuredness), otherwise it will take
a value from [0,1). The conclusion of [3] is that there is a clear distinction in the
structuredness, i.e. the coherence between the datasets derived from the exist-
ing RDF benchmarks and the real-world RDF datasets. For the first ones, the
values range between 0.79 (for SP2Bench) and 1 (for TPC-H) showing us the
characteristics of relational databases, while the coherence values for almost all
real-world datasets are below or around 0.6.
3 Measuring RDF Dataset Coherence in Virtuoso
In order to make an RDF benchmark more realistic, the dataset should follow
the nature of real data. Since the SNB dataset is developed to test not only
RDF stores, but also graph database systems, graph programming frameworks,
relational and noSQL database systems, we wanted to measure how this dataset
is suitable for RDF benchmarks.
4
The authors of [3] propose a workflow to compute the coherence of a dataset
in a few steps: assembling all triples into a single file, data cleaning and nor-
malization, generating several new files and sorting them in different orders to
provide the ability that the corresponding metrics can be collected by making a
single pass of the sorted file. The disadvantages of this approach are the mem-
ory requirements for storing all files in non-compressed format, and the time
required to sort them. Also, the sorting process could use additional temporary
space which is not negligible.
Here, we propose a new approach to compute coherence of any dataset, us-
ing an efficient RDF store, such as Virtuoso [5], leaving to the system to take
care of the efficacy and data compression. Virtuoso is a column store with good
compression capabilities, thus we will have a simpler and much more space- and
time-efficient procedure for calculating the proposed metric. First, we load the
dataset in question in a graph within Virtuoso, with a single command (ld dir).
Afterwards, we define a stored procedure in Virtuoso/PL for calculating the
coherence, by selecting all types from a dataset, calculating their individual cov-
erage and the weighted sum coefficient. The procedure, along with its supporting
procedures, is available on GitHub2. Here, we show the coverage() procedure:
create procedure coverage (in graph VARCHAR, in t LONG VARCHAR) {
declare a, b, c bigint;
select sum(cnt) into a from (
select t2.P as pred, count(distinct t1.S) as cnt
from RDF_QUAD t1, RDF_QUAD t2
where t1.S = t2.S and t2.P <> iri_to_id(’rdf:type’)
and t1.G = iri_to_id(graph) and t2.G = iri_to_id(graph)
and t1.P = iri_to_id(’rdf:type’) and t1.O = iri_to_id(t)
group by pred
) tmp ;
select count(distinct t2.P) into b from RDF_QUAD t1, RDF_QUAD t2
where t1.S = t2.S and t2.P <> iri_to_id(’rdf:type’)
and t1.G = iri_to_id(graph) and t2.G = iri_to_id(graph)
and t1.P = iri_to_id(’rdf:type’) and t1.O = iri_to_id(t) ;
select count(distinct S) into c from RDF_QUAD
where G = iri_to_id(graph)
and P = iri_to_id(’rdf:type’) and O = iri_to_id(t) ;
return cast (a as real) / (b * c);
}
Using the original SNB Datagen, we prepared the datasets whose sizes and co-
herence metrics are presented in the Section 5. Since their coherence varies from
0.86 to 0.89, we can conclude that these datasets are much more structured than
the real-world RDF datasets, thus they are not suitable for benchmarking RDF
stores. Our intention is to make them mimic real-world Linked Data datasets,
with a structuredness level of around 0.6.
2https://github.com/ldbc/ldbc snb implementations
5
4 A Realistic RDF Dataset Generator for the Social
Network Benchmark
The original SNB Datagen (Section 2.1) reproduces the important structural
characteristics observed in real online social networks. However, the authors of
[3] show that structuredness is also important for RDF datasets used for bench-
marking. Therefore, we modify the SNB Datagen so that we lower the struc-
turedness, i.e. coherence measure, from around 0.88 to around 0.6, to comply
with the structuredness of real-world RDF datasets.
The authors of [3] propose a generic way of decreasing the coherence met-
ric for any dataset, without using domain specific knowledge. The consequence
of this modification is the reduction of the dataset size. We decided to take a
different approach and make some changes to the initial data generator, intro-
ducing new predicates for some instances, as well as removing some triples from
the initial dataset, all while taking into account the reality of the specific do-
main, i.e. a social network. This enrichment phase provides a more realistic and
complex dataset, complying with the current state and features of real-world
social networks. In Table 1, we present the most dominant types from the SNB
dataset, along with their weights. We omit the other types, as their weights are
not significant in this case.
Table 1. Entity types and their weights from the SNB dataset.
Type Weight
Comment 0.6477
Post 0.3126
Forum 0.0282
Person 0.0031
The weight of a type mostly depends on the number of its instances in the
dataset. In the SNB dataset the comments are the most numerous, followed by
posts, which is visible in the results shown in Table 1 where we can see their
dominance in this regard: they hold 96% of the dataset weight. In order to
decrease the coherence metric of the dataset, CH (T, D), we should decrease the
coverage CV (T , D) of each type Tfrom the dataset. But, if we, for example,
decrease the coverage of type Person from 0.95 to 0 (which is not realistic), that
will result in a drop of the coherence measure for less than 0.3%. Bearing in mind
that we have to decrease it much more than that, the only reasonable choice for
modifications are the Comment and Post types.
The mutual predicates of these two types are browserUsed,content,creation-
Date,hasCreator,hasTag,id,isLocatedIn,length and locationIP, while Comment
instances additionally have the replyOf predicate, and Post instances can have
6
language and imageFile properties if the Post instance is a photo. One way of de-
creasing the coverage of specific types is the removal of a high number of triplets
related to a specific property. But, taking into account the specific domain, we
conclude that the only property that can be removed in part of the posts and
comments is isLocatedIn. The initial purpose of this property was to specify a
country from which the message was issued, and it was determined by the IP
address of the location where the message had been created. However, since a
lot of users access social networks using their smartphones equipped with GPS
receivers, social networks offer the possibility of adding locations to the mes-
sages. If we consider this property in that manner, we can remove the location
from some messages, as not all messages contain location information. Various
research in the domain show that users rarely share their location in the posts:
the authors of [7] show that only about 1.2% of Twitter posts contain an explicit
location information, while [9] shows that only around 6% of Instagram posts
(photos) have a location tagged. Therefore, we remove the location information
from 98% of comments and textual posts, and from 94% of photo posts, and
with it the coverage of posts and comments gets significantly reduced.
Since it does not make sense to remove any other property, in order to achieve
our goal, we decided to introduce new ones. In the initial dataset, all of the
comments are textual, while recently social networks added a predefined set of
GIFs which can be used as comments [8, 6]. In the initial dataset, one third of all
comments are long textual comments, while two thirds are short comments, e.g.
“ok”, “great”, “cool”, “thx”, “lol”, “no way!”, “I see”, “maybe”, etc. In order to
include GIFs as comments, we introduce the gifFile property, which we apply in
80% of the short comments as a replacement of their content property.
In the next step, we add one more property to posts and comments: mentions.
Its purpose is to mention a person in a post or a comment. This modification
is also in line with what we have on social networks such as Facebook, Twitter,
Instagram, etc., where a user can mention another user in a post or a comment,
usually to make the other person aware of it. An analysis we performed over the
Twitter7 dataset [13] showed that 40% of he tweets contain at least one mention.
Therefore, we add this property to 40% of posts and comments, which provides
an additional drop in the coherence measure.
A significant issue in operating a social network is privacy. Facebook intro-
duced the possibility for each author of a post/comment to determine its level of
privacy: if you want to share it publicly, to your friends only, or to specific group
of people [1]. Therefore, we introduce the visibility predicate, which is set to a
post/comment when it is posted with a privacy setting different from the default
one for the user. Therefore, we generate this property for 5% of all messages,
using the assumption that users generally use their default privacy setting.
The final change we added to the data generator is the addition of the link
property, which both textual posts and comment can have. This corresponds to
the real-world activity of sharing a link in a post, in addition to the text. Based
on the analysis of user behavior on social media [10], which found that 43% of
Facebook posts contain links, we add the link property to that share of textual
7
posts and comments. As a value for the link property, we use a random value
from a predefined pool, similar as with other properties filled by the Datagen.
It will always be fetched at the end of query execution, without any filtering to
introduce estimation of cardinality, so the actual value is irrelevant.
The new SNB Datagen which generates RDF datasets with real-world coher-
ence is publicly available on GitHub3.
5 Measurements
To assess the structuredness of the RDF datasets generated by the original data
generator and our modified version of it, we made measurements of the datasets
in different sizes: 1, 3, 10, 30, and 100GB. Table 2 and Table 3 depict the results
of the measurements: they show the number of triplets (in millions), and the
coherence metric for all versions of the datasets. The tables provide a good
overview of the dependence of the structuredness measure on the dataset size
and the number of instances, in both the original and the modified dataset.
Table 2. Coherence of the initial
SNB datasets.
SF #Triplets Coherence
1 46.9M 0.8599
3 142.6M 0.8702
10 480.8M 0.8808
30 1478.2M 0.8883
100 4804.3M 0.8943
Table 3. Coherence of the modified
SNB datasets.
SF #Triplets Coherence
1 45.4M 0.6025
3 136.5M 0.6049
10 464.1M 0.6086
30 1428.4M 0.6115
100 4645.7M 0.6139
6 Conclusion and Future Work
In this paper, we introduced modification to the SNB data generator, to lower
the generated RDF dataset coherence to a value of around 0.6, which corresponds
better with real RDF datasets. We removed the location value in most posts and
comments, and introduced new properties in the dataset: a GIF-type comment,
user mentions and links in posts/comments, as well a level of visibility of a post.
We used general characteristics of real social networks, such as Twitter, Facebook
and Instagram, to generate a dataset which mimics real social networks. With
all changes combined, we manage to get an RDF dataset for the SNB with the
desired structuredness, i.e. coherence value. Additionally, we introduce a set of
Virtuoso procedures which can be used for calculating the dataset coherence of
3https://github.com/mirkospasic/ldbc snb datagen
8
any RDF dataset stored in the quad store. With this, we simplify the process of
coherence calculation introduced by the authors of [3].
As future work, we plan to reduce the coverage of other types besides posts
and comments. We will also address the correlations in the newly added parts
of the dataset. This will not change the overall structuredness, but the dataset
will further correspond to real-world RDF data. The changes and additions in-
troduced in the dataset will be implemented in the corresponding SNB queries.
References
1. When I post something, how do I choose who can see it? https://
www.facebook.com/help/120939471321735. Accessed: 2016-06-29.
2. Renzo Angles, Peter Boncz, Josep Larriba-Pey, Irini Fundulaki, Thomas Neumann,
Orri Erling, Peter Neubauer, Norbert Martinez-Bazan, Venelin Kotsev, and Ioan
Toma. The Linked Data Benchmark Council: A Graph and RDF Industry Bench-
marking Effort. ACM SIGMOD Record, 43(1):27–31, 2014.
3. Songyun Duan, Anastasios Kementsietsidis, Kavitha Srinivas, and Octavian Udrea.
Apples and Oranges: A Comparison of RDF Benchmarks and Real RDF Datasets.
In Proceedings of the 2011 ACM SIGMOD International Conference on Manage-
ment of Data, pages 145–156. ACM, 2011.
4. Orri Erling, Alex Averbuch, Josep Larriba-Pey, Hassan Chafi, Andrey Gubichev,
Arnau Prat, Minh-Duc Pham, and Peter Boncz. The LDBC Social Network Bench-
mark: Interactive Workload. In Proceedings of the 2015 ACM SIGMOD Interna-
tional Conference on Management of Data, SIGMOD ’15, pages 619–630, New
York, NY, USA, 2015. ACM.
5. Orri Erling and Ivan Mikhailov. Virtuoso: RDF Support in a Native RDBMS,
pages 501–519. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
6. Kia Kokalitcheva. There’s Now a Better Way to GIF on Twitter. http://
fortune.com/2016/02/17/twitter-gif-button-finally/, 2016. Accessed: 2016-
06-29.
7. Kalev Leetaru, Shaowen Wang, Guofeng Cao, Anand Padmanabhan, and Eric
Shook. Mapping the Global Twitter Heartbeat: The Geography of Twitter. First
Monday, 18(5), 2013.
8. Molly McHugh. You Can Finally, Actually, Really, Truly Post GIFs on Face-
book. http://www.wired.com/2015/05/real-gif-posting-on-facebook/, 2015.
Accessed: 2016-06-29.
9. Simply Measured. Quarterly Instagram Network Study (Q4 2014). 2014.
10. Amy Mitchell, Jocelyn Kiley, Jeffrey Gottfried, and Emily Guskin. The Role
of News on Facebook. http://www.journalism.org/2013/10/24/the-role-of-
news-on-facebook/, 2013. Accessed: 2016-06-29.
11. Minh-Duc Pham, Peter Boncz, and Orri Erling. S3G2: A Scalable Structure-
Correlated Social Graph Generator. In Technology Conference on Performance
Evaluation and Benchmarking, pages 156–172. Springer, 2012.
12. Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow. The
Anatomy of the Facebook Social Graph. arXiv preprint arXiv:1111.4503, 2011.
13. Jaewon Yang and Jure Leskovec. Patterns of Temporal Variation in Online Media.
In Proceedings of the Fourth ACM International Conference on Web Search and
Data Mining, pages 177–186. ACM, 2011.
... Real datasets, on the one hand, are often only scarcely available for testing, and only cover very specific scenarios, such that not all aspects of systems can be assessed. Synthetic datasets, on the other hand, are typically generated by mimicking algorithms [6,7,8,9], which are not always sufficiently realistic [10]. Features that are rel-evant for real-world datasets may not be tested. ...
... Based on this measure, the authors introduce a generic method for creating variants of real datasets with different sizes while maintaining a similar structuredness. The authors describe a method to calculate the coverage value of this dataset, which has been implemented as a procedure in the Virtuoso RDF store [9]. As the goal of our work is to generate realistic RDF public transport datasets, we will use this measure to compare the realism of generated datasets with real datasets. ...
... 1. Find the delay value of the connection after a given connection. 2. Find the delay values of all connections after a given connection.9. Find instances by inverse property path with a certain value.1. ...
... Our research team has some experience in designing and using RDF data generators in several domains, as well. For instance, in the domain of social network data, for the purpose of benchmarking RDF storage solutions, we have developed a domain-specific RDF dataset generator [19]. It is written in the Java programming language, and builds on a previous generator, in order to improve some of the metrics in the resulting graph and make its features closer to a real-world RDF dataset. ...
Preprint
Full-text available
This paper introduces RDFGraphGen, a general-purpose, domain-independent generator of synthetic RDF graphs based on SHACL constraints. The Shapes Constraint Language (SHACL) is a W3C standard which specifies ways to validate data in RDF graphs, by defining constraining shapes. However, even though the main purpose of SHACL is validation of existing RDF data, in order to solve the problem with the lack of available RDF datasets in multiple RDF-based application development processes, we envisioned and implemented a reverse role for SHACL: we use SHACL shape definitions as a starting point to generate synthetic data for an RDF graph. The generation process involves extracting the constraints from the SHACL shapes, converting the specified constraints into rules, and then generating artificial data for a predefined number of RDF entities, based on these rules. The purpose of RDFGraphGen is the generation of small, medium or large RDF knowledge graphs for the purpose of benchmarking, testing, quality control, training and other similar purposes for applications from the RDF, Linked Data and Semantic Web domain. RDFGraphGen is open-source and is available as a ready-to-use Python package.
... • LDBC SNB [1,2] measures the performance of all systems relevant to linked data operating a social network. • The Semantic Publishing Benchmark (SPB) [5] measures the performance of semantic databases operating on RDF datasets. • The Graphalytics benchmark [6] measures the performance of graph analysis operations (e.g., PageRank, local clustering coefficient). ...
Preprint
Full-text available
The Linked Data Benchmark Council's Financial Benchmark (LDBC FinBench) is a new effort that defines a graph database benchmark targeting financial scenarios such as anti-fraud and risk control. The benchmark has one workload, the Transaction Workload, currently. It captures OLTP scenario with complex, simple read queries and write queries that continuously insert or delete data in the graph. Compared to the LDBC SNB, the LDBC FinBench differs in application scenarios, data patterns, and query patterns. This document contains a detailed explanation of the data used in the LDBC FinBench, the definition of transaction workload, a detailed description for all queries, and instructions on how to use the benchmark suite.
... These datasets allow users to generate the datasets of different scales according to their needs. We use the method described in [26] Table 2. ...
Article
Full-text available
With the gradual development of the network, RDF graphs have become more and more complex as the scale of data increases; how to perform more effective query for massive RDF graphs is a hot topic of continuous research. The traditional methods of graph query and graph traversal produce great redundancy of intermediate results, and processing subgraph collection queries in stand-alone mode cannot perform efficient matching when the amount of data is extremely large. Moreover, when processing subgraph collection queries, it is necessary to iterate the query graph multiple times in the query of the common subgraph, and the execution efficiency is not high. In response to the above problems, a distributed query strategy of RDF subgraph set based on composite relation tree is proposed. Firstly, a corresponding composite relationship is established for RDF subgraph set, then the composite relation graph is clipped, and the redundant nodes and edges of the composite relation graph are deleted to obtain the composite relation tree. Finally, using the composite relation tree, a MapReduce-based RDF subgraph set query method is proposed, which can use parallel in the computing environment, the distributed query batch processing is performed on the RDF subgraph set, and the query result of the RDF subgraph set is obtained by traversing the composite relation tree. The experimental results show that the algorithm proposed in this paper can improve the query efficiency of RDF subgraph set.
... • The Semantic Publishing Benchmark (SPB) [28] measures the performance of semantic databases operating on RDF datasets. • The Graphalytics benchmark [16] measures the performance of graph analysis operations (e.g. ...
Preprint
Full-text available
The Linked Data Benchmark Council's Social Network Benchmark (LDBC SNB) is an effort intended to test various functionalities of systems used for graph-like data management. For this, LDBC SNB uses the recognizable scenario of operating a social network, characterized by its graph-shaped data. LDBC SNB consists of two workloads that focus on different functionalities: the Interactive workload (interactive transactional queries) and the Business Intelligence workload (analytical queries). This document contains the definition of the Interactive Workload and the first draft of the Business Intelligence Workload. This includes a detailed explanation of the data used in the LDBC SNB benchmark, a detailed description for all queries, and instructions on how to generate the data and run the benchmark with the provided software.
... PaRMAT [13] can be used to generate large R-MAT graphs and provides controllability over certain properties (e.g., directed/undirected, loops). LDBC Datagen [25] is a synthetic graph data generation tool for the LDBC Social Network Benchmark [6]. Datagen graphs mimic real-world social media graphs in terms of degree distribution, following a discretized power law distribution. ...
Conference Paper
Full-text available
Although graph processing has become a topic of interest in many domains, we still observe a lack of representative datasets for in-depth performance and scalability analysis. Neither data collections, nor graph generators provide enough diversity and control for thorough analysis. To address this problem, we propose a heuristic method for scaling existing graphs. Our approach, based on sampling and interconnection, can provide a scaled "version" of a given graph. Moreover, we provide analytical models to predict the topological properties of the scaled graphs (such as the diameter, degree distribution , density, or the clustering coefficient), and further enable the user to tweak these properties. Property control is achieved through a portfolio of graph interconnection methods (e.g., star, ring, chain, fully connected) applied for combining the graph samples. We further implement our method as an open-source tool which can be used to quickly provide families of datasets for in-depth benchmarking of graph processing algorithms. Our empirical evaluation demonstrates our tool provides scaled graphs of a wide range of sizes, whose properties match well the model predictions and/or user requirements. Finally, we also illustrate, through a case-study, how scaled graphs can be used for in-depth performance analysis of graph processing algorithms.
Article
Nowadays, Resource Description Framework (RDF) query has been widely used in social networks, biomedicine and other fields. With the explosion of RDF data due to the Internet of Things and Semantic Web, people's demand for intelligent computing and intelligent search is increasing, effectively querying RDF has become a major challenge. The current query methods often introduce a large number of join operations, and repeatedly traverse in some subgraphs during the query process, which makes the query efficiency and query performance poor. To address the above problems, this paper proposes a subgraph query algorithm for RDF graph data in a distributed environment. The graph structure is used to decompose the stars of the RDF graph, and the optimal query sequence of the stars is calculated. Fewer intermediate results can be produced based on the query sequence to reduce repeated calculations. Besides, adjacency lists are used to store RDF graphs, which are distributed across multiple tables. Multiple table operations can reduce the scope of subject node traversal, and further improve the query efficiency of RDF subgraph by matching one star per iteration. Experimental results show that our work can improve the query efficiency of RDF subgraphs.
Article
When benchmarking RDF data management systems such as public transport route planners, system evaluation needs to happen under various realistic circumstances, which requires a wide range of datasets with different properties. Real-world datasets are almost ideal, as they offer these realistic circumstances, but they are often hard to obtain and inflexible for testing. For these reasons, synthetic dataset generators are typically preferred over real-world datasets due to their intrinsic flexibility. Unfortunately, many synthetic dataset that are generated within benchmarks are insufficiently realistic, raising questions about the generalizability of benchmark results to real-world scenarios. In order to benchmark geospatial and temporal RDF data management systems such as route planners with sufficient external validity and depth, we designed PoDiGG, a highly configurable generation algorithm for synthetic public transport datasets with realistic geospatial and temporal characteristics comparable to those of their real-world variants. The algorithm is inspired by real-world public transit network design and scheduling methodologies. This article discusses the design and implementation of PoDiGG and validates the properties of its generated datasets. Our findings show that the generator achieves a sufficient level of realism, based on the existing coherence metric and new metrics we introduce specifically for the public transport domain. Thereby, PoDiGG provides a flexible foundation for benchmarking RDF data management systems with geospatial and temporal data.
Chapter
Automated model generation can be highly beneficial for various application scenarios including software tool certification, validation of cyber-physical systems or benchmarking graph databases to avoid tedious manual synthesis of models. In the paper, we present a long-term research challenge how to generate graph models specific to a domain which are consistent, diverse, scalable and realistic at the same time. We provide foundations for a class of model generators along a refinement relation which operates over partial models with 3-valued representation and ensures that subsequently derived partial models preserve the truth evaluation of well-formedness constraints in the domain. We formally prove completeness, i.e. any finite instance model of a domain can be generated by model generator transformations in finite steps and soundness, i.e. any instance model retrieved as a solution satisfies all well-formedness constraints. An experimental evaluation is carried out in the context of a statechart modeling tool to evaluate the trade-off between different characteristics of model generators.
Conference Paper
Full-text available
The Linked Data Benchmark Council (LDBC) is now two years underway and has gathered strong industrial participation for its mission to establish benchmarks, and benchmarking practices for evaluating graph data management systems. The LDBC introduced a new choke-point driven methodology for developing benchmark workloads, which combines user input with input from expert systems architects, which we outline. This paper describes the LDBC Social Network Benchmark (SNB), and presents database benchmarking innovation in terms of graph query functionality tested, correlated graph generation techniques, as well as a scalable benchmark driver on a workload with complex graph dependencies. SNB has three query workloads under development: Interactive, Business Intelligence, and Graph Algorithms. We describe the SNB Interactive Workload in detail and illustrate the workload with some early results, as well as the goals for the two other workloads.
Article
Full-text available
The Linked Data Benchmark Council (LDBC) is an EU project that aims to develop industry-strength benchmarks for graph and RDF data management systems. It includes the creation of a non-profit LDBC organization, where industry players and academia come together for managing the development of benchmarks as well as auditing and publishing official results. We present an overview of the project including its goals and organization, and describe its process and design methodology for benchmark development. We introduce so-called “choke-point” based benchmark development through which experts identify key technical challenges, and introduce them in the benchmark workload. Finally, we present the status of two benchmarks currently in development, one targeting graph data management systems using a social network data case, and the other targeting RDF systems using a data publishing case.
Article
Full-text available
In just under seven years, Twitter has grown to count nearly three percent of the entire global population among its active users who have sent more than 170 billion 140-character messages. Today the service plays such a significant role in American culture that the Library of Congress has assembled a permanent archive of the site back to its first tweet, updated daily. With its open API, Twitter has become one of the most popular data sources for social research, yet the majority of the literature has focused on it as a text or network graph source, with only limited efforts to date focusing exclusively on the geography of Twitter, assessing the various sources of geographic information on the service and their accuracy. More than three percent of all tweets are found to have native location information available, while a naive geocoder based on a simple major cities gazetteer and relying on the user-provided Location and Profile fields is able to geolocate more than a third of all tweets with high accuracy when measured against the GPS-based baseline. Geographic proximity is found to play a minimal role both in who users communicate with and what they communicate about, providing evidence that social media is shifting the communicative landscape. © 2013, First Monday. © 2013, Kalev H. Leetaru, Shaowen Wang, Guofeng Cao, Anand Padmanabhan, and Eric Shook.
Conference Paper
Full-text available
Benchmarking graph-oriented database workloads and graph-oriented database systems is increasingly becoming relevant in analytical Big Data tasks, such as social network analysis. In graph data, structure is not mainly found inside the nodes, but especially in the way nodes happen to be connected, i.e. structural correlations. Because such structural correlations determine join fan-outs experienced by graph analysis algorithms and graph query executors, they are an essential, yet typically neglected, ingredient of synthetic graph generators. To address this, we present S3G2: a Scalable Structure-correlated Social Graph Generator. This graph generator creates a synthetic social graph, containing non-uniform value distributions and structural correlations, which is intended as test data for scalable graph analysis algorithms and graph database systems. We generalize the problem by decomposing correlated graph generation in multiple passes that each focus on one so-called correlation dimension; each of which can be mapped to a MapReduce task. We show that S3G2 can generate social graphs that (i) share well-known graph connectivity characteristics typically found in real social graphs (ii) contain certain plausible structural correlations that influence the performance of graph analysis algorithms and queries, and (iii) can be quickly generated at huge sizes on common cluster hardware.
Article
Full-text available
We study the structure of the social graph of active Facebook users, the largest social network ever analyzed. We compute numerous features of the graph including the number of users and friendships, the degree distribution, path lengths, clustering, and mixing patterns. Our results center around three main observations. First, we characterize the global structure of the graph, determining that the social network is nearly fully connected, with 99.91% of individuals belonging to a single large connected component, and we confirm the "six degrees of separation" phenomenon on a global scale. Second, by studying the average local clustering coefficient and degeneracy of graph neighborhoods, we show that while the Facebook graph as a whole is clearly sparse, the graph neighborhoods of users contain surprisingly dense structure. Third, we characterize the assortativity patterns present in the graph by studying the basic demographic and network properties of users. We observe clear degree assortativity and characterize the extent to which "your friends have more friends than you". Furthermore, we observe a strong effect of age on friendship preferences as well as a globally modular community structure driven by nationality, but we do not find any strong gender homophily. We compare our results with those from smaller social networks and find mostly, but not entirely, agreement on common structural network characteristics.
Article
RDF (Resource Description Framework) is seeing rapidly increasing adoption, for example, in the context of the Linked Open Data (LOD) movement and diverse life sciences data publishing and integration projects. This paper discusses how we have adapted OpenLink Virtuoso, a general purpose RDBMS, for this new type of workload. We discuss adapting Virtuoso's relational engine for native RDF support with dedicated data types, bitmap indexing and SQL optimizer techniques. We further discuss scaling out by running on a cluster of commodity servers, each with local memory and disk. We look at how this impacts query planning and execution and how we achieve high parallel utilization of multiple CPU cores on multiple servers. We present comparisons with other RDF storage models as well as other approaches to scaling out on server clusters. We present conclusions and metrics as well as a number of use cases, from DBpedia to bio informatics and collaborative web applications.
Conference Paper
Online content exhibits rich temporal dynamics, and diverse realtime user generated content further intensifies this process. However, temporal patterns by which online content grows and fades over time, and by which different pieces of content compete for attention remain largely unexplored. We study temporal patterns associated with online content and how the content's popularity grows and fades over time. The attention that content receives on the Web varies depending on many factors and occurs on very different time scales and at different resolutions. In order to uncover the temporal dynamics of online content we formulate a time series clustering problem using a similarity metric that is invariant to scaling and shifting. We develop the K-Spectral Centroid (K-SC) clustering algorithm that effectively finds cluster centroids with our similarity measure. By applying an adaptive wavelet-based incremental approach to clustering, we scale K-SC to large data sets. We demonstrate our approach on two massive datasets: a set of 580 million Tweets, and a set of 170 million blog posts and news media articles. We find that K-SC outperforms the K-means clustering algorithm in finding distinct shapes of time series. Our analysis shows that there are six main temporal shapes of attention of online content. We also present a simple model that reliably predicts the shape of attention by using information about only a small number of participants. Our analyses offer insight into common temporal patterns of the content on theWeb and broaden the understanding of the dynamics of human attention.
Conference Paper
The widespread adoption of the Resource Description Framework (RDF) for the representation of both open web and enterprise data is the driving force behind the increasing research interest in RDF data management. As RDF data management systems proliferate, so are benchmarks to test the scalability and performance of these systems under data and workloads with various characteristics. In this paper, we compare data generated with existing RDF benchmarks and data found in widely used real RDF datasets. The results of our comparison illustrate that existing benchmark data have little in common with real data. Therefore any conclusions drawn from existing benchmark tests might not actually translate to expected behaviours in real settings. In terms of the comparison itself, we show that simple primitive data metrics are inadequate to flesh out the fundamental differences between real and benchmark data. We make two contributions in this paper: (1) To address the limitations of the primitive metrics, we introduce intuitive and novel metrics that can indeed highlight the key differences between distinct datasets; (2) To address the limitations of existing benchmarks, we introduce a new benchmark generator with the following novel characteristics: (a) the generator can use any (real or synthetic) dataset and convert it into a benchmark dataset; (b) the generator can generate data that mimic the characteristics of real datasets with user-specified data properties. On the technical side, we formulate the benchmark generation problem as an integer programming problem whose solution provides us with the desired benchmark datasets. To our knowledge, this is the first methodological study of RDF benchmarks, as well as the first attempt on generating RDF benchmarks in a principled way.
There's Now a Better Way to GIF on Twitter
  • Kia Kokalitcheva
Kia Kokalitcheva. There's Now a Better Way to GIF on Twitter. http:// fortune.com/2016/02/17/twitter-gif-button-finally/, 2016. Accessed: 2016-06-29.
The Role of News on Facebook
  • Amy Mitchell
  • Jocelyn Kiley
  • Jeffrey Gottfried
  • Emily Guskin
Amy Mitchell, Jocelyn Kiley, Jeffrey Gottfried, and Emily Guskin. The Role of News on Facebook. http://www.journalism.org/2013/10/24/the-role-ofnews-on-facebook/, 2013. Accessed: 2016-06-29.