Article

Virtuoso: RDF support in a native RDBMS

Authors:
  • OpenLink Software
  • OpenLink Software, Burlington MA, United States
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

RDF (Resource Description Framework) is seeing rapidly increasing adoption, for example, in the context of the Linked Open Data (LOD) movement and diverse life sciences data publishing and integration projects. This paper discusses how we have adapted OpenLink Virtuoso, a general purpose RDBMS, for this new type of workload. We discuss adapting Virtuoso's relational engine for native RDF support with dedicated data types, bitmap indexing and SQL optimizer techniques. We further discuss scaling out by running on a cluster of commodity servers, each with local memory and disk. We look at how this impacts query planning and execution and how we achieve high parallel utilization of multiple CPU cores on multiple servers. We present comparisons with other RDF storage models as well as other approaches to scaling out on server clusters. We present conclusions and metrics as well as a number of use cases, from DBpedia to bio informatics and collaborative web applications.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Many efforts have been made in recent years with regards to both general metadata, e.g. Vocabulary Of Interlinked Datasets (voiD) [3], Vocabulary of a Friend (VOAF), 25 Data Catalog Vocabulary (DCAT) 26 and DataID [9], and for more modeland domain-specific cataloguing, e.g. the Semantic Web Applications in Neuromedicine Ontology (SWAN) 27 by the Semantic Web Health Care and Life Sciences (HCLS) Interest Group; 28 the Linguistic Metadata (LIME) [28] for OntoLex, 29 the Meta-Share.owl ontology 30 [57], a linked open data version of the XML-based META-SHARE [29]. ...
... In particular, the authors suggest for this element: corpus, lexical/conceptual resource, language description, and tool/service. 25 ...
... The SPARQL language is the standard query language proposed by the W3C 75 to query a collection of RDF triples [32]. Triple stores and RDF processing frameworks, such as Virtuoso [25], Jena [31], Eclipse RDF4J 76 or RDFLib 77 usually offer a SPARQL interface. Users are able to query the triples on the Web because of the SPARQL protocol [27]: clients submit SPARQL queries through a specific HTTP interface and the server executes these queries and responds with the results. ...
Article
Full-text available
The need for reusable, interoperable, and interlinked linguistic resources in Natural Language Processing downstream tasks has been proved by the increasing efforts to develop standards and metadata suitable to represent several layers of information. Nevertheless, despite these efforts, the achievement of full compatibility for metadata in linguistic resource production is still far from being reached. Access to resources observing these standards is hindered either by (i) lack of or incomplete information, (ii) inconsistent ways of coding their metadata, and (iii) lack of maintenance. In this paper, we offer a quantitative and qualitative analysis of descriptive metadata and resources availability of two main metadata repositories: LOD Cloud and Annohub. Furthermore, we introduce a metadata enrichment, which aims at improving resource information, and a metadata alignment to META-SHARE ontology, suitable for easing the accessibility and interoperability of such resources.
... In particular, both research and industry are employing these systems to store and query knowledge bases augmented with semantic B [59,85]. Thus, a multitude of triplestore implementations are available, ranging from academic prototypes (e.g., RDF-3X [55], Hexastore [81]) and community projects (e.g., JENA TDB [58], and Rya [64]) to commercial products (e.g., Virtuoso [24], GraphDB [57], and Neptune [13]). RDF data differs from relational data in the complexity of its structure. ...
... Main triple data When reusing the underlying infrastructure and technologies of relational databases, designers must define how the RDF structure is mapped into a relational structure. 3store [33], RDFLib [46], and Virtuoso [24] use a large triple table with a field for each s,p,o atom together with some auxiliary indexes. Other relational-based systems use dynamic subdivision to create a set of relational tables. ...
... <Prefix># representations that provide the number of triples with a given prefix nism is either B+ trees, which support range queries well, or hash tables, which are more efficient for single lookup. [24], YARS2 [36]) employ block-level compression. This approach entails organizing the triples in memory blocks in a manner that allows compression using techniques such as Huffman encoding [39]. ...
Article
Full-text available
RDF triplestores’ ability to store and query knowledge bases augmented with semantic annotations has attracted the attention of both research and industry. A multitude of systems offer varying data representation and indexing schemes. However, as recently shown for designing data structures, many design choices are biased by outdated considerations and may not result in the most efficient data representation for a given query workload. To overcome this limitation, we identify a novel three-dimensional design space. Within this design space, we map the trade-offs between different RDF data representations employed as part of an RDF triplestore and identify unexplored solutions. We complement the review with an empirical evaluation of ten standard SPARQL benchmarks to examine the prevalence of these access patterns in synthetic and real query workloads. We find some access patterns, to be both prevalent in the workloads and under-supported by existing triplestores. This shows the capabilities of our model to be used by RDF store designers to reason about different design choices and allow a (possibly artificially intelligent) designer to evaluate the fit between a given system design and a query workload.
... This allows for quickly retrieving the full triples that match (e.g.) a given triple pattern. However, some RDF engines based on relational storage (e.g., Virtuoso [36]) rather use (or provide an option for) column-wise storage, where the values along a column are stored contiguously, often following a particular order. Such column-wise storage allows for better compression, and for quickly reading many values from a single column. ...
... Triple tables can be straightforwardly extended to quad tables in order to support SPARQL datasets [36,49]. ...
... Vertical partitioning can be used to store quads by adding a Graph column to each table [36,49] ...
... This allows for quickly retrieving the full triples that match (e.g.) a given triple pattern. However, some RDF engines based on relational storage (e.g., Virtuoso [69]) rather use (or provide an option for) column-wise storage, where the values along a column are stored contiguously, often following a particular order. Such column-wise storage allows for better compression, and for quickly reading many values from a single column. ...
... Triple tables can be straightforwardly extended to quad tables in order to support SPARQL datasets [69,91]. ...
... Also, RDF graphs may have thousands of properties [233], which may lead to a schema with many relations. Vertical partitioning can be used to store quads by adding a Graph column to each table [69,91]. ...
Article
Full-text available
RDF has seen increased adoption in recent years, prompting the standardization of the SPARQL query language for RDF, and the development of local and distributed engines for processing SPARQL queries. This survey paper provides a comprehensive review of techniques and systems for querying RDF knowledge graphs. While other reviews on this topic tend to focus on the distributed setting, the main focus of the work is on providing a comprehensive survey of state-of-the-art storage, indexing and query processing techniques for efficiently evaluating SPARQL queries in a local setting (on one machine). To keep the survey self-contained, we also provide a short discussion on graph partitioning techniques used in the distributed setting. We conclude by discussing contemporary research challenges for further improving SPARQL query engines. This extended version also provides a survey of over one hundred SPARQL query engines and the techniques they use, along with twelve benchmarks and their features.
... This allows for quickly retrieving the full triples that match (e.g.) a given triple pattern. However, some RDF engines based on relational storage (e.g., Virtuoso [54]) rather use (or provide an option for) column-wise storage, where the values along a column are stored contiguously, often following a particular order. Such column-wise storage allows for better compression, and thus for caching more data in memory; it offers performance benefits when queries require reading many values from a particular column rdf:type Figure 1 (e.g., for aggregations) but may be slower when queries need to match and retrieve entire triples. ...
... Such column-wise storage allows for better compression, and thus for caching more data in memory; it offers performance benefits when queries require reading many values from a particular column rdf:type Figure 1 (e.g., for aggregations) but may be slower when queries need to match and retrieve entire triples. Triple tables can be straightforwardly extended to quad tables in order to support SPARQL datasets, as is done in a variety of engines [54,75]. ...
... Vertical partitioning can be used to store quads by adding a Graph column to each table [54,75]. ...
Preprint
Full-text available
Recent years have seen the growing adoption of non-relational data models for representing diverse, incomplete data. Among these, the RDF graph-based data model has seen ever-broadening adoption, particularly on the Web. This adoption has prompted the standardization of the SPARQL query language for RDF, as well as the development of a variety of local and distributed engines for processing queries over RDF graphs. These engines implement a diverse range of specialized techniques for storage, indexing, and query processing. A number of benchmarks, based on both synthetic and real-world data, have also emerged to allow for contrasting the performance of different query engines, often at large scale. This survey paper draws together these developments, providing a comprehensive review of the techniques, engines and benchmarks for querying RDF knowledge graphs.
... This allows for quickly retrieving the full triples that match (e.g.) a given triple pattern. However, some RDF engines based on relational storage (e.g., Virtuoso [54]) rather use (or provide an option for) column-wise storage, where the values along a column are stored contiguously, often following a particular order. Such column-wise storage allows for better compression, and thus for caching more data in memory; it offers performance benefits when queries require reading many values from a particular column rdf:type Figure 1 (e.g., for aggregations) but may be slower when queries need to match and retrieve entire triples. ...
... Such column-wise storage allows for better compression, and thus for caching more data in memory; it offers performance benefits when queries require reading many values from a particular column rdf:type Figure 1 (e.g., for aggregations) but may be slower when queries need to match and retrieve entire triples. Triple tables can be straightforwardly extended to quad tables in order to support SPARQL datasets, as is done in a variety of engines [54,75]. ...
... Vertical partitioning can be used to store quads by adding a Graph column to each table [54,75]. ...
Preprint
Full-text available
Recent years have seen the growing adoption of non-relational data models for representing diverse, incomplete data. Among these, the RDF graph-based data model has seen ever-broadening adoption, particularly on the Web. This adoption has prompted the standardization of the SPARQL query language for RDF, as well as the development of a variety of local and distributed engines for processing queries over RDF graphs. These engines implement a diverse range of specialized techniques for storage, indexing, and query processing. A number of benchmarks, based on both synthetic and real-world data, have also emerged to allow for contrasting the performance of different query engines, often at large scale. This survey paper draws together these developments, providing a comprehensive review of the techniques, engines and benchmarks for querying RDF knowledge graphs.
... Furthermore, using these logical fragments as physical structures could significantly improve the performance of centralized and distributed systems. The work of Pham et al. in [PPEB15] showed that explicitly storing the data using a relational schema automatically discovered boosts the performance of Virtuoso [EM09], a relational-based triple store. Organizing the data into forward and backward graph fragments, regardless of the structure used to persist the data (e.g., tables, indexes or adjacency lists), avoids to scan the whole dataset many times with for a single query as it is done by most of the systems storing the entire RDF graph in a single data structure. ...
... The loading module of RDF QDAG is able to partition very large RDF graphs even when the entire data does not fit in the available main memory. The compared systems are: i) Virtuoso [EM09], ii) RDF-3X [NW08], iii) gStore [ZÖC + 14] and iv) CliqueSquare [GKM + 15] in a single node (from this point represent it as CliqueSquareS). We evaluated them using real and synthetic dataset described below. ...
... (i) Virtuoso 7 [EM09]: it is the relational-based system by excellence storing the data in a triple table. ...
Thesis
The Resource Description Framework (RDF) and SPARQL are very popular graph-based standards initially designed to represent and query information on the Web. The flexibility offered by RDF motivated its use in other domains and today RDF datasets are great information sources. They gather billions of triples in Knowledge Graphs that must be stored and efficiently exploited. The first generation of RDF systems was built on top of traditional relational databases. Unfortunately, the performance in these systems degrades rapidly as the relational model is not suitable for handling RDF data inherently represented as a graph. Native and distributed RDF systems seek to overcome this limitation. The former mainly use indexing as an optimization strategy to speed up queries. Distributed and parallel RDF systems resorts to data partitioning. The logical representation of the database is crucial to design data partitions in the relational model. The logical layer defining the explicit schema of the database provides a degree of comfort to database designers. It lets them choose manually or automatically (through advisors) the tables and attributes to be partitioned. Besides, it allows the partitioning core concepts to remain constant regardless of the database management system. This design scheme is no longer valid for RDF databases. Essentially, because the RDF model does not explicitly enforce a schema since RDF data is mostly implicitly structured. Thus, the logical layer is inexistent and data partitioning depends strongly on the physical implementations of the triples on disk. This situation contributes to have different partitioning logics depending on the target system, which is quite different from the relational model’s perspective. In this thesis, we promote the novel idea of performing data partitioning at the logical level in RDF databases. Thereby, we first process the RDF data graph to support logical entity-based partitioning. After this preparation, we present a partitioning framework built upon these logical structures. This framework is accompanied by data fragmentation, allocation, and distribution procedures. This framework was incorporated to a centralized (RDF_QDAG) and a distributed (gStoreD) triple store. We conducted several experiments that confirmed the feasibility of integrating our framework to existent systems improving their performances for certain queries. Finally, we design a set of RDF data partitioning management tools including a data definition language (DDL) and an automatic partitioning wizard.
... This allows for quickly retrieving the full triples that match (e.g.) a given triple pattern. However, some RDF engines based on relational storage (e.g., Virtuoso [54]) rather use (or provide an option for) column-wise storage, where the values along a column are stored contiguously, often following a particular order. Such column-wise storage allows for better compression, and thus for caching more data in memory; it offers performance benefits when queries require reading many values from a particular column rdf:type Figure 1 (e.g., for aggregations) but may be slower when queries need to match and retrieve entire triples. ...
... Such column-wise storage allows for better compression, and thus for caching more data in memory; it offers performance benefits when queries require reading many values from a particular column rdf:type Figure 1 (e.g., for aggregations) but may be slower when queries need to match and retrieve entire triples. Triple tables can be straightforwardly extended to quad tables in order to support SPARQL datasets, as is done in a variety of engines [54,75]. ...
... Vertical partitioning can be used to store quads by adding a Graph column to each table [54,75]. ...
Preprint
Full-text available
Recent years have seen the growing adoption of non-relational data models for representing diverse , incomplete data. Among these, the RDF graph-based data model has seen ever-broadening adoption, particularly on the Web. This adoption has prompted the standardization of the SPARQL query language for RDF, as well as the development of a variety of local and distributed engines for processing queries over RDF graphs. These engines implement a diverse range of specialized techniques for storage, indexing, and query processing. A number of benchmarks, based on both synthetic and real-world data, have also emerged to allow for contrasting the performance of different query engines , often at large scale. This survey paper draws together these developments, providing a comprehensive review of the techniques, engines and benchmarks for querying RDF knowledge graphs.
... For simplicity, let's assume all the triples gave in table Table 3 belongs to named Graph G1, then the Quad representation of the given TT is shown in Table 5. The quad representation has been used in many well-known RDF 10 RDF named graph: https://www.w3.org/TR/ rdf11-concepts/#section-dataset Person 10 Course 11 "Computer Science" 12 "USA" engines 11 such as Virtuoso [30] and 4store [43]. Please note that as per SPARQL specification 12 , a SPARQL query may specify the RDF graph to be used for matching by using the FROM clause and the FROM NAMED clause to describe the RDF dataset. ...
... 42 https://blazegraph.com/ Table, SHARD [90] 4store [43] Virtuoso [30] HDT [32] H-RDF-3X [50] GraphDB 37 CumulusRDF [44] Rapid+ [89] Jena-HBase [62] Rya [86] AMADA [9] H2RDF [83] SHAPE [67] WARP [49] PigSPARQL [102] EAGRE [132] H2RDF+ [82] Trinity.RDF [131] D-SPARQ [75] CliqueSquare [34] TripleRush [113] chameleon-db [8] Partout [33] Sempala [103] TriAD [39] SparkRDF [19] SemStore [123] DREAM [40] DiploCloud [125] SPARQLGX [102] S2RDF [104] AdPart [41] S2X [101] gStoreD [85] Wukong [108] SANSA [68] Stylus [48] Koral [55] PRoST [24] WORQ [71] Anzograph 38 Neptune 39 HF,VF,MF [84] DiStRDF [119] Leon [38] DISE [52] default approach uses static analysis and fast cardinality estimation of access paths, the second approach uses runtime sampling of join graphs. In remove the scaling limit, the Blazegraph employs a dynamically partitioned key-range shards. ...
... Virtuoso [30] stores RDF data as single table with four columns: Subject (s), Predicate (p), Object (o), and Graph (g), where s, p and g are IRIs, and o may be of any data type. The Virtuoso makes use of the dictionary encoding for IRI's and literals used in the triples. ...
Preprint
Version 2 of the paper "Storage, Indexing, Query Processing, and Benchmarking in Centralized and Distributed RDF Engines: A Survey"<br
... For simplicity, let's assume all the triples gave in table Table 3 belongs to named Graph G1, then the Quad representation of the given TT is shown in Table 5. The quad representation has been used in many well-known RDF 10 RDF named graph: https://www.w3.org/TR/ rdf11-concepts/#section-dataset Person 10 Course 11 "Computer Science" 12 "USA" engines 11 such as Virtuoso [30] and 4store [43]. Please note that as per SPARQL specification 12 , a SPARQL query may specify the RDF graph to be used for matching by using the FROM clause and the FROM NAMED clause to describe the RDF dataset. ...
... 42 https://blazegraph.com/ Table, SHARD [90] 4store [43] Virtuoso [30] HDT [32] H-RDF-3X [50] GraphDB 37 CumulusRDF [44] Rapid+ [89] Jena-HBase [62] Rya [86] AMADA [9] H2RDF [83] SHAPE [67] WARP [49] PigSPARQL [102] EAGRE [132] H2RDF+ [82] Trinity.RDF [131] D-SPARQ [75] CliqueSquare [34] TripleRush [113] chameleon-db [8] Partout [33] Sempala [103] TriAD [39] SparkRDF [19] SemStore [123] DREAM [40] DiploCloud [125] SPARQLGX [102] S2RDF [104] AdPart [41] S2X [101] gStoreD [85] Wukong [108] SANSA [68] Stylus [48] Koral [55] PRoST [24] WORQ [71] Anzograph 38 Neptune 39 HF,VF,MF [84] DiStRDF [119] Leon [38] DISE [52] default approach uses static analysis and fast cardinality estimation of access paths, the second approach uses runtime sampling of join graphs. In remove the scaling limit, the Blazegraph employs a dynamically partitioned key-range shards. ...
... Virtuoso [30] stores RDF data as single table with four columns: Subject (s), Predicate (p), Object (o), and Graph (g), where s, p and g are IRIs, and o may be of any data type. The Virtuoso makes use of the dictionary encoding for IRI's and literals used in the triples. ...
Preprint
The recent advancements of the Semantic Web and Linked Data have changed the working of the traditional web. There is significant adoption of the Resource Description Framework (RDF) format for saving of web-based data. This massive adoption has paved the way for the development of various centralized and distributed RDF processing engines. These engines employ various mechanisms to implement critical components of the query processing engines such as data storage, indexing, language support, and query execution. All these components govern how queries are executed and can have a substantial effect on the query runtime. For example, the storage of RDF data in various ways significantly affects the data storage space required and the query runtime performance. The type of indexing approach used in RDF engines is critical for fast data lookup. The type of the underlying querying language (e.g., SPARQL or SQL) used for query execution is a crucial optimization component of the RDF storage solutions. Finally, query execution involving different join orders significantly affects the query response time. This paper provides a comprehensive review of centralized and distributed RDF engines in terms of storage, indexing, language support, and query execution.
... For simplicity, let's assume all the triples gave in table Table 3 belongs to named Graph G1, then the Quad representation of the given TT is shown in Table 5. The quad representation has been used in many well-known RDF 10 RDF named graph: https://www.w3.org/TR/ rdf11-concepts/#section-dataset Person 10 Course 11 "Computer Science" 12 "USA" engines 11 such as Virtuoso [30] and 4store [43]. Please note that as per SPARQL specification 12 , a SPARQL query may specify the RDF graph to be used for matching by using the FROM clause and the FROM NAMED clause to describe the RDF dataset. ...
... 42 https://blazegraph.com/ Table, SHARD [90] 4store [43] Virtuoso [30] HDT [32] H-RDF-3X [50] GraphDB 37 CumulusRDF [44] Rapid+ [89] Jena-HBase [62] Rya [86] AMADA [8] H2RDF [83] SHAPE [67] WARP [49] PigSPARQL [102] EAGRE [132] H2RDF+ [82] Trinity.RDF [131] D-SPARQ [75] CliqueSquare [34] TripleRush [113] chameleon-db [7] Partout [33] Sempala [103] TriAD [39] SparkRDF [19] SemStore [123] DREAM [40] DiploCloud [125] SPARQLGX [102] S2RDF [104] AdPart [41] S2X [101] gStoreD [85] Wukong [108] SANSA [68] Stylus [48] Koral [55] PRoST [24] WORQ [71] Anzograph 38 Neptune 39 HF,VF,MF [84] DiStRDF [119] Leon [38] DISE [52] with provenance. Based on the type of triple Bigdata creates three or six key-range partitioned B+ tree indexes. ...
... Virtuoso [30] stores RDF data as single table with four columns: Subject (s), Predicate (p), Object (o), and Graph (g), where s, p and g are IRIs, and o may be of any data type. The Virtuoso makes use of the dictionary encoding for IRI's and literals used in the triples. ...
Preprint
Full-text available
The recent advancements of the Semantic Web and Linked Data have changed the working of the traditional web. There is a huge adoption of the Resource Description Framework (RDF) format for saving of web-based data. This massive adoption has paved the way for the development of various centralized and distributed RDF processing engines. These engines employ different mechanisms to implement key components of the query processing engines such as data storage, indexing, language support, and query execution. All these components govern how queries are executed and can have a substantial effect on the query runtime. For example, the storage of RDF data in various ways significantly affects the data storage space required and the query runtime performance. The type of indexing approach used in RDF engines is key for fast data lookup. The type of the underlying querying language (e.g., SPARQL or SQL) used for query execution is a key optimization component of the RDF storage solutions. Finally, query execution involving different join orders significantly affects the query response time. This paper provides a comprehensive review of centralized and distributed RDF engines in terms of storage, indexing, language support, and query execution.
... Ces approches de partitionnement ne peuvent pas s'adapter à la nature dynamique des flux RDF. Les principales approches de partitionnement de données RDF statiques sont orientées hachage [25,26,27,28,29,30,31,32] ou graphe [33,34,35,36,37,38,39]. Cependant, ces approches restent peu pratiques pour les flux RDF et presque impossibles à appliquer dans un cadre de croisement de données statiques et dynamiques. ...
... Si p = 1, comme dans le cas de la transition du noeud 8 au noeud 7 du graphe G ( Figure 5.3), on réduit la valeur de p à 0, 99999999999 pour éviter que P(S n = k) soit nulle. Soient SN l'ensemble des noeuds sujets du graphe G et LN l'ensemble des noeuds feuilles du graphe G tel que SN ∪ LN = L, l'ensemble des noeuds du graphe G. La probabilité P(s) que l'information quitte le noeud sujet s et atteigne tous les noeuds feuilles du graphe G est définie comme suit : if l NP.cont ai ns(nod e) then 6: nbTP ← nbTP − 1 7: nod esSet ← g r aph.g et Successor s(nod e) // nS 8: nbCT ← nbTP + nS.Si ze() 9: while nod eS ∈ nS do 10: if sSN.not Cont ai ns(nod eS) then 11: if nod eS == l N then 12: nbSP ← nbSP + 1 13: else 14: sSN.ad d (nod eS) 15: end if 16: else 17: nbSP ← nbSP + 1 18: end if 19: end while 20: else 21: sSN.r emove(nod eS) 22: if nod eS == l F then 23: nbSP ← nbSP + 1 24: end if 25: end if 26 ...
... Spark est un framework de traitement en grappe de type MapReduce qui propose une collection d'éléments parallèles tolérants aux pannes, appelée RDD (Resilient Distributed Dataset) [26]. Un RDD est divisé en plusieurs partitions sur différents noeuds d'un cluster, de sorte que les opérations peuvent être effectuées en parallèle. ...
Thesis
Notre utilisation quotidienne de l’Internet et des technologies connexes génère, de manière continue et à des vitesses rapides et variables, de grandes quantités de données hétérogènes issues des réseaux de capteurs, des logs de moteurs de recherches génériques ou spécialisés, des données de sites de contenu multimédia, des données de mesure de stations météorologiques, de la géolocalisation, des applications IoT (l’Internet des objets), etc. Traiter de telles données dans les bases de données conventionnelles (Systèmes de Gestion de Bases de Données Relationnelles) peut être très coûteux en ressources temporelles et mémoires. Pour répondre efficacement aux besoins et aider à la prise de décision, ces flots d’informations nécessitent des traitements en temps réel. Les Systèmes de Gestion de Flux de Données (SGFDs) posent et évaluent des requêtes sur les données récentes d’un flux dans des structures appelées fenêtre. Les données en entrée des SGFDs sont de différents formats bruts tels que CSV, XML, RSS, ou encore JSON. Ce verrou d’hétérogénéité émane de la nature des flux de données et doit être levé. Pour cela, plusieurs groupes de recherche ont bénéficié des avantages des technologies du web sémantique (RDF et SPARQL) en proposant des systèmes de traitement de flux de données RDF appelés RSPs. Cependant, la volumétrie des données, le débit d’entrée élevé, les requêtes concurrentes, le croisement des flux RDF à de larges volumes de données stockées et les calculs coûteux baissent considérablement les performances de ces systèmes. Il faut prévoir une nouvelle approche de réduction de la charge de traitement des flux de données RDF. Dans cette thèse, nous proposons plusieurs solutions pour réduire la charge de traitement de flux de données en mode centralisé. Une approche d’échantillonnage à la volée de flux de graphes RDF est proposée afin de réduire la charge de données et du traitement des flux tout en préservant les liens sémantiques. Cette approche est approfondie en adoptant une méthode de résumé orienté graphe pour extraire des graphes RDF les informations les plus pertinentes en utilisant des mesures de centralité issues de l’Analyse des Réseaux Sociaux. Nous adoptons également un format compressé des données RDF et proposons une approche d’interrogation de données RDF compressées sans phase de décompression. Pour assurer une gestion parallèle et distribuée des flux de données, le travail présenté propose deux solutions supplémentaires de réduction de la charge de traitement en mode distribué : un moteur de traitement parallèle et distribué de flux de graphes RDF et une approche de traitement optimisé des opérations de croisement entre données statiques et dynamiques sont présentés.
... Currently, there has been some research works on subgraph matching queries over RDF data in a distributed environment. One category of methods is based on the relational schema [8,11,14,19,22], in which RDF data are modeled as a set of triples and stored in relational tables or a variant relational schema. All of these methods do not consider inherent graph-like structures of RDF data. ...
... The abovementioned two methods do not employ any structural information of query graphs, thus a large number of join operations may incur expensive costs. Furthermore, Virtuoso [8], supporting RDF in a native RDBMS, also model RDF data as a set of triples. TriAD [11], using a custom MPI protocol, employs six SPO permutation indexes, partitions RDF triples into those indexes, and uses a locality-based summary graph to speed up queries. ...
... Similarly, after obtaining the vertex r with respect to the h value, a new star is generated (lines 10-11). This process (lines [8][9][10][11] terminates until the set Q is empty. ...
Article
Full-text available
With the popularity of knowledge graphs growing rapidly, large amounts of RDF graphs have been released, which raises the need for addressing the challenge of distributed subgraph matching queries. In this paper, we propose an efficient distributed method to answer subgraph matching queries on big RDF graphs using MapReduce. In our method, query graphs are decomposed into a set of stars that utilize the semantic and structural information embedded RDF graphs as heuristics. Two optimization techniques are proposed to further improve the efficiency of our algorithms. One algorithm, called RDF property filtering, filters out invalid input data to reduce intermediate results; the other is to improve the query performance by postponing the Cartesian product operations. The extensive experiments on both synthetic and real-world datasets show that our method outperforms the close competitors S2X and SHARD by an order of magnitude on average.
... Rather than partitioning data, data can also be replicated across partitions. This may vary from replicating the full graph on each machine, such that queries can be answered in full by any machine increasing query throughput (used, e.g., by DREAM [21]), to replicating partitions that are in high-demand (e.g., containing schema data, central nodes, etc.) so that more queries can be evaluated on individual machines and/or machines have equal workloads that avoid hot-spots (used, e.g., by Blazegraph [47] and Virtuoso [14]). D1.2 -v. 1 :q (4) :p (4) Figure 2: Example of optimal k-way partitioning (k = 4) ...
... The architecture will use the Virtuoso Universal Server as an RDF storage solution. Virtuoso is a highperformance virtual database engine which provides transparent access to existing data sources, which are typically databases from different database vendors, or RDF knowledge graphs [13,14,12]. It supports data access via the standard SPARQL protocol, but also offers drivers for the Jena, Sesame, and Redland frameworks. ...
... These syntaxes are generaly used for storing and exchanging RDF data; However, we address here software storage solutions known in collaborative sharing and modification approaches. [51][52][53]. ...
... The main feature of Sesame is that it provides query languages, and a subset of RQL which incorporate the RDF Schema semantics. The concrete data storage is implemented differently according to the underlying database system[52]: PostgreSQL and MySQL.Virtuoso[53] is a multi-protocol server providing access to relational data stored either within Virtuoso itself or any combination of external relational databases. Virtuoso's initial storage solution is fairly conventional: a single table of four columns holds one quad, i.e. triple plus graph per row. ...
Thesis
Access to the Web of Data is nowadays of real interest for research, mainly in the sense that the clients consuming or processing this data are more and more numerous and have various specificities (mobility, Internet, autonomy, storage, etc.). Tools such as Web applications, search engines, e-learning platforms, etc., exploit the Web of Data to offer end-users services that contribute to the improvement of daily activities. In this context, we are working on Web of Data access, considering constraints such as customer mobility and intermittent availability of the Internet connection. We are interested in mobility as our work is oriented towards end-users with mobile devices such as smartphones, tablets, laptops, etc. The intermittency of Internet connection refers herein to scenarios of unavailability of national or international links that make remote data sources inaccessible. We target a scenario where users form a peer-to-peer network such that anyone can generate information and make it available to everyone else on the network. Thus, we survey and compare several solutions (models, architectures,etc.) dedicated to Web of Data access by mobile contributors and discussed in relation to the underlying network architectures and data models considered. We present a conceptual study of peer-to-peer solutions based on gossip protocols dedicated to design the connected overlay networks and present a detailed analysis of data replication systems whose general objective is to ensure a system’s local data availability. On the basis of this work, we proposed an architecture adapted to constraining environments and allowing mobile contributors to share locally, via a browser network, an RDF dataset. The architecture consists of 3 levels: single peers, super peers and remote sources. Two main axes are considered for the implementation of this architecture: firstly the construction and maintenance of connectivity ensured by the gossip protocol, and secondly the high availability of data ensured by a replication mechanism. Our approach has the particularity to consider the location of each participant’s neighbours to increase the search perimeter and to integrate super-peers on which the data graph is replicated allowing data availability improvement. We finally carried out an experimental evaluation of our architecture through extensive simulation configured to capture key aspects of our motivating scenario of supporting data exchange between the participants of a local event.
... The SPARQL protocol [21] standardizes this interaction: clients send SPARQL queries through a specific HTTP interface, and the server attempts to execute these queries and responds with their results. Many triple stores, such as Virtuoso [22] and Jena TDB [23], offer a SPARQL interface. Exposing such a query interface on the public Web contrasts with most other Web APIs, whose main purpose is to expose a less powerful interface than the underlying database. ...
... and 7.1.1) [22] and Jena Fuseki [23] (TDB 1.0.1 and HDT 1.1.1), respectively; and for the TPF server we use an HDT [35] backend. ...
... Rather than partitioning data, data can also be replicated across partitions. This may vary from replicating the full graph on each machine, such that queries can be answered in full by any machine increasing query throughput (used, e.g., by DREAM [21]), to replicating partitions that are in high-demand (e.g., containing schema data, central nodes, etc.) so that more queries can be evaluated on individual machines and/or machines have equal workloads that avoid hot-spots (used, e.g., by Blazegraph [47] and Virtuoso [14]). D1.2 -v. 1 :q (4) :p (4) Figure 2: Example of optimal k-way partitioning (k = 4) ...
... The architecture will use the Virtuoso Universal Server as an RDF storage solution. Virtuoso is a highperformance virtual database engine which provides transparent access to existing data sources, which are typically databases from different database vendors, or RDF knowledge graphs [13,14,12]. It supports data access via the standard SPARQL protocol, but also offers drivers for the Jena, Sesame, and Redland frameworks. ...
... The RDF data storage and SPARQL query execution is performed by the backend triple store. For example, the DBpedia SPARQL endpoint is powered by the Virtuoso [11] triple store. The Wikidata endpoint works on top of the BlazeGraph 7 triple store. ...
... Furthermore, Verborgh et al. [28] [17] perform evaluations based on avg. workload completion time for 50 clients and compare their system with brTPF [14], TPF [28] and Virtuoso [11]. Finally, Azzam et al. [4] compare SMART-KG with TPF, Virtuoso and S A G E by using performance metrics Number of Timeouts, Execution Time and Resource Consumption. ...
Chapter
Full-text available
With significant growth in RDF datasets, application developers demand online availability of these datasets to meet the end users’ expectations. Various interfaces are available for querying RDF data using SPARQL query language. Studies show that SPARQL end-points may provide high query runtime performance at the cost of low availability. For example, it has been observed that only 32.2% of public endpoints have a monthly uptime of 99–100%. One possible reason for this low availability is the high workload experienced by these SPARQL endpoints. As complete query execution is performed at server side (i.e., SPARQL endpoint), this high query processing workload may result in performance degradation or even a service shutdown. We performed extensive experiments to show the query processing capabilities of well-known triple stores by using their SPARQL endpoints. In particular, we stressed these triple stores with multiple parallel requests from different querying agents. Our experiments revealed the maximum query processing capabilities of these triple stores after which point they lead to service shutdowns. We hope this analysis will help triple store developers to design workload-aware RDF engines to improve the availability of their public endpoints with high throughput.
... Therefore, we generate path queries based on frequent query patterns from the real-world query logs. Competitors By the time of this paper, we are not aware of any previous approach which is available for comparison except for Apache Jena [1] and Virtuoso [20]. Apache Jena is a widely used graph database to store and query RDF data. ...
... Then an α-RA expression tree is generated to guide the query processing. Virtuoso [20] is a representative RDF system as well as a relational database system, deploying a "α" operator to support the property path query. However, this α-RA method cannot solve transitive closure in a pipelined fashion. ...
Article
Full-text available
SPARQL 1.1 offers a type of navigational query for RDF systems, called regular path query (RPQ). A regular path query allows for retrieving node pairs with the paths between them satisfying regular expressions. Regular path queries are always difficult to be evaluated efficiently because of the possible large search space. Thus there has been no scalable and practical solution so far. In this paper, we present Leon+, an in-memory distributed framework, to address the RPQ problem in the context of the knowledge graph. To reduce search space and mitigate mounting communication costs, Leon+ takes advantage of join-ahead pruning via a novel RDF summarization technique together with a path partitioning strategy. We also develop a subtle cost model to devise query plans to achieve high efficiency for complex RPQs. As there has been no available RPQ benchmark, we create micro-benchmarks on both synthetic and real-world datasets. A thorough experimental evaluation is presented between our approach and the state-of-the-art RDF stores. The results show that our approach outperforms 5x faster than the competitors on single RPQ. For query workload, it saves up to 1/2 time and 2/3 communication overheads over the baseline method.
... For simplicity, let's assume all the triples gave in table Table 3 belongs to named Graph G1, then the Quad representation of the given TT is shown in Table 5. The quad representation has been used in many well-known RDF engines 11 such as Virtuoso [30] and 4store [43]. Please note that as per SPARQL specification 12 , a SPARQL query may specify the RDF graph to be used for matching by using the FROM clause and the FROM NAMED clause to describe the RDF dataset. ...
... Virtuoso [30] stores RDF data as single table with four columns: Subject (s), Predicate (p), Object (o), and Graph (g), where s, p and g are IRIs, and o may be of any data type. The Virtuoso makes use of the dictionary encoding for IRI's and literals used in the triples. ...
Preprint
Full-text available
This paper is all about the different constructs of RDF data processing. This paper covers the storage, indexing, language, query planning.<br
... For simplicity, let's assume all the triples gave in table Table 3 belongs to named Graph G1, then the Quad representation of the given TT is shown in Table 5. The quad representation has been used in many well-known RDF engines 11 such as Virtuoso [34] and 4store [35]. Please note that as per SPARQL specification 12 , a SPARQL query may specify the RDF graph to be used for matching by using the FROM clause and the FROM NAMED clause to describe the RDF dataset. ...
... Virtuoso [34] stores RDF data as single table with four columns: Subject (s), Predicate (p), Object (o), and Graph (g), where s, p and g are IRIs, and o may be of any data type. The Virtuoso makes use of the dictionary encoding for IRI's and literals used in the triples. ...
Preprint
The recent advancements of the Semantic Web and Linked Data have changed the working of the traditional web. There is a huge adoption of the Resource Description Framework (RDF) format for saving of web-based data. This massive adoption has paved the way for the development of various centralized and distributed RDF processing engines. These engines employ different mechanisms to implement key components of the query processing engines such as data storage, indexing, language support, and query execution. All these components govern how queries are executed and can have a substantial effect on the query runtime. For example, the storage of RDF data in various ways significantly affects the data storage space required and the query runtime performance. The type of indexing approach used in RDF engines is key for fast data lookup. The type of the underlying querying language (e.g., SPARQL or SQL) used for query execution is a key optimization component of the RDF storage solutions. Finally, query execution involving different join orders significantly affects the query response time. This paper provides a comprehensive review of centralized and distributed RDF engines in terms of storage, indexing, language support, and query execution.
... For simplicity, let's assume all the triples gave in table Table 3 belongs to named Graph G1, then the Quad representation of the given TT is shown in Table 5. The quad representation has been used in many well-known RDF engines 11 such as Virtuoso [34] and 4store [35]. Please note that as per SPARQL specification 12 , a SPARQL query may specify the RDF graph to be used for matching by using the FROM clause and the FROM NAMED clause to describe the RDF dataset. ...
... Virtuoso [34] stores RDF data as single table with four columns: Subject (s), Predicate (p), Object (o), and Graph (g), where s, p and g are IRIs, and o may be of any data type. The Virtuoso makes use of the dictionary encoding for IRI's and literals used in the triples. ...
Preprint
Full-text available
The recent advancements of the Semantic Web and Linked Data have changed the working of the traditional web. There is a huge adoption of the Resource Description Framework (RDF) format for saving of web-based data. This massive adoption has paved the way for the development of various centralized and distributed RDF processing engines. These engines employ different mechanisms to implement key components of the query processing engines such as data storage, indexing, language support, and query execution. All these components govern how queries are executed and can have a substantial effect on the query runtime. For example, the storage of RDF data in various ways significantly affects the data storage space required and the query runtime performance. The type of indexing approach used in RDF engines is key for fast data lookup. The type of the underlying querying language (e.g., SPARQL or SQL) used for query execution is a key optimization component of the RDF storage solutions. Finally, query execution involving different join orders significantly affects the query response time. This paper provides a comprehensive review of centralized and distributed RDF engines in terms of storage, indexing, language support, and query execution.
... The client then takes care of all other query processing tasks, such as joining, filtering, grouping, query optimization and decomposition, and sending triple pattern requests to the servers. The TPF interface has been evaluated against SPARQL endpoints based on Jena Fuseki and Virtuoso [6] using an instance of the Berlin SPARQL Benchmark (BSBM) dataset [3] that contains 100 millions triples. The experiments show that the CPU load on the server is lower and the CPU load on the client is higher for TPF interfaces compared to SPARQL endpoints. ...
... We extended the Java TPF server 5 with support for brTPF requests and additional SPARQL-based backends such as Virtuoso endpoints 6 and use that as the server component for TPF and brTPF. Hardware Configuration. ...
Preprint
The adoption of Semantic Web technologies, and in particular the Open Data initiative, has contributed to the steady growth of the number of datasets and triples accessible on the Web. Most commonly, queries over RDF data are evaluated over SPARQL endpoints. Recently, however, alternatives such as TPF have been proposed with the goal of shifting query processing load from the server running the SPARQL endpoint towards the client that issued the query. Although these interfaces have been evaluated against standard benchmarks and testbeds that showed their benefits over previous work in general, a fine-granular evaluation of what types of queries exploit the strengths of the different available interfaces has never been done. In this paper, we present the results of our in-depth evaluation of existing RDF interfaces. In addition, we also examine the influence of the backend on the performance of these interfaces. Using representative and diverse query loads based on the query log of a public SPARQL endpoint, we stress test the different interfaces and backends and identify their strengths and weaknesses.
... Other projects (e.g. Virtuoso Sponger [18], Bio2RDF [19]) resort to GitHub repositories for keeping LDW code. This LDW code can be cloned; next, upgraded in a different GitHub branch, and finally, send a pull request to modify the master distribution. ...
... The evaluation focuses on three of the quality in use model characteristics proposed by the ISO/IEC 25010 standard 18 [55], namely: ...
Article
Linked Data Wrappers (LDWs) turn Web APIs into RDF end-points, leveraging the Linked Open Data cloud with current data. Unfortunately, LDWs are fragile upon upgrades on the underlying APIs, compromising LDW stability. Hence, for API-based LDWs to become a sustainable foundation for the Web of Data, we should recognize LDW maintenance as a continuous effort that outlives their breakout projects. This is not new in Software Engineering. Other projects in the past faced similar issues. The strategy: becoming open source and turning towards dedicated platforms. By making LDWs open, we permit others not only to inspect (hence, increasing trust and consumption), but also to maintain (to cope with API upgrades) and reuse (to adapt for their own purposes). Promoting consumption, adaptation and reuse might all help to increase the user base, and in so doing, might provide the critical mass of volunteers, current LDW projects lack. Drawing upon the Helping Theory, we investigate three enablers of volunteering applied to LDW maintenance: impetus to respond, positive evaluation of contributing and increasing awareness. Insights are fleshed out through SYQL, a LDW platform on top of Yahoo’s YQL. Specifically, SYQL capitalizes on the YQL community (i.e. impetus to respond), providesannotation overlays to easy participation (i.e. positive evaluation of contributing), and introduces aHealth Checker (i.e. increasing awareness). Evaluation is conducted for 12 subjects involved in maintaining someone else’s LDWs. Results indicate that both the Health Checker and the annotation overlays provide utility as enablers of awareness and contribution.
... The data is then published in RDF through the D2RQ server and can be queried via a D2RQ SPARQL endpoint. We also took an RDF dump from D2RQ into Virtuoso [28] to run federated queries. Figure 2 shows detailed RDF representation of two patients. ...
Article
Full-text available
Background Next-generation sequencing provides comprehensive information about individuals’ genetic makeup and is commonplace in precision oncology practice. Due to the heterogeneity of individual patient’s disease conditions and treatment journeys, not all targeted therapies were initiated despite actionable mutations. To better understand and support the clinical decision-making process in precision oncology, there is a need to examine real-world associations between patients’ genetic information and treatment choices. Methods To fill the gap of insufficient use of real-world data (RWD) in electronic health records (EHRs), we generated a single Resource Description Framework (RDF) resource, called PO2RDF (precision oncology to RDF), by integrating information regarding genes, variants, diseases, and drugs from genetic reports and EHRs. Results There are a total 2,309,014 triples contained in the PO2RDF. Among them, 32,815 triples are related to Gene, 34,695 triples are related to Variant, 8,787 triples are related to Disease, 26,154 triples are related to Drug. We performed two use case analyses to demonstrate the usability of the PO2RDF: (1) we examined real-world associations between EGFR mutations and targeted therapies to confirm existing knowledge and detect off-label use. (2) We examined differences in prognosis for lung cancer patients with/without TP53 mutations. Conclusions In conclusion, our work proposed to use RDF to organize and distribute clinical RWD that is otherwise inaccessible externally. Our work serves as a pilot study that will lead to new clinical applications and could ultimately stimulate progress in the field of precision oncology.
... In addition, GraphDB also provides some additional features for internal federation 9 which unfortunately does not comply with the SPARQL 1.1 standard anymore, at the time of writing of this article, in June 2021. Other solutions like RDF4J 10 (Broekstra et al., 2002), Virtuoso 11 (Erling and Mikhailov, 2010), Anzo 12 or also BlazeGraph 13 extend the set of possible parameters. For example, RDF4J allows the practitioner to add timeout or limit; with explain, BlazeGraph enables a mode where the query results are extended with explanations of the query plan. ...
... SPARQL endpoint uses the SPARQL protocol, which follows the First-Come-First-Served principle, therefore the queries have to wait in queue for execution by the server 4 .Many triple stores, like Virtuoso [10] and Jena TDB [1], offer a SPARQL interface.While serving queries concurrently by a SPARQL endpoint, it has to restrict the size of the result-set returned to the end user or to generate a time-out error message in a case where a query spends too many resources [8]. Therefore, we may not get the complete result-set which we should expect from the query. ...
Conference Paper
Full-text available
Over years, the Web of Data has grown significantly. Various interfaces such as SPARQL endpoints, data dumps, and Triple Pattern Fragments (TPF) have been proposed to provide access to this data. Studies show that many of the SPARQL endpoints have availability issues. The data dumps do not provide live querying capabilities. The TPF solution aims to provide a trade-off between the availability and performance by dividing the workload among TPF servers and clients. In this solution, the TPF server only performs the triple patterns execution of the given SPARQL query. While the TPF client performs the joins between the triple patterns to compute the final resultset of the SPARQL query. High availability is achieved in TPF but increase in network bandwidth and query execution time lower the performance. We want to propose a more intelligent SPARQL querying server to keep the high availability along with high query execution performance, while minimizing the network bandwidth. The proposed server will offer query execution services (can be single triple patterns or even join execution) according to the current status of the workload. If a server is free, it should be able to execute the complete SPARQL query. Thus, the server will offer execution services while avoiding going beyond the maximum query processing limit, i.e. the point after which the performance start decreasing or even service shutdown. Furthermore, we want to develop a more intelligent client, which keeps track of a server s processing capabilities and therefore avoid DOS attacks and crashes. 1 Problem Statement The problem we want to address focuses on the trade-off between the availability and performance in Linked Data interfaces. A large amount of Linked Data is available on the web and it keeps on increasing day by day. According to LODStats 1 , a total of ∼ 150 billion triples available from 9960 datasets. Querying this massive amount of data in a scal-able way is particularly challenging. SPARQL is the primary query language to retrieve data from RDF linked datasets [28]. The true value of these datasets 1 LODStats: http://lodstats.aksw.org/
... RDF-3X [9] and Hexastore [10] build tables on all six permutations of SPO. Built on a relational backbone, Virtuoso [11] uses a 4-column table for quads, and a combination of full and partial indexes. These methods work well for queries with small numbers of joins, however, they degrade with increasing sizes, unbound variables and joins. ...
Article
Full-text available
Characteristic sets (CS) organize RDF triples based on the set of properties associated with their subject nodes. This concept was recently used in indexing techniques, as it can capture the implicit schema of RDF data. While most CS-based approaches yield significant improvements in space and query performance, they fail to perform well when answering complex query workloads in the presence of schema heterogeneity, i.e., when the number of CSs becomes very large, resulting in a highly partitioned data organization. In this paper, we address this problem by introducing a novel technique, for merging CSs based on their hierarchical structure. Our method employs a lattice to capture the hierarchical relationships between CSs, identifies dense CSs and merges dense CSs with their ancestors. We have implemented our algorithm on top of a relational backbone, where each merged CS is stored in a relational table, and therefore, CS merging results in a smaller number of required tables to host the source triples of a data set. Moreover, we perform an extensive experimental study to evaluate the performance and impact of merging to the storage and querying of RDF datasets, indicating significant improvements. We also conduct a sensitivity analysis to identify the stability and any possible weaknesses of our algorithm, and report on our results.
... The data is then published in RDF through the D2RQ server, and can be queried via a D2RQ SPARQL endpoint. We also took an RDF dump from D2RQ into Virtuoso [28] to run federated queries. Figure 2 showed detailed RDF representation of two patients. ...
Preprint
Full-text available
Background: Next-generation sequencing provides comprehensive information about individuals’ genetic makeup and is commonplace in precision oncology practice. Due to the heterogeneity of individual patient’s disease conditions and treatment journeys, not all targeted therapies were initiated despite actionable mutations. To better understand and support the clinical decision-making process in precision oncology, there is a need to examine real-world associations of patients’ genetic information and treatment choice. Methods: To fill the gap of insufficient use of real-world data (RWD) in electronic health records (EHRs), we generated a single Resource Description Framework (RDF) resource, called PO2RDF (precision oncology to RDF) by integrating information regarding gene, variant, disease, and drug from genetic reports and EHRs. Results: There are total 2,309,014 triples contained in the PO2RDF. Among them 32,815 triples are related to Gene, 34,695 triples are related to Variant, 8,787 triples are related to Disease, 26,154 triples are related to Drug. We performed one use case analysis to demonstrate the usability of the PO2RDF: we examined real-world associations between EGFR mutations and targeted therapies to confirm existing knowledge and detect off-label use. Conclusions: In conclusion, our work proposed to use RDF to organize and distribute clinical RWD that is otherwise inaccessible externally. Our work serves as a pilot study that will lead to new clinical applications and could ultimately stimulate progress in the field of precision oncology.
... Naturally, the design of native RDF storage is heavily influenced by traditional database design. For example, Virtuoso [43] or 4Store [44] store RDF statements in a table-like structure. In these systems, RDF data is presented as "RDF quads" consisting of 4 elements: subject, predicate, object and graph id (or model in 4Store). ...
Article
Full-text available
Semantic interoperability for the Internet of Things (IoT) is enabled by standards and technologies from the Semantic Web. As recent research suggests a move towards decentralised IoT architectures, we have investigated the scalability and robustness of RDF (Resource Description Framework)engines that can be embedded throughout the architecture, in particular at edge nodes. RDF processing at the edge facilitates the deployment of semantic integration gateways closer to low-level devices. Our focus is on how to enable scalable and robust RDF engines that can operate on lightweight devices. In this paper, we have first carried out an empirical study of the scalability and behaviour of solutions for RDF data management on standard computing hardware that have been ported to run on lightweight devices at the network edge. The findings of our study shows that these RDF store solutions have several shortcomings on commodity ARM (Advanced RISC Machine) boards that are representative of IoT edge node hardware. Consequently, this has inspired us to introduce a lightweight RDF engine, which comprises an RDF storage and a SPARQL processor for lightweight edge devices, called RDF4Led. RDF4Led follows the RISC-style (Reduce Instruction Set Computer) design philosophy. The design constitutes a flash-aware storage structure, an indexing scheme, an alternative buffer management technique and a low-memory-footprint join algorithm that demonstrates improved scalability and robustness over competing solutions. With a significantly smaller memory footprint, we show that RDF4Led can handle 2 to 5 times more data than popular RDF engines such as Jena TDB (Tuple Database) and RDF4J, while consuming the same amount of memory. In particular, RDF4Led requires 10%–30% memory of its competitors to operate on datasets of up to 50 million triples. On memory-constrained ARM boards, it can perform faster updates and can scale better than Jena TDB and Virtuoso. Furthermore, we demonstrate considerably faster query operations than Jena TDB and RDF4J.
... Virtuoso was originally a relational database that was later extended in order to support RDF data. The software first used a row-wise transaction scheme [33], but the latest version of the software, Virtuoso 7, uses column-wise compressed storage with a vectored execution [31]. The software is provided under both an open-source license for single machines and a commercial license for software that supports federated (distributed) storage and other additional functionality. ...
... S2X [19] firstly matches all triple patterns in the query, then gradually discards some intermediate results by iteration calculating, and finally joins the remaining matched results. Virtuoso [20] processes RDF graph queries based on relational databases, transforms SPARQL into SQL, and obtains matching results through join operations of RDB tables. S2RDF [21] introduce a relational partitioning schema for RDF data called ExtVP that uses a semi-join based preprocessing, akin to the concept of Join Indices in relational databases, to efficiently minimize query input size regardless of its pattern shape and diameter based on Spark [22]. ...
Article
Full-text available
The open collaborative characteristics of online encyclopedia and the large number of ambiguity phenomena in the encyclopedia entry lead to inappropriate classification of plenty of Infobox knowledge triples of entries, which requires for refining and denoising of large-scale knowledge to improve the precision of Knowledge Base (KB). The enormous amount of triples in the KBs will cause excessive serial computing time expenditure by knowledge denoising and disambiguation processing. Existing knowledge refinement and disambiguation techniques have limitations in terms of scalability and time-efficient. There is still few typical research on the parallel processing of knowledge refinement in distributed environment. Therefore, this paper proposes a novel parallel algorithm for Chinese large-scale knowledge refinement based on MapReduce to further improve the overall system computing speed through parallel optimization for serial algorithm. Based on the original serial refining algorithm which can enhance the precision of encyclopedia-oriented KBs, results show that the novel parallel denoising algorithm proposed in this paper can further provide the system with good scalability and high speedup.
... Centralized RDF store. The famous centralized RDF systems like Jena [14], Sesame [15], Virtuoso [16], RDF-3X [17], RDFox [18] [19], we introduce the distributed RDF stores in three categories: ...
Thesis
Real-time processing of data streams emanating from sensors is becoming a common task in industrial scenarios. In an Internet of Things (IoT) context, data are emitted from heterogeneous stream sources, i.e., coming from different domains and data models. This requires that IoT applications efficiently handle data integration mechanisms. The processing of RDF data streams hence became an important research field. This trend enables a wide range of innovative applications where the real-time and reasoning aspects are pervasive. The key implementation goal of such application consists in efficiently handling massive incoming data streams and supporting advanced data analytics services like anomaly detection. However, a modern RSP engine has to address volume and velocity characteristics encountered in the Big Data era. In an on-going industrial project, we found out that a 24/7 available stream processing engine usually faces massive data volume, dynamically changing data structure and workload characteristics. These facts impact the engine's performance and reliability. To address these issues, we propose Strider, a hybrid adaptive distributed RDF Stream Processing engine that optimizes logical query plan according to the state of data streams. Strider has been designed to guarantee important industrial properties such as scalability, high availability, fault-tolerant, high throughput and acceptable latency. These guarantees are obtained by designing the engine's architecture with state-of-the-art Apache components such as Spark and Kafka. Moreover, an increasing number of processing jobs executed over RSP engines are requiring reasoning mechanisms. It usually comes at the cost of finding a trade-off between data throughput, latency and the computational cost of expressive inferences. Therefore, we extend Strider to support real-time RDFS+ (i.e., RDFS + owl:sameAs) reasoning capability. We combine Strider with a query rewriting approach for SPARQL that benefits from an intelligent encoding of knowledge base. The system is evaluated along different dimensions and over multiple datasets to emphasize its performance. Finally, we have stepped further to exploratory RDF stream reasoning with a fragment of Answer Set Programming. This part of our research work is mainly motivated by the fact that more and more streaming applications require more expressive and complex reasoning tasks. The main challenge is to cope with the large volume and high-velocity dimensions in a scalable and inference-enabled manner. Recent efforts in this area still missing the aspect of system scalability for stream reasoning. Thus, we aim to explore the ability of modern distributed computing frameworks to process highly expressive knowledge inference queries over Big Data streams. To do so, we consider queries expressed as a positive fragment of LARS (a temporal logic framework based on Answer Set Programming) and propose solutions to process such queries, based on the two main execution models adopted by major parallel and distributed execution frameworks: Bulk Synchronous Parallel (BSP) and Record-at-A-Time (RAT). We implement our solution named BigSR and conduct a series of evaluations. Our experiments show that BigSR achieves high throughput beyond million-triples per second using a rather small cluster of machines
... We used the Virtuoso software (Erling and Mikhailov, 2010) to hold our triplestore and provide a SPARQL endpoint to it. The only SPARQL commands that our interface use are to retrieve triples that match one to three given field values. ...
Chapter
The escalation in demand of RDF format for knowledge representation and information management can be attributed to its flexible nature. The RDF data model is increasingly being used for sharing and integration of information and knowledge across several domains. Some of the domains and applications where RDF model is increasingly being used include bioinformatics and search engines. As the amount of RDF data continues to increase, the management of such large amount of data becomes challenging. Thus, scalability is a major concern while handling large-scale RDF data. Hence, it becomes necessary to opt for scalable solutions while managing large RDF data. As a solution to this, many researchers opt for distributed data management systems. In this article, the authors provide a detailed analysis of RDF data management techniques used to make a RDF system more scalable. The objective of this article is to provide a brief description of the centralized and distributed RDF frameworks.
Thesis
Avec le développement des bases de connaissances dans de nombreux domaines industriels comme académiques,des utilisateurs novices sont confrontés à la nécessité de formuler des requêtes, sans forcément maîtriser le langage de requête, SPARQL, ou la structure de données sous-jacente, généralement décrite avec le langage RDF. Ces utilisateurs peuvent ainsi commettre des erreurs lors de l’écriture de leurs requêtes et obtenir des résultats inattendus ou difficiles à traiter. Parmi les situations d’insatisfaction des utilisateurs, le problème des réponses vides a été largement étudié. L’explication des raisons de l’absence de réponse peut permettre soit à l’utilisateur de progresser dans l’écriture de ses requêtes, soit à les corriger automatiquement. Mais l’absence de réponse n’est pas la seule source possible d’insatisfaction et peu de travaux existants se sont intéressés à l’identification des causes d’échec pour des problèmes différents. Dans un premier temps nous nous intéressons au problème des réponses pléthoriques,c’est-à-dire lorsqu’une requête produit un très grand nombre de réponses alors que l’utilisateur ne s’y attendait pas,et qu’il ne peut alors pas en extraire l’information pertinente. Nous montrons que des notions de cause d’échec et de requête alternatives introduites pour le problème des réponses vides peuvent être étendues au problème des réponses pléthoriques, et nous introduisons des algorithmes de calcul adaptés. Nous avons ensuite considéré les apports spécifiques de SPARQL en utilisant les cardinalités de prédicats pour améliorer les algorithmes de recherche et en adaptant notre formalisme pour accepter les requêtes contenant plusieurs opérateurs spécifiques à ce langage.Enfin, la méthode est généralisée pour un problème quelconque d’insatisfaction de l’utilisateur avec les résultats obtenus. Nous montrons comment traiter cinq problèmes élémentaires de réponses insatisfaisantes et comment les combiner pour décrire des problèmes plus complexes. Nos contributions ont été validées expérimentalement en utilisant des données et requêtes synthétiques de WatDiv et des données et requêtes réelles de DBpedia.
Chapter
The Resource Description Framework (RDF) is a flexible model for representing information about resources on the Web. As a World Wide Web Consortium (W3C) Recommendation, RDF has rapidly gained popularity. With the widespread acceptance of RDF on the Web and in the enterprise, a huge amount of RDF data is being proliferated and becoming available. Efficient and scalable management of RDF data is therefore of increasing importance. This is creating a new set of data management requirements involving RDF, and RDF data management has attracted attention in the database and Semantic Web communities. In this chapter, RDF data and techniques for data management will be introduced. We start by providing a formal definition of RDF that includes the model and semantics, and then focuses on the RDF query language SPARQL. We provide an algebraic syntax and a compositional semantics for this language. Finally, we introduce RDF data storage and present an overview of the current state of the art in RDF data storage strategy.
Article
This paper details RDF/OWL storage and management in two popular Relational Database Management Systems (RDBMSs): Oracle and Virtuoso. Popularity, sustainability, and conformance with the SPARQL language are the main reasons for choosing these systems. This work combines empirical and analytical studies guided by a comparative framework developed and motivated in the paper. Several dimensions are considered, including RDF data type preservation, SPARQL query and update processing, reasoning capabilities, custom inferences, blank node management, and other functional and non-functional features. Furthermore, a review of the performance assessments reported in the literature has been conducted. This study’s results identify the advantages and shortcomings of these RDBMSs for storing and managing RDF/OWL. They can help improve these systems or serve as a guide in choosing an appropriate system to use in a project context. Opportunities for further research efforts are also suggested.
Chapter
A quick and intuitive understanding of network reachability is of great significance for network optimization and network security management. In this paper, we propose a query engine called NREngine for network reachability when considering the network security policies. NREngine constructs a knowledge graph based on the network security policies and designs an algorithm over the graph for the network reachability. Furthermore, for supporting a user-friendly interface, we also propose a structural query language named NRQL in NREngine for the network reachability query. The experimental results show that NREngine can efficiently support a variety of network reachability query services.
Chapter
Intelligent and smart health monitoring is prevalent nowadays with the support of advancement in Internet of Things, machine learning, and ontology-based decision support systems. As a decision support system can analyze current patient vitals based on historical data, effective data representation from different data sources into a common knowledge base is essential. Web semantics has an increasingly important role to play here in terms of storing data following ontology for more usable knowledge repository. The findings of the decision support system can be fed to doctor’s smartphone as a message based on which the doctor may intervene in a specific scenario or may validate his own diagnosis with the one provided by the decision support system. As the comfort and convenience of the end-users of remote healthcare is important, in addition to quality of service, quality of experience is a matter of concern among other issues and challenges. This work emphasizes on several Machine Learning (ML) algorithms, ontology techniques to design and implement intelligent decision support system for effective healthcare support satisfying quality of service and quality of experience requirements.
Thesis
Full-text available
The Internet of Things (IoT) is an emerging phenomenon in the public space. Users with accessibility needs could especially benefit from these “smart” devices if they were able to interact with them through speech. This thesis presents a Compositional Semantics and framework for developing extensible and expressive Natural Language Query Interfaces to the Semantic Web, addressing privacy and auditability needs in the process. This could be particularly useful in healthcare or legal applications, where confidentiality of information is a key concern.
Article
Resource Description Framework (RDF) is increasingly being used to model data on the web. RDF model was designed to support easy representation and exchange of information on the web. RDF is queried using SPARQL, a standard query language recommended by W3C. The growth in acceptance of RDF format can be attributed to its flexible and reusable nature. The size of RDF data is steadily increasing as many government organizations and companies are using RDF for data representation and exchange. This resulted in the need for developing distributed RDF frameworks that can efficiently manage RDF data on large scale i.e. Big RDF data. These scalable distributed RDF data management systems competent enough to handle Big RDF data can also be termed as Big RDF frameworks. The proliferation of RDF data has made RDF data management a difficult task. In this survey, we provide an extensive literature on Big RDF frameworks from the aspect of storage, partitioning, indexing, query optimization and processing. A taxonomy of the tools and technologies used for storage and retrieval of Big RDF data in these systems has been presented. The comparative evaluation of some Big RDF frameworks based on query performance and our observations from this evaluation are described. The research challenges identified during the study of these systems are elaborated to suggest promising directions for future research.
Article
Natural Language Query Interfaces (NLQIs) have once again captured the public imagination, but developing them for the Semantic Web has proven to be non-trivial. This is unfortunate, because the Semantic Web offers many opportunities for interacting with smart devices, including those connected to the Internet of Things. In this paper, we present an NLQI to the Semantic Web based on a Compositional Semantics (CS) that can accommodate many particularly tricky aspects of the English language, including nested n-ary transitive verbs, superlatives, and chained prepositional phrases, and even ambiguity. Key to our approach is a new data structure which has proven to be useful in answering NL queries. As a consequence of this, our system is able to handle NL features that are often considered to be non-compositional. We also present a novel method to memoize sub-expressions of a query formed from CS, drastically improving query execution times with respect to large triplestores. Our approach is agnostic to any particular database query language. A live demonstration of our NLQI is available online.
Article
Full-text available
Integrating social networks data in the process of promoting business and marketing applications is widely addressed by several researchers. However, regarding the isolation between social network platforms managing such data has become a challenging task facing data scientist. In this respect, the present paper is designed to put forward a special semantic data integration approach, whereby a unified presentation and access to social networks data can be maintained. To this end, the novel SNOWL (Social Network OWL) ontology aims to provide a new social network content modeling, following the UPON Lite ontology-construction methodology. The advanced ontology is not created from scratch; it is but a continuation of some previously devised ontologies, elaborated to integrate an additional selection of newly incorporated social entities, such as content and user popularity. Additionally, and for an effective advantage of the model to be gained, a special mapping of the social networks data has been firstly implemented to the designed ontology, developed on the basis of the RML mapping language. Secondly, the SNOWL ontology is evaluated through the OOPS! Pitfall tool. Finally, a set of SPARQL-based services has also been designed on top of the SNOWL ontology in a bid to ensure a unified access to the mapped social data.
Article
This article presents LinkZoo, a web-based, linked data enabled tool that supports collaborative management of information resources. LinkZoo addresses the modern needs of information-intensive collaboration environments to publish, manage, and share heterogeneous resources within user-driven contexts. Users create and manage diverse types of resources into common spaces such as files, web documents, people, datasets, and calendar events. They can interlink them, annotate them, and share them with other users, thus enabling collaborative editing, as well as enrich them with links to externally linked data resources. Resources are inherently modeled and published as resource description framework (RDF) and can be explicitly interlinked and dereferenced by external applications. LinkZoo supports creation of dynamic communities that enable web-based collaboration through resource sharing and annotating, exposing objects on the linked data Cloud under controlled vocabularies and permissions. The authors demonstrate the applicability of the tool on a popular collaboration use case scenario for sharing and organizing research resources.
Chapter
One of the key traits of Big Data is its complexity in terms of representation, structure, or formats. One existing way to deal with it is offered by Semantic Web standards. Among them, RDF – which proposes to model data with triples representing edges in a graph – has received a large success and the semantically annotated data has grown steadily towards a massive scale. Therefore, there is a need for scalable and efficient query engines capable of retrieving such information. In this paper, we propose Sparklify: a scalable software component for efficient evaluation of SPARQL queries over distributed RDF datasets. It uses Sparqlify as a SPARQL-to-SQL rewriter for translating SPARQL queries into Spark executable code. Our preliminary results demonstrate that our approach is more extensible, efficient, and scalable as compared to state-of-the-art approaches. Sparklify is integrated into a larger SANSA framework and it serves as a default query engine and has been used by at least three external use scenarios.
Article
We propose to use DL-Lite A techniques to reason and query the Web-scale Open Data (knowledge bases) described by Semantic Web standards like RDF and OWL due to the low reasoning complexity and suitable expressivity of the language. When facing the real-life scalability challenge, the actual reasoning and query answering may become infeasible by the following two factors. Firstly, for both satisfiability checking and conjunctive query answering, a polynomial size of queries may need to be answered over the data layers of the corresponding knowledge bases (KBs) w.r.t. the size of the schema knowledge of these KBs. Secondly, for KBs with massive individual assertions, evaluating a single query over the data layers may be highly time-consuming. This impels us to seek for a divide-and-conquer reasoning and query answering approach for DL-Lite A , with the basic idea of partitioning both KBs and queries into smaller chunks and decomposing the original reasoning and query answering tasks into a group of independent sub-tasks such that the overall performance can be improved by taking advantage of parallelization and distribution techniques. The challenge for designing such an approach lies in how to carry out partitioning and reasoning reduction in a sound and complete way. Motivated by hash partitioning of RDF graphs, we expect the smaller KB chunks to have the local feature for both satisfiability checking and simple-query answering. Here simple-queries are the conjunctive queries whose query atoms share a common variable or individual. For query answering, we expect to partition a query into smaller simple-queries and evaluate them over smaller KB chunks. Under these expectations, our divide-and-conquer approach is constructed from both theoretical and practical perspectives. Theoretically, definitions of KB partitions and query partitions are presented, and the sufficient and necessary conditions are identified to determine whether a KB partition holds the desired features. Practically, based on the theoretical results, the concrete ways of partitioning KBs and queries as well as evaluating query partitions over KB partitions are described. Moreover, a strategy of optimizing the procedure of evaluating query partitions over KB partitions is provided to improve the overall query answering performance. To verify our approach, two Web-scale open datasets, DBpedia and BTC 2012 dataset, have been chosen. The empirical results indicate that the provided approach opens new possibilities for realizing performance-critical applications on the Web with both high expressivity and scalability.
Article
Full-text available
As Semantic Web technologies are getting ma-ture, there is a growing need for RDF applica-tions to access the content of huge, live, non-RDF, legacy databases without having to replicate the whole database into RDF. In this poster, we present D2RQ, a declarative lan-guage to describe mappings between applica-tion-specific relational database schemata and RDF-S/OWL ontologies. D2RQ allows RDF applications to treat non-RDF relational data-bases as virtual RDF graphs, which can be queried using RDQL.
Conference Paper
Full-text available
Integrating relational databases is recently acknowledged as an im- portant vision of the Semantic Web research, however there are not many well- implemented tools and not many applications that are in large-scale real use either. This paper introduces the Dartgrid which is an application development framework together with a set of semantic tools to facilitate the integration of heterogenous relational databases using semantic web technologies. For exam- ples, DartMapping is a visualized mapping tool to help DBA in defining semantic mappings from heterogeneous relational schemas to ontologies. DartQuery is an ontology-based query interface helping user to construct semantic queries, and capable of rewriting SPARQL semantic queries to a set of SQL queries. Dart- Search is an ontology-based search engine enabling user to make full-text search over all databases and to navigate across the search results semantically. It is also enriched with a concept ranking mechanism to enable user to find more accurate and reliable results. This toolkit has been used to develop an currently in-use ap- plication for China Academy of Traditional Chinese Medicine (CATCM). In this application, over 70 legacy relational databases are semantically interconnected by an ontology with over 70 classes and 800 properties, providing integrated semantic-enriched query, search and navigation services to TCM communities.
Conference Paper
Full-text available
Wikis are established means for the collaborative authoring, versioning and publishing of textual articles. The Wikipedia project, for example, succeeded in creating the by far largest encyclopedia just on the basis of a wiki. Recently, several approaches have been proposed on how to extend wikis to allow the creation of structured and semantically enriched content. However, the means for creating semantically enriched structured content are already available and are, although unconsciously, even used by Wikipedia authors. In this article, we present a method for revealing this structured content by extracting information from template instances. We suggest ways to efficiently query the vast amount of extracted information (e.g. more than 8 million RDF statements for the English Wikipedia version alone), leading to astonishing query answering possibilities (such as for the title question). We analyze the quality of the extracted content, and propose strategies for quality improvements with just minor modifications of the wiki systems being currently used.
Article
Full-text available
The SPARQL Query Language for RDF and the SPARQL Protocol for RDF are implemented by a growing number of storage systems and are used within enterprise and open Web settings. As SPARQL is taken up by the community, there is a growing need for benchmarks to compare the performance of storage systems that expose SPARQL endpoints via the SPARQL protocol. Such systems include native RDF stores as well as systems that rewrite SPARQL queries to SQL queries against non-RDF relational databases. This article introduces the Berlin SPARQL Benchmark (BSBM) for comparing the performance of native RDF stores with the performance of SPARQL-to-SQL rewriters across architectures. The benchmark is built around an e-commerce use case in which a set of products is offered by different vendors and consumers have posted reviews about products. The benchmark query mix emulates the search and navigation pattern of a consumer looking for a product. The article discusses the design of the BSBM benchmark and presents the results of a benchmark experiment comparing the performance of four popular RDF stores (Sesame, Virtuoso, Jena TDB, and Jena SDB) with the performance of two SPARQL-to-SQL rewriters (D2R Server and Virtuoso RDF Views) as well as the performance of two relational database management systems (MySQL and Virtuoso RDBMS).
Article
Full-text available
D2RQ is a complete solution to map and expose relational databases as RDF.
Article
We describe our method for benchmarking Semantic Web knowledge base systems with respect to use in large OWL applications. We present the Lehigh University Benchmark (LUBM) as an example of how to design such benchmarks. The LUBM features an ontology for the university domain, synthetic OWL data scalable to an arbitrary size, 14 extensional queries representing a variety of properties, and several performance metrics. The LUBM can be used to evaluate systems with different reasoning capabilities and storage mechanisms. We demonstrate this with an evaluation of two memory-based systems and two systems with persistent storage.
Conference Paper
E cient management of RDF data is an important factor in real- izing the Semantic Web vision. Performance and scalability issues are becoming increasingly pressing as Semantic Web technology is applied to real-world applications. In this paper, we examine the reasons why current data management solutions for RDF data scale poorly, and explore the fundamental scalability limitations of these approaches. We review the state of the art for improving perfor- mance for RDF databases and consider a recent suggestion, "prop- erty tables." We then discuss practically and empirically why this solution has undesirable features. As an improvement, we propose an alternative solution: vertically partitioning the RDF data. We compare the performance of vertical partitioning with prior art on queries generated by a Web-based RDF browser over a large-scale (more than 50 million triples) catalog of library data. Our results show that a vertical partitioned schema achieves similar perfor- mance to the property table technique while being much simpler to design. Further, if a column-oriented DBMS (a database archi- tected specially for the vertically partitioned case) is used instead of a row-oriented DBMS, another order of magnitude performance improvement is observed, with query times dropping from minutes to several seconds.
Oracle and HP smash world record for TPC-H 10 TB data warehousing benchmark Oracle Real Application Clusters
  • Northrop Grumman
Northrop Grumman Corporation: Kowari Metastore. http://www.kowari.org/ 16. Oracle and HP smash world record for TPC-H 10 TB data warehousing benchmark. http:// dssresources.com/news/316.php 17. Oracle Real Application Clusters. http://www.oracle.com/database/rac_home.html 18. Oracle Semantic Technologies Center. http://www.oracle.com/technology/tech/semantic_ technologies/index.html
Redland RDF Application Framework D2RQ: Treating Non-RDF Databases as Virtual RDF Graphs
  • D Beckett
  • C Bizer
  • R Cyganiak
  • J Garbers
  • O Maresch
Beckett, D.: Redland RDF Application Framework. http://librdf.org/ 4. Bizer, C., Cyganiak, R., Garbers, J., Maresch, O.: D2RQ: Treating Non-RDF Databases as Virtual RDF Graphs. http://sites.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/
Interlinking open data on the web
  • C Bizer
  • T Heath
  • D Ayers
  • Y Raimond
ISWC 2008: Billion Triples Challenge
  • O Erling
SPASQL: SPARQL Support in MySQL
  • E Prudhommeaux
Updating Relational Data Via SPARUL (Updatable RDF Views
  • I Mikhailov
Harnessing the Semantic Web to answer scientific questions
  • A Ruttenberg
What have Innsbruck and Leipzig in common?
  • S Auer
  • J Lehmann