Chapter

Challenges in RDF validation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The RDF data model forms a cornerstone of the Semantic Web technology stack. Although there have been different proposals for RDF serialization syntaxes, the underlying simple data model enables great flexibility which allows it to be successfully employed in many different scenarios and to form the basis on which other technologies are developed. In order to apply an RDF-based approach in practice it is necessary to communicate the structure of the data that is being stored or represented. Data quality is of paramount importance for the acceptance of RDF as a data representation language and it must be enabled by the use of tools that can check if some data conforms to some specific structure. There have been several recent proposals for RDF validation languages like ShEx and SHACL. In this chapter, we describe both proposals and enumerate some challenges and trends that we foresee with regards to RDF validation. We devote more space to what we consider one of the main challenges, which is to compare ShEx and SHACL and to understand their underlying foundations. To that end, we propose an intermediate language and show how ShEx and SHACL can be converted to it.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... More details about ShEx and SHACL can be found in the book by Labra Gayo et al. [306]. A recently proposed language that can be used as a common basis for both ShEx and SHACL reveals their similarities and differences [305]. A similar notion of schema has been proposed by Angles [14] for property graphs. ...
... B.3.2 Validating schema. We define shapes following conventions used by Labra Gayo et al. [305]. ...
... The nodes that a shape targets can be selected manually, based on the type(s) of the nodes, based on the results of a graph query, etc. [104,305]. ...
... More details about ShEx and SHACL can be found in the book by Labra Gayo et al. [306]. A recently proposed language that can be used as a common basis for both ShEx and SHACL reveals their similarities and differences [305]. A similar notion of schema has been proposed by Angles [14] for property graphs. ...
... B.3.2 Validating schema. We define shapes following conventions used by Labra Gayo et al. [305]. ...
... The nodes that a shape targets can be selected manually, based on the type(s) of the nodes, based on the results of a graph query, etc. [104,305]. ...
Article
Full-text available
In this article, we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models, as well as languages used to query and validate knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We conclude with high-level future research directions for knowledge graphs.
... More details about ShEx and SHACL can be found in the book by Labra Gayo et al. [287]. A recently proposed language that can be used as a common basis for both ShEx and SHACL reveals their similarities and differences [286]. A similar notion of schema has been proposed by Angles [12] for property graphs. ...
... The nodes that a shape targets can be selected manually, based on the type(s) of the nodes, based on the results of a graph query, etc. [100,286]. ...
... The semantics we present here assumes that each node in the graph either satisfies or does not satisfy each shape labelled by the schema. More complex semantics -for example, based on Kleene's three-valued logic [100,286] -have been proposed that support partial shapes maps, where the satisfaction of some nodes for some shapes can be left undefined. Shapes languages in practice may support other forms of constraints, such as counting on paths [278]. ...
Article
Full-text available
In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After a general introduction, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graphs.
... More details about ShEx and SHACL can be found in the book by Labra Gayo et al. [287]. A recently proposed language that can be used as a common basis for both ShEx and SHACL reveals their similarities and differences [286]. A similar notion of schema has been proposed by Angles [12] for property graphs. ...
... The nodes that a shape targets can be selected manually, based on the type(s) of the nodes, based on the results of a graph query, etc. [100,286]. ...
... The semantics we present here assumes that each node in the graph either satisfies or does not satisfy each shape labelled by the schema. More complex semantics -for example, based on Kleene's three-valued logic [100,286] -have been proposed that support partial shapes maps, where the satisfaction of some nodes for some shapes can be left undefined. Shapes languages in practice may support other forms of constraints, such as counting on paths [278]. ...
Preprint
Full-text available
In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After a general introduction, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graphs.
... Property graphs Wikibase graphs Formal definition Done, ex. [25] Done, [62,37] on Wikidata toolkit 32 , and another one based on the fs2 library 33 . We tested it using a virtual server running on XCP-ng 8.2 (CentOS 8 stream) with 32 cores and 64Gb of RAM and it was able to process the JSON Wikidata dump from 2021, which has 1.256,55Gb uncompressed, in 5h 15min. ...
... We consider that ShEx adapts better to describe data models than SHACL, which is more focused on constraint violations. A comparison between both is provided in [39] while in [37], a simple language is defined that can be used as a common subset of both. ...
Preprint
Full-text available
The initial adoption of knowledge graphs by Google and later by big companies has increased their adoption and popularity. In this paper we present a formal model for three different types of knowledge graphs which we call RDF-based graphs, property graphs and wikibase graphs. In order to increase the quality of Knowledge Graphs, several approaches have appeared to describe and validate their contents. Shape Expressions (ShEx) has been proposed as concise language for RDF validation. We give a brief introduction to ShEx and present two extensions that can also be used to describe and validate property graphs (PShEx) and wikibase graphs (WShEx). One problem of knowledge graphs is the large amount of data they contain, which jeopardizes their practical application. In order to palliate this problem, one approach is to create subsets of those knowledge graphs for some domains. We propose the following approaches to generate those subsets: Entity-matching, simple matching, ShEx matching, ShEx plus Slurp and ShEx plus Pregel which are based on declaratively defining the subsets by either matching some content or by Shape Expressions. The last approach is based on a novel validation algorithm for ShEx based on the Pregel algorithm that can handle big data graphs and has been implemented on Apache Spark GraphX.
... Due to the growth of these public available KGs, the W3C has promoted a recommendation called SHACL (Shapes Constraint Language) to validate the RDF graphs [2]. In the last years KGs validation by means of SHACL shapes has gained momentum and has become a relevant research topic [14]. ...
... In addition, experiments relied on two ontologies developed in the context of the European Projects VICINITY and DELTA, namely: the VICIN-ITY ontology 12 and the DELTA ontology 13 . All the shapes generated during both experiments using Astrea were manually validated in term of syntax correctness using the SHACL playground 14 . ...
Conference Paper
Knowledge Graphs (KGs) that publish RDF data modelled using ontologies in a wide range of domains have populated the Web. The SHACL language is a W3C recommendation that has been endowed to encode a set of either value or model data restrictions that aim at validating KG data, ensuring data quality. Developing shapes is a complex and time consuming task that is not feasible to achieve manually. This article presents two resources that aim at generating automatically SHACL shapes for a set of ontologies: (1) Astrea-KG, a KG that publishes a set of mappings that encode the equivalent conceptual restrictions among ontology constraint patterns and SHACL constraint patterns, and (2) Astrea, a tool that automatically generates SHACL shapes from a set of ontologies by executing the mappings from the Astrea-KG. These two resources are openly available at Zenodo, GitHub, and a web application. In contrast to other proposals, these resources cover a large number of SHACL restrictions producing both value and model data restrictions, whereas other proposals consider only a limited number of restrictions or focus only on value or model restrictions.
... Validating RDF data portals using SPARQL queries has been regularly proposed as an approach that gives great flexibility and expressiveness (Labra Gayo & Alvarez Rodríguez, 2013). However, academic literature is still far from revealing a consensus on methods and approaches to evaluate ontologies using this query language (Walisadeera, Ginige & Wikramanayake, 2016), and other approaches have been proposed for validation Labra-Gayo et al., 2019). Currently, there is mostly an effort to normalize how to define SPARQL queries, particularly for knowledge graph validation purposes, to save runtime and ameliorate the completeness of the output of a query using a set of heuristics and axioms (Salas & Hogan, 2022 In Wikidata, the Wikidata Query Service (https://query.wikidata.org) ...
Article
Full-text available
Urgent global research demands real-time dissemination of precise data. Wikidata, a collaborative and openly licensed knowledge graph available in RDF format, provides an ideal forum for exchanging structured data that can be verified and consolidated using validation schemas and bot edits. In this research article, we catalog an automatable task set necessary to assess and validate the portion of Wikidata relating to the COVID-19 epidemiology. These tasks assess statistical data and are implemented in SPARQL, a query language for semantic databases. We demonstrate the efficiency of our methods for evaluating structured non-relational information on COVID-19 in Wikidata, and its applicability in collaborative ontologies and knowledge graphs more broadly. We show the advantages and limitations of our proposed approach by comparing it to the features of other methods for the validation of linked web data as revealed by previous research.
... We consider that ShEx adapts better to describe data models than SHACL, which is more focused on constraint violations. A comparison between both is provided in [8] while in [6], a simple language is defined that can be used as a common subset of both. ...
Preprint
Full-text available
Wikidata is one of the most successful Semantic Web projects. Its underlying Wikibase data model departs from RDF with the inclusion of several features like qualifiers and references, built-in datatypes, etc. Those features are serialized to RDF for content negotiation, RDF dumps and in the SPARQL endpoint. Wikidata adopted the entity schemas namespace using the ShEx language to describe and validate the RDF serialization of Wikidata entities. In this paper we propose WShEx, a language inspired by ShEx that directly supports the Wikibase data model and can be used to describe and validate Wikibase entities. The paper presents a the abstract syntax and semantic of the WShEx language.
... Besides the need for visualization of RDF vocabularies, since 2007 2 , the public Knowledge Graphs (KGs) are growing significantly, to deal with the quality and assessment of this large data cloud, the W3C has promoted a recommendation called Shapes Constraint Language (SHACL). Since then, SHACL has become a relevant research topic [5]. ...
Preprint
Full-text available
In this paper, we present an early stage of an integrated environment allowing the user to edit an RDF vocabulary visualizing the relations among classes, properties, and SHACL shapes. It also includes automatic generation of SHACL shapes based on the SHACL classes and properties as well as the validation of the instances according to the SHACL constraints. The proposed environment is integrated with a versioned triple store which allows having multiple versions of the vocabulary while tracking provenance information. To the best of our knowledge, this is the first integrated environment for visualizing, Validating, and Versioning RDF vocabularies. Accepted for publication at IEEE ICSC 2021 (https://www.ieee-icsc.org/)
... Most of the times, the shapes should be reviewed and edited by domain experts in order to ensure they fit with the application requirements. The usability of tools letting a proper edition of those schemas has been identified as a challenge [8]. The relative youth of consolidated constraint languages for RDF data causes that this kind of tools has not been thoroughly investigated yet, but a list of desirable features has been proposed [3]. ...
Preprint
Full-text available
We present a method for the construction of SHACL or ShEx constraints for an existing RDF dataset. It has two components that are used conjointly: an algorithm for automatic schema construction, and an interactive workflow for editing the schema. The schema construction algorithm takes as input sets of sample nodes and constructs a shape constraint for every sample set. It can be parametrized by a schema pattern that defines structural requirements for the schema to be constructed. Schema patterns are also used to feed the algorithm with relevant information about the dataset coming from a domain expert or from some ontology. The interactive workflow provides useful information about the dataset, shows validation results w.r.t. the schema under construction, and offers schema editing operations that combined with the schema construction algorithm allow to build a complex ShEx or SHACL schema.
Conference Paper
Full-text available
The motivation for this work stems from an EU funded project, which focuses on leveraging digital tools for improving the renovation processes. In particular, specific tools require Linked Building Data (LBD) that need to fulfil the application-specific exchange requirements. In this research, we focus on two different use cases to investigate how to validate a Linked Building Data model. First, we study how to minimise data loss and errors when data is converted and brought into an LBD data store. The usage of unit tests to improve conversion quality is introduced. The second use case focuses on how Model View Definition (MVD) in LBD for evaluating the energy performance of the renovation designs in energy simulation can be formed. This feasibility study shows that unit test can be written the conversion. Besides validation, methods shown in the study can be used to create model views for LBD data using SHACL.
Poster
Full-text available
In order to perform any operation in an RDF graph, it is recommendable to know the expected topology of the targeted information. Some technologies and syntaxes have been developed in the last years to describe the expected shapes in an RDF graph, such as ShEx and SHACL. In general, a domain expert can use these syntaxes to define shapes in a graph, with two main purposes: data validation and documentation. However, there are some scenarios in which the schema cannot be predicted a priori, but it emerges at the same time that the graph is filled with new information. In those cases, the shapes are latent in the current content. We have developed a prototype which is able to infer shapes of classes in a knowledge graph and used it with classes of DBpedia ontology. We serialize our results using ShEx.
Article
Full-text available
RDF and Linked Data have broad applicability across many fields, from aircraft manufacturing to zoology. Requirements for detecting bad data differ across communities, fields, and tasks, but nearly all involve some form of data validation. This book introduces data validation and describes its practical use in day-to-day data exchange. The Semantic Web offers a bold, new take on how to organize, distribute, index, and share data. Using Web addresses (URIs) as identifiers for data elements enables the construction of distributed databases on a global scale. Like the Web, the Semantic Web is heralded as an information revolution, and also like the Web, it is encumbered by data quality issues. The quality of Semantic Web data is compromised by the lack of resources for data curation, for maintenance, and for developing globally applicable data models. At the enterprise scale, these problems have conventional solutions. Master data management provides an enterprise-wide vocabulary, while constraint languages capture and enforce data structures. Filling a need long recognized by Semantic Web users, shapes languages provide models and vocabularies for expressing such structural constraints. This book describes two technologies for RDF validation: Shape Expressions (ShEx) and Shapes Constraint Language (SHACL), the rationales for their designs, a comparison of the two, and some example applications.
Conference Paper
Full-text available
Article
Full-text available
In complex reasoning tasks, as expressible by Answer Set Programming (ASP), problems often permit for multiple solutions. In dynamic environments, where knowledge is continuously changing, the question arises how a given model can be incrementally adjusted relative to new and outdated information. This paper introduces Ticker, a prototypical engine for well-defined logical reasoning over streaming data. Ticker builds on a practical fragment of the recent rule-based language LARS which extends Answer Set Programming for streams by providing flexible expiration control and temporal modalities. We discuss Ticker's reasoning strategies: First, the repeated one-shot solving mode calls Clingo on an ASP encoding. We show how this translation can be incrementally updated when new data is streaming in or time passes by. Based on this, we build on Doyle's classic justification-based truth maintenance system (TMS) to update models of non-stratified programs. Finally, we empirically compare the obtained evaluation mechanisms. This paper is under consideration for acceptance in TPLP.
Conference Paper
Full-text available
We build on our earlier finding that more than 95 % of the triples in actual RDF triple graphs have a remarkably tabular structure, whose schema does not necessarily follow from explicit metadata such as ontologies, but for which an RDF store can automatically derive by looking at the data using so-called “emergent schema” detection techniques. In this paper we investigate how computers and in particular RDF stores can take advantage from this emergent schema to more compactly store RDF data and more efficiently optimize and execute SPARQL queries. To this end, we contribute techniques for efficient emergent schema aware RDF storage and new query operator algorithms for emergent schema aware scans and joins. In all, these techniques allow RDF schema processors fully catch up with relational database techniques in terms of rich physical database design options and efficiency, without requiring a rigid upfront schema structure definition.
Article
Full-text available
Access to hundreds of knowledge bases has been made available on the Web through public SPARQL endpoints. Unfortunately, few endpoints publish descriptions of their content (e.g., using VoID). It is thus unclear how agents can learn about the content of a given SPARQL endpoint or, relatedly, find SPARQL endpoints with content relevant to their needs. In this paper, the authors investigate the feasibility of a system that gathers information about public SPARQL endpoints by querying them directly about their own content. With the advent of SPARQL 1.1 and features such as aggregates, it is now possible to specify queries whose results would form a detailed profile of the content of the endpoint, comparable with a large subset of VoID. In theory it would thus be feasible to build a rich centralised catalogue describing the content indexed by individual endpoints by issuing them SPARQL (1.1) queries; this catalogue could then be searched and queried by agents looking for endpoints with content they are interested in. In practice, however, the coverage of the catalogue is bounded by the limitations of public endpoints themselves: some may not support SPARQL 1.1, some may return partial responses, some may throw exceptions for expensive aggregate queries, etc. The authors' goal in this paper is thus twofold: (i) using VoID as a bar, to empirically investigate the extent to which public endpoints can describe their own content, and (ii) to build and analyse the capabilities of a best-effort online catalogue of current endpoints based on the (partial) results collected.
Conference Paper
Full-text available
The Linked Data initiative continues to grow making more datasets available; however, discovering the type of data contained in a dataset, its structure, and the vocabularies used still remains a challenge hindering the querying and reuse. VoID descriptions provide a starting point but a more detailed analysis is required to unveil the implicit vocabulary usage such as common data patterns. Such analysis helps the selection of datasets, the formulation of effective queries, or the identification of quality issues. Loupe is an online tool for inspecting datasets by looking at both implicit data patterns as well as explicit vocabulary definitions in data. This demo paper presents the dataset inspection capabilities of Loupe.
Article
Full-text available
The Resource Description Framework (RDF) is the W3C’s graph data model for Semantic Web applications. We study the problem of RDF graph summarization: given an input RDF graph \(\mathtt {G}\), find an RDF graph \(\mathtt {S}_\mathtt {G}\) which summarizes \(\mathtt {G}\) as accurately as possible, while being possibly orders of magnitude smaller than the original graph. Our approach is query-oriented, i.e., querying a summary of a graph should reflect whether the query has some answers against this graph. The summaries are aimed as a help for query formulation and optimization. We introduce two summaries: a baseline which is compact and simple and satisfies certain accuracy and representativeness properties, but may oversimplify the RDF graph, and a refined one which trades some of these properties for more accuracy in representing the structure.
Conference Paper
Full-text available
Governments and public administrations started recently to publish large amounts of structured data on the Web, mostly in the form of tabular data such as CSV files or Excel sheets. Various tools and projects have been launched aiming at facilitating the lifting of tabular data to reach semantically structured and linked data. However, none of these tools supported a truly incremental, pay-as-you-go data publication and mapping strategy, which enables effort sharing between data owners, community experts and consumers. In this article, we present an approach for enabling the user-driven semantic mapping of large amounts tabular data. We devise a simple mapping language for tabular data, which is easy to understand even for casual users, but expressive enough to cover the vast majority of potential tabular mappings use cases. We outline a formal approach for mapping tabular data to RDF. Default mappings are automatically created and can be revised by the community using a semantic wiki. The mappings are executed using a sophisticated streaming RDB2RDF conversion. We report about the deployment of our approach at the Pan-European data portal PublicData.eu, where we transformed and enriched almost 10,000 datasets accounting for 7.3 billion triples.
Article
Full-text available
This article defines C-SPARQL, an extension of SPARQL whose distinguishing feature is the support of continuous queries, i.e. queries registered over RDF data streams and then continuously executed. Queries consider windows, i.e. the most recent triples of such streams, observed while data is continuously flowing. Supporting streams in RDF format guarantees interoperability and opens up important applications, in which reasoners can deal with evolving knowledge over time. C-SPARQL is presented by means of a full specification of the syntax, a formal semantics, and a comprehensive set of examples, relative to urban computing applications, that systematically cover the SPARQL extensions. The expression of meaningful queries over streaming data is strictly connected to the availability of aggregation primitives, thus C-SPARQL also includes extensions in this respect.
Article
Full-text available
RDF is a graph based data model which is widely used for semantic web and linked data applications. In this paper we describe a Shape Expression definition language which enables RDF validation through the declaration of constraints on the RDF model. Shape Expressions can be used to validate RDF data, communicate expected graph patterns for interfaces and generate user interface forms. In this paper we describe the syntax and the formal semantics of Shape Expressions using inference rules. Shape Expressions can be seen as domain specific language to define Shapes of RDF graphs based on regular expressions. Attached to Shape Expressions are semantic actions which provide an extension point for validation or for arbitrary code execution such as those in parser generators. Using semantic actions, it is possible to augment the validation expressiveness of Shape Expressions and to transform RDF graphs in a easy way. We have implemented several validation tools that check if an RDF graph matches against a Shape Expressions schema and infer the corresponding Shapes. We have also implemented two extensions, called GenX and GenJ that leverage the predictability of the graph traversal and create ordered, closed content, XML/Json documents, providing a simple, declarative mapping from RDF data to XML and Json documents.
Article
Full-text available
Sensor networks are increasingly being deployed in the environment for many different purposes. The observations that they produce are made available with heterogeneous schemas, vocabularies and data formats, making it difficult to share and reuse this data, for other purposes than those for which they were originally set up. The authors propose an ontology-based approach for providing data access and query capabilities to streaming data sources, allowing users to express their needs at a conceptual level, independent of implementation and language-specific details. In this article, the authors describe the theoretical foundations and technologies that enable exposing semantically enriched sensor metadata, and querying sensor observations through SPARQL extensions, using query rewriting and data translation techniques according to mapping languages, and managing both pull and push delivery modes.
Article
Full-text available
Rather than thinking of XML and RDF as competing alternatives, we regard XML and RDF as complementary technologies. This paper discusses bidirectional mapping between XML and RDF, the problem of lift. Rather than inventing a new mapping language, we use information already available in the XML schema. The result is a tool, Gloze that works within the Jena framework. This is illustrated by examples from an application to Model Based Automation.
Article
Full-text available
One promise of Semantic Web applications is to seamlessly deal with heterogeneous data. The Extensible Markup Language (XML) has become widely adoptedas an almost ubiquitous interchange format for data, along with transformation languages like XSLT and XQuery to translate data from one XML format into another. However, the more recent Resource Description Framework (RDF) has become another popular standard for data representation and exchange, supported by its own query language SPARQL, that enables extraction and transformation of RDF data. Being able to work with XML and RDF using a common framework eliminates several unnecessary steps that are currently required when handling both formats side by side. In this paper we present the XSPARQL language that, by combining XQuery and SPARQL, allows to query XML and RDF data using the same framework and transform data from one format into the other. We focus on the semantics of this combined language and present an implementation, including discussion of query optimisations along with benchmark evaluation.
Article
Full-text available
The term "Linked Data" refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
Conference Paper
Full-text available
The role of metadata is gaining importance due to today's growth of multimedia content. Currently, XML is the standard for data interchange. However, as XML Schemas do not express semantics but rather the document structure, there is a lack of semantic interoperability regarding current (XML-based) metadata standards. By using semantic Web technologies, ontologies can be created to describe the semantics of a particular metadata format. A problem is that the existing XML data (compliant with a particul.ar metadata format) cannot be used by an ontology, implying the need for a conversion of XML data to RDF instances. In this paper, a generic approach is proposed for the transformation of XML data into RDF instances in an ontology-dependent way. By means of a mapping document, the link is described between an XML Schema (describing the structure of particular XML data) and an OWL ontology. Our approach is illustrated by applying it to the DIG35 specification, which is an XML-based metadata standard for the description of digital images.
Conference Paper
Full-text available
In this paper we address the problem of scalable, native and adaptive query processing over Linked Stream Data integrated with Linked Data. Linked Stream Data consists of data generated by stream sources, e.g., sensors, enriched with semantic descriptions, following the standards proposed for Linked Data. This enables the integration of stream data with Linked Data collections and facilitates a wide range of novel applications. Currently available systems use a “black box” approach which delegates the processing to other engines such as stream/event processing engines and SPARQL query processors by translating to their provided languages. As the experimental results described in this paper show, the need for query translation and data transformation, as well as the lack of full control over the query execution, pose major drawbacks in terms of efficiency. To remedy these drawbacks, we present CQELS (Continuous Query Evaluation over Linked Streams), a native and adaptive query processor for unified query processing over Linked Stream Data and Linked Data. In contrast to the existing systems, CQELS uses a “white box” approach and implements the required query operators natively to avoid the overhead and limitations of closed system regimes. CQELS provides a flexible query execution framework with the query processor dynamically adapting to the changes in the input data. During query execution, it continuously reorders operators according to some heuristics to achieve improved query execution in terms of delay and complexity. Moreover, external disk access on large Linked Data collections is reduced with the use of data encoding and caching of intermediate query results. To demonstrate the efficiency of our approach, we present extensive experimental performance evaluations in terms of query execution time, under varied query types, dataset sizes, and number of parallel queries. These results show that CQELS outperforms related approaches by orders of magnitude.
Conference Paper
Full-text available
The purpose of this paper is to provide a rigorous compari- son of six query languages for RDF. We outline and categorize features that any RDF query language should provide and compare the individ- ual languages along these features. We describe several practical usage examples for RDF queries and conclude with a comparison of the expres- siveness of the particular query languages. The usecases, sample data and queries for the respective languages are available on the web1.
Article
Full-text available
A generic transformation of XML data into the Resource Description Framework (RDF) and its implementation by XSLT transformations is presented. It was developed by the grid integration project for robotic telescopes of AstroGrid-D to provide network communication through the Remote Telescope Markup Language (RTML) to its RDF based information service. The transformation's generality is explained by this example. It automates the transformation of XML data into RDF and thus solves this problem of semantic computing. Its design also permits the inverse transformation but this is not yet implemented.
Conference Paper
Full-text available
Interpreting legacy XML documents is a great challenge for realizing the vision of the semantic Web (SW). This paper presents an algorithm to transform XML data into RDF- foundation language of the SW - automatically. Our approach maps element definitions stored in XML schema to RDF schema ontology, where the ontology is used to describe the meaning and relationships between XML elements. The RDF results containing XML data at the semantic level while retaining their nesting structure make huge XML data source on the current Web be available for the SW.
Conference Paper
Full-text available
Interpreting the XML data in a current web into sources that can be used by the Semantic Web has received great attention in recent years. In this paper, we propose a procedure for transforming valid XML documents into RDF by using vocabularies of RDF schema. The first objective here is to obtain classes and properties from labels in XML document exactly by accessing the XML DTD. After that, we can interpret XML data as RDF triples by using some vocabularies of RDF schema. The main advantage of our approach is that it ensures the integrity of the structure and meaning of the original XML documents while transforming them into RDF. This procedure can be used for any kind of valid XML documents.
Article
RDF validation is a field where the Semantic Web community is currently focusing attention. In other communities, like XML or databases, data validation and quality is considered a key part of their ecosystem. Besides, there is a recent trend to migrate data from different sources to semantic web formats. These transformations and mappings between different technologies have an economical and technological impact on society. In order to facilitate this transformation, we propose a set of mappings that can be used to convert from XML Schema to Shape Expressions (ShEx)—a validation language for RDF. We also present a prototype that implements a subset of the mappings proposed, and an example application to obtain a ShEx schema from an XML Schema. We consider that this work and the development of other format mappings could drive to an improvement of data interoperability due to the reduction of the technological gap.
Conference Paper
In this paper, we propose a novel data-driven schema for large-scale heterogeneous knowledge graphs inspired by Formal Concept Analysis (FCA). We first extract the sets of properties associated with individual entities; these property sets (aka. characteristic sets) are annotated with cardinalities and used to induce a lattice based on set-containment relations, forming a natural hierarchical structure describing the knowledge graph. We then propose an algebra over such schema lattices, which allows to compute diffs between lattices (for example, to summarise the changes from one version of a knowledge graph to another), to add lattices (for example, to project future changes), and so forth. While we argue that this lattice structure (and associated algebra) may have various applications, we currently focus on the use-case of modelling and predicting the dynamic behaviour of knowledge graphs. Along those lines, we instantiate and evaluate our methods for analysing how versions of the Wikidata knowledge graph have changed over a period of 11 weeks. We propose algorithms for constructing the lattice-based schema from Wikidata, and evaluate their efficiency and scalability. We then evaluate use of the resulting schema(ta) for predicting how the knowledge graph will evolve in future versions.
Article
We give an account of stable reasoning, a recent and novel approach to problem solving from a formal, logical point of view. We describe the underlying logic of stable reasoning and illustrate how it is used to model different domains and solve practical reasoning problems. We discuss some of the main differences with respect to reasoning in classical logic and we examine an ongoing research programme for the rational reconstruction of human knowledge that may be considered a successor to the logical empiricists’ programme of the mid-twentieth century.
Conference Paper
Although the link prediction problem, where missing relation assertions are predicted, has been widely researched, error detection did not receive as much attention. In this paper, we investigate the problem of error detection in relation assertions of knowledge graphs, and we propose an error detection method which relies on path and type features used by a classifier for every relation in the graph exploiting local feature selection. We perform an extensive evaluation on a variety of datasets, backed by a manual evaluation on DBpedia and NELL, and we propose and evaluate heuristics for the selection of relevant graph paths to be used as features in our method.
Conference Paper
We present a formal semantics and proof of soundness for shapes schemas, an expressive schema language for RDF graphs that is the foundation of Shape Expressions Language 2.0. It can be used to describe the vocabulary and the structure of an RDF graph, and to constrain the admissible properties and values for nodes in that graph. The language defines a typing mechanism called shapes against which nodes of the graph can be checked. It includes an algebraic grouping operator, a choice operator and cardinality constraints for the number of allowed occurrences of a property. Shapes can be combined using Boolean operators, and can use possibly recursive references to other shapes.
Conference Paper
An XML schema specifies the structural properties of XML documents generated from the schema and, thus, is useful to manage XML data efficiently. However, there are often XML documents without a valid schema or with an incorrect schema in practice. This leads us to study the problem of inferring a Relax NG schema from a set of XML documents that are presumably generated from a specific XML schema. Relax NG is an XML schema language developed for the next generation of XML schema languages such as document type definitions (DTDs) and XML Schema Definitions (XSDs). Regular hedge grammars accept regular tree languages and the design of Relax NG is closely related with regular hedge grammars. We develop an XML schema inference system using hedge grammars. We employ a genetic algorithm and state elimination heuristics in the process of retrieving a concise Relax NG schema. We present experimental results using real-world benchmark.
Article
This paper introduces the LOD Laundromat meta-dataset, a continuously updated RDF meta-dataset-that describes the documents crawled, cleaned and (re)published by the LOD Laundromat. This meta-dataset-of over 110 million triples contains structural information for more than 650,000-documents (and growing). Dataset meta-data-is often not provided alongside published data, it is incomplete or it is incomparable given the way they were generated. The LOD Laundromat-meta-dataset-provides a wide range of structural dataset properties, such as the number of triples in LOD Laundromat-documents, the average degree in documents, and the distinct number of Blank Nodes, Literals and IRIs. This makes it a particularly useful dataset for data comparison and analytics, as well as for the global study of the Web of Data. This paper presents the dataset, its requirements, and its impact.
Conference Paper
Recently, Semantic Web Technologies (SWT) have been introduced and adopted to address the problem of enterprise data integration (e.g., to solve the problem of terms and concepts heterogeneity within large organizations). One of the challenges of adopting SWT for enterprise data integration is to provide the means to define and validate structural constraints over Resource Description Framework (RDF) graphs. This is difficult since RDF graph axioms behave like implications instead of structural constraints. SWT researchers and practitioners have proposed several solutions to address this challenge (e.g., SPIN and ShapeExpression). However, to the best of our knowledge, none of them provide an integrated solution within open source ontology editors (e.g., Protégé). We identified this absence of the integrated solution and developed SHACL4P, a Protégé plugin for defining and validating Shapes Constraint Language (SHACL), the upcoming W3C standard for constraint validation within Protégé ontology editor.
Conference Paper
The original SPARQL proposal was often criticized for its inability to navigate through the structure of RDF documents. For this reason property paths were introduced in SPARQL 1.1, but up to date there are no theoretical studies examining how their addition to the language affects main computational tasks such as query evaluation, query containment, and query subsumption. In this paper we tackle all of these problems and show that although the addition of property paths has no impact on query evaluation, they do make the containment and subsumption problems substantially more difficult.
Article
In approximate computing, programs gain efficiency by allowing occasional errors. Controlling the probabilistic effects of this approximation remains a key challenge. We propose a new approach where programmers use a type system to communicate high-level constraints on the degree of approximation. A combination of type inference, code specialization, and optional dynamic tracking makes the system expressive and convenient. The core type system captures the probability that each operation exhibits an error and bounds the probability that each expression deviates from its correct value. Solver-aided type inference lets the programmer specify the correctness probability on only some variables-program outputs, for example-and automatically fills in other types to meet these specifications. An optional dynamic type helps cope with complex run-time behavior where static approaches are insufficient. Together, these features interact to yield a high degree of programmer control while offering a strong soundness guarantee. We use existing approximate-computing benchmarks to show how our language, DECAF, maintains a low annotation burden. Our constraint-based approach can encode hardware details, such as finite degrees of reliability, so we also use DECAF to examine implications for approximate hardware design. We find that multi-level architectures can offer advantages over simpler two-level machines and that solveraided optimization improves efficiency.
Article
Without any doubt XML is currently a de facto standard for data representation. Its popularity is given by the fact that it is well-defined, easy-to-use, and, at the same time, sufficiently expressive. However, most of XML documents are still created without the respective description of their structure, i.e. an XML schema. Since the knowledge of XML schema is crucial for various data processing approaches as well as a standard optimization tool, approaches to (semi-)automatic inference of XML schema become an interesting research problem. Recently, there have been introduced several approaches that improve particular steps of the inference process. The problem is how to compare such approaches, how to find the optimal ones and how to join them to get the best possible result. In this paper, we introduce jInfer, a general framework for XML schema inference. It represents an easily extensible tool that enables one to implement, test and compare new modules of the inference process. Since the compulsory parts of the process, such as parsing of XML data, visualization of automata, transformation of automata to XML schema languages, etc. are implemented, the user can focus purely on the research and the improved aspect of the inference process. We describe not only the framework, but the area of schema inference in general, including related work and open problems.
Conference Paper
Despite the significant number of existing tools, incorporating data from multiple sources and different formats into the Linked Open Data cloud remains complicated. No mapping formalisation exists to define how to map such heterogeneous sources into RDF in an integrated and interoperable fashion. This paper introduces the RML mapping language, a generic language based on an extension over R2RML, the W3C standard for mapping relational databases into RDF. Broadening RML’s scope, the language becomes source-agnostic and extensible, while facilitating the definition of mappings of multiple heterogeneous sources. This leads to higher integrity within datasets and richer interlinking among resources.
Article
In this paper we thoroughly cover the issue of blank nodes, which have been defined in RDF as ‘existential variables’. We first introduce the theoretical precedent for existential blank nodes from first order logic and incomplete information in database theory. We then cover the different (and sometimes incompatible) treatment of blank nodes across the W3C stack of RDF-related standards. We present an empirical survey of the blank nodes present in a large sample of RDF data published on the Web (the BTC–2012 dataset), where we find that 25.7% of unique RDF terms are blank nodes, that 44.9% of documents and 66.2% of domains featured use of at least one blank node, and that aside from one Linked Data domain whose RDF data contains many “blank node cycles”, the vast majority of blank nodes form tree structures that are efficient to compute simple entailment over. With respect to the RDF-merge of the full data, we show that 6.1% of blank-nodes are redundant under simple entailment. The vast majority of non-lean cases are isomorphisms resulting from multiple blank nodes with no discriminating information being given within an RDF document or documents being duplicated in multiple Web locations. Although simple entailment is NP-complete and leanness-checking is coNP-complete, in computing this latter result, we demonstrate that in practice, real-world RDF graphs are sufficiently “rich” in ground information for problematic cases to be avoided by non-naive algorithms.
Article
Scripting the Semantic Web requires to access and transform RDF data. We present XSLT+SPARQL, a set of extension functions for XSLT which allow stylesheets to directly access RDF data, indepen-dently of any serialization syntax, by means of SPARQL queries. Using these functions, XSLT stylesheets can retrieve, query, merge and trans-form data from the semantic web. We illustrate the functionality of our proposal with an example script which creates XHTML pages from the contents of DBpedia.
Conference Paper
XML vocabularies can be characterized as those designed for the convenience of authors or software developers, called colloquial, and those designed to have a trivial mapping to a non-XML data structure, which we call non- colloquial. Mapping colloquial vocabularies into other formats (e.g., symbolic logic or RDF) is a powerful tool for making colloquial XML tractable. Specifying this mapping is a way of documenting what the elements and attributes are supposed to mean and how they are to be used. If this is done only in English prose, humans can make use of it, but not machines. If machine-readable syntax is used to specify a mapping from the XML vocabulary into some well- known target syntax, the mapping can benefit both humans and machines. Simple examples illustrate how mappings can be defined using XSLT and how they can be attached to the schema defining the XML vocabulary.
Linked-data design issues. W3C design issue document
  • T Berners-Lee
Shapes Constraint Language (SHACL). W3C Proposed Recommendation
  • H Knublauch
  • D Kontokostas
  • J E Labra Gayo
Shaclex: Scala Implementation of ShEx and SHACL
  • J E Labra Gayo
Declarative rules for linked data generation at your fingertips! In
  • P Heyvaert
  • B De Meester
  • A Dimou
  • R Verborgh
  • Pieter Heyvaert
Validata: a tool for testing profile conformance
  • A J G Gray
OSLC resource shape: a language for defining constraints on linked data
  • A G Ryman
  • A L Hors
  • S Speicher
Semantics and validation of recursive SHACL (extended version)
  • J Corman
  • J L Reutter
  • O Savkovic
  • Julien Corman