PosterPDF Available

Abstract

In order to perform any operation in an RDF graph, it is recommendable to know the expected topology of the targeted information. Some technologies and syntaxes have been developed in the last years to describe the expected shapes in an RDF graph, such as ShEx and SHACL. In general, a domain expert can use these syntaxes to define shapes in a graph, with two main purposes: data validation and documentation. However, there are some scenarios in which the schema cannot be predicted a priori, but it emerges at the same time that the graph is filled with new information. In those cases, the shapes are latent in the current content. We have developed a prototype which is able to infer shapes of classes in a knowledge graph and used it with classes of DBpedia ontology. We serialize our results using ShEx.
Inference of Latent Shape Expressions
Associated to DBpedia Ontology
Daniel Fern´andez-´
Alvarez1, Herminio Garc´ıa-Gonz´alez1, Johannes Frey2,
Sebastian Hellmann2, and Jos´e Emilio Labra Gayo1
1Department of Computer Science, University of Oviedo,
Oviedo 33007, Spain
{danifdezalvarez,herminiogg}@gmail.com, labra@uniovi.es
2Agile Knowledge Engineering and Semantic Web, University of Leipzig,
04109 Leipzig, Germany
{frey,hellmann}@informatik.uni-leipzig.de
Abstract. In order to perform any operation in an RDF graph, it is
recommendable to know the expected topology of the targeted informa-
tion. Some technologies and syntaxes have been developed in the last
years to describe the expected shapes in an RDF graph, such as ShEx
and SHACL. In general, a domain expert can use these syntaxes to de-
fine shapes in a graph, with two main purposes: data validation and
documentation. However, there are some scenarios in which the schema
cannot be predicted a priori, but it emerges at the same time that the
graph is filled with new information. In those cases, the shapes are latent
in the current content. We have developed a prototype which is able to
infer shapes of classes in a knowledge graph and used it with classes of
DBpedia ontology. We serialize our results using ShEx.
Keywords: RDF ·ShEx ·Inference ·DBpedia ·Schema
1 Introduction
The most common way to perform queries against any Resource Description
Framework (RDF) store is using SPARQL. In order to perform an effective
SPARQL query against a Knowledge Graph (KG), one may need to know the
expected topology of the KG. A wrong pick of properties, data-types or classes
may cause a certain query to ignore relevant information or to update data in a
way that does not fit with the current KG’s topology.
Ontologies define the meaning and the correct usage of properties and classes,
but they are not intended to specify the expected shape of a group of nodes in
the context of a specific KG. Some other languages, such as eXtended Markup
Language (XML), have technologies to perform the task of defining the expected
shape of the elements in a given dataset, including RelaxNG or XML Schema. In
the RDF world, Shape Expressions (ShEx) [5] and Shapes Constraint Language
(SHACL)[4] have been proposed for describing and validating RDF.
2 D. Fern´andez- ´
Alvarez et al.
Usually, the topology of a given KG can be designed or predicted by domain
experts in controlled scenarios. However, there are situations in which a KG
does not have a planned schema, but the shapes emerge while the content keeps
growing. Insightful examples of that are community-driven approaches such as
DBpedia or Wikidata. In those cases, as suggested in [2], discovering the latent
schemata associated to classes by applying inference can be useful in several
ways:
– Guideline for users. Knowing the shape associated to a class allows for
effective querying of its instances, since it shows which of the properties are
used to describe the knowledge placed in their immediate neighborhood.
– Measure of data quality. The process of inference may produce shapes
with different levels of trustworthiness w.r.t how homogeneously the knowl-
edge is represented. That trustworthiness may be used as a data quality
measure.
We implemented a prototype which is able to infer Shape Expressions asso-
ciated to the classes in a KG and applied it on the English chapter of DBpedia3.
Our prototype calculates a score of how trustworthy the constraints inferred
in the shapes are, i.e., how many of the total of instances really conform to it.
Then, it serializes the results using ShEx. Some other works have already studied
emergent schemata in RDF sources[3] and serialization or visualization of this
information[1]. The novelty of our approach consists of the usage of ShEx.
2 Shape Inference
Our prototype receives as input a list of class URIs selected from a KG. Then,
it infers a shape for each class in that list. It works in three main stages:
1. Instance tracking. Find all the instances of the target classes.
2. Class profiling. For each class, explore all the triples in the graph whose
subject is one of its instances. Through this, build a profile of each class. A
profile consists of a list of triple constraints4that can be induced from the
tracked triples.
3. Serialization to ShEx. Turn each class profile into a Shape in ShEx as-
sociated to the class. During this stage we include configurable features to
filter some triple constraints.
Listing 1.1 shows an example of a small RDF graph about countries, and
Listing 1.2 presents the shape that our prototype infers from it5.
3The used source code as well as an extended explanation of our experiments is
available at https://github.com/DaniFdezAlvarez/dbpedia-shexer
4Triple constraints are the basic building block in ShEx. They are composed of a
property, a node constraint and a cardinality.
5The prefixes employed in this paper are the common ones that can be resolved by
the service http://prefix.cc/
Inference of Latent Shape Expressions Associated to DBpedia Ontology 3
Listing 1.1. RDF example graph
db r : S pa i n rd f : t yp e db o : C o un t ry ;
db p : c ap i ta l d br : Ma d ri d ;
rd f s : la be l " S pa in " ;
rd f s : la b el " K in g do m o f Sp a in " .
db r : F ra n ce r d f : ty p e db o : C ou n tr y ;
db p : c ap i ta l d br : Pa r is ;
rd f s: l a be l " Fr a nc e " .
Listing 1.2. Example Country Shape
:Country
{
rd f : t yp e [ db o : C ou n tr y ] ; # 10 0%
db p : c ap i ta l I RI ; # 10 0%
rd f s : la b el x s d : st r i ng + # 10 0%
# 50 % h av e c a r di n a li t y { 1 }
}
As can be seen in Listing 1.2, every triple constraint induced in a shape is
associated with a percentage that indicates how many instances of the target
class conform with the constraint. The common case in real scenarios is that not
all of the instances conform with a given constraint rule, with the exception of
the constraint rdf:type [:nameOfTheClass], which they all share.
Listing 1.3 shows an example of Country Shape inferred by analyzing the
actual content of DBpedia.
Listing 1.3. Country Shape in DBpedia (minimun trustworthiness of 80%)
: Country {
r d f : t y p e [ dbo : C ou nt ry ] ; # 1 0 0 . 0 %
db o : wi ki Pa g eI D xs d : i n t e g e r ; # 9 7 . 10 8 %
ow l : sameAs IR I +; # 9 6.9 33 %
f o a f : name x sd : s t r i n g +; # 9 6 . 75 8 %
d ct e rm s : s u b j e c t I RI + ; # 9 6 . 0 28 %
db o : d i s s o l u t i o n Y e a r xs d : gY ea r + ; # 8 3 . 1 48 %
# 8 2 .5 93 % h ave c a r d i n a l i t y {1}
dbo : foun d i n g Year x s d : gYea r +; # 82 . 00 9 %
# 8 1 .4 54 % h ave c a r d i n a l i t y {1}
dbp : c o n t i n e n t r d f : l a n g S t r i n g + # 8 0 .6 0 7 %
# 8 0 .3 44 % h ave c a r d i n a l i t y {1}
}
The main features of our prototype are the following:
Trustworthiness score. Every triple constraint inferred is associated with
the relative amount of instances that fit with it. We provide that informa-
tion in a comment. That allows for sorting the constraint w.r.t. its trust-
worthiness, as well as filtering constraints that are not frequent enough. The
threshold to accept or reject a constraint w.r.t how trustworthy it is can be
configured.
– Literals and IRIs recognition. All kinds of literals are recognized and
treated separately when inferring the constraints. In case a literal is not
explicitly associated with a type in the original KG, xsd:string is assumed.
If the object of a triple is an IRI, the macro IRI is used to represent it in
the inferred constraints.
Special treatment of rdf:type. The only exception to the previous feature
happens when analyzing triples whose predicate is rdf:type. In those cases,
we create a triple constraint whose object is a value set containing a single
4 D. Fern´andez- ´
Alvarez et al.
element, which is the actual object of the original triple. This is shown in
Listing 1.2 and Listing 1.3. Future versions of this prototype could improve
this behavior by customizing which properties point to value sets.
Cardinality management. Some of the triples of a given instance may fit
in an infinite number of constraint triples with the same predicate and object
but different cardinalities. For example, if a given instance has a single label
specified by rdfs:label, that makes it fit with infinite triple constraints of the
form {rdfs:label xsd:string C}, where C can be any cardinality that includes
the possibility of a single occurrence: {1},+,{1,2},{1,3},... Currently, our
prototype considers rules with exact cardinality or +closure.
When serializing the shapes, our prototype can be configured to prioritize the
least specific cardinality or the most specific one if its trustworthiness is high
enough. Information about cardinality which is not given in the constraint
itself is provided through comments.
3 Conclusions and Future Work
We have applied automatic inference over DBpedia to discover latent Shapes
associated to classes of the DBpedia ontology using a statistical approach. We
have serialized the latent shapes using ShEx, which can be useful as a guideline on
how to manipulate the data. Our approach associates a score of trustworthiness
to each rule, so it can also be used as a metric of homogeneity of the dataset.
We are presenting a work in progress research. The algorithm underlying
our prototype can be extended with extra features, including more complex
inferences, such as inter-shape referencing or more precise cardinalities; regular
expressions for some literals; or generation of serializations different to ShEx,
such as SHACL or example SPARQL queries associated to each class.
Acknowledgments. This work is partially funded by the Spanish Ministry of
Economy and Competitiveness (Society challenges: TIN2017-88877-R)
References
1. Dud´s, M., Sv´atek, V., Mynarz, J.: Dataset summary visualization with lodsight.
In: International Semantic Web Conference. pp. 36–40. Springer (2015)
2. Fern´andez-Alvarez, D., Labra-Gayo, J.E., Garcıa-Gonz´alez, H.: Inference and se-
rialization of latent graph schemata using shex. In: SEMAPRO 2016, The Tenth
International Conference on Advances in Semantic Processing. IARIA
3. Gonz´alez, L., Hogan, A.: Modelling dynamics in semantic web knowledge graphs
with formal concept analysis. In: Proceedings of the 2018 World Wide Web Confer-
ence on World Wide Web. pp. 1175–1184. International World Wide Web Confer-
ences Steering Committee (2018)
4. Knublauch, H., TopQuadrant, Inc., Kontokostas, D., University of Leipzig: Shapes
constraint language (shacl). W3C Recommendation 11, 8 (2017)
5. Prud’hommeaux, E., Labra Gayo, J.E., Solbrig, H.: Shape expressions: an rdf vali-
dation and transformation language. In: Proceedings of the 10th International Con-
ference on Semantic Systems. pp. 32–40. ACM (2014)
... In such a context, mechanisms to validate the structure or assist the main- 10 tenance of KGs are needed. Ontologies can be used to define restrictions w.r.t. ...
... sheXer can produce ShEx and SHACL content, and it allows to tune the extraction process with multiple configuration parameters. sheXer was suggested as a theoretical idea [9] and proposed as a demo [10]. Nowadays, we 45 offer a public Python library with a mature implementation of sheXer 3 . ...
... We have executed sheXer to extract shapes from three well-known LD data sources: 280 Wikidata 10 , YAGO 11 , and DBpedia 12 . Details about these computations can be 10 Only triples using Wikidata direct properties in the namespace http://www.wikidata. org/prop/direct/ where used to extract shapes. ...
Article
Full-text available
There is an increasing number of projects based on Knowledge Graphs and SPARQL endpoints. These SPARQL endpoints are later queried by final users or used to feed many different kinds of applications. Shape languages, such as ShEx and SHACL, have emerged to guide the evolution of these graphs and to validate their expected topology. However, authoring shapes for an existing knowledge graph is a time-consuming task. The task gets more challenging when dealing with sources, possibly maintained by heterogeneous agents. In this paper, we present sheXer, a system that extracts shapes by mining the graph structure. We offer sheXer as a free Python library capable of producing both ShEx and SHACL content. Compared to other automatic shape extractors, sheXer includes some novel features such as shape inter-linkage and computation of big real-world datasets. We analyze the features and limitations w.r.t. performance with different experiments using the English chapter of DBpedia.
... Different proposals to assist shapes generation have been proposed. Some focus on learning shapes from a set of data [1,7,16,22]; these proposals cover a small amount of the restrictions, and most of the learnt restrictions refer to value restrictions. Nevertheless, since KGs are modelled by ontologies, when these proposals learn model restrictions from data they do not take such ontologies into account, leading to a potential discordance with the model. ...
... Another work related to the generation of shapes from data is the one presented by Fernández-Alvarez et al. [7], which infers Shape expressions associated to the classes in an RDF graph. This approach consists in the following steps: ...
Conference Paper
Knowledge Graphs (KGs) that publish RDF data modelled using ontologies in a wide range of domains have populated the Web. The SHACL language is a W3C recommendation that has been endowed to encode a set of either value or model data restrictions that aim at validating KG data, ensuring data quality. Developing shapes is a complex and time consuming task that is not feasible to achieve manually. This article presents two resources that aim at generating automatically SHACL shapes for a set of ontologies: (1) Astrea-KG, a KG that publishes a set of mappings that encode the equivalent conceptual restrictions among ontology constraint patterns and SHACL constraint patterns, and (2) Astrea, a tool that automatically generates SHACL shapes from a set of ontologies by executing the mappings from the Astrea-KG. These two resources are openly available at Zenodo, GitHub, and a web application. In contrast to other proposals, these resources cover a large number of SHACL restrictions producing both value and model data restrictions, whereas other proposals consider only a limited number of restrictions or focus only on value or model restrictions.
Chapter
The RDF data model forms a cornerstone of the Semantic Web technology stack. Although there have been different proposals for RDF serialization syntaxes, the underlying simple data model enables great flexibility which allows it to be successfully employed in many different scenarios and to form the basis on which other technologies are developed. In order to apply an RDF-based approach in practice it is necessary to communicate the structure of the data that is being stored or represented. Data quality is of paramount importance for the acceptance of RDF as a data representation language and it must be enabled by the use of tools that can check if some data conforms to some specific structure. There have been several recent proposals for RDF validation languages like ShEx and SHACL. In this chapter, we describe both proposals and enumerate some challenges and trends that we foresee with regards to RDF validation. We devote more space to what we consider one of the main challenges, which is to compare ShEx and SHACL and to understand their underlying foundations. To that end, we propose an intermediate language and show how ShEx and SHACL can be converted to it.
Conference Paper
In this paper, we propose a novel data-driven schema for large-scale heterogeneous knowledge graphs inspired by Formal Concept Analysis (FCA). We first extract the sets of properties associated with individual entities; these property sets (aka. characteristic sets) are annotated with cardinalities and used to induce a lattice based on set-containment relations, forming a natural hierarchical structure describing the knowledge graph. We then propose an algebra over such schema lattices, which allows to compute diffs between lattices (for example, to summarise the changes from one version of a knowledge graph to another), to add lattices (for example, to project future changes), and so forth. While we argue that this lattice structure (and associated algebra) may have various applications, we currently focus on the use-case of modelling and predicting the dynamic behaviour of knowledge graphs. Along those lines, we instantiate and evaluate our methods for analysing how versions of the Wikidata knowledge graph have changed over a period of 11 weeks. We propose algorithms for constructing the lattice-based schema from Wikidata, and evaluate their efficiency and scalability. We then evaluate use of the resulting schema(ta) for predicting how the knowledge graph will evolve in future versions.
Conference Paper
We present a web-based tool that shows a summary of an RDF dataset as a visualization of a graph formed from classes, datatypes and predicates used in the dataset. The visualization should allow to quickly and easily find out what kind of data the dataset contains and its structure. It also shows how vocabularies are used in the dataset.
Inference and serialization of latent graph schemata using shex
  • D Fernández-Alvarez
  • J E Labra-Gayo
  • H Garcıa-González
Fernández-Alvarez, D., Labra-Gayo, J.E., Garcıa-González, H.: Inference and serialization of latent graph schemata using shex. In: SEMAPRO 2016, The Tenth International Conference on Advances in Semantic Processing. IARIA
University of Leipzig: Shapes constraint language (shacl). W3C Recommendation 11
  • H Knublauch
  • Inc Topquadrant
  • D Kontokostas
Knublauch, H., TopQuadrant, Inc., Kontokostas, D., University of Leipzig: Shapes constraint language (shacl). W3C Recommendation 11, 8 (2017)
Shape expressions: an rdf validation and transformation language
  • E Prud'hommeaux
  • J E Labra Gayo
  • H Solbrig
Prud'hommeaux, E., Labra Gayo, J.E., Solbrig, H.: Shape expressions: an rdf validation and transformation language. In: Proceedings of the 10th International Conference on Semantic Systems. pp. 32-40. ACM (2014)