Conference PaperPDF Available

Verbalizing the Evolution of Knowledge Graphs with Formal Concept Analysis

Authors:

Abstract and Figures

Questioning Answering and Verbalization over Knowledge Graphs(KGs) are gaining momentum as they provide natural interfaces to knowledge harvested from a myriad of data sources. KGs are dynamic: new facts are added and removed over time, producing multiple versions, each representing a knowledge snapshot of a point in time. Verbalizing a report of the evolution of entities is useful in many scenarios, e.g., reporting digital twins’ evolution in manufacturing or healthcare. We envision a method to verbalize a graph summary capturing the temporal evolution of entities across different versions of a KG. Technically, our approach considers revisions of a graph over time and converts them into RDFmolecules. Formal Concept Analysis is then performed on them to generate summary information. Finally, a verbalization pipeline generates a report in natural language.
Content may be subject to copyright.
A preview of the PDF is not available
Article
Purpose Enterprise knowledge graphs (EKG) in resource description framework (RDF) consolidate and semantically integrate heterogeneous data sources into a comprehensive dataspace. However, to make an external relational data source accessible through an EKG, an RDF view of the underlying relational database, called an RDB2RDF view, must be created. The RDB2RDF view should be materialized in situations where live access to the data source is not possible, or the data source imposes restrictions on the type of query forms and the number of results. In this case, a mechanism for maintaining the materialized view data up-to-date is also required. The purpose of this paper is to address the problem of the efficient maintenance of externally materialized RDB2RDF views. Design/methodology/approach This paper proposes a formal framework for the incremental maintenance of externally materialized RDB2RDF views, in which the server computes and publishes changesets, indicating the difference between the two states of the view. The EKG system can then download the changesets and synchronize the externally materialized view. The changesets are computed based solely on the update and the source database state and require no access to the content of the view. Findings The central result of this paper shows that changesets computed according to the formal framework correctly maintain the externally materialized RDB2RDF view. The experiments indicate that the proposed strategy supports live synchronization of large RDB2RDF views and that the time taken to compute the changesets with the proposed approach was almost three orders of magnitude smaller than partial rematerialization and three orders of magnitude smaller than full rematerialization. Originality/value The main idea that differentiates the proposed approach from previous work on incremental view maintenance is to explore the object-preserving property of typical RDB2RDF views so that the solution can deal with views with duplicates. The algorithms for the incremental maintenance of relational views with duplicates published in the literature require querying the materialized view data to precisely compute the changesets. By contrast, the approach proposed in this paper requires no access to view data. This is important when the view is maintained externally, because accessing a remote data source may be too slow.
Conference Paper
Full-text available
This paper presents DELTA-LD, an approach that detects and classifies the changes between two versions of a linked dataset. It contributes to the state-of-art: firstly, by proposing a classification to distinctly identify the resources that have had both their IRIs and representation changed and the resources that have had only their IRI changed; secondly by automatically selecting the appropriate resource properties to identify the same resources in different versions of a linked dataset with different IRIs and similar representation. The paper also presents the DELTA-LD change model to represent the detected changes. This model captures the information of both changed resources and triples in linked datasets during its evolution, bridging the gap between resource-centric and triple-centric views of changes. As a result, a single change detection mechanism can support several diverse use cases like interlink maintenance and replica synchronization. The paper, in addition, describes an experiment conducted to examine the accuracy of DELTA-LD in detecting the changes between the person snapshots of DBpedia. The result indicates that the accuracy of DELTA-LD outperforms the state-of-art approaches by up to 4%, in terms of F-measure. It is demonstrated that the proposed classification of changes helped to identify up to 1529 additional updated resources as compared to the existing classification of resource level changes. By means of a case study, we also demonstrate the automatic repair of broken interlinks using the changes detected by DELTA-LD and represented in DELTA-LD change model, showing how 100% of the broken interlinks were repaired between DBpedia person snapshot 3.7 and Freebase.
Conference Paper
Full-text available
Knowledge graphs are dynamic in nature, new facts about an entity are added or removed over time. Therefore, multiple versions of the same knowledge graph exist, each of which represents a snapshot of the knowledge graph at some point in time. Entities within the knowledge graph undergo evolution as new facts are added or removed. The problem of automatically generating a summary out of different versions of a knowledge graph is a long-studied problem. However, most of the existing approaches limit to pair-wise version comparison. Making it difficult to capture complete evolution out of several versions of the same graph. To overcome this limitation, we envision an approach to create a summary graph capturing temporal evolution of entities across different versions of a knowledge graph. The entity summary graphs may then be used for documentation generation, profiling or visualization purposes. First, we take different temporal versions of a knowledge graph and convert them into RDF molecules. Secondly, we perform Formal Concept Analysis on these molecules to generate summary information. Finally, we apply a summary fusion policy in order to generate a compact summary graph which captures the evolution of entities.
Article
Full-text available
In this work, we focus on the task of generating natural language descriptions from a structured table of facts containing fields (such as nationality, occupation, etc) and values (such as Indian, actor, director, etc). One simple choice is to treat the table as a sequence of fields and values and then use a standard seq2seq model for this task. However, such a model is too generic and does not exploit task-specific characteristics. For example, while generating descriptions from a table, a human would attend to information at two levels: (i) the fields (macro level) and (ii) the values within the field (micro level). Further, a human would continue attending to a field for a few timesteps till all the information from that field has been rendered and then never return back to this field (because there is nothing left to say about it). To capture this behavior we use (i) a fused bifocal attention mechanism which exploits and combines this micro and macro level information and (ii) a gated orthogonalization mechanism which tries to ensure that a field is remembered for a few time steps and then forgotten. We experiment with a recently released dataset which contains fact tables about people and their corresponding one line biographical descriptions in English. In addition, we also introduce two similar datasets for French and German. Our experiments show that the proposed model gives 21% relative improvement over a recently proposed state of the art method and 10% relative improvement over basic seq2seq models. The code and the datasets developed as a part of this work are publicly available.
Conference Paper
Full-text available
In the last two decades a new part of the web grew significantly, namely the Semantic Web. It contains many Knowledge Bases (KB) about different areas like music, books, publications, live science and many more. Question Answering (QA) over KBs is seen as the most promising approach to bring this data to end-users. We describe WDAqua-core1, a QA service for querying RDF knowledge-bases. It is multilingual, it supports different RDF knowledge bases and it understands both full natural language questions and keyword questions.
Conference Paper
Full-text available
The nature of the RDF data model allows for numerous descriptions of the same entity. For example, different RDF vocabularies may be utilized to describe pharmacogenomic data, and the same drug or gene is represented by different RDF graphs in DBpedia or Drug-bank. To provide a unified representation of the same real-world entity, RDF graphs need to be semantically integrated. Semantic integration requires the management of knowledge encoded in RDF vocabularies to determine the relatedness of different RDF representations of the same entity, e.g., axiomatic definition of vocabulary properties or resource equivalences. We devise MINTE, an integration technique that relies on both: knowledge stated in RDF vocabularies and semantic similarity measures to merge semantically equivalent RDF graphs, i.e., graphs corresponding to the same real-world entity. MINTE follows a two-fold approach to solve the problem of integrating RDF graphs. In the first step, MINTE implements a 1--1 weighted perfect matching algorithm to identify semantically equivalent RDF entities in different graphs. Then, MINTE relies on different fusion policies to merge triples from these semantically equivalent RDF entities. We empirically evaluate the performance of MINTE on data from DBpedia, Wiki-data, and Drugbank. The experimental results suggest that MINTE is able to accurately integrate semantically equivalent RDF graphs.
Conference Paper
Context-specific description of entities-expressed in RDF-poses challenges during data-driven tasks, e.g., data integration, and context-aware entity matching represents a building-block for these tasks. However, existing approaches only consider inter-schema mapping of data sources, and are not able to manage several contexts during entity matching. We devise COMET, an entity matching technique that relies on both the knowledge stated in RDF vocabularies and context-based similarity metrics to match contextually equivalent entities. COMET executes a novel 1-1 perfect matching algorithm for matching contextually equivalent entities based on the combined scores of semantic similarity and context similarity. COMET employs the Formal Concept Analysis algorithm in order to compute the context similarity of RDF entities. We empirically evaluate the performance of COMET on a testbed from DBpedia. The experimental results suggest that COMET is able to accurately match equivalent RDF graphs in a context-dependent manner.
Article
Knowledge graphs offer a versatile knowledge representation, and have been studied under different forms, such as conceptual graphs or RDF graphs in the Semantic Web. A challenge is to discover conceptual structures in those graphs, in the same way as Formal Concept Analysis (FCA) discovers conceptual structures in tables. FCA has been successful for analysing, mining, learning, and exploring tabular data, and our aim is to help transpose those results to graph-based data. Previous several FCA approaches have already addressed relational data, hence graphs, but with various limits. We propose Graph-FCA as an extension of FCA where a dataset is a hypergraph instead of a binary table. We show that it can be formalized simply by replacing objects by tuples of objects. This leads to the notion of “n-ary concept”, whose extent is an n-ary relation of objects, and whose intent is a “projected graph pattern”. In this paper, we formally reconstruct the fundamental results of FCA for knowledge graphs. We describe in detail the representation of hypergraphs, and the operations on them, as they are much more complex than the sets of attributes that they extend. We also propose an algorithm based on a notion of “pattern basis” to generate and display n-ary concepts in a more efficient and more compact way. We explore a few use cases, in order to study the feasibility and usefulness of Graph-FCA. We consider two use cases: workflow patterns in cooking recipes and linguistic structures from parse trees. In addition, we report on experiments about quantitative aspects of the approach.
Conference Paper
In this paper, we propose a novel data-driven schema for large-scale heterogeneous knowledge graphs inspired by Formal Concept Analysis (FCA). We first extract the sets of properties associated with individual entities; these property sets (aka. characteristic sets) are annotated with cardinalities and used to induce a lattice based on set-containment relations, forming a natural hierarchical structure describing the knowledge graph. We then propose an algebra over such schema lattices, which allows to compute diffs between lattices (for example, to summarise the changes from one version of a knowledge graph to another), to add lattices (for example, to project future changes), and so forth. While we argue that this lattice structure (and associated algebra) may have various applications, we currently focus on the use-case of modelling and predicting the dynamic behaviour of knowledge graphs. Along those lines, we instantiate and evaluate our methods for analysing how versions of the Wikidata knowledge graph have changed over a period of 11 weeks. We propose algorithms for constructing the lattice-based schema from Wikidata, and evaluate their efficiency and scalability. We then evaluate use of the resulting schema(ta) for predicting how the knowledge graph will evolve in future versions.
Conference Paper
There is an emerging demand on efficiently archiving and (temporal) querying different versions of evolving semantic Web data. As novel archiving systems are starting to address this challenge, foundations/standards for benchmarking RDF archives are needed to evaluate its storage space efficiency and the performance of different retrieval operations. To this end, we provide theoretical foundations on the design of data and queries to evaluate emerging RDF archiving systems. Then, we instantiate these foundations along a concrete set of queries on the basis of a real-world evolving dataset. Finally, we perform an empirical evaluation of various current archiving techniques and querying strategies on this data. Our work comprises -- to the best of our knowledge -- the first benchmark for querying evolving RDF data archives.