Conference Paper

# Evolutionary Discovery of Multi-relational Association Rules from Ontological Knowledge Bases

Authors:
• Thang Long University, Hanoi, Vietnam
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

In the Semantic Web, OWL ontologies play the key role of domain conceptualizations, while the corresponding assertional knowledge is given by the heterogeneous Web resources referring to them. However, being strongly decoupled, ontologies and assertional knowledge can be out of sync. In particular, an ontology may be incomplete, noisy, and sometimes inconsistent with the actual usage of its conceptual vocabulary in the assertions. Despite of such problematic situations, we aim at discovering hidden knowledge patterns from ontological knowledge bases, in the form of multi-relational association rules, by exploiting the evidence coming from the (evolving) assertional data. The final goal is to make use of such patterns for (semi-)automatically enriching/completing existing ontologies. An evolutionary search method applied to populated ontological knowledge bases is proposed for the purpose. The method is able to mine intensional and assertional knowledge by exploiting problem-aware genetic operators, echoing the refinement operators of inductive logic programming, and by taking intensional knowledge into account, which allows to restrict the search space and direct the evolutionary process. The discovered rules are represented in SWRL, so that they can be straightforwardly integrated within the ontology, thus enriching its expressive power and augmenting the assertional knowledge that can be derived from it. Discovered rules may also suggest new (schema) axioms to be added to the ontology. We performed experiments on publicly available ontologies, validating the performances of our approach and comparing them with the main state-of-the-art systems.

## No full-text available

... Combining ILP with Association Rule Mining A hybrid approach that involves both reasoning and association rule mining is presented in [26]. The intended application of this type of rule mining is suggesting new axioms to be added to the ontology [27]. ...
... The intended application of this type of rule mining is suggesting new axioms to be added to the ontology [27]. Despite the advantages of the hybrid approach, the use of a reasoner in combination with association rule mining is likely to exceed what one can consider a "reasonable time" [26]. On the other hand, the advantage of approaches involving reasoning is that they generate rules, which are not necessarily closed, while AMIE+/RDFRules outputs only closed rules. ...
... For our experiments, we mainly used the YAGO2 core dataset and YAGO3 core samples of yagoLiter-alFacts, yagoFacts and yagoDBpediaInstances available from the Max Planck Institute website 26 . For more time consuming tasks we used DBpedia 3.8 with person data and mapping-based properties. ...
Article
AMIE+ is a state-of-the-art algorithm for learning rules from RDF knowledge graphs (KGs). Based on association rule learning, AMIE+ constituted a breakthrough in terms of speed on large data compared to the previous generation of ILP-based systems. In this paper we present several algorithmic extensions to AMIE+, which make it faster, and the support for data pre-processing and model post-processing, which provides a more comprehensive coverage of the linked data mining process than does the original AMIE+ implementation. The main contributions are related to performance improvement: (1) the top-k approach, which addresses the problem of combinatorial explosion often resulting from a hand-set minimum support threshold, (2) a grammar that allows to define fine-grained patterns reducing the size of the search space, and (3) a faster projection binding reducing the number of repetitive calculations. Other enhancements include the possibility to mine across multiple graphs, the support for discretization of continuous values, and the selection of the most representative rules using proven rule pruning and clustering algorithms. Benchmarks show reductions in mining time of up to several orders of magnitude compared to AMIE+. An open-source implementation is available under the name RDFRules at https://github.com/propi/rdfrules.
... This is the goal of level-wise generate and test methods proposed in the inductive logic programming (ILP) [18,9,17], and in the SW community [15,14,22,12], which exploit just the assertional evidence of ontological KBs and, more recently, of approaches that exploit also the resoning capabilities of the SW, like [7]. Even more recently, approaches that take advantage of the exploration capabilities of evolutionary algorithms jointly with the reasoning capabilities of ontologies have been proposed: this is the case of EDMAR [8,21] an evolutionary inductive programming approach capable of discovering discovering hidden knowledge patterns in the form of multi-relational association rules (ARs) coded in SWRL [13], which can be added to the ontology, thus enriching its expressive power and increasing the assertional knowledge that can be derived from it. Additionally, discovered rules may suggest new axioms to be added to the ontology, such as transitivity and symmetry of a role, as well as concept/role inclusion. ...
... The best metrics could be considered and used in the next researches. We applied our evolutionary algorithm to the same populated ontological KBs used in [8]: Financial, 7 describing the banking domain; Biological Pathways Exchange (BioPAX) 8 Level 2 Ontology, describing biological pathway data; and New Testament Names Ontology (NTNMerged), 9 describing named entities (people, places, and other classes) in the New Testament, as well as their attributes and relationships. Details on these ontologies are reported in Table 1. ...
... To test the capability of the discovered rules to predict new assertional knowledge for each examined ontological KB, stratified versions of each ontology have been constructed (as described in [8]) by randomly removing, respectively, 20%, 30%, and 40% of the concept assertions, while the full ontology versions have been used as a testbed. We performed 30 runs of the EA described in Sect. 3 for each stratified version and for each choice of fitness function using the following parameter setting: where the number of predictions is the number of predicted assertions and precision is defined in Def. 17. ...
... However, none of those approaches takes advantage of the exploration capabilities of EAs jointly with the reasoning capabilities of ontologies. Recently, a first attempt in this novel direction to discover hidden knowledge patterns in ontological KBs has been made [5]. However, that proposal suffers from some limitations, which our solution manages to overcome. ...
... Only connected [9] and non-redundant [12] rules satisfying the safety condition [10] are considered. 5 In the following, notations and formal definitions for the listed properties are reported. ...
... Since the goal is to discover rules capable of making (possibly a large number of) accurate predictions, the fitness of a pattern is the sum of the head coverage and of the PCA confidence of the corresponding rule, which measures its the overall quality. The EA we propose may be regarded as an improvement of a previous proposal by three of us [5] and at the same time as an alternative and complementary approach to level-wise generate-and-test algorithms for discovering relational ARs from RDF data-sets [9] and recent proposals that take into account terminological axioms and deductive reasoning capabilities [4]. ...
Conference Paper
In the Semantic Web context, OWL ontologies represent the conceptualization of domains of interest while the corresponding assertional knowledge is given by RDF data referring to them. Because of its open, distributed, and collaborative nature, such knowledge can be incomplete, noisy, and sometimes inconsistent. By exploiting the evidence coming from the assertional data, we aim at discovering hidden knowledge patterns in the form of multi-relational association rules while taking advantage of the intensional knowledge available in ontological knowledge bases. An evolutionary search method applied to populated ontological knowledge bases is proposed for finding rules with a high inductive power. The proposed method, EDMAR, uses problem-aware genetic operators, echoing the refinement operators of ILP, and takes the intensional knowledge into account, which allows it to restrict and guide the search. Discovered rules are coded in SWRL, and as such they can be straightforwardly integrated within the ontology, thus enriching its expressive power and augmenting the assertional knowledge that can be derived. Additionally, discovered rules may also suggest new axioms to be added to the ontology. We performed experiments on publicly available ontologies, validating the performances of our approach and comparing them with the main state-of-the-art systems.
... The problem of knowledge completion aims to find information missing from the knowledge base. An indicative method that tackles this problem is AMIE [26,27] . AMIE target to mine logic rules from RDF knowledge bases in order to further predict new assertions. ...
... AMIE target to mine logic rules from RDF knowledge bases in order to further predict new assertions. In a recent work [27] , the system targets to mine SWRL rules from OWL ontolo-gy. Learning disjointness axioms methods aim to discover axioms from the data that during the modelling process are overlooked and lead to misunderstanding the negative knowledge of the target domain. ...
Article
Full-text available
Remarkable progress in research has shown the efficiency of Knowledge Graphs (KGs) in extracting valuable external knowledge in various domains. A Knowledge Graph (KG) can illustrate high-order relations that connect two objects with one or multiple related attributes. The emerging Graph Neural Networks (GNN) can extract both object characteristics and relations from KGs. This paper presents how Machine Learning (ML) meets the Semantic Web and how KGs are related to Neural Networks and Deep Learning. The paper also highlights important aspects of this area of research, discussing open issues such as the bias hidden in KGs at different levels of graph representation.
... The basis element of this environment is Linked Data which connect related data across the Web by reusing HTTP Internationalized Resource Identiers (IRIs) and with the use of the Resource Description Framework (RDF) to publicly share semi-structured data on 1 Figure 1 The diagram of the Linked Open Data Cloud More and more applications are devised to help humans explore [62] this huge knowledge network, and performing data analysis tasks, and data mining to extract again new knowledge, which may complete or correct the KBs. Very interesting proposals have been recently published concerning this topic [30,36], which are currently experimented on only one source but can be expected to become even more powerful when they will deal with several linked open data sets. An essential requirement for these tasks is safe and sound data retrieved from the Web of Linked Open Data, which is usually done by data collection and preprocessing steps. ...
... Thus, in the next execution of step (a), the PIs that were not applicable to the atoms of the query q in the previous step may become applicable to the atoms in the new query. 30 ...
Thesis
The term Linked Open Data (LOD) is proposed the first time by Tim Berners-Lee since 2006. He suggested principles to connect and publish data on the web so that both humans and machines can effectively retrieve and process them.Since then, LOD has evolved impressively with thousands datasets on the Web of Data, which has raised a number of challenges for the research community to retrieve and to process LOD.In this thesis, we focus on the problem of quality of retrieved data from various sources of the LOD and we propose a context-driven querying system that guarantees the quality of answers with respect to the quality context defined by users.We define a fragment of constraints and propose two approaches: the naive and the rewriting, which allows us to filter dynamically valid answers at the query time instead of validating them at the data source level.The naive approach performs the validation process by generating and evaluating sub-queries for each candidate answer with respect to each constraint. While the rewriting approach uses constraints as rewriting rules to reformulate query into a set of auxiliary queries such that the answers of rewritten-queries are not only the answers of the query but also valid answers with respect to all integrated constraints.The proof of the correction and completeness of our rewriting system is presented after formalizing the notion of a valid answers with respect to a context.These two approaches have been evaluated and have shown the feasibility of our system.This is our main contribution: we extend the set of well-known query-rewriting systems (Chase, Chase \& backchase, PerfectRef, Xrewrite, etc.) with a new effective solution for the new purpose of filtering query results based on constraints in user context. Moreover, we also enlarge the trigger condition of the constraint compared with other works by using the notion of one-way MGU.
... The basis element of this environment is Linked Data which connect related data across the Web by reusing HTTP Internationalized Resource Identiers (IRIs) and with the use of the Resource Description Framework (RDF) to publicly share semi-structured data on 1 Figure 1 The diagram of the Linked Open Data Cloud More and more applications are devised to help humans explore [62] this huge knowledge network, and performing data analysis tasks, and data mining to extract again new knowledge, which may complete or correct the KBs. Very interesting proposals have been recently published concerning this topic [30,36], which are currently experimented on only one source but can be expected to become even more powerful when they will deal with several linked open data sets. An essential requirement for these tasks is safe and sound data retrieved from the Web of Linked Open Data, which is usually done by data collection and preprocessing steps. ...
... Thus, in the next execution of step (a), the PIs that were not applicable to the atoms of the query q in the previous step may become applicable to the atoms in the new query. 30 ...
Thesis
Le terme Linked Open Data (LOD) (ou données ouvertes liées) a été introduit pour la première fois par Tim Berners-Lee en 2006. Depuis, les LOD ont connu une importante évolution. Aujourd’hui,nous pouvons constater les milliers de jeux de données présents sur le Web de données. De ce fait, la communauté de recherche s’est confrontée à un certain nombre de défis concernant la récupération et le traitement de données liées.Dans cette thèse, nous nous intéressons au problème de la qualité des données extraites de diverses sources du LOD et nous proposons un système d’interrogation contextuelle qui garantit la qualité des réponses par rapport à un contexte spécifié par l’utilisateur. Nous définissons un cadre d’expression de contraintes et proposons deux approches : l’une naïve et l’autre de réécriture, permettant de filtrer dynamiquement les réponses valides obtenues à partir des sources éventuellement non-valides, ceci au moment de la requête et non pas en cherchant à les valider dans les sources des données. L’approche naïve exécute le processus de validation en générant et en évaluant des sous-requêtes pour chaque réponse candidate en fonction de chaque contrainte. Alors que l’approche de réécriture utilise les contraintes comme des règles de réécriture pour reformuler la requête en un ensemble de requêtes auxiliaires, de sorte que les réponses à ces requêtes réécrites ne sont pas seulement les réponses de la requête initiale mais aussi des réponses valides par rapport à toutes les contraintes intégrées. La preuve de la correction et de la complétude de notre système de réécriture est présentée après un travail de formalisation de la notion de réponse valide par rapport à un contexte. Ces deux approches ont été évaluées et ont montré la praticabilité de notre système.Ceci est notre principale contribution: nous étendons l’ensemble de systèmes de réécriture déjà connus(Chase, C&BC, PerfectRef, Xrewrite, etc.) avec une nouvelle solution efficace pour ce nouveau défi qu’est le filtrage des résultats en fonction d’un contexte utilisateur. Nous généralisons également les conditions de déclenchement de contraintes par rapport aux solutions existantes, en utilisant la notion de one-way MGU.
... Alternatively, where available, ontologies can be used to derive logicallycertain negative edges under OWA through, for example, disjointness axioms. The system proposed by d'Amato et al. [114,115] leverages ontologically-entailed negative edges for determining the confidence of rules generated through an evolutionary algorithm. ...
... However, these latter approaches do not use the semantics of ontology in the exploration of the search space. Other approaches such as [2] can be guided by the ontology semantics to avoid constructing semantically redundant rules. However, the author has shown that the exploitation of reasoning capabilities during the learning process does not allow mining rules on large KGs. ...
Chapter
Full-text available
In the context of the work conducted at CSTB (French Scientific and Technical Center for Building), the need for a tool providing assistance in the identification of asbestos-containing materials in buildings was identified. To this end, we have developed an approach, named CRA-Miner, that mines logical rules from a knowledge graph that describes buildings and asbestos diagnoses. Since the specific product used is not defined, CRA-Miner considers temporal data, product types, and contextual information to find a set of candidate rules that maximizes the confidence. These rules can then be used to identify building elements that may contain asbestos and those that are asbestos-free. The experiments conducted on an RDF graph provided by the CSTB show that the proposed approach is promising and a satisfactory accuracy can be obtained.
... Alternatively, where available, ontologies can be used to derive logicallycertain negative edges under OWA through, for example, disjointness axioms. The system proposed by d'Amato et al. [114,115] leverages ontologically-entailed negative edges for determining the confidence of rules generated through an evolutionary algorithm. ...
Article
Full-text available
In this article, we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models, as well as languages used to query and validate knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We conclude with high-level future research directions for knowledge graphs.
... Alternatively, where available, ontologies can be used to derive logicallycertain negative edges under OWA through, for example, disjointness axioms. The system proposed by d'Amato et al. [109,110] leverages ontologically-entailed negative edges for determining the confidence of rules generated through an evolutionary algorithm. ...
Article
Full-text available
In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After a general introduction, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graphs.
... Alternatively, where available, ontologies can be used to derive logicallycertain negative edges under OWA through, for example, disjointness axioms. The system proposed by d'Amato et al. [109,110] leverages ontologically-entailed negative edges for determining the confidence of rules generated through an evolutionary algorithm. ...
Preprint
Full-text available
In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After a general introduction, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graphs.
... degrees, clustering coefficients) in order to make it exploitable by learning methods [4]. Inductive logic programming and rule discovery approaches [1,3,6,7] can detect rules such as "If a person was born in a country, they probably speak the language of that country". These rules can then be used to predict new information. ...
Conference Paper
Full-text available
Since 1997, the production, import and sale of asbestos\footnoteNaturally occurring mineral fibres which were used due to their insulating properties. have been banned in France. However, there are still millions of tons scattered in factories, buildings, or hospitals. In this paper we propose a method for predicting the presence of asbestos products in buildings based on temporal data that describes the probability of the presence of asbestos in marketed products.
... Related works concerning learning rules from KGs can be found in the literature. One way of doing so is mostly by exploiting rule mining methods [34,31]. Here rules are automatically discovered from KGs and represented in SWRL whilst, we propose to use SHACL for representing constraints that are learned from KGs and that are ultimately used for validating (possibly also new) statements of a KG. ...
Preprint
Full-text available
Linked Open Data (LOD) is the publicly available RDF data in the Web. Each LOD entity is identfied by a URI and accessible via HTTP. LOD encodes globalscale knowledge potentially available to any human as well as artificial intelligence that may want to benefit from it as background knowledge for supporting their tasks. LOD has emerged as the backbone of applications in diverse fields such as Natural Language Processing, Information Retrieval, Computer Vision, Speech Recognition, and many more. Nevertheless, regardless of the specific tasks that LOD-based tools aim to address, the reuse of such knowledge may be challenging for diverse reasons, e.g. semantic heterogeneity, provenance, and data quality. As aptly stated by Heath et al. Linked Data might be outdated, imprecise, or simply wrong": there arouses a necessity to investigate the problem of linked data validity. This work reports a collaborative effort performed by nine teams of students, guided by an equal number of senior researchers, attending the International Semantic Web Research School (ISWS 2018) towards addressing such investigation from different perspectives coupled with different approaches to tackle the issue.
... Applications are needed to help humans exploring this huge knowledge network, performing data analysis and data mining tasks. Promising recent proposals are currently experimented on only one semantic web data source [7,10], and these processes can be expected to be even more helpful when they will deal with several linked open data sets. One crucial point for such applications, and in particular for data mining algorithms, is that the data collection and pre-processing steps have to be safe and sound. ...
Chapter
The Semantic Web (SW) is characterized by the availability of a vast amount of semantically annotated data collections. Annotations are provided by exploiting ontologies acting as shared vocabularies. Additionally ontologies are endowed with deductive reasoning capabilities which allow to make explicit knowledge that is formalized implicitly. Along the years a large number of data collections have been developed and interconnected, as testified by the Linked Open Data Cloud. Currently, seminal examples are represented by the numerous Knowledge Graphs (KGs) that have been built, either as enterprise KGs or open KGs, that are freely available. All of them are characterized by very large data volumes, but also incompleteness and noise. These characteristics have made the exploitation of deductive reasoning services less feasible from a practical viewpoint, opening up to alternative solutions, grounded on Machine Learning (ML), for mining knowledge from the vast amount of information available. Actually, ML methods have been exploited in the SW for solving several problems such as link and type prediction, ontology enrichment and completion (both at terminological and assertional level), and concept leaning. Whilst initially symbol-based solutions have been mostly targeted, recently numeric-based approaches are receiving major attention because of the need to scale on the very large data volumes. Nevertheless, data collections in the SW have peculiarities that can hardly be found in other fields. As such the application of ML methods for solving the targeted problems is not straightforward. This paper extends [20], by surveying the most representative symbol-based and numeric-based solutions and related problems, with a special focus on the main issues that need to be considered and solved when ML methods are adopted in the SW field as well as by analyzing the main peculiarities and drawbacks for each solution.KeywordsSemantic WebMachine learningSymbol-based methodsNumeric-based methods
Article
Constructing large-scale knowledge base has encountered a bottleneck because of the limitation of natural language processing. Many approaches have been put forward to infer new facts based on existing knowledge. Graph feature models mine rule-like patterns from a knowledge base and use them to predict missing edges. These models take account of the graph structure information and they can explain the existence of a fact reasonably. Existing models only describe local interaction between entities, but how to model co-relationships among facts globally is a tough problem. In this paper, we develop an efficient model which uses association rules to make inferences. First, we use a rule mining model to detect simple association rules and use them to produce large amounts of evidence. Second, based on all the produced evidence and the connections among them, we construct a factor graph which represents the inference space. Then, we develop an EM inference model, wherein the E-step we use Belief Propagation to calculate the marginal distribution of candidate edges and, in the M-step we propose a Generalized Iterative Proportional Fitting algorithm to re-learn the confidence of soft rules. Experiments show that our approach outperforms state-of-the-art approaches in knowledge base completion (KBC) tasks.
Thesis
Face au danger de la désinformation et de la prolifération de fake news (fausses nouvelles) sur le Web, la notion de véracité des données constitue un enjeu crucial. Dans ce contexte, il devient essentiel de développer des modèles qui évaluent de manière automatique la véracité des informations. De fait, cette évaluation est déjà très difficile pour un humain, en raison notamment du biais de confirmation qui empêche d’évaluer objectivement la fiabilité des informations. De plus, la quantité d'informations disponibles sur le Web rend cette tâche quasiment impossible. Il est donc nécessaire de disposer d'une grande puissance de calcul et de développer des méthodes capables d'automatiser cette tâche.Dans cette thèse, nous nous concentrons sur les modèles de découverte de la vérité. Ces approches analysent les assertions émises par différentes sources afin de déterminer celle qui est la plus fiable et digne de confiance. Cette étape est cruciale dans un processus d'extraction de connaissances, par exemple, pour constituer des bases de qualité, sur lesquelles pourront s'appuyer différents traitements ultérieurs (aide à la décision, recommandation, raisonnement…). Plus précisément, les modèles de la littérature sont des modèles non supervisés qui reposent sur un postulat : les informations exactes sont principalement fournies par des sources fiables et des sources fiables fournissent des informations exactes.Les approches existantes faisaient jusqu'ici abstraction de la connaissance a priori d'un domaine. Dans cette contribution, nous montrons comment les modèles de connaissance (ontologies de domaine) peuvent avantageusement être exploités pour améliorer les processus de recherche de vérité. Nous insistons principalement sur deux approches : la prise en compte de la hiérarchisation des concepts de l'ontologie et l'identification de motifs dans les connaissances qui permet, en exploitant certaines règles d'association, de renforcer la confiance dans certaines assertions. Dans le premier cas, deux valeurs différentes ne seront plus nécessairement considérées comme contradictoires ; elles peuvent, en effet, représenter le même concept mais avec des niveaux de détail différents. Pour intégrer cette composante dans les approches existantes, nous nous basons sur les modèles mathématiques associés aux ordres partiels. Dans le second cas, nous considérons des modèles récurrents (modélisés en utilisant des règles d'association) qui peuvent être dérivés à partir des ontologies et de bases de connaissances existantes. Ces informations supplémentaires peuvent renforcer la confiance dans certaines valeurs lorsque certains schémas récurrents sont observés. Chaque approche est validée sur différents jeux de données qui sont rendus disponibles à la communauté, tout comme le code de calcul correspondant aux deux approches.
Article
In recent years, there has been a growing interest from the digital humanities in knowledge graphs as data modelling paradigm. Already, this has led to the creation of many such knowledge graphs, many of which are now available as part of the Linked Open Data cloud. This presents new opportunities for data mining. In this work, we develop, implement, and evaluate (both data-driven and user-driven) an end-to-end pipeline for user-centric pattern mining on knowledge graphs in the humanities. This pipeline combines constrained generalized association rule mining with natural language output and facet rule browsing to allow for transparency and interpretability—two key domain requirements. Experiments in the archaeological domain show that domain experts were positively surprised by the range of patterns that were discovered and were overall optimistic about the future potential of this approach.
Thesis
The notion of data veracity is increasingly getting attention due to the problem of misinformation and fake news. With more and more published online information it is becoming essential to develop models that automatically evaluate information veracity. Indeed, the task of evaluating data veracity is very difficult for humans. They are affected by confirmation bias that prevents them to objectively evaluate the information reliability. Moreover, the amount of information that is available nowadays makes this task time-consuming. The computational power of computer is required. It is critical to develop methods that are able to automatize this task. In this thesis we focus on Truth Discovery models. These approaches address the data veracity problem when conflicting values about the same properties of real-world entities are provided by multiple sources. They aim to identify which are the true claims among the set of conflicting ones. More precisely, they are unsupervised models that are based on the rationale stating that true information is provided by reliable sources and reliable sources provide true information. The main contribution of this thesis consists in improving Truth Discovery models considering a priori knowledge expressed in ontologies. This knowledge may facilitate the identification of true claims. Two particular aspects of ontologies are considered. First of all, we explore the semantic dependencies that may exist among different values, i.e. the ordering of values through certain conceptual relationships. Indeed, two different values are not necessary conflicting. They may represent the same concept, but with different levels of detail. In order to integrate this kind of knowledge into existing approaches, we use the mathematical models of partial order. Then, we consider recurrent patterns that can be derived from ontologies. This additional information indeed reinforces the confidence in certain values when certain recurrent patterns are observed. In this case, we model recurrent patterns using rules. Experiments that were conducted both on synthetic and real-world datasets show that a priori knowledge enhances existing models and paves the way towards a more reliable information world. Source code as well as synthetic and real-world datasets are freely available.
Article
Full-text available
Recent advances in information extraction have led to huge knowledge bases (KBs), which capture knowledge in a machine-readable format. Inductive logic programming (ILP) can be used to mine logical rules from these KBs, such as “If two persons are married, then they (usually) live in the same city.” While ILP is a mature field, mining logical rules from KBs is difficult, because KBs make an open-world assumption. This means that absent information cannot be taken as counterexamples. Our approach AMIE (Galárraga et al. in WWW, 2013) has shown how rules can be mined effectively from KBs even in the absence of counterexamples. In this paper, we show how this approach can be optimized to mine even larger KBs with more than 12M statements. Extensive experiments show how our new approach, AMIE+, extends to areas of mining that were previously beyond reach.
Conference Paper
Full-text available
Recent advances in information extraction have led to huge knowledge bases (KBs), which capture knowledge in a machine-readable format. Inductive Logic Programming (ILP) can be used to mine logical rules from the KB. These rules can help deduce and add missing knowledge to the KB. While ILP is a mature field, mining logical rules from KBs is different in two aspects: First, current rule mining systems are easily overwhelmed by the amount of data (state-of-the art systems cannot even run on today's KBs). Second, ILP usually requires counterexamples. KBs, however, implement the open world assumption (OWA), meaning that absent data cannot be used as counterexamples. In this paper, we develop a rule mining model that is explicitly tailored to support the OWA scenario. It is inspired by association rule mining and introduces a novel measure for confidence. Our extensive experiments show that our approach outperforms state-of-the-art approaches in terms of precision and coverage. Furthermore, our system, AMIE, mines rules orders of magnitude faster than state-of-the-art approaches.
Article
Full-text available
An important characteristic of all natural systems is the ability to acquire knowledge through experience and to adapt to new situations. Learning is the single unifying theme of all natural systems. One of the basic ways of gaining knowledge is through examples of some concepts.For instance, we may learn how to distinguish a dog from other creatures after that we have seen a number of creatures, and after that someone (a teacher, or supervisor) told us which creatures are dogs and which are not. This way of learning is called supervised learning. Inductive Concept Learning (ICL) constitutes a central topic in machine learning. The problem can be formulated in the following manner: given a description language used to express possible hypotheses, a background knowledge, a set of positive examples, and a set of negative examples, one has to find a hypothesis which covers all positive examples and none of the negative ones. This is a supervised way of learning, since a supervisor has already classified the examples of the concept into positive and negative examples. The so learned concept can be used to classify previously unseen examples. In general deriving general conclusions from specific observation is called induction. Thus in ICL, concepts are induced because obtained from the observation of a limited set of training examples. The process can be seen as a search process. Starting from an initial hypothesis, what is done is searching the space of the possible hypotheses for one that fits the given set of examples. A representation language has to be chosen in order to represent concepts, examples and the background knowledge. This is an important choice, because this may limit the kind of concept we can learn. With a representation language that has a low expressive power we may not be able to represent some problem domain, because too complex for the language adopted. On the other side, a too expressive language may give us the possibility to represent all problem domains. However this solution may also give us too much freedom, in the sense that we can build concepts in too many different ways, and this could lead to the impossibility of finding the right concept. We are interested in learning concepts expressed in a fragment of first--order logic (FOL). This subject is known as Inductive Logic Programming (ILP), where the knowledge to be learn is expressed by Horn clauses, which are used in programming languages based on logic programming like Prolog. Learning systems that use a representation based on first--order logic have been successfully applied to relevant real life problems, e.g., learning a specific property related to carcinogenicity. Learning first--order hypotheses is a hard task, due to the huge search space one has to deal with. The approach used by the majority of ILP systems tries to overcome this problem by using specific search strategies, like the top-down and the inverse resolution mechanism. However, the greedy selection strategies adopted for reducing the computational effort, render techniques based on this approach often incapable of escaping from local optima. An alternative approach is offered by genetic algorithms (GAs). GAs have proved to be successful in solving comparatively hard optimization problems, as well as problems like ICL. GAs represents a good approach when the problems to solve are characterized by a high number of variables, when there is interaction among variables, when there are mixed types of variables, e.g., numerical and nominal, and when the search space presents many local optima. Moreover it is easy to hybridize GAs with other techniques that are known to be good for solving some classes of problems. Another appealing feature of GAs is represented by their intrinsic parallelism, and their use of exploration operators, which give them the possibility of escaping from local optima. However this latter characteristic of GAs is also responsible for their rather poor performance on learning tasks which are easy to tackle by algorithms that use specific search strategies. These observations suggest that the two approaches above described, i.e., standard ILP strategies and GAs, are applicable to partly complementary classes of learning problems. More important, they indicate that a system incorporating features from both approaches could profit from the different benefits of the approaches. This motivates the aim of this thesis, which is to develop a system based on GAs for ILP that incorporates search strategies used in successful ILP systems. Our approach is inspired by memetic algorithms, a population based search method for combinatorial optimization problems. In evolutionary computation memetic algorithms are GAs in which individuals can be refined during their lifetime.
Conference Paper
Full-text available
A search approach is presented, based on a novel algorithm called QG (Quick Generalisation). QG carries out a random-restart stochastic bottom-up search which efficiently generates a consistent clause on the fringe of the refinement graph search without needing to explore the graph in detail. We use a Genetic Algorithm (GA) to evolve and re-combine clauses generated by QG. Initial experiments with QG/GA indicate that this approach can be more efficient than standard refinement-graph searches, while generating similar or better solutions.
Conference Paper
Full-text available
In a previous paper we introduced a framework for combining Genetic Algorithms with ILP which included a novel representation for clauses and relevant operators. In this paper we complete the proposed framework by introducing a fast evaluation mechanism. In this evaluation mechanism individuals can be evaluated at genotype level (i.e. bit-strings) without mapping them into corresponding clauses. This is intended to replace the complex task of evaluating clauses (which usually needs repeated theorem proving) with simple bitwise operations. In this paper we also provide an experimental evaluation of the proposed framework. The results suggest that this framework could lead to significantly increased efficiency in problems involving complex target theories.
Conference Paper
Full-text available
We tackle the problem of statistical learning in the standard knowledge base representations for the Semantic Web which are ultimately expressed in description Logics. Specifically, in our method a kernel functions for the $$\mathcal{ALCN}$$ logic integrates with a support vector machine which enables the usage of statistical learning with reference representations. Experiments where performed in which kernel classification is applied to the tasks of resource retrieval and query answering on OWL ontologies.
Article
Full-text available
The breeder genetic algorithm (BGA) models artificial selection as performed by human breeders. The science of breeding is based on advanced statistical methods. In this paper a connection between genetic algorithm theory and the science of breeding is made. We show how the response to selection equation and the concept of heritability can be applied to predict the behavior of the BGA. Selection, recombination, and mutation are analyzed within this framework. It is shown that recombination and mutation are complementary search operators. The theoretical results are obtained under the assumption of additive gene effects. For general fitness landscapes, regression techniques for estimating the heritability are used to analyze and control the BGA. The method of decomposing the genetic variance into an additive and a nonadditive part connects the case of additive fitness functions with the general case.
Article
Full-text available
Onto-Relational Learning is an extension of Relational Learning aimed at accounting for ontologies in a clear, well-founded and elegant manner. The system -QuIn supports a variant of the frequent pattern discovery task by following the Onto-Relational Learning approach. It takes taxonomic ontologies into account during the discovery process and produces descriptions of a given relational database at multiple granularity levels. The functionalities of the system are illustrated by means of examples taken from a Semantic Web Mining case study concerning the analysis of relational data extracted from the on-line CIA World Fact Book.
Book
Full-text available
Chapter
Full-text available
We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant association rules between items in the database. The algorithm incorporates buffer management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm.
Article
Full-text available
The Breeder Genetic Algorithm BGA models artiicial selection as performed by human breeders. The science of breeding is based on advanced statistical methods. In this paper a connection between genetic algorithm theory and the science of breeding is made. We show h o w the response to selection equation and the concept of heritability can be applied to predict the behavior of the BGA. Selection, recombination and mutation are analyzed within this framework. It is shown that recombination and mutation are complementary search operators. The theoretical results are obtained under the assumption of additive gene eects. For general tness landscapes regression techniques for estimating the heritability are used to analyze and control the BGA. The method of decomposing the genetic variance into an additive and a nonadditive part connects the case of additive tness functions with the general case.
Article
Full-text available
We propose a new method for mining frequent patterns in a language that combines both Semantic Web ontologies and rules. In particular we consider the setting of using a language that combines description logics with DL-safe rules. This setting is important for the practical application of data mining to the Semantic Web. We focus on the relation of the semantics of the representation formalism to the task of frequent pattern discovery, and for the core of our method, we propose an algorithm that exploits the semantics of the combined knowledge base. We have developed a proof-of-concept data mining implementation of this. Using this we have empirically shown that using the combined knowledge base to perform semantic tests can make data mining faster by pruning useless candidate patterns before their evaluation. We have also shown that the quality of the set of patterns produced may be improved: the patterns are more compact, and there are fewer patterns. We conclude that exploiting the semantics of a chosen representation formalism is key to the design and application of (onto-)relational frequent pattern discovery methods. Note: To appear in Theory and Practice of Logic Programming (TPLP) Comment: 40 pages, 6 figures, 6 tables
Article
Full-text available
Inductive learning in First-Order Logic (FOL) is a hard task due to both the prohibitive size of the search space and the computational cost of evaluating hypotheses.
Conference Paper
In the Semantic Web context, OWL ontologies represent the con-ceptualization of domains of interest while the corresponding as-sertional knowledge is given by the heterogeneous Web resources referring to them. Being strongly decoupled, ontologies and assertion can be out-of-sync. An ontology can be incomplete, noisy and sometimes inconsistent with regard to the actual usage of its conceptual vocabulary in the assertions. Data mining can support the discovery of hidden knowledge patterns in the data, to enrich the ontologies. We present a method for discovering multi-relational association rules, coded in SWRL, from ontological knowledge bases. Unlike state-of-the-art approaches, the method is able to take the intensional knowledge into account. Furthermore, since discovered rules are represented in SWRL, they can be straightforwardly integrated within the ontology, thus (i) enriching its expressive power and (ii) augmenting the assertional knowledge that can be derived. Discovered rules may also suggest new axioms to be added to the ontology. We performed experiments on publicly available ontologies validating the performances of our approach.
Article
Article
Article
Inductive logic programming (ILP) algorithms are classification algorithms that construct classifiers represented as logic programs. ILP algorithms have a number of attractive features, notably the ability to make use of declarative background (user-supplied) knowledge. However, ILP algorithms deal poorly with large data sets (>104 examples) and their widespread use of the greedy set-covering algorithm renders them susceptible to local maxima in the space of logic programs. This paper presents a novel approach to address these problems based on combining the local search properties of an inductive logic programming algorithm with the global search properties of an evolutionary algorithm. The proposed algorithm may be viewed as an evolutionary wrapper around a population of ILP algorithms. The evolutionary wrapper approach is evaluated on two domains. The chess-endgame (KRK) problem is an artificial domain that is a widely used benchmark in inductive logic programming, and Part-of-Speech Tagging is a real-world problem from the field of Natural Language Processing. In the latter domain, data originates from excerpts of the Wall Street Journal. Results indicate that significant improvements in predictive accuracy can be achieved over a conventional ILP approach when data is plentiful and noisy.
Conference Paper
Both OWL-DL and function-free Horn rules are decidable logics with interesting, yet orthogonal expressive power: from the rules perspective, OWL-DL is restricted to tree-like rules, but provides both existentially and universally quantified variables and full, monotonic negation. From the description logic perspective, rules are restricted to universal quantification, but allow for the interaction of variables in arbitrary ways. Clearly, a combination of OWL-DL and rules is desirable for building Semantic Web ontologies, and several such combinations have already been discussed. However, such a combination might easily lead to the undecidability of interesting reasoning problems. Here, we present a decidable such combination which is, to the best of our knowledge, more general than similar decidable combinations proposed so far. Decidability is obtained by restricting rules to so-called DL-safe ones, requiring each variable in a rule to occur in a non-DL-atom in the rule body. We show that query answering in such a combined logic is decidable, and we discuss its expressive power by means of a non-trivial example. Finally, we present an algorithm for query answering in SHIQ(D)\mathcal{SHIQ}(\mathbf{D}) extended with DL-safe rules based on the reduction to disjunctive datalog.
Article
In this paper, we present a brief overview of Pellet: a complete OWL-DL reasoner with acceptable to very good performance, extensive middleware, and a number of unique features. Pellet is the first sound and complete OWL-DL reasoner with extensive support for reasoning with individuals (including nominal support and conjunctive query), user-defined datatypes, and debugging support for ontologies. It implements several extensions to OWL-DL including a combination formalism for OWL-DL ontologies, a non-monotonic operator, and preliminary support for OWL/Rule hybrid reasoning. Pellet is written in Java and is open source.
Conference Paper
Most search techniques within ILP require the evaluation of a large number of inconsistent clauses. However, acceptable clauses typically need to be consistent, and are only found at the "fringe" of the search space. A search approach is presented, based on a novel algorithm called QG (Quick Generalization). QG carries out a random-restart stochastic bottom-up search which efficiently generates a consistent clause on the fringe of the refinement graph search without needing to explore the graph in detail. We use a Genetic Algorithm (GA) to evolve and re-combine clauses generated by QG. In this QG/GA setting, QG is used to seed a population of clauses processed by the GA. Experiments with QG/GA indicate that this approach can be more efficient than standard refinement-graph searches, while generating similar or better solutions.
Conference Paper
While the realization of the Semantic Web as once envisioned by Tim Berners-Lee remains in a distant future, the Web of Data has already become a reality. Billions of RDF statements on the Internet, facts about a variety of different domains, are ready to be used by semantic applications. Some of these applications, however, crucially hinge on the availability of expressive schemas suitable for logical inference that yields non-trivial conclusions. In this paper, we present a statistical approach to the induction of expressive schemas from large RDF repositories. We describe in detail the implementation of this approach and report on an evaluation that we conducted using several data sets including DBpedia.
Article
We present a novel reasoning calculus for the description logicSHOIQ+|a knowledge representation formalism with applications in areas such as the Semantic Web. Unnecessary nondeterminism and the construction of large models are two primary sources of ineciency in the tableau-based reasoning calculi used in state-of-the-art reasoners. In order to reduce nondeterminism, we base our calculus on hypertableau and hyperresolution calculi, which we extend with a blocking condition to ensure termination. In order to reduce the size of the constructed models, we introduce anywhere pairwise blocking. We also present an improved nominal introduction rule that ensures termination in the presence of nominals, inverse roles, and number restrictions|a combination of DL constructs that has proven notoriously dicult to handle. Our implementation shows signicant performance improvements over state-of-the-art reasoners on several well-known ontologies.
Article
This document contains a proposal for a Semantic Web Rule Language (SWRL) based on a combination of the OWL DL and OWL Lite sublanguages of the OWL Web Ontology Language with the Unary/Binary Datalog RuleML sublanguages of the Rule Markup Language. SWRL includes a high-level abstract syntax for Horn-like rules in both the OWL DL and OWL Lite sublanguages of OWL. A model-theoretic semantics is given to provide the formal meaning for OWL ontologies including rules written in this abstract syntax. An XML syntax based on RuleML and the OWL XML Presentation Syntax as well as an RDF concrete syntax based on the OWL RDF/XML exchange syntax are also given, along with several examples. Ce document propose un langage, SWRL (Semantic Web Rule Language ou langage de règles du Web sémantique), basé sur une combinaison des sous langages OWL DL et OWL Lite du langage ontologique Web OWL, avec les sous langages Datalog RuleML unaire/binaire du langage Rule Markup Language. SWRL intègre une syntaxe abstraite de haut niveau pour les règles de Horn dans les sous langages OWL DL et OWL Lite de OWL. Un modèle sémantique théorique permettant d'établir la signification formelle des ontologies OWL, y compris des règles rédigées dans cette syntaxe abstraite, est présenté. Une syntaxe XML basée sur RuleML et la syntaxe de présentation de OWL XML, ainsi qu'une syntaxe RDF concrète basée sur la syntaxe d'échange de OWL RDF/XML sont également proposées, avec plusieurs exemples.
Article
Although the OWLWeb Ontology Language adds considerable expressive power to the Semantic Web it does have expressive limitations, particularly with respect to what can be said about properties. We present ORL (OWL Rules Language), a Horn clause rules extension to OWL that overcomes many of these limitations. ORL extends OWL in a syntactically and semantically coherent manner: the basic syntax for ORL rules is an extension of the abstract syntax for OWL DL and OWL Lite; ORL rules are given formal meaning via an extension of the OWL DL model-theoretic semantics; ORL rules are given an XML syntax based on the OWL XML presentation syntax; and a mapping from ORL rules to RDF graphs is given based on the OWL RDF/XML exchange syntax. We discuss the expressive power of ORL, showing that the ontology consistency problem is undecidable, provide several examples of ORL usage, and discuss how reasoning support for ORL might be provided.
Learning with kernels in description logics
• N Fanizzi
• C Amato
• F Esposito
N. Fanizzi, C. d'Amato, and F. Esposito. Learning with kernels in description logics. In F. Zelezný and N. Lavrac, editors, Proceedings of the 18th Int. Conf. on Inductive Logic Programming (ILP 2008), volume 5194 of LNCS, pages 210-225. Springer, 2008.
Genetic Relational Search for Inductive Concept Learning: A Memetic Algorithm for ILP
• Federico Divina
Federico Divina. Genetic Relational Search for Inductive Concept Learning: A Memetic Algorithm for ILP. LAP LAMBERT Academic Publishing, 2010.
Statistical schema induction
• J Völker
• M Niepert
J. Völker and M. Niepert. Statistical schema induction. In ESWC'11 Proc., volume 6643 of LNCS, pages 124-138. Springer, 2011.