Boris Motik’s research while affiliated with University of Oxford and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (192)


Goal-Driven Query Answering over First- and Second-Order Dependencies with Equality
  • Preprint
  • File available

December 2024

·

2 Reads

·

Boris Motik

Query answering over data with dependencies plays a central role in most applications of dependencies. The problem is commonly solved by using a suitable variant of the chase algorithm to compute a universal model of the dependencies and the data and thus explicate all knowledge implicit in the dependencies. After this preprocessing step, an arbitrary conjunctive query over the dependencies and the data can be answered by evaluating it the computed universal model. If, however, the query to be answered is fixed and known in advance, computing the universal model is often inefficient as many inferences made during this process can be irrelevant to a given query. In such cases, a goal-driven approach, which avoids drawing unnecessary inferences, promises to be more efficient and thus preferable in practice. In this paper we present what we believe to be the first technique for goal-driven query answering over first- and second-order dependencies with equality reasoning. Our technique transforms the input dependencies so that applying the chase to the output avoids many inferences that are irrelevant to the query. The transformation proceeds in several steps, which comprise the following three novel techniques. First, we present a variant of the singularisation technique by Marnette [60] that is applicable to second-order dependencies and that corrects an incompleteness of a related formulation by ten Cate et al. [74]. Second, we present a relevance analysis technique that can eliminate from the input dependencies that provably do not contribute to query answers. Third, we present a variant of the magic sets algorithm [19] that can handle second-order dependencies with equality reasoning. We also present the results of an extensive empirical evaluation, which show that goal-driven query answering can be orders of magnitude faster than computing the full universal model.

Download

Rewriting the Infinite Chase for Guarded TGDs

September 2024

·

2 Reads

ACM Transactions on Database Systems

·

Maxime Buron

·

Stefano Germano

·

[...]

·

Boris Motik

Guarded tuple-generating dependencies (GTGDs) are a natural extension of description logics and referential constraints. It has long been known that queries over GTGDs can be answered by a variant of the chase —a quintessential technique for reasoning with dependencies. However, there has been little work on concrete algorithms and even less on implementation. To address this gap, we revisit Datalog rewriting approaches to query answering, where a set of GTGDs is transformed to a Datalog program that entails the same base facts on each base instance. We show that a rewriting consists of “shortcut” rules that circumvent certain chase steps, we present several algorithms that compute a rewriting by deriving such “shortcuts” efficiently, and we discuss important implementation issues. Finally, we show empirically that our techniques can process complex GTGDs derived from synthetic and real benchmarks and are thus suitable for practical use.


Accurate Sampling-Based Cardinality Estimation for Complex Graph Queries

August 2024

·

7 Reads

ACM Transactions on Database Systems

Accurately estimating the cardinality (i.e., the number of answers) of complex queries plays a central role in database systems. This problem is particularly difficult in graph databases, where queries often involve a large number of joins and self-joins. Recently, Park et al. [55] surveyed seven state-of-the-art cardinality estimation approaches for graph queries. The results of their extensive empirical evaluation show that a sampling method based on the WanderJoin online aggregation algorithm [47] consistently offers superior accuracy. We extended the framework by Park et al. [55] with three additional datasets and repeated their experiments. Our results showed that WanderJoin is indeed very accurate, but it can often take a large number of samples and thus be very slow. Moreover, when queries are complex and data distributions are skewed, it often fails to find valid samples and estimates the cardinality as zero. Finally, complex graph queries often go beyond simple graph matching and involve arbitrary nesting of relational operators such as disjunction, difference, and duplicate elimination. Neither of the methods considered by Park et al. [55] is applicable to such queries. In this paper we present a novel approach for estimating the cardinality of complex graph queries. Our approach is inspired by WanderJoin, but, unlike all approaches known to us, it can process complex queries with arbitrary operator nesting. Our estimator is strongly consistent, meaning that the average of repeated estimates converges with probability one to the actual cardinality. We present optimisations of the basic algorithm that aim to reduce the chance of producing zero estimates and improve accuracy. We show empirically that our approach is both accurate and quick on complex queries and large datasets. Finally, we discuss how to integrate our approach into a simple dynamic programming query planner, and we confirm empirically that our planner produces high-quality plans that can significantly reduce end-to-end query evaluation times.


On the Correspondence Between Monotonic Max-Sum GNNs and Datalog

September 2023

·

9 Reads

·

2 Citations

Although there has been significant interest in applying machine learning techniques to structured data, the expressivity (i.e., a description of what can be learned) of such techniques is still poorly understood. In this paper, we study data transformations based on graph neural networks (GNNs). First, we note that the choice of how a dataset is encoded into a numeric form processable by a GNN can obscure the characterisation of a model's expressivity, and we argue that a canonical encoding provides an appropriate basis. Second, we study the expressivity of monotonic max-sum GNNs, which cover a subclass of GNNs with max and sum aggregation functions. We show that, for each such GNN, one can compute a Datalog program such that applying the GNN to any dataset produces the same facts as a single round of application of the program's rules to the dataset. Monotonic max-sum GNNs can sum an unbounded number of feature vectors which can result in arbitrarily large feature values, whereas rule application requires only a bounded number of constants. Hence, our result shows that the unbounded summation of monotonic max-sum GNNs does not increase their expressive power. Third, we sharpen our result to the subclass of monotonic max GNNs, which use only the max aggregation function, and identify a corresponding class of Datalog programs.


On the Correspondence Between Monotonic Max-Sum GNNs and Datalog

May 2023

·

17 Reads

Although there has been significant interest in applying machine learning techniques to structured data, the expressivity (i.e., a description of what can be learned) of such techniques is still poorly understood. In this paper, we study data transformations based on graph neural networks (GNNs). First, we note that the choice of how a dataset is encoded into a numeric form processable by a GNN can obscure the characterisation of a model's expressivity, and we argue that a canonical encoding provides an appropriate basis. Second, we study the expressivity of monotonic max-sum GNNs, which cover a subclass of GNNs with max and sum aggregation functions. We show that, for each such GNN, one can compute a Datalog program such that applying the GNN to any dataset produces the same facts as a single round of application of the program's rules to the dataset. Monotonic max-sum GNNs can sum an unbounded number of feature vectors which can result in arbitrarily large feature values, whereas rule application requires only a bounded number of constants. Hence, our result shows that the unbounded summation of monotonic max-sum GNNs does not increase their expressive power. Third, we sharpen our result to the subclass of monotonic max GNNs, which use only the max aggregation function, and identify a corresponding class of Datalog programs.


Rewriting the Infinite Chase

December 2022

·

15 Reads

Guarded tuple-generating dependencies (GTGDs) are a natural extension of description logics and referential constraints. It has long been known that queries over GTGDs can be answered by a variant of the chase - a quintessential technique for reasoning with dependencies. However, there has been little work on concrete algorithms and even less on implementation. To address this gap, we revisit Datalog rewriting approaches to query answering, where GTGDs are transformed to a Datalog program that entails the same base facts on each base instance. We show that the rewriting can be seen as containing "shortcut" rules that circumvent certain chase steps, we present several algorithms that compute the rewriting by simulating specific types of chase steps, and we discuss important implementation issues. Finally, we show empirically that our techniques can process complex GTGDs derived from synthetic and real benchmarks and are thus suitable for practical use.


Rewriting the infinite chase

July 2022

·

4 Reads

·

7 Citations

Proceedings of the VLDB Endowment

Guarded tuple-generating dependencies (GTGDs) are a natural extension of description logics and referential constraints. It has long been known that queries over GTGDs can be answered by a variant of the chase ---a quintessential technique for reasoning with dependencies. However, there has been little work on concrete algorithms and even less on implementation. To address this gap, we revisit Datalog rewriting approaches to query answering, where GTGDs are transformed to a Datalog program that entails the same base facts on each base instance. We show that the rewriting can be seen as containing "shortcut" rules that circumvent certain chase steps, we present several algorithms that compute the rewriting by simulating specific types of chase steps, and we discuss important implementation issues. Finally, we show empirically that our techniques can process complex GTGDs derived from synthetic and real benchmarks and are thus suitable for practical use.


Faithful Approaches to Rule Learning

July 2022

·

6 Reads

·

6 Citations

Rule learning involves developing machine learning models that can be applied to a set of logical facts to predict additional facts, as well as providing methods for extracting from the learned model a set of logical rules that explain symbolically the model's predictions. Existing such approaches, however, do not describe formally the relationship between the model's predictions and the derivations of the extracted rules; rather, it is often claimed without justification that the extracted rules `approximate' or `explain' the model, and rule quality is evaluated by manual inspection. In this paper, we study the formal properties of Neural-LP--a prominent rule learning approach. We show that the rules extracted from Neural-LP models can be both unsound and incomplete: on the same input dataset, the extracted rules can derive facts not predicted by the model, and the model can make predictions not derived by the extracted rules. We also propose a modification to the Neural-LP model that ensures that the extracted rules are always sound and complete. Finally, we show that, on several prominent benchmarks, the classification performance of our modified model is comparable to that of the standard Neural-LP model. Thus, faithful learning of rules is feasible from both a theoretical and practical point of view.


The Dow Jones Knowledge Graph

May 2022

·

19 Reads

·

1 Citation

Lecture Notes in Computer Science

Dow Jones is a leading provider of market, industry and portfolio intelligence serving a wide range of financial applications including asset management, trading, analysis and bankruptcy/restructuring. The information needed to provide such intelligence comes from a variety of heterogeneous data sources. Integrating this information and answering complex queries over it presents both conceptual and computational challenges. In order to address these challenges Dow Jones have used the RDFox system to integrate the various sources in a large RDF knowledge graph. The knowledge graph is being used to power an expanding range of internal processes and market intelligence products.


Modular materialisation of Datalog programs

April 2022

·

16 Reads

·

2 Citations

Artificial Intelligence

Answering queries over large datasets extended with Datalog rules plays a key role in numerous data management applications, and it has been implemented in several highly optimised Datalog systems in both academic and commercial contexts. Many systems implement reasoning via materialisation, which involves precomputing all consequences of the rules and the dataset in a preprocessing step. Some systems also use incremental reasoning algorithms, which can update the materialisation efficiently when the input dataset changes. Such techniques allow queries to be processed without any reference to the rules, so they are often used in applications where the performance of query answering is critical. Existing materialisation and incremental reasoning techniques enumerate all possible ways to apply rules to the data in order to derive all relevant consequences. This, however, can be inefficient because derivations of rules commonly used in practice are redundant; for example, rules axiomatising a binary predicate as symmetric and transitive can have a cubic number of applications, yet they can derive at most a quadratic number of facts. Such redundancy can be a significant source of overhead in practice and can prevent Datalog systems from successfully processing large datasets. To address this issue, in this paper we present a novel framework for modular materialisation and incremental reasoning. Our key idea is that, for certain combinations of rules commonly used in practice, all consequences can be derived using specialised procedures that do not necessarily enumerate all possible rule applications. Thus, our framework supports materialisation and incremental reasoning via a collection of modules. Each module is responsible for deriving consequences of a subset of the program, by using either standard rule application or proprietary algorithms. We prove that such an approach is complete as long as each module satisfies certain properties. Our formalisation of a module is very general, and in fact it allows modules to keep arbitrary auxiliary information. We also show how to realise custom procedures for four types of modules: transitivity, symmetry–transitivity, chain rules, and sequencing elements of a total order. Finally, we demonstrate empirically that using our custom procedures can speed up materialisation and incremental reasoning by several orders of magnitude on several well-known benchmarks. Thus, our technique has the potential to significantly improve the scalability of Datalog reasoners.


Citations (79)


... The success of GNNs in applications has stimulated lively research into their theoretical properties such as expressive power. A landmark result is due to Barceló et al. [5] which was among the first to characterize the expressive power of GNNs in terms of logic, see [11,8,6, 23] and references therein for related results. More precisely, Barceló et al. show that a basic GNN model with a constant number of iterations has exactly the same expressive power as graded modal logic GML in restriction to properties definable in first-order logic FO. ...

Reference:

Logical Characterizations of Recurrent Graph Neural Networks with Reals and Floats
On the Correspondence Between Monotonic Max-Sum GNNs and Datalog
  • Citing Conference Paper
  • September 2023

... More recently, such reductions for Horn description logics have been implemented and evaluated Carral, González, and Koopmann 2019). Such datalog rewritings have also been studied for existential rules, for guarded (Benedikt et al. 2022), nearly guarded (Gottlob, Rudolph, and Simkus 2014), warded (Berger et al. 2022) and shy (Leone et al. 2019) rule sets. ...

Rewriting the infinite chase
  • Citing Article
  • July 2022

Proceedings of the VLDB Endowment

... The canonical transformation T N induced by a max GNN can only derive new unary facts, while KG completion requires also the derivation of binary facts. To address this limitation, we use an alternative encoding/decoding scheme (Liu et al. 2021;Tena Cucala et al. 2022) where binary facts are also encoded in feature vector components and edges in the graph correspond to different types of possible joins between unary and binary atoms. As shown in (Tena Cucala et al. 2023), such scheme can be captured by fixed encoding and decoding programs P enc and P dec so that the overall transformation is given by T P dec (T N (T Penc (D))), which in turn coincides with T P dec (T L+2 P N (T Penc (D))) by Theorem 7. Benchmarks, metrics, and baselines We used the inductive KG completion benchmarks by Teru, Denis, and Hamilton (2020), based on the FB15K-237 (Bordes et al. 2013), NELL-995 (Xiong, Hoang, and Wang 2017), and WN18RR (Dettmers et al. 2018) KGs. ...

Faithful Approaches to Rule Learning
  • Citing Conference Paper
  • July 2022

... RSP-QL [29] is a reference model that unifies the semantics of the existing RSP approaches. RSP has been extended to support ROT in various ways: (i) solutions incorporating efficient incremental maintenance of materializations of the windowed ontology streams [10,44,54,73], (ii) solutions for expressive Description Logics (DL) [47,68], and (iii) a solution for Answer Set Programming (ASP) [52]. More central to ROT is the logic-based framework for analyzing reasoning over streams (LARS) [13] that extends ASP for analytical reasoning over data streams. ...

Incremental Update of Datalog Materialisation: the Backward/Forward Algorithm
  • Citing Article
  • February 2015

Proceedings of the AAAI Conference on Artificial Intelligence

... In this approach, no redundancy is allowed, i.e. updated views reflect only currently true logical assertions: this differs from the overgrounding idea, which aims to materialize bigger portions of logic programs which can possibly support true logic assertions. Hu et al. (2022) extended further the idea, by proposing a general method in which modular parts of a Datalog view can be attached to ad-hoc incremental maintenance algorithms. For instance, one can plug in the general framework a special incremental algorithm for updating transitive closure patterns, etc. ...

Modular materialisation of Datalog programs
  • Citing Article
  • April 2022

Artificial Intelligence

... As parallelization has been successfully employed in some rule-based ER systems (Deng et al. 2022), another promising but non-trivial direction would be to see how parallel algorithms can be integrated into ASPEN. For this, we hope to build upon existing work on parallelization of Datalog reasoning (Perri, Ricca, and Sirianni 2013;Ajileye and Motik 2022) and ASP solving . ...

Materialisation and data partitioning algorithms for distributed RDF systems
  • Citing Article
  • March 2022

Journal of Web Semantics

... They have indeed been studied to enrich the expressive power of rules expressed in Datalog [Consens and Mendelzon, 1993], or in disjunctive Datalog with a notable implementation in the DLV system [Dell'Armi et al., 2003]. More recently, restrictions of Datalog Z , an extension of Datalog which captures many data aggregation tasks by allowing arithmetic functions over integers at the cost of undecidability, have been studied to regain decidability, resulting in the fragment Limit Datalog Z [Cuenca Grau et al., 2020] whose expressive power has been further studied [Kaminski et al., 2021]. ...

The Complexity and Expressive Power of Limit Datalog
  • Citing Article
  • February 2022

Journal of the ACM

... Due to data richness and high coverage, the MAKG has been employed in several research fields and scenarios, including bibliometrics and scientific impact [27][28][29], recommender systems [30], data analytics (i.e., Nesta business intelligence tools 26 ), and benchmarking [31]. Further, the MAG, to which the MAKG originally derives, has been extensively investigated [32][33][34] and used for scientific ethics and mobility networks [24,25], and in COVID-19 related studies [35,36]. ...

Streaming Partitioning of RDF Graphs for Datalog Reasoning
  • Citing Chapter
  • May 2021

Lecture Notes in Computer Science

... They have indeed been studied to enrich the expressive power of rules expressed in Datalog [Consens and Mendelzon, 1993], or in disjunctive Datalog with a notable implementation in the DLV system [Dell'Armi et al., 2003]. More recently, restrictions of Datalog Z , an extension of Datalog which captures many data aggregation tasks by allowing arithmetic functions over integers at the cost of undecidability, have been studied to regain decidability, resulting in the fragment Limit Datalog Z [Cuenca Grau et al., 2020] whose expressive power has been further studied [Kaminski et al., 2021]. ...

Limit Datalog: A Declarative Query Language for Data Analysis
  • Citing Article
  • February 2020

ACM SIGMOD Record

... Specifically, we first use the pp-ocrv3 model to realize the invoice text recognition in the Chinese scene, then the extracted information is utilized to classify into commodities based on the BERT-TextCNN model. We next construct a knowledge graph and rules for financial reimbursement, and finally the compliance judgment of the invoice is realized via reasoning on the knowledge graph [5]. The experimental results show that our Chinese invoice audit framework can accurately and efficiently replace the manual invoice audit. ...

Datalog Reasoning over Compressed RDF Knowledge Bases
  • Citing Conference Paper
  • November 2019